Automating Mathematics

Working journal

Andrews-Curtis: First Run

Finally having run something in ProvingGround involving the main learning method, here are some observations.

Simple Andrews-Curtis run

Starting with an initial distribution, evoution took place with the basic Andrews-Curtis moves. I ran two loops with two steps each, and then resumed and ran 3 more loops. The raw results are at first run results


  • In terms of the highest weights, the method worked well even with such a short run. For example, the presentation $\langle a, b; \bar{a}, b\rangle$ was one of those with high weight.

  • In general, the order was according to expected behaviour.

  • However, virtually all theorems were remembered.

  • This stems from two weaknesses:
    • two high a weight for theorems relative to proofs, leading to memos favoured over deduction.
    • not enough methods of deduction, specifically exlusion of multiplication-inversion moves.
  • These can be corrected by:
    • adding more moves: multiplication-inversions, transpositions; or allowing composite moves.
    • lowering the probability of word-continuation in generating presentations (just a runtime parameter).

Lambda Islands

A single lambda island

  • We are given a type $T$.
  • Generate a new variable $x : T$. The result and gradient do not depend on $x$ (upto equality).
  • We have an inclusion map $i_x: FD(V) \to FD(V)$. This is almost the identity, except the gradient purges the weight of $x$.
  • From this we get the initializer on distributions, introducing $x$ with weight $p(m)$ with $m$ the $\lambda$-parameter and scaling the rest by $1 - p(m)$.
  • It is probably easiest to directly define the initializer, defining its application and gradient.
  • We compose the initializer with a given dynamical system.
  • Finally, export by mapping $y$ to $x \mapsto y$. This is composed with the initializer and the given dynamical system.

Combining lambda islands.

  • Given an initial distribution, we have weights $p(T)$ associated to each type $T$.
  • Given a dynamical system $f$ and type $T$, we have a new system $\Lambda_T(f)$.
  • We take the weighted sum of these, i.e., $\Lambda(f) = \sum_T w(T) \Lambda_t(f)$.

Using recurrence

  • Given a dynamical system, generate a family $g_k$ indexed by natural numbers $k$ corresponding to iterating $2^k$ times.
  • Mix in lambda to get a new family $f_k$ with $f_k = g_k + \Lambda(f_{k-1})$, with the last term a priori vanishing for $k \leq 0$.

Dynamical Actors


  • Dynamical system $f: X\to X$ depending on parameters.
  • Goals to check.


  • Called with state, parameter,iterations, goals.
  case class LoopWork(x: X, p: P, loops: Int, goals: X => Boolean = List())
  • Iterates until success or loops completed.
  case class LoopsDone(id: String, final : X)

  case class Success(id: String, final : X)


  • commits
  • solutions: commit, goal.
  • goals


  • State
  • Hash-tag : string representing hex code of hash-tag
  • Ancestor
  • Meta-data: who committed, why.
  def commit(x: X, parent: String) : String // returns hash-tag


  • Maintains all workers and communication with outside. Messages (and a few similar ones).
  case object ActorList // returns actor ids and whether they are paused.

  case class QueryState(id: String, pause: Boolean = true)

  case class QueryParams(id: String) // look up parameters at hub.

  case class UpdateParams(id: String, p: P)

  case class UpdateState(id: String, x: X)

  case class PauseWorker(id: String)

  case class ResumeWorker(id: String)

  case class StopWorker(id: String)

  case class UpdateGoals(id: String, goals: List[X => Boolean])

  case class ActorLoops(id: String, loops: Int)
  • For simplicity, assume that pausing on success is independent of actor, and pausing on query is part of the query string.
  case class ActorState(x: X, p: P,
    loops: Int,
    paused : Boolean,
    lastCommit: Long

Callback on success

  • Commit the state.
  • Record that the commit contains a success.
  • Possibly update goals removing those attained.
  • SSE log the goals.


  • from interface: post JSON to server.
  • to interface: send SSE with type used for a switch to decide action.



  • A feedback is a function $f: X \to X$, typically depending on parameters, but which is not differentiable (i.e., no gradient).


  • Linear Structure: We can define a linear structure on A => B given one on B. This does not collide with that on differentiable functions as linear structures are not covariant.
  • Conjugation: Given a differentiable function $f: X \to Y$ and a (feedback) function $g: Y\to Y$, we can define a conjugate function $g^f: X\to X$ by
  • This behaves well with respect to composition of differentiable functions and linear combinations of feedbacks.
  • Partial conjugation: Sometimes the feedback also depends on $x \in X$, so we have $g: X\to Y$. We are then back-propagating $g$ using $f$.

Matching distributions:

  • Given a target distribution on $Y$, we get a feedback towards the target.
  • Given a distribution on $X$ and $f: X\to Y$, we have a pushforward distribution. We get an image feedback.
  • To get a feedback on $X$, we need to distribute among all terms with a given image proportional to the term, or depending only on the coefficient of the image image.
  • Question Can we define this shift on $X$ in one step correctly?
  • Gradient view: Just view $f$ as a differentiable function and back-propagate by the gradient. This shifts weights independently.

Terms matching types

  • We compare the distribution of terms that are types with the map from terms to types.
  • The difference between these gives a flow on types.
  • We back-propagate to get a flow on terms.

Approximate matches

  • We can filter by having a specific type, and then an error matching having a refined type (after some function application).
  • For a given term, we get the product of its weight with an error factor.
  • In terms of combinators, we replace inclusion by a weighted inclusion, or a product of the inclusion with a conformal factor. We need a new basic differentiable function.

Blending feedbacks.

  • As in the case of matching types, suppose
    • $X$ is a finite distribution.
    • various feedbacks are based on atoms.
    • we have weights for the various feedbacks.
  • We give more credit for feedback components attained by fewer terms.
  • This is achieved by sharing the feedback coming from one component, instead of just giving gradient
  • The terms representing types feedback is a special instance of this.

Blending Pruning and Backprop

The problem

  • One way to evolve a system is the genetic algorithm style, where we consider the fitness of the final elements and prune based on this. This however has trivial credit assignment.
  • At the other extreme is a pure back-propogation. This has nice credit assignment, but done directly the support never changes.


  • Firstly, there must be an identity term in the evolution, so credit can go to objects just persisting.
  • Note that while the result of a big sum does not depend on terms that are zero, specifically $\mu(v)f$ with $\mu(v) = 0$, the gradient still depends on such terms. Specifically we can get a flow back to add weight to $v$
  • We decouple the big sum component, so that when we apply to an argument which is a distribution, we do not just take a set of functions depending on that distribution. We can instead take a specified support.
  • We take a big sum over all generated elements, to allow some of them to acquire weight, while computing the gradient.
  • We should periodically prune by weight, but only after flowing for long enough to allow for picking up weight.


  • We consider differentiable functions that also depend on a subset of V, by taking linear combinations including bigsums that are the union of the support of a distribution with the given subset.
  • For propagation, we just take an empty subset of V.
  • Once we have the image, we take its (essential) support as the given subset of V. We recompute the differentiable function. Note that the value is unchanged.
  • We use the gradient of the new function to back-propagate.
  • This gives a single loop, which we repeat to evolve the system.
  • Purge after looping for enough time.

Using the identity

  • The gradient of the identity does not depend on any form of a priori support.
  • Hence the gradient of a persistence term, i.e., identity multiplied by a weigh, attributes weight to everything in the final distribution.
  • Note: This lets us avoid the above complications.

Differentiable Function Combinators

Basic functions and Combinators (all implemented)

We build differentiable functions for our basic learning system using some basic ones and combinators:

  • compositions (in class)
  • sums and scalar products of differentiable functions. (done)
  • product of a real valued and a vector valued function - the subtlest case (done).
  • identity function (done).
  • projections on (V, W) (done)
  • inclusions to (V, W) (dome)
  • evaluation of a finite distribution at a point (done).
  • atomic distribution as a function of weight (done).
  • point-wise multiplication of a finite distribution by a given function (done).
  • sum of a set of functions, with even the set depending on argument (done).
    • this can be interpreted as the sum of a fixed set of functions, but with all but finitely many zero.
  • repeated squaring $k$ times, with $k=0$ and $k<0$ cases (done).
  • recursive definitions for families indexed by integers - generic given zero. Done (for vector spaces).

Derived from these

  • (optional) moves
  • (optional) pairings
  • linear combinations.

Convenience code

  • Have implicit conversion from
    • a type T implicitly depending on a linear structure on T to have methods ++ and *:
    • DiffbleFunction[V, V] to a class with a multiplication method **:

To Do:

  • Construct differentiable functions for Finite Distributions - atom, evaluate and point-wise product.

Gradient of Product

A subtle differentiable function is the scalar product,

for a vector space $V$.


By Liebniz rule, the total derivative is


Note that depends on the vector space structure on $(\mathbb{R}, V)$. For a vector $w$ in $V$, note that we can write

By definition of gradient (as adjoint derivative), we have


We need an additional implicit structure with combinators:

case class InnerProduct[V](dot: (V, V) => Double)

def vdot[V](implicit ip: InnerProduct[V]) =

Linear Structure, Differentiable Functions (Implicit) Combinators

The best way to build differentiable functions for the typical learning system in function-finder is using combinators. At present there is some ad hoc version of this. The better way is to use linear structures systematically.

Linear structures

case class LinearStructure[A](sum: (A, A) => A, mult : (Double, A) => A){
  def diff(frm: A, remove: A) = sum(frm, mult(-1.0, remove))

Linear Spaces can be built with

  • Real numbers - Double
  • Finite Distributions have ++ and *
  • Pairs of linear spaces
  • Differentiable functions between linear spaces
  • (Eventually) Maps with values in Linear spaces - for representation learning.
case class LinearStructure[A](sum: (A, A) => A, mult : (Double, A) => A){
  def diff(frm: A, remove: A) = sum(frm, mult(-1.0, remove))

implicit val RealsAsLinearStructure = LinearStructure[Double]((_+_), (_*_))

implicit def VectorPairs[A, B](implicit lsa: LinearStructure[A], lsb: LinearStructure[B]): LinearStructure[(A, B)] = {
  def sumpair(fst: (A, B), scnd: (A, B)) =(lsa.sum(fst._1, scnd._1), lsb.sum(fst._2, scnd._2))

  def multpair(sc: Double, vect: (A, B)) = (lsa.mult(sc, vect._1), lsb.mult(sc, vect._2))

  LinearStructure(sumpair, multpair)

implicit def FiniteDistVec[T] = LinearStructure[FiniteDistribution[T]](_++_, (w, d) => d * w)

To Do:

  • Add a field for the zero vector (done).
  • Create a linear structure for differentiable functions(done).
  • Have functions that implicitly use linear structures(done).

More vector structures

In addition, we need inner products for computing certain gradients as well as totals for normalization. These are built for

  • Finite distributions
  • pairs
  • Real numbers
  • (Eventually) Maps with co-domain having inner products and totals.

Differentiable functions

trait DiffbleFunction[A, B] extends (A => B){self =>
  def grad(a: A) : B => A

  * Composition f *: g is f(g(_))
  def *:[C](that: DiffbleFunction[B, C]) = andThen(that)

  def andThen[C](that: => DiffbleFunction[B, C]): DiffbleFunction[A, C] = DiffbleFunction((a: A) => that(this(a)))(
    (a: A) =>
    (c: C) =>

  object DiffbleFunction{
    def apply[A, B](f: => A => B)(grd: => A => (B => A)) = new DiffbleFunction[A, B]{
      def apply(a: A) = f(a)

      def grad(a: A) = grd(a)


Differentiable functions are built from

  • Composing and iterating differentiable functions.
  • (Partial) Moves on X giving FD(X) -> FD(X)
  • (Partial) Combinations on X giving FD(X) -> FD(X)
  • Co-ordinate functions on FD(X).
  • Inclusion of a basis vector.
  • Projections and Inclusions for pairs.
  • Scalar product of differentiable functions, $x\mapsto f(x) * g(x)$ with $f: V \to \mathbb{R}$ and $g: V\to W$.
  • The sum of a set of differentiable functions X \to Y, with the set of functions depending on $x : X$.

From the above, we should be able to build.

  • Islands: Recursive definitions in terms of the function itself - more precisely a sequence of functions at different depths (in a more rational form).
  • Multiple islands: A recurrence corresponding to a lot of island terms.

To Do:

  • Clean up the present ad hoc constructors replacing them by combinators.


  • After adding an island, we get a family of differentiable functions $f_d$ indexed by depth.
  • This can be simply viewed as a recursive definition.
  • Given a function $g$, we get a function at depth $d$ by iterating $2^d$ times by repeated squaring.

  • Determined by:

    • An initial family of differentiable functions $g_d$, corresponding to iterating $2^d$ times.
    • A transformation on differentiable functions $f \mapsto L(f)$.
  • Recurrence relation:

    • $f_d = g_d + L(f_{d-1})$, for $d \geq 0$, with $f_{-1} = 0$.
    • Here $L(f) = i \circ f \circ j$.
    • When $d=0$, the recurrence relation just gives just gives $id = id$.
    • Then $d=1$, we get $f_1 = g_1 + L(id)$.

Multiple Islands

  • We have a set of islands associated to a collection of objects, with a differentiable function associated to each.
  • We can assume the set to be larger, with functions for the rest being zero. So the set is really an a priori bound.
  • We get a recurence relation:
  • $f_d(\cdot) = g_d(\cdot) + \sum_v i_v(\cdot) * L_v(f_{d-1}(\cdot)).$
  • Here $i_v$ is evaluation at $v$, for a pair of finite distributions.

Quasi-literate Programming

This blog will now have various comments, often technical, concerning the issues I face while working, mainly programming for Automated theorem proving.