# Andrews-Curtis: First Run

Finally having run something in ProvingGround involving the main learning method, here are some observations.

## Simple Andrews-Curtis run

Starting with an initial distribution, evoution took place with the basic Andrews-Curtis moves. I ran two loops with two steps each, and then resumed and ran 3 more loops. The raw results are at first run results

## Conclusions

• In terms of the highest weights, the method worked well even with such a short run. For example, the presentation $\langle a, b; \bar{a}, b\rangle$ was one of those with high weight.

• In general, the order was according to expected behaviour.

• However, virtually all theorems were remembered.

• This stems from two weaknesses:
• two high a weight for theorems relative to proofs, leading to memos favoured over deduction.
• not enough methods of deduction, specifically exlusion of multiplication-inversion moves.
• These can be corrected by:
• adding more moves: multiplication-inversions, transpositions; or allowing composite moves.
• lowering the probability of word-continuation in generating presentations (just a runtime parameter).

# Lambda Islands

## A single lambda island

• We are given a type $T$.
• Generate a new variable $x : T$. The result and gradient do not depend on $x$ (upto equality).
• We have an inclusion map $i_x: FD(V) \to FD(V)$. This is almost the identity, except the gradient purges the weight of $x$.
• From this we get the initializer on distributions, introducing $x$ with weight $p(m)$ with $m$ the $\lambda$-parameter and scaling the rest by $1 - p(m)$.
• It is probably easiest to directly define the initializer, defining its application and gradient.
• We compose the initializer with a given dynamical system.
• Finally, export by mapping $y$ to $x \mapsto y$. This is composed with the initializer and the given dynamical system.

## Combining lambda islands.

• Given an initial distribution, we have weights $p(T)$ associated to each type $T$.
• Given a dynamical system $f$ and type $T$, we have a new system $\Lambda_T(f)$.
• We take the weighted sum of these, i.e., $\Lambda(f) = \sum_T w(T) \Lambda_t(f)$.

## Using recurrence

• Given a dynamical system, generate a family $g_k$ indexed by natural numbers $k$ corresponding to iterating $2^k$ times.
• Mix in lambda to get a new family $f_k$ with $f_k = g_k + \Lambda(f_{k-1})$, with the last term a priori vanishing for $k \leq 0$.

# Dynamical Actors

## Context

• Dynamical system $f: X\to X$ depending on parameters.
• Goals to check.

## Worker

• Called with state, parameter,iterations, goals.
• Iterates until success or loops completed.

## Database

• commits
• solutions: commit, goal.
• goals

### Commits

• State
• Hash-tag : string representing hex code of hash-tag
• Ancestor
• Meta-data: who committed, why.

## Hub

• Maintains all workers and communication with outside. Messages (and a few similar ones).
• For simplicity, assume that pausing on success is independent of actor, and pausing on query is part of the query string.

### Callback on success

• Commit the state.
• Record that the commit contains a success.
• Possibly update goals removing those attained.
• SSE log the goals.

## Channels

• from interface: post JSON to server.
• to interface: send SSE with type used for a switch to decide action.

# Feedbacks

## Feedbacks

• A feedback is a function $f: X \to X$, typically depending on parameters, but which is not differentiable (i.e., no gradient).

## Combinators

• Linear Structure: We can define a linear structure on A => B given one on B. This does not collide with that on differentiable functions as linear structures are not covariant.
• Conjugation: Given a differentiable function $f: X \to Y$ and a (feedback) function $g: Y\to Y$, we can define a conjugate function $g^f: X\to X$ by
• This behaves well with respect to composition of differentiable functions and linear combinations of feedbacks.
• Partial conjugation: Sometimes the feedback also depends on $x \in X$, so we have $g: X\to Y$. We are then back-propagating $g$ using $f$.

## Matching distributions:

• Given a target distribution on $Y$, we get a feedback towards the target.
• Given a distribution on $X$ and $f: X\to Y$, we have a pushforward distribution. We get an image feedback.
• To get a feedback on $X$, we need to distribute among all terms with a given image proportional to the term, or depending only on the coefficient of the image image.
• Question Can we define this shift on $X$ in one step correctly?
• Gradient view: Just view $f$ as a differentiable function and back-propagate by the gradient. This shifts weights independently.

## Terms matching types

• We compare the distribution of terms that are types with the map from terms to types.
• The difference between these gives a flow on types.
• We back-propagate to get a flow on terms.

## Approximate matches

• We can filter by having a specific type, and then an error matching having a refined type (after some function application).
• For a given term, we get the product of its weight with an error factor.
• In terms of combinators, we replace inclusion by a weighted inclusion, or a product of the inclusion with a conformal factor. We need a new basic differentiable function.

## Blending feedbacks.

• As in the case of matching types, suppose
• $X$ is a finite distribution.
• various feedbacks are based on atoms.
• we have weights for the various feedbacks.
• We give more credit for feedback components attained by fewer terms.
• This is achieved by sharing the feedback coming from one component, instead of just giving gradient
• The terms representing types feedback is a special instance of this.

# Blending Pruning and Backprop

## The problem

• One way to evolve a system is the genetic algorithm style, where we consider the fitness of the final elements and prune based on this. This however has trivial credit assignment.
• At the other extreme is a pure back-propogation. This has nice credit assignment, but done directly the support never changes.

## Solution

• Firstly, there must be an identity term in the evolution, so credit can go to objects just persisting.
• Note that while the result of a big sum does not depend on terms that are zero, specifically $\mu(v)f$ with $\mu(v) = 0$, the gradient still depends on such terms. Specifically we can get a flow back to add weight to $v$
• We decouple the big sum component, so that when we apply to an argument which is a distribution, we do not just take a set of functions depending on that distribution. We can instead take a specified support.
• We take a big sum over all generated elements, to allow some of them to acquire weight, while computing the gradient.
• We should periodically prune by weight, but only after flowing for long enough to allow for picking up weight.

## Code

• We consider differentiable functions that also depend on a subset of V, by taking linear combinations including bigsums that are the union of the support of a distribution with the given subset.
• For propagation, we just take an empty subset of V.
• Once we have the image, we take its (essential) support as the given subset of V. We recompute the differentiable function. Note that the value is unchanged.
• We use the gradient of the new function to back-propagate.
• This gives a single loop, which we repeat to evolve the system.
• Purge after looping for enough time.

## Using the identity

• The gradient of the identity does not depend on any form of a priori support.
• Hence the gradient of a persistence term, i.e., identity multiplied by a weigh, attributes weight to everything in the final distribution.
• Note: This lets us avoid the above complications.

# Differentiable Function Combinators

## Basic functions and Combinators (all implemented)

We build differentiable functions for our basic learning system using some basic ones and combinators:

• compositions (in class)
• sums and scalar products of differentiable functions. (done)
• product of a real valued and a vector valued function - the subtlest case (done).
• identity function (done).
• projections on (V, W) (done)
• inclusions to (V, W) (dome)
• evaluation of a finite distribution at a point (done).
• atomic distribution as a function of weight (done).
• point-wise multiplication of a finite distribution by a given function (done).
• sum of a set of functions, with even the set depending on argument (done).
• this can be interpreted as the sum of a fixed set of functions, but with all but finitely many zero.
• repeated squaring $k$ times, with $k=0$ and $k<0$ cases (done).
• recursive definitions for families indexed by integers - generic given zero. Done (for vector spaces).

## Derived from these

• (optional) moves
• (optional) pairings
• linear combinations.

## Convenience code

• Have implicit conversion from
• a type T implicitly depending on a linear structure on T to have methods ++ and *:
• DiffbleFunction[V, V] to a class with a multiplication method **:

### To Do:

• Construct differentiable functions for Finite Distributions - atom, evaluate and point-wise product.

# Gradient of Product

A subtle differentiable function is the scalar product,

for a vector space $V$.

## Derivative

By Liebniz rule, the total derivative is

Note that depends on the vector space structure on $(\mathbb{R}, V)$. For a vector $w$ in $V$, note that we can write

By definition of gradient (as adjoint derivative), we have

## Code

We need an additional implicit structure with combinators:

# Linear Structure, Differentiable Functions (Implicit) Combinators

The best way to build differentiable functions for the typical learning system in function-finder is using combinators. At present there is some ad hoc version of this. The better way is to use linear structures systematically.

## Linear structures

Linear Spaces can be built with

• Real numbers - Double
• Finite Distributions have ++ and *
• Pairs of linear spaces
• Differentiable functions between linear spaces
• (Eventually) Maps with values in Linear spaces - for representation learning.

### To Do:

• Add a field for the zero vector (done).
• Create a linear structure for differentiable functions(done).
• Have functions that implicitly use linear structures(done).

## More vector structures

In addition, we need inner products for computing certain gradients as well as totals for normalization. These are built for

• Finite distributions
• pairs
• Real numbers
• (Eventually) Maps with co-domain having inner products and totals.

## Differentiable functions

Differentiable functions are built from

• Composing and iterating differentiable functions.
• (Partial) Moves on X giving FD(X) -> FD(X)
• (Partial) Combinations on X giving FD(X) -> FD(X)
• Co-ordinate functions on FD(X).
• Inclusion of a basis vector.
• Projections and Inclusions for pairs.
• Scalar product of differentiable functions, $x\mapsto f(x) * g(x)$ with $f: V \to \mathbb{R}$ and $g: V\to W$.
• The sum of a set of differentiable functions X \to Y, with the set of functions depending on $x : X$.

From the above, we should be able to build.

• Islands: Recursive definitions in terms of the function itself - more precisely a sequence of functions at different depths (in a more rational form).
• Multiple islands: A recurrence corresponding to a lot of island terms.

### To Do:

• Clean up the present ad hoc constructors replacing them by combinators.

## Islands

• After adding an island, we get a family of differentiable functions $f_d$ indexed by depth.
• This can be simply viewed as a recursive definition.
• Given a function $g$, we get a function at depth $d$ by iterating $2^d$ times by repeated squaring.

• Determined by:

• An initial family of differentiable functions $g_d$, corresponding to iterating $2^d$ times.
• A transformation on differentiable functions $f \mapsto L(f)$.
• Recurrence relation:

• $f_d = g_d + L(f_{d-1})$, for $d \geq 0$, with $f_{-1} = 0$.
• Here $L(f) = i \circ f \circ j$.
• When $d=0$, the recurrence relation just gives just gives $id = id$.
• Then $d=1$, we get $f_1 = g_1 + L(id)$.

## Multiple Islands

• We have a set of islands associated to a collection of objects, with a differentiable function associated to each.
• We can assume the set to be larger, with functions for the rest being zero. So the set is really an a priori bound.
• We get a recurence relation:
• $f_d(\cdot) = g_d(\cdot) + \sum_v i_v(\cdot) * L_v(f_{d-1}(\cdot)).$
• Here $i_v$ is evaluation at $v$, for a pair of finite distributions.

# Quasi-literate Programming

This blog will now have various comments, often technical, concerning the issues I face while working, mainly programming for Automated theorem proving.