- algorithms for category theory,
- data structures for category theory,
- applied category theory (ACT),
- building the ACT community,
- and open problems in the field.

Below, I’ve written up a partial summary of these discussions.

**Discussion 1 (Ryan): Ryan has a wish list of algorithms.**

Ryan wants pullbacks, pushouts of schemas (for pullbacks you currently have to finitize to get an algorithm). A way of computing right Kan extensions (Carmody-Walters have a semi-algorithm for computing left Kan extensions; see also Knuth-Bendix completion).

**Discussion 2 (Spencer): how do we represent a category as a data structure?**

Decision: parsimony vs precomputation. Common representation as the intersection of UI, Algorithms, and Translation areas.

- Conclusion: XML
- Ryan: currently FQL implements categories as a Java class
- Bob: embedded dependency / lifting problem
- Spencer: using inductive types?
- Mike: want lazy evaluation
- David: use string diagrams (as geometric basis?)
- Sub: need MATLAB as funding
- Peter: compare to DBMS metastasis, e.g. MongoDB, NoSQL, wide databases, etc. the big question is
**interoperability**

**Discussion 3 (Sub): how do we apply category theory to actual applications? **

In Sub’s experience, majority of the problems take place in the interface between different departments, due to differences in assumptions and meanings. To solve, we need mappings between decision-makers, the system itself, and records/models of the system. (A conceptual model may have 600 parameters, which gets passed to several analytic tools, etc.) The problem is that models may be iterated in response to changes in the spec. With reference to engineering’s models: a *context*, for an engineering model, is the layers and layers of previous and existing models, and we want to keep track of all these. How much generic support can CT provide to engineers? (Other standards that we may want to augment or replace: S88, S91 specification standards; UML; SysML?)

**Discussion 4 (David): how do we sustain a community effort? **

There was a workshop on applied category theory in ’96 featuring Bob Roseburgh and Mike Johnson, but little happened after it. How do we make sure we keep up momentum? Ideas: a list of collaborations, hiring and working with each other’s graduate students, an annual conference, publications, applying for large funding programs together, more quantitative data of CCT for the sake of getting funding, coming up with a new brand and marketing strategy (new name? “functional” instead of “categorical”?), a categorist’s temp agency, a blog / mailing list, a web clearinghouse for applied category theory, a 3-year roadmap, a list of open math problems.

Vision: integrate information/knowledge and models. A user-friendly formal specification. A unifying framework for science and engineering. A tool for synthesis, a “theory of anything”.

**Discussion 5 (Everyone): open problems in the field.**

- How do we model the grouping and aggregation database operations in category theory?
- How can CT handle “fudge” and approximate data? For example, the idea that a line and “nearly a line” are close to the same object.
- How can we model machine learning in CT?
- How to model serializability (of schema functors) in CT?
- What is an end-to-end model of the data pipeline in CT, from data sources to outputs to decision-makers?

This year’s theme is “All The World’s Futures”. What a hopeful title. The future is associated with kids (America), robots (Belgium), and chrome skinsuits (South Korea), so all the futures can only mean all the kids, all the robots, and all the metal eyeshadow. The theme, whatever it’s supposed to mean, functions as a trick. The more you stare at some icon of the future hanging or projected on the beige prop wall, the more you feel like you are being dragged relentlessly into some regressive French movie about the American 80’s, like you have opened the door to a dour, fat-faced salesman trying to sell you on the next new gospel. The future has never seemed so off-kilter, so imbecile.

Art, like bread, is meant to be consumed. Art fills you, it soaks up excess alcohol, and it makes you sick if you consume too much. The Biennale is a feast. You walk around, and there are Adrian Pipers to enjoy, Young British Artists to mock, an epically boring live reading of *Das Kapital*, tourists going on benders with selfie-sticks, Venetians glowering in the backlight. Some of the pavilions were atrocious; some of them were sublime. The German pavilion was a hot mess of hipster lawn art and commercials for video games I would never play. At the Japanese pavilion, Chiharu Shiota somehow both submerged and elevated the entire exhibit under a skein of red thread, keys, and sunken boats, creating another horizon where heaven meets the sea. The Norwegian pavilion was anomalous, architectural, modern, and striking. The French pavilion, like the Belgian, has robots. You can walk through the entire exhibition hall, from the Giardini through the Arsenale, and find good art, bad art, blameable art, art which is forgivable because it is, after all, only art. What you will not find is art that gives you hope for the future. Art—fine art, collected art, curated art—does not belong to the future, at least not as robots belong to the future.

*Art is a sideshow to progress*.

(to be continued…)

]]>Representation of corpus |
How to learn / compute |

n-grams | string matching |

word embeddings | standard algorithms, e.g. neural nets |

presyntactic category | computational category theory |

bar construction | Koszul dual |

formal concept lattice | TBD |

There are some very nice connections between all five representations. In a way, they’re all struggling to get away from the raw, syntactic, “1-dimensional” data of word co-location to something higher-order, something *semantic*. (For example, “royal + woman = queen” is semantic; “royal + woman = royal woman” is not.) I’d like to tell a bit of that story here.

Consider the following 35-word corpus.

>>> let T = "Do you like green eggs and ham? I would not like them here or there. I would not like them anywhere. I do not like them, Sam-I-am. I do not like green eggs and ham!"

This is a text string. In a standard natural language processing (NLP) scenario, we begin with a nice parser, which when applied on the above gives us a sequence of words

>>> let words = nice_parse(T) -- eg words(T) in Haskell or T.split() in Python >>> words ['do', 'you', 'like', 'green', 'eggs', 'and', 'ham', ...] >>> let dict = unique_words(words) >>> dict ['do', 'you', 'like', 'green', 'eggs', 'and', 'ham', 'I', 'would', 'not', 'them', 'here', 'or', 'there', 'anywhere', 'Sam-I-am'] >>> let l = size(dict)

For now let’s abstract from the parsing / word segmentation problem (which can be nontrivial, especially in languages like Chinese or Korean) and assume we’re given such a list of words. We can then compute a list of 2-gram probabilities like P(‘like’ | ‘you’) = .2 and P(‘like’ | ‘Sam-I-am’) = 0 in the usual way. So to every unique word in my dictionary (of size *l*), I associate a vector of probabilities of size *l* and magnitude = 1.

>>> x = dict[2] >>> x 'like' >>> x' = some_n_gram_embedding(word='like', n=2, text=words, vocabulary=dict) >>> zip(dict, x') [('do', 0), ('you', .2), ('like', 0), ('green', 0), ...]

In an *n*-gram model, I would associate to each word a vector of probabilities of size and magnitude = 1, so the final output of an algorithm is an matrix. Note that there is a Markov assumption working here—one expects that the probability of the next word relies on every word that went before it in its sentence (or in its entire paragraph), but we suppose that P(‘green’ | ‘I do not like’) ≈ P(‘green’ | ‘not like’) ≈ P(‘green’ | ‘like’).

Significant variations of *n*-grams exist, for example smoothing (words that never show up in the corpus are modified to have non-zero probability) or skip-grams (you can skip words when determining the *n*-gram possibility). Most of the legwork of computing these probabilities comes down to *string matching*, i.e. counting occurrences of similar strings in a large text.

Once learned, I can apply my *n*-gram model to a range of tasks, for example testing whether a sentence is grammatically correct (‘I would not like them’ vs ‘I would green like them’) or predicting the next word in a sentence (‘I would not like …’).

*n*-grams are a particularly simple kind of vector space representation for text strings (reminiscent of ‘sliding windows’ in time-series analysis) where we represent every word as a vector of probabilities. We can consider more general embeddings of text strings into a vector space, whose value is often self-evident:

I can then further extract clusters of words using the ambient metric in . (Where did the metric or the Euclidean structure come from? Answer: we chose because it was a convenient template for clustering, but in principle we could use other mathematical objects as templates.) Alternately, one can compute a list of the nearest neighbors to a word, such as

The examples above are borrowed from Chris Olah’s blog, which has a good introduction to word embeddings in the context of neural networks. In such examples, the topology of the embedding space can tell us something about the *meaning* of the strings. For more detail, consider Minh and Kavukcuoglu (2013).

Computing these embeddings can be a bit tricky. For now let’s restrict ourselves to strings of single words and consider word embeddings . One way of computing is through a neural network, as in Bengio et al.’s original neural probabilistic language model, which trains the word embedding simultaneously with a standard n-gram model, e.g. by taking the entries of a projection matrix as the weights of a projection layer in a neural network. The network is fed through another hidden layer and then trained in the standard way on a corpus, using backpropagation and SGD.

On our little corpus, this looks something like

-- define parameters>>> let n = 2 -- eg. 2-grams >>> let d = 10 -- d is the dimension of the embedding space >>> let h = 5 -- number of neurons in the hidden layer of the neural network-- set up the neural network (this is based on the PyBrain syntax)>>> let net = FeedForwardNetwork() >>> net.addInput(input_layer) -- inputs are "windows" of n words >>> net.addLayer(projection_layer(n,d)) -- weights of the word embedding >>> net.addLayer(hidden_layer(h)) -- stores reaction to proj. layer >>> net.addOutput(output_layer) -- a softmax layer gives the likelihood >>> net.addConnection(full, input_layer, projection_layer) >>> net.addConnection(full, projection_layer, hidden_layer) >>> net.addConnection(full, hidden_layer, output_layer)-- prep the inputs>>> let ngram n xs = take n xs : tail $ ngram xs >>> bigram_T = ngram 2 words >>> bigram_T [['do', 'you'], ['you', 'like'], ['like', 'green'], ['green', 'eggs'], ...]-- train the neural network>>> net.train(bigram_T)-- visualize the model in two dimensions>>> for word in dict: project_2d(net.activate(word)) [FIGURE]

In the king-queen example, Mikolov et al. mention how word embeddings learned by Bengio’s neural network model satisfy approximate linear equations like

Mikolov’s paper focuses on how to modify the neural network model in order to improve the accuracy of these linear analogies, but notably, Bengio’s original paper doesn’t discuss these relations at all—they introduce word embedding as a way of doing dimensionality reduction on the space of possible *n*-grams. The analogies and “quadratic” relations seem to come for free whenever we use a neural network!

In principle, there should be other ways of capturing and making use of this “free structure”. I.e. I don’t think that the use of neural networks is special. There is probably some elemental (geometric?) fact about the data and/or natural language that the neural network is picking up on.

So we would like a better theory for why and when these quadratic relations show up. Such a theory should help us solve problems like how to obtain cubic relations, e.g. analogies between analogies such as

Back in 2014, Misha Gromov (personal communication, May 8, 2014) introduced the idea of the *presyntactic category* of a given text . We define as a string, where strings come with notions of *isomorphism* when two strings in different locations share the same characters, *equality* when two strings share the same location and characters, and *string inclusions* when there exists some substring such that . In that case, the presyntactic category of , , is the category with objects T and all of its substrings, and morphisms the string inclusions.

Naively, we can represent the category as two lists. (There are more sophisticated representations, e.g. most typeclasses in a functional programming language are, in fact, categories.)

>>> let T = "do you like green eggs and ham" >>> Ob(T) ['do you like green eggs and ham', 'o you like green eggs and ham', ..., 'a', 'n', 'd', 'h', 'a', 'm' ] >>> Mor(T) [('d', 'do'), ('d', 'do '), ('d', 'do y'), ..., ('o you like green eggs and ham', 'do you like green eggs and ham')] -- note, we can allow for "approximate" morphisms that allow spaces between characters, e.g. ('gen', 'green'). This is more obviously useful when we restrict Ob(T) to include only words, e.g. so we can draw arrows like ('fast car', 'fast red car').

Gromov introduced presyntactic categories in order to study the algebraic / combinatorial properties of language. For example, given one can imagine a subcategory which is *minimal* with respect to , in the sense that will be the smallest subcategory in which generates “as a monoid”, e.g. under all compositions of strings in , satisfying identity and associativity . This minimal subcategory will usually be some subcategory of *words* in the text .

Despite the somewhat unfamiliar setting, one can do all the usual *n*-gram analysis in : we have the same probability measure on bigrams given by , where denotes the occurrences of in . Note that can be computed directly by counting the morphisms in from the word *u*.

The payoff for all this comes when we want to classify similar words. A pair of words is *similar*, written , when the two words “similarly interact”, respectively, with words which are already similar. “Similarly interact” can be understood in the n-gram way as having high values for or . For example, the word ‘bat’ is similar to the word ‘bird’ because they often show up next to the word ‘fly’, which is similar to itself. Or ‘hammock’ is similar to ‘bed’ since ‘hammock’ shows up with ‘nap’, which is similar to ‘sleep’.

>>> let T = "do you like green eggs and ham I would like them here or there"-- define "similarly interacts">>> interacts :: String -> String -> Bool >>> interacts u u' = N_T(u, u')-- define the similarity relation ~ recursively, based on interact>>> (~) :: String -> String -> Bool >>> v ~ v = true -- note, Gromov actually discounts self-similarity >>> u ~ v = if interacts(u,u') && interacts(v,v') then u' ~ v'

How do we compute or “learn” similarity relations on a text corpus? A collection of similar words is called a *cluster*, and finding such collections in a corpus is called clustering. *Finding clusters of words is a lot like learning a word embedding*: the difference is that instead of embedding words into a (finite) metric space, from which we derive clusters, we are embedding the words directly into a set or category of clusters. The underlying “facts” of clustering are the same interacting pairs as they are in a word embedding problem. We then compute the similarity relations (which in some sense “lie over” just as lies over the bare corpus ) through a recursive process. This process, in the case of pairs of words (u,u’) and (v, v’), looks a lot like biclustering, since we are determining similarity simultaneously on two axes: formation of a cluster and formation of a cluster .

Similarly, the linear relation we saw between ‘king’ and ‘man’ and ‘queen’ and ‘woman’ can be read as a special kind of similarity relation. ‘King’ is similar to ‘man’ because ‘king’ interacts with ‘queen’, which is similar to ‘woman’. Of course there is some more structure to the relation. One way to interpret the “linear” structure we found in the word embedding is to say that the way ‘king’ interacts with ‘queen’ *in the same way* that ‘man’ interacts with ‘woman’:

<king, queen> ~ <man, woman>

Gromov might say that the interaction relations here have the same “color”, in the sense of a colored operad. Certainly there are many different “colors” of interaction relations: we can define as a “nearby” *n*-gram relation, as a “frequently nearby” relation, as a “far from” relation, as a noun-verb relation, as a short-long relation, as any sort of grammatical relation, and so on.

Let’s understand the problem: there are many different ways to define “similarly interact”, and not only may the different interaction relations be themselves related, but changing from one to the other will also change the computational structure of the clustering problem on . For example, is not always transitive if we use an *n*-gram interpretation.

(Originally I was going to write a whole section here on functorial clustering and how that might help us do analysis of algorithms while varying the interaction relations. Instead, I’ll put those thoughts into another post (draft). Addendum: functorial clustering covers hierarchical clustering, e.g. algorithms where I form one set of clusters, then combine clusters to form a smaller set of larger clusters.)

I like colored operads for different reasons, but I’m not sure about using colors here. Instead, I’d prefer to think that any similarity relation between two edges, <king, queen> and <man, woman>, should be defined in the same, recursive way that we defined similarity between two words: <king, queen> ~ <man, woman> just whenever <king, queen> interacts with some word/relation *u* and <man, woman> interacts with some word/relation *v* such that *u ~ v*. I wonder: can similarity be thought of as an analogue to homotopy for finite sets? (Wow, I just googled this question and it could lead to some pretty interesting topology!) Maybe this formulation can help clarify some our assumptions about higher-order relations.

Clustering can be used to simplify languages (e.g. Gromov thinks of clustering as making a “dictionary”) but they can also be seen as a way of enriching language by “continuing” the context of a word, i.e. the space of other words it could *potentially* interact with in . A really simple way of continuing a context is by varying or adding new interaction conditions (e.g. from 2-grams to 3-grams), and functorial clustering suggests a way to control that process. But when we ask about higher-order relations like “[king – man = queen – woman] = [niece – girl = nephew – boy]” (if king – man = queen – woman is a second-order relation, this would be a third-order relation), it seems like we are really asking for a way of continuing the context of a word through the set of its associated clusters.

These higher-order relations aren’t just cute, isolated artifacts. Consider a metaphor: “the time slipped through her fingers.” This requires not only recognizing that time ~ sand, but that time *can* slip, i.e. that ‘time’ interacts with ‘slip’ similarly to how ‘sand’ interacts with ‘slipped’. We might write this as <time, slipped> ~ <sand, slipped>, where <n,v> represents here the interaction relation “follows after” (or perhaps some other interaction; the details may vary). With the metaphor established, it’s easier to imagine building yet more connections going the opposite way, perhaps <ticks of, time> ~ <ticks of, sand>. Poetry is replete with these metaphors, and they are pretty much endemic to language.

Just to bring it back to word embedding for a second: the reason we like word embeddings is because we can translate the poorly-organized data of word co-location into the topology of a vector space. Now we want to translate the data of word co-location (among other interaction data) directly into a cluster of “similar” objects. Are presyntactic categories the right generalization of word embedding? If so, do they suggest any practical algorithms for computing these higher-order clusters? I don’t know the answer to that question yet, but let’s try playing around with some algebraic constructions before circling back to the categorical formalism.

In homological algebra, the bar construction is a canonical way of getting a simplicial complex out of an -algebra by considering all the actions of on . It underlies many classical constructions in algebraic topology (e.g. Ext, Tor, homotopy colimits, classifying spaces, realizations, etc.). Personally, I like Baez’s description: “The bar construction ‘puffs up’ any algebraic gadget, replacing equations by edges, syzygies by triangles and so on, with the result being a simplicial object with one contractible component for each element of the original gadget.”

As we noted before while talking about how to “generate as a monoid”, a object in the presyntactic category can be understood as a very simple algebraic gadget assembled from all its substrings. In fact, each string can be represented as an associative algebra. The bar construction is the obvious way to extract the “qualitative” information from this gadget. In particular, the bar complex of a text , viewed as an associative algebra, will allow us to compute the *cohomology* of the text as an algebra. The cohomology captures all the words and phrases which do *not* appear in the text.

According to Yiannis (joint with Maxim Konsetvich, personal communication), for any given text , we can associate a unital, associative algebra which we define as the free algebra generated by all substrings of modulo the ideal of all the impossible strings (e.g. ‘greeSam’ is modded out in ). Given as words or strings in the text, we define if the string exists in the text, and if not. So has as objects the substrings of (the same as those of ), and basis the words of T. Given such a basis, we can make a -graded algebra where the degree of a given substring is the number of words it contains.

The bar construction associates to the algebra a coalgebra (graded over ) whose elements are all the strings in . The comultiplication is defined by where is the grading and represents a formal concatenation.

>>> let T = "do you like green eggs and ham" >>> Delta(T) [('d', 'o you like green eggs and ham'), ('do', ' you like green eggs and ham'), ... ('do you like green eggs and ha', 'm') ] >>> Delta(Delta(T)) [ [(0, ('o', ' you like green eggs and ham')), (0, ('o ', 'you like green eggs and ham')) ... (0, ('o you like green eggs and ha', 'm')], [(('d', 'o'), (' you like green eggs and ham')), ... (('d', 'o'), (' you like green eggs and ha', 'm')], ... ] -- on singletons 'x', we set the differential d('x') = 0 = ''

Intuitively, the comultiplication captures all the different ways in which I can *split up* a given string *s* in the text. The coalgebra structure is generated from iteratively at each level, with bar differential defined by . Geometrically, text strings correspond to paths, string concatenation to path composition, and the “context” of a string to an *open neighborhood* of a path. The idea is that every non-contractible *n*-simplex in the bar complex will stand for some concatenation of words which is an actual sentence in . The cohomology of the complex will detect when, upon adding or removing a word to a sentence takes it out of the text , i.e. when a word is used out-of-context.

Obviously the bar construction can become quite unwieldy for even a medium sized string. Remember, however, that the point of computing the bar construction is to extract the cohomology. Can we “preprocess” in order to get a more computable representation?

% [read up on koszul complex and finish this section later.]

Recently, Yiannis brought up the example of* formal concept analysis*, and he hypothesized that one could get the same results out of formal concept analysis that one gets from neural networks. Formal concept analysis is a graph-based method for analyzing data—one puts in some labeled data, and the algorithm spits out a concept graph (in fact, a lattice!) of “formal concepts”, which are clusters of data with similar labels. Besides the lattice structure, formal concept lattices are also much easier for a human to interpret than the layers and weights of a neural network.

A *formal context* is a triple (X, Y, J) of *objects* X, *attributes* Y, and a *correspondence* that tells us which objects have which attributes. One can visualize the correspondence as a binary matrix with objects denoted by rows and attributes denoted by columns:

There is a natural notion of clustering in formal concept analysis. First, we define two operators on subsets of X and Y respectively: the *intent* of a set of objects given by and the *extent* of a set of attributes defined by .

In other words, represents all the attributes shared by every object in A, while represents all the objects which satisfy every attribute in B.

In that case a *formal concept* is a pair where and such that and . Alternatively, we say that a subset is a formal concept just when , i.e. when A is a *fixed point* of the composition . In the matrix representation of , each formal concept corresponds to a maximal block of rows and columns which cannot be expanded by either row or column interchanges.

In principle this looks a lot like biclustering—which we saw before in the context of Gromov’s presyntactic categories—since we are also constructing maximal, contiguous blocks of similar values. The additional piece here is computing the entire *lattice* of formal concepts. (Actually, computing the join and meet of formal concepts feels sort of like computing data migrations for a functor . To see an example of these on instances, look at the FQL manual.) There are many algorithms for computing concept lattices, but I’m not sure if any of them are explicitly related to biclustering.

% is computing higher-order relations the same thing as computing the concept lattice?

Now we’ve come full circle. We started with an observation: neural networks can learn analogies like “W(king) – W(man) = W(queen) – W(woman)” as part of a good word embedding W. We then asked two questions: what’s the right way to represent these relations outside of a neural network, and how do we learn even more complicated relations? In Gromov’s theory, analogies such as the above are modeled by a recursively-defined similarity relation, man ~ king and woman ~ queen, and the word embedding is modeled by a map to a set of clusters (while glossing over how to write the embedding as an actual functor). We discussed how to capture the extra structure of the word embedding using a higher-order relation between relations like <man, king> ~ <woman, queen>. We then re-told this story from another perspective—via algebra / the bar construction—and saw the connection to cohomology. Finally, we looked at formal concept analysis, tried to understand how the lattice structure relates to “typical” clustering on , which led us back to neural networks, by Yannis’ conjecture!

I wish I could give a neater summary, but I’m not sure yet how to interpret all this yet—some of it feels like it should be old hat to people who do neural networks, and some of it feels wildly speculative. The strength of category theory is its ability to make connections between far-ranging bodies of knowledge, but I’m sure that I’m still using it in a very naive way here. I’ll update my analysis as I get more data; for example, Chris Olah recently posted some speculation connecting neural networks to functional programming, which has a well-known categorical semantics. I have some thoughts I plan to post later about applying topos theory to the problem, because in many ways it feels like the same concerns about interaction conditions—which are akin to “coherent” guides to writing “local” systems of data into the morphisms of a (postsyntactic?) category—are reflected in the way we define sheaves and Grothendieck topologies on categories.

In the meantime, I hope this was a moderately useful illustration of some very different approaches to analyazing natural language texts. If you have any feedback, please let me know in the comments.

]]>I’d like to formalize subsumption using operads.

*But why would anyone want to formalize an derelict robotics architecture with high-falutin’ mathematics? *

The answer is simple. It’s not that subsumption on its own is important (though it is) or that it requires formalizing (though it does). What I’d really like to understand is how *operads give domain-specific languages* (and probably much more) and whether categories are the *right* way to pose problems that involve combining and stacking many such DSLs—think of a robot that can move, plan, and learn all at the same time—which, for lack of a better term, I will call *hard integration problems*.

(The rest of this post is currently in process! I will come back throughout the fall and update it.)

The psychologist Robert Sternberg once said of wisdom: “practical wisdom is domain-specific […] but wisdom can exist on a general level as knowing the limits of this domain-specific wisdom.” But how can we study the limits of this domain-specific wisdom?

I think we can start by formalizing what we mean by *domain-specific knowledge*. I’d like to focus on something really simple and dumb: the idea of “domain-specific” is that there is a *domain*: a closed-off, special area where some kind of unique logic or principle applies. The operad gives the architecture of that domain—a suite of tools for thinking about and comparing the particular *algebras over the operad*, i.e. the particular examples of objects in that domain.

Example: matrix multiplication vs tensor product vs using a simple 3-box wiring diagram.

Example: a man walking through a series of rooms: this is an algebra over the same operad! .

Example: context-free grammars, e.g. postal address = {name = {first, m, last}, address, zip}. This is a “free operad”.

Example: Matriach at MIT.

Example: a diagram in the operad gives morphisms in a cospan category (e.g. of finite sets). Rel : U \to Set, then which can be understood as relations of type or “tables of real numbers with S-many columns.

An *easy integration problem* is…

% Operads were originally developed in algebraic topology by Peter May to express the natural, many-to-one operations in iterated loop spaces. [Do we need the history or is that a distraction?]

Why subsumption?

- Subsumption is distinguished for being one of the first
*reactive*architectures. Reactive, e.g. “pre-wired”. Sensors project directly to motors, without intervening abstraction. - It’s a platform on which we can build more complicated faculties like planning, inference, and learning.
- Subsumption is cheap compared to deliberative, sequential, symbolic architectures like Sense-Plan-Act (Shakey).
- Distributed control leads to graceful degradation.
- The key to subsumption is arbitration. Higher priority tasks, when flagged “true”,
*subsume*lower priority tasks.

Why operads for subsumption?

- Better theoretical insight into subsumption and what it means to be “reactive”.
- Want new design tools for subsumption, that will help us handle the combinatorial explosion as we add behaviors.
- Use category theory to relate it to other methods in AI (e.g. symbolic logic, ML, KR), e.g. we want a good theory of how to “add in” subsumption to some other architecture when reactivity is needed.

What are operads?

- The tree operad
- May’s box operad
- Spivak’s wiring diagrams
- Despite the pictures, fundamentally an algebraic construct
- Define colored operads
- Intuition from Massey products and homotopical algebra, e.g. this survey

Construct an example of a subsumption diagram, e.g. the classic one for Genghis, as described here.

- Composed of augmented finite state machines (augmented with timer, so there needs to be a notion of time or at least recurrence—see ideas from operads for functional reactive programming, and FRP + robots.)
- Each AFSM has input signals and output signals. Like a neuron, it “activates” (i.e. sends an output signal) when a certain threshold is passed, given by the sum of input signals.
- Suppression gates (and task priority)
- Inhibition gates
- Code it up in a robot simulator, e.g. Simbad.

Define the colored operad for this diagram. Actually, better yet, generate the operad in OPL.

]]>Recently I met some people who have also been thinking about different variants of a “Wikipedia for math” and, more generally, about tools for mathematicians like visualizations, databases, and proof assistants. People are coming together; a context is emerging; it feels like the time is ripe for something good! So I thought I’d dust off my old notes and see if I can build some momentum around these ideas.

- In part 1, examples, examples, examples. I will discuss type systems for wikis, Florian Rabe’s “module system for mathematical theories“, Carette and O’Connor’s work on theory presentation combinators, and the pro/con of a scalable “library” of mathematics.
- In part 2, I’d like to understand what kind of theoretical foundation would be needed for an attack on mathematical
*pragmatics*(a.k.a. “real mathematics“) and check whether homotopy type theory could be a good candidate. - In part 3, I will talk about
*mathematical experiments*(everything we love about examples, done fancier!), their relationship with “data”, and what they can do for the working mathematician.

Let’s start with the math wikis: Wikipedia, PlanetMath, Wolfram MathWorld, the nLab, the Stacks Project, and to some degree PolyMath and MathOverflow. As for repositories of formal math, we have: the Mizar library, the Coq standard library, various libraries in Isabelle/HOL, and whatever content dictionaries happen to exist in OpenMath/OMDoc. We may also consider computer algebra systems such as Sage, Macaulay2, and MathScheme along with numerical solvers like MATLAB or Julia, and general software frameworks like Mathematica and Python’s Scipy.

Wiki-style reference sites are useful to most mathematicians. Numerical solvers are* **very* useful to a significant subset of mathematicians. Formalized repositories are useless to almost all mathematicians, theoretical or applied. That’s a big problem, because almost every idea for applying computers to math depends on *some* degree of formalization of mathematics (specifically, of proofs and expressions). But formalization is hard, and tedious. Can we make something useful—whether a library of math or a proof assistant with human-style outputs—without requiring a full formalization of mathematics?

To reiterate, my goal here is to *build something useful for working mathematicians*.

Just to make things concrete, let’s consider the problem of *variable binding* over multiple documents. In the easy version, variable binding amounts to defining a type (e.g. a connected graph is a type, a boolean is a type) for every keyed variable, and making sure different variables that represent the same thing across different documents reflects the binding or instantiation given on any one of the documents; the displayed name of the variable (e.g. “G”) is one part of the instantiation.

For example, if a document consists of the statement, “G is a group with two elements g_1, g_2 and no relations,” then clicking on the word ‘group’ should bring up a new document where the name and type of the variables such as G, g_1, g_2 are preserved, as well as their relation (set-membership), e.g. “A group is a set G of elements closed under an operation satisfying three conditions: identity, associativity, and inverse.” Clicking on ‘set’ should bring up another document such as “A set G is a collection of elements satisfying the ZFC axioms.”

Note that we are *not* performing inference in the easy version of variable binding; we are (1) clarifying what it means for two variables to represent “the same thing” and (2) choosing a data structure to keep track of multiple variables across multiple documents. In the hard version (with inference), variable binding is another name for unification, which is intimately connected to automated reasoning in proof assistants (there’s also a broader literature on variable binding in other parts of AI, e.g. variable binding in neural networks).

% Note: every series of nested problems is ripe for abstraction…

There are two broad ways to solve the easy version of variable binding (without inference).

First and most obviously, we could use a namespace solution to scope over multiple documents, which would then be bound together as modules (e.g. “G” is the displayed name of a graph in the graph theory module, while “G” is the name of a group in the group theory module). The namespace data structure and its methods are built into the module; all documents in a module share a primary pool of names, each with a unique key. Different variables are not distinguished when they refer to the same name in this primary pool.

A second method, suggested by Trevor Dilley and David Spivak, encodes the naming convention directly through links between documents satisfying a few conditions. A link is a partial function between documents that maps source variables (in the source document) to target variables (in the target document). One thinks of links as potential instantiations. If we further allow links only if they are *onto* and only if all variables in both documents are well-typed, then we can regard any composition of linked modules as its own module. If we think of documents as tables in a database schema, then in database theory the onto condition is called *referential integrity*; the variables in the source document form the primary key and their targets in the target document form the foreign key. The well-typing condition is called *domain integrity*. The fact that each appearance of a variable is keyed uniquely corresponds to *entity integrity*.

The two approaches are related—after all, practically the only distinction is on tracking variables appearances versus variables names! In principle we could take every document to be its own module and reproduce the link-based method. Alternately, due to referential integrity, *any* composition of linked documents can be regarded as belonging to the same module. But what happens if there are two links from competing source documents into the same target document? The two approaches will differ.

A more formal way of understanding referential integrity, domain integrity, and entity integrity is to see them as the basis of an *adjunction* (in the sense of category theory) between link-based solutions and module-based solutions to variable binding. This is related to Spivak’s functorial data migration formalism, which regards mappings between database schemas as functors. Suppose we form a category with objects the documents specified up to unique variable appearances (*not* variables names) and morphisms links between documents satisfying referential integrity. We don’t distinguish documents aside from the appearance of variables in them. Then compare to a category of objects the modules—given by a signature of variables names (*not* variable appearances) + axioms defining properties of those names—and morphisms inclusions between modules.

% It feels like we should want to distinguish documents even beyond the appearance of variables, e.g. by the “stuff” in the sentences that contextualize the variable appearances. What about using higher coherence conditions on links?

In the rest of this post, I’ll compare the module- and link-based approaches to building a mathematical “library” by studying two examples: (1) a module system for mathematics currently being developed in Bremen, Germany and (2) a hypothetical type system for mathematical papers.

One example of the module-based approach is Florian Rabe’s “module system for mathematical theories” (MMT). I first learned about MMT from Paul-Olivier Dehaye, who proposed it as something one could use to think about knowledge representation more generally.

Here’s MMT in Rabe’s own words:

MMT is a generic, formal module system for mathematical knowledge. It is designed to be applicable to a large collection of declarative formal base languages, and all MMT notions are fully abstract in the choice of base language. MMT is designed to be applicable to all base languages based on theories [corresponding to “modules” as we have been using them, above]. Theories are the primary modules in the sense of Section 2. In the simplest case, they are defined by a set of typed symbols (the signature) and a set of axioms describing the properties of the symbols. A signature morphism σ from a theory S to a theory T translates or interprets the symbols of S in T.

It’s been a while since I did any model theory, but we can demystify the above description by reviewing MMT’s lineage through OMDoc, OpenMath, and MathML. These are all classes of simple, XML languages for mathematical expressions designed for the web.

Take a formula, for example . We can write it in MathML as

<math xmlns='http://www.w3.org/1998/Math/MathML'> <mrow> <msup> <mi>b</mi> <mn>2</mn> </msup> <mo>-</mo> <mrow> <mn>4</mn> <mo>⁢</mo> <mi>ac</mi> </mrow> </mrow> </math>

and pass the formula to a browser for display. But the language doesn’t know what the symbols “mean”.

Take the same formula and rewrite it in OpenMath as

<OMS cd="relation1" name="eq"/> <OMA> <OMS cd="arith1" name="minus"/> <OMA> <OMS cd="arith1" name="power"/> <OMV name="b"/> <OMF dec="2"/> </OMA> <OMA> <OMS cd="arith1" name="product"/> <OMF dec="4"/> <OMV name="a"/> <OMV name="c"/> </OMA> </OMA> <OMA> <OMF dec="0"/> </OMA>

Now we know that the symbols stand for variables (note the OMV tag), that 4 is a constant, and that = is a relation. Further, given another expression in the same theory or “content dictionary”, we know that refer to the same variables in both expressions.

The expression of in OMDoc is the same as that in OpenMath above. OMDoc extends OpenMath’s content dictionaries (e.g. theories) so that we can talk about their interrelationships (e.g. group theory monoid theory magma theory).

OpenMath, OMDoc, and MMT are all managed by Kohlhase’s group in Bremen, which is now refocusing on MMT. Why? For one, the OpenMath and OMDoc representations do not have enough structure to support automated, “internal” reasoning of the kind that goes on in theorem provers and proof assistants like Isabelle, HOL, and Coq. MMT was designed to address this issue: its main contribution is a standardized, proof-theoretic semantics for communicating between different proof systems. When he talks about knowledge representation, Rabe is really talking about preserving this proof semantics across different modules, formal theories, and especially technical languages. Knowledge representation and integration becomes a matter of building interfaces between those formal theories and the primary MMT vocabulary, as evidenced by all the effort Rabe put into rounding up the disparate terminology for concepts like theories, functors, and views.

% Write up a from-the-ground-up example of what we mean by proof-theoretic semantics: an elementary vocabulary of theory, of implication, of proofs, of variables and quantifiers. Also of theory morphisms, views, and modules.

Assuming all the interfaces can be constructed, Rabe’s MMT gives us a unified interface that can relate formal expressions between mathematical theories, so that knowledge (formal expressions) in one theory becomes knowledge (formal expressions) in the other. Classically, one imagines the inclusion from the theory of monoids to the theory of groups.

% This is variable binding, writ large, at the module level. Make the connection, segue into Carette’s paper.

Carette and O’Connor discuss this structural redundancy in theories…

- A theory is just a context specified over some dependent type theory.
- We can define a category of contexts over a dependent type theory.
- We will use this category to distinguish different labelings of variables.
- A given labeling of the variables in a document D is called a
*concept*on D. - We will
*not*regard labelings as equal under -equivalence.

MMT reminds me of blackboard systems; these are architectures designed to allow efficient, simultaneous communication between independent modules (think camera modules, motor modules, reasoning modules in a robot) via a message-passing over a shared, global “blackboard”. These systems are often programming language independent. Here’s a summary. MMT feels like a marriage between such blackboard systems and a good old fashioned package management system, all enriched (or should I say flattened?) over a nice, formal type system. Maybe that’s the secret sauce; I’m not sure. I used to like blackboard systems before I realized that, in practice, they suffered many of the same limitations as other logical methods in AI.

The question is whether most situations faced by a working mathematician support such a semantics. I am honestly not sure. Likely the situations do support a proof-theoretic semantics, but using MMT (or any formal system) requires that we abstract out too far from each situation for the semantics to be useful.

% For example?

So we are missing a picture of how these theories “face the world”. We don’t know how mathematicians *use* the expressions in these theories. A statement of group theory may show up in a geometry textbook, but *why* does this statement show up? If a given variable has multiple interpretations, then what is the *intended* interpretation? What is the *context of interpretation*? In natural language processing, the problem of guessing a statement’s intended meaning from its real-world (i.e. human) context is called *pragmatics*. In ontology matching, the problem is called *semiotic **heterogeneity*. In interactive theorem proving and in mathematical software more generally, we might call the problem *UX*, for *user experience*.

In this AMS post, Pitman and Lynch report back some findings from their recent “Committee on Planning a Global Library of the Mathematical Sciences”, which was hosted by the NSF in 2014. (The full report can be found here.) In another article, Tausczik et al. discusses mathematical collaboration with reference to MathOverflow.

[Insert discussion of what UX would mean, first for the example of a mathematical library. Consider Fateman’s article (see Section 3).]

Continued in part 2.

]]>Persistent homology has been applied to neural data before, particularly by Singh et al. to model vision and by Dabaghian et al. to model the behavior of place cells. Place cells are phenomenal, highly-structured little cells in your hippocampus that fire whenever you find yourself in a particular, local “place”. If you’re in a room, there might a place cell for the corner of the room, a place cell for the center, and another one for the area right in front of the door! The problem with Dabaghian et. al.’s analysis, however, is that they are using persistent homology not (only) as a data analytics tool but as part of a *biological hypothesis* about how the place cells work. That is, they conjecture that the neurons downstream of the place cells interpret the co-firing of place cells as* natively topological* information—information about “connectedness, adjacency, and containment”—rather than exact geometric information about distances. They don’t necessarily suggest that our brain is running persistent homology, but they point out that the co-firing of juxtaposed place cells certainly suggests something like that is very possible. (This shouldn’t be a big surprise for us. After all, homology itself comes down to a very basic and universal phenomenon: the need to piece together many small pieces of information in order to get a global picture of the environment.)

A couple of problems I’ve encountered in applying this to the spike train data I have.

- The task I’m facing is straight data analysis—the data comes from neurons in mice reacting to different odors—and there are no obvious “geometric hooks” like place cells that encourage a topological perspective. In other words, there’s no obvious topology in the odorant space (as opposed to the the topology of a room with a hole in the middle) that lets me verify my findings.
- In their experiment, Dabaghian et al. are generating the “temporal” simplicial complex directly from the time-binned data, not forming (as is usual in persistent homology) a Rips complex from a cover over points in a metric space.
- Okay, so I want points in a metric space. But this is a time-series… and because there are so many neurons firing (or not firing) at each point in time (really a very small interval of time), each point is almost maximally distant from each other in Hamming distance!

What do I do? There’s one normal approach to time series: smooth (e.g. interpolate) then decompose my smoothed data into a Fourier basis, then compare the spike train data for a particular neuron in this metric space. Intuitively I find this dubious (how is this better than PCA if I’m throwing out a lot of information when I smooth?). And it also misses the fact that what I really want is to study the co-firing information.

I thought another approach might be to use the “persistent vineyard” approach discussed in Applying Persistent Homology to Time Varying Systems. In her thesis, Munch describes methods for combining many “slices” of the persistent homology of a point cloud as it varies (continuously). But the point cloud I have at each “slice” is very sparse—it’s sitting in some finite product space!

Okay, I’m not throwing up my hands yet. These papers look promising. But to make progress we really have to dig back into persistent homology and see what this technique is really doing, and see what the “natural” modification for time-series might be.

Continued in…

]]>To contact me, send me an email at joshua dot z dot tan at gmail dot com (remember the “z” in the middle, otherwise you’ll get someone different!).

]]>In this case the number of books in each room (32 per bookshelf x 5 shelves per wall x 4 walls = 640 books) is far less than the length in characters of each book, i.e. the dimension of the book if we represent it in vector form. We would like to pare down this representation so that we can analyze and compare just the most relevant features of these books, i.e. those that suggest their subjects, their settings, the language they are written in, and so on.

Typically, we simplify our representations by “throwing out the unimportant dimensions”, and the most typical way to do this to keep just the two to three dimensions that have the largest singular value. On a well-structured data set, singular value decomposition, aka PCA, might leave only 2-3 dimensions which together account for over 90% of the variance… but clearly this approach will never work on books so long as we represent them as huge strings of characters. No single character can say anything about the subject or classification of an entire book.

Another way of simplifying data is to look at the data in a qualitative way: by studying its shape or, more precisely, its *connectedness*. This is the idea behind persistent homology.

To be continued…

]]>