# Formal concepts and natural languages

Back in January, Yiannis (Vlassopoulos) and I were talking about “quadratic relations” and higher concepts in language, for example the analogy between “(king, queen)” and “(man, woman)”. Deep neural networks can learn such relations from a set of natural language texts, called a corpus. But there are other ways of learning such relations:

 Representation of corpus How to learn / compute n-grams string matching word embeddings standard algorithms, e.g. neural nets presyntactic category computational category theory bar construction Koszul dual formal concept lattice TBD

There are some very nice connections between all five representations. In a way, they’re all struggling to get away from the raw, syntactic, “1-dimensional” data of word co-location to something higher-order, something semantic. (For example, “royal + woman = queen” is semantic; “royal + woman = royal woman” is not.) I’d like to tell a bit of that story here.

# n-grams

Consider the following 35-word corpus.

>>> let T = "Do you like green eggs and ham?
I would not like them here or there.
I would not like them anywhere.
I do not like them, Sam-I-am.
I do not like green eggs and ham!"

This is a text string. In a standard natural language processing (NLP) scenario, we begin with a nice parser, which when applied on the above gives us a sequence of words

>>> let words = nice_parse(T) -- eg words(T) in Haskell or T.split() in Python
>>> words
['do', 'you', 'like', 'green', 'eggs', 'and', 'ham', ...]

>>> let dict = unique_words(words)
>>> dict
['do', 'you', 'like', 'green', 'eggs', 'and', 'ham', 'I', 'would', 'not', 'them', 'here', 'or', 'there', 'anywhere', 'Sam-I-am']

>>> let l = size(dict)

For now let’s abstract from the parsing / word segmentation problem (which can be nontrivial, especially in languages like Chinese or Korean) and assume we’re given such a list of words. We can then compute a list of 2-gram probabilities like P(‘like’ | ‘you’) = .2 and P(‘like’ | ‘Sam-I-am’) = 0 in the usual way. So to every unique word in my dictionary (of size l), I associate a vector of probabilities of size l and magnitude = 1.

>>> x = dict[2]
>>> x
'like'

>>> x' = some_n_gram_embedding(word='like', n=2, text=words, vocabulary=dict)
>>> zip(dict, x')
[('do', 0), ('you', .2), ('like', 0), ('green', 0), ...]

In an n-gram model, I would associate to each word a vector of probabilities of size $l^{n-1}$ and magnitude = 1, so the final output of an algorithm is an $l \times l^{n-1}$ matrix. Note that there is a Markov assumption working here—one expects that the probability of the next word relies on every word that went before it in its sentence (or in its entire paragraph), but we suppose that P(‘green’ | ‘I do not like’) ≈ P(‘green’ | ‘not like’) ≈ P(‘green’ | ‘like’).

Significant variations of n-grams exist, for example smoothing (words that never show up in the corpus are modified to have non-zero probability) or skip-grams (you can skip words when determining the n-gram possibility). Most of the legwork of computing these probabilities comes down to string matching, i.e. counting occurrences of similar strings in a large text.

Once learned, I can apply my n-gram model to a range of tasks, for example testing whether a sentence is grammatically correct (‘I would not like them’ vs ‘I would green like them’) or predicting the next word in a sentence (‘I would not like …’).

# Word embeddings

n-grams are a particularly simple kind of vector space representation for text strings (reminiscent of ‘sliding windows’ in time-series analysis) where we represent every word as a vector of probabilities. We can consider more general embeddings of text strings into a vector space, whose value is often self-evident:

word embeddings (projected to 2D) from Turian et al. (2010)

I can then further extract clusters of words using the ambient metric in $\mathbb{R}^d$. (Where did the metric or the Euclidean structure come from? Answer: we chose $\mathbb{R}^n$ because it was a convenient template for clustering, but in principle we could use other mathematical objects as templates.) Alternately, one can compute a list of the nearest neighbors to a word, such as

nearest neighbors from Collobert et al. (2011)

The examples above are borrowed from Chris Olah’s blog, which has a good introduction to word embeddings in the context of neural networks. In such examples, the topology of the embedding space can tell us something about the meaning of the strings. For more detail, consider Minh and Kavukcuoglu (2013).

Computing these embeddings can be a bit tricky. For now let’s restrict ourselves to strings of single words and consider word embeddings $W : \text{words} \to \mathbb{R}^d$. One way of computing $W$ is through a neural network, as in Bengio et al.’s original neural probabilistic language model, which trains the word embedding simultaneously with a standard n-gram model, e.g. by taking the entries of a $n\times d$ projection matrix as the weights of a projection layer in a neural network. The network is fed through another hidden layer and then trained in the standard way on a corpus, using backpropagation and SGD.

On our little corpus, this looks something like

-- define parameters
>>> let n = 2    -- eg. 2-grams
>>> let d = 10   -- d is the dimension of the embedding space
>>> let h = 5    -- number of neurons in the hidden layer of the neural network

-- set up the neural network (this is based on the PyBrain syntax)
>>> let net = FeedForwardNetwork()
>>> net.addInput(input_layer)              -- inputs are "windows" of n words
>>> net.addLayer(projection_layer(n,d))    -- weights of the word embedding
>>> net.addLayer(hidden_layer(h))          -- stores reaction to proj. layer
>>> net.addOutput(output_layer)            -- a softmax layer gives the likelihood

>>> net.addConnection(full, input_layer, projection_layer)
>>> net.addConnection(full, projection_layer, hidden_layer)
>>> net.addConnection(full, hidden_layer, output_layer)

-- prep the inputs
>>> let ngram n xs = take n xs : tail \$ ngram xs
>>> bigram_T = ngram 2 words
>>> bigram_T
[['do', 'you'], ['you', 'like'], ['like', 'green'], ['green', 'eggs'], ...]

-- train the neural network
>>> net.train(bigram_T)

-- visualize the model in two dimensions
>>> for word in dict:
project_2d(net.activate(word))

[FIGURE]

In the king-queen example, Mikolov et al. mention how word embeddings learned by Bengio’s neural network model satisfy approximate linear equations like

$W(\text{king'})-W(\text{man'})+W(\text{woman'})\approx W(\text{queen'})$

Mikolov’s paper focuses on how to modify the neural network model in order to improve the accuracy of these linear analogies, but notably, Bengio’s original paper doesn’t discuss these relations at all—they introduce word embedding as a way of doing dimensionality reduction on the space of possible n-grams. The analogies and “quadratic” relations seem to come for free whenever we use a neural network!

# Presyntactic categories

In principle, there should be other ways of capturing and making use of this “free structure”. I.e. I don’t think that the use of neural networks is special. There is probably some elemental (geometric?) fact about the data and/or natural language that the neural network is picking up on.

So we would like a better theory for why and when these quadratic relations show up. Such a theory should help us solve problems like how to obtain cubic relations, e.g. analogies between analogies such as

$[W(\text{king'})-W(\text{man'})=W(\text{queen'})-W(\text{woman'})]=[W(\text{niece'})-W(\text{girl'})=W(\text{nephew'})-W(\text{boy'})].$

Back in 2014, Misha Gromov (personal communication, May 8, 2014) introduced the idea of the presyntactic category of a given text $T$. We define $T$ as a string, where strings $s_1, s_2, ...$ come with notions of isomorphism $s_1 \simeq s_2$ when two strings in different locations share the same characters, equality $s_1 = s_2$ when two strings share the same location and characters, and string inclusions $s_1 \hookrightarrow s_2$ when there exists some substring $s_1' \subseteq s_2$ such that $s_1' \simeq s_1$. In that case, the presyntactic category of $T$, $\mathcal{C}_{\hookrightarrow} =\mathcal{C}_{\hookrightarrow}(T)$, is the category with objects T and all of its substrings, and morphisms the string inclusions.

Naively, we can represent the category as two lists. (There are more sophisticated representations, e.g. most typeclasses in a functional programming language are, in fact, categories.)

>>> let T = "do you like green eggs and ham"

>>> Ob(T)
['do you like green eggs and ham', 'o you like green eggs and ham', ..., 'a', 'n', 'd', 'h', 'a', 'm' ]

>>> Mor(T)
[('d', 'do'), ('d', 'do '), ('d', 'do y'), ..., ('o you like green eggs and ham', 'do you like green eggs and ham')]

-- note, we can allow for "approximate" morphisms that allow spaces between characters, e.g. ('gen', 'green'). This is more obviously useful when we restrict Ob(T) to include only words, e.g. so we can draw arrows like ('fast car', 'fast red car').

Gromov introduced presyntactic categories in order to study the algebraic / combinatorial properties of language. For example, given $\mathcal{C}_{\hookrightarrow}$ one can imagine a subcategory $\mathcal{D}_{\hookrightarrow} \subset\mathcal{C}_{\hookrightarrow}$ which is minimal with respect to $\mathcal{C}_{\hookrightarrow}$, in the sense that $\mathcal{D}_{\hookrightarrow}$ will be the smallest subcategory in $\mathcal{C}_{\hookrightarrow}$ which generates $\mathcal{C}_{\hookrightarrow}$ “as a monoid”, e.g. under all compositions of strings $s . t$ in $\mathcal{D}_{\hookrightarrow}$, satisfying identity $s.\text{ '}=s$ and associativity $(s . (t . u)) = ((s . t) . u))$. This minimal subcategory will usually be some subcategory of words in the text $T$.

Despite the somewhat unfamiliar setting, one can do all the usual n-gram analysis in $\mathcal{D}_{\hookrightarrow}$: we have the same probability measure on bigrams given by $P_u^{\hookrightarrow}(v) = P(v|u) = N_T(u.v)/N_T(u)$, where $N_T(u)$ denotes the occurrences of $u$ in $T$. Note that $N_T$ can be computed directly by counting the morphisms in $\mathcal{D}_{\hookrightarrow}$ from the word u.

The payoff for all this comes when we want to classify similar words. A pair of words $(u, u')$ is similar, written $u \sim u'$, when the two words “similarly interact”, respectively, with words $(v, v')$ which are already similar. “Similarly interact” can be understood in the n-gram way as having high values for $P(v|u), P(v'|u')$ or $P(u|v), P(u'|v')$. For example, the word ‘bat’ is similar to the word ‘bird’ because they often show up next to the word ‘fly’, which is similar to itself. Or ‘hammock’ is similar to ‘bed’ since ‘hammock’ shows up with ‘nap’, which is similar to ‘sleep’.

>>> let T = "do you like green eggs and ham I would like them here or there"

-- define "similarly interacts"
>>> interacts :: String -> String -> Bool
>>> interacts u u' = N_T(u, u')

-- define the similarity relation ~ recursively, based on interact
>>> (~) :: String -> String -> Bool
>>> v ~ v = true          -- note, Gromov actually discounts self-similarity
>>> u ~ v = if interacts(u,u') && interacts(v,v')
then u' ~ v'

How do we compute or “learn” similarity relations on a text corpus? A collection of similar words is called a cluster, and finding such collections in a corpus is called clustering. Finding clusters of words is a lot like learning a word embedding: the difference is that instead of embedding words into a (finite) metric space, from which we derive clusters, we are embedding the words directly into a set or category of clusters. The underlying “facts” of clustering are the same interacting pairs $\langle u,v \rangle$ as they are in a word embedding problem. We then compute the similarity relations (which in some sense “lie over” $\mathcal{D}_{\hookrightarrow}$ just as $\mathcal{D}_{\hookrightarrow}$ lies over the bare corpus $T$) through a recursive process. This process, in the case of pairs of words (u,u’) and (v, v’), looks a lot like biclustering, since we are determining similarity simultaneously on two axes: formation of a cluster $u, u', u'', ...$ and formation of a cluster $v, v', v'', ...$.

Similarly, the linear relation we saw between ‘king’ and ‘man’ and ‘queen’ and ‘woman’ can be read as a special kind of similarity relation. ‘King’ is similar to ‘man’ because ‘king’ interacts with ‘queen’, which is similar to ‘woman’. Of course there is some more structure to the relation. One way to interpret the “linear” structure we found in the word embedding is to say that the way ‘king’ interacts with ‘queen’ in the same way that ‘man’ interacts with ‘woman’:

<king, queen> ~ <man, woman>

Gromov might say that the interaction relations here have the same “color”, in the sense of a colored operad. Certainly there are many different “colors” of interaction relations: we can define $\langle u,v \rangle \in \hom(u,v)$ as a “nearby” n-gram relation, as a “frequently nearby” relation, as a “far from” relation, as a noun-verb relation, as a short-long relation, as any sort of grammatical relation, and so on.

Let’s understand the problem: there are many different ways to define “similarly interact”, and not only may the different interaction relations be themselves related, but changing from one to the other will also change the computational structure of the clustering problem on $\mathcal{D}_{\hookrightarrow}$. For example, $\sim$ is not always transitive if we use an n-gram interpretation.

(Originally I was going to write a whole section here on functorial clustering and how that might help us do analysis of algorithms while varying the interaction relations. Instead, I’ll put those thoughts into another post (draft). Addendum: functorial clustering covers hierarchical clustering, e.g. algorithms where I form one set of clusters, then combine clusters to form a smaller set of larger clusters.)

I like colored operads for different reasons, but I’m not sure about using colors here. Instead, I’d prefer to think that any similarity relation between two edges, <king, queen> and <man, woman>, should be defined in the same, recursive way that we defined similarity between two words: <king, queen> ~ <man, woman> just whenever <king, queen> interacts with some word/relation u and <man, woman> interacts with some word/relation v such that u ~ v. I wonder: can similarity be thought of as an analogue to homotopy for finite sets? (Wow, I just googled this question and it could lead to some pretty interesting topology!) Maybe this formulation can help clarify some our assumptions about higher-order relations.

# What are higher-order relations?

Clustering can be used to simplify languages (e.g. Gromov thinks of clustering as making a “dictionary”) but they can also be seen as a way of enriching language by “continuing” the context of a word, i.e. the space of other words it could potentially interact with in $\mathcal{D}_{\hookrightarrow}$. A really simple way of continuing a context is by varying or adding new interaction conditions (e.g. from 2-grams to 3-grams), and functorial clustering suggests a way to control that process. But when we ask about higher-order relations like “[king – man = queen – woman] = [niece – girl = nephew – boy]” (if king – man = queen – woman is a second-order relation, this would be a third-order relation), it seems like we are really asking for a way of continuing the context of a word through the set of its associated clusters.

These higher-order relations aren’t just cute, isolated artifacts. Consider a metaphor: “the time slipped through her fingers.” This requires not only recognizing that time ~ sand, but that time can slip, i.e. that ‘time’ interacts with ‘slip’ similarly to how ‘sand’ interacts with ‘slipped’. We might write this as <time, slipped> ~ <sand, slipped>, where <n,v> represents here the interaction relation “follows after” (or perhaps some other interaction; the details may vary). With the metaphor established, it’s easier to imagine building yet more connections going the opposite way, perhaps <ticks of, time> ~ <ticks of, sand>. Poetry is replete with these metaphors, and they are pretty much endemic to language.

Just to bring it back to word embedding for a second: the reason we like word embeddings is because we can translate the poorly-organized data of word co-location into the topology of a vector space. Now we want to translate the data of word co-location (among other interaction data) directly into a cluster of “similar” objects. Are presyntactic categories the right generalization of word embedding? If so, do they suggest any practical algorithms for computing these higher-order clusters? I don’t know the answer to that question yet, but let’s try playing around with some algebraic constructions before circling back to the categorical formalism.

# The bar construction

In homological algebra, the bar construction $BA$ is a canonical way of getting a simplicial complex out of an $R$-algebra $A$ by considering all the actions of $R$ on $A$. It underlies many classical constructions in algebraic topology (e.g. Ext, Tor, homotopy colimits, classifying spaces, realizations, etc.). Personally, I like Baez’s description: “The bar construction ‘puffs up’ any algebraic gadget, replacing equations by edges, syzygies by triangles and so on, with the result being a simplicial object with one contractible component for each element of the original gadget.”

As we noted before while talking about how to “generate $\mathcal{C}_{\hookrightarrow}$ as a monoid”, a object in the presyntactic category can be understood as a very simple algebraic gadget assembled from all its substrings. In fact, each string can be represented as an associative algebra. The bar construction is the obvious way to extract the “qualitative” information from this gadget. In particular, the bar complex of a text $A$, viewed as an associative algebra, will allow us to compute the cohomology of the text as an algebra. The cohomology captures all the words and phrases which do not appear in the text.

According to Yiannis (joint with Maxim Konsetvich, personal communication), for any given text $T$, we can associate a unital, associative algebra $A = A_T$ which we define as the free algebra generated by all substrings of $T$ modulo the ideal of all the impossible strings (e.g. ‘greeSam’ is modded out in $A$). Given $u,v$ as words or strings in the text, we define $u.v=s$ if the string $s$ exists in the text, and $u.v=0$ if not. So $A$ has as objects the substrings of $T$ (the same as those of $Ob(\mathcal{D}_{\hookrightarrow})$), and basis the words of T. Given such a basis, we can make $A$$\mathbb{Z}$-graded algebra where the degree of a given substring is the number of words it contains.

The bar construction associates to the algebra $A$ a coalgebra $C$ (graded over $\mathbb{Z}$) whose elements are all the strings in $T$. The comultiplication $\Delta:C\to C\otimes C$ is defined by $\Delta(u) := \sum_i a^{(i)} \otimes b^{(i)}$ where $i$ is the grading and $\otimes$ represents a formal concatenation.

>>> let T = "do you like green eggs and ham"
>>> Delta(T)
[('d', 'o you like green eggs and ham'),
('do', ' you like green eggs and ham'),
...
('do you like green eggs and ha', 'm')
]

>>> Delta(Delta(T))
[
[(0, ('o', ' you like green eggs and ham')),
(0, ('o ', 'you like green eggs and ham'))
...
(0, ('o you like green eggs and ha', 'm')],
[(('d', 'o'), (' you like green eggs and ham')),
...
(('d', 'o'), (' you like green eggs and ha', 'm')],
...
]

-- on singletons 'x', we set the differential d('x') = 0 = ''`

Intuitively, the comultiplication captures all the different ways in which I can split up a given string s in the text. The coalgebra structure is generated from $A$ iteratively at each level, with bar differential defined by $\bar{\partial}(a \otimes b) = (\partial a \otimes b) + (-1)^{|a|} (a \otimes \partial b)$. Geometrically, text strings correspond to paths, string concatenation to path composition, and the “context” of a string to an open neighborhood of a path. The idea is that every non-contractible n-simplex in the bar complex will stand for some concatenation of words $u_1 . u_2 . ... . u_n$ which is an actual sentence in $T$. The cohomology of the complex will detect when, upon adding or removing a word to a sentence takes it out of the text $T$, i.e. when a word is used out-of-context.

Obviously the bar construction can become quite unwieldy for even a medium sized string. Remember, however, that the point of computing the bar construction is to extract the cohomology. Can we “preprocess” $BA$ in order to get a more computable representation?

% [read up on koszul complex and finish this section later.]

# Formal concept lattices

Recently, Yiannis brought up the example of formal concept analysis, and he hypothesized that one could get the same results out of formal concept analysis that one gets from neural networks. Formal concept analysis is a graph-based method for analyzing data—one puts in some labeled data, and the algorithm spits out a concept graph (in fact, a lattice!) of “formal concepts”, which are clusters of data with similar labels. Besides the lattice structure, formal concept lattices are also much easier for a human to interpret than the layers and weights of a neural network.

Example of a simple concept lattice, from fcahome.org.uk. Each edge can be read as “is an example of”.

formal context is a triple (X, Y, J) of objects X, attributes Y, and a correspondence $J \subset X\times Y$ that tells us which objects have which attributes. One can visualize the correspondence $J$ as a binary matrix $M_{i,j}$ with objects denoted by rows and attributes denoted by columns:

There is a natural notion of clustering in formal concept analysis. First, we define two operators on subsets of X and Y respectively: the intent of a set of objects $A \subset X$ given by $I(A) := \{ y \in Y : (x,y) \in J \; \forall x \in A \}$ and the extent of a set of attributes $B \subset Y$ defined by $E(B) := \{ x \in X : (x,y) \in J \; \forall y \in B \}$.

In other words, $I(A)$ represents all the attributes shared by every object in A, while $E(B)$ represents all the objects which satisfy every attribute in B.

In that case a formal concept is a pair $(A,B)$ where $A \subset X$ and $B \subset Y$ such that $I(A) = B$ and $E(B)=A$. Alternatively, we say that a subset $A \subset X$ is a formal concept just when $E(I(A)) = A$, i.e. when A is a fixed point of the composition $H = E \circ I$. In the matrix representation of $J$, each formal concept corresponds to a maximal block of rows and columns which cannot be expanded by either row or column interchanges.

Given the matrix of a correspondence J, we can rearrange it by row and column interchanges. The red 1’s form one formal concept; the blue 1’s form another. In this example, there 2 more formal concepts not colored.

In principle this looks a lot like biclustering—which we saw before in the context of Gromov’s presyntactic categories—since we are also constructing maximal, contiguous blocks of similar values. The additional piece here is computing the entire lattice of formal concepts. (Actually, computing the join and meet of formal concepts feels sort of like computing data migrations $\Sigma_F, \Pi_F, \Delta_F$ for a functor $F: C \to D$. To see an example of these on instances, look at the FQL manual.) There are many algorithms for computing concept lattices, but I’m not sure if any of them are explicitly related to biclustering.

% is computing higher-order relations the same thing as computing the concept lattice?

# Summing up

Now we’ve come full circle. We started with an observation: neural networks can learn analogies like “W(king) – W(man) = W(queen) – W(woman)” as part of a good word embedding W. We then asked two questions: what’s the right way to represent these relations outside of a neural network, and how do we learn even more complicated relations? In Gromov’s theory, analogies such as the above are modeled by a recursively-defined similarity relation, man ~ king and woman ~ queen, and the word embedding is modeled by a map to a set of clusters (while glossing over how to write the embedding as an actual functor). We discussed how to capture the extra structure of the word embedding using a higher-order relation between relations like <man, king> ~ <woman, queen>. We then re-told this story from another perspective—via algebra / the bar construction—and saw the connection to cohomology. Finally, we looked at formal concept analysis, tried to understand how the lattice structure relates to “typical” clustering on $\mathcal{D}_{\hookrightarrow}$, which led us back to neural networks, by Yannis’ conjecture!

I wish I could give a neater summary, but I’m not sure yet how to interpret all this yet—some of it feels like it should be old hat to people who do neural networks, and some of it feels wildly speculative. The strength of category theory is its ability to make connections between far-ranging bodies of knowledge, but I’m sure that I’m still using it in a very naive way here. I’ll update my analysis as I get more data; for example, Chris Olah recently posted some speculation connecting neural networks to functional programming, which has a well-known categorical semantics. I have some thoughts I plan to post later about applying topos theory to the problem, because in many ways it feels like the same concerns about interaction conditions—which are akin to “coherent” guides to writing “local” systems of data into the morphisms of a (postsyntactic?) category—are reflected in the way we define sheaves and Grothendieck topologies on categories.

In the meantime, I hope this was a moderately useful illustration of some very different approaches to analyazing natural language texts. If you have any feedback, please let me know in the comments.

1. gottfried

some other word classes (prepositions, conjunctions) are much difficulter to treat than nouns. but I think topos (or pregroup of Lambek) should be the right idea