A Theory of Learning for Categorial Grammars

1. Introduction and Summary of the Theory


Categorial grammars, being fully lexicalised, embody all the language-specific structure of any language in its lexical items. Because there are so many lexical items, their structure cannot be determined innately, and they must be acquired from language data. So categorial grammars require a theory of how lexical items are learnt from language data. This paper describes a working theory of language learning for categorial grammars, in which:

* All lexical items, of all categories, are learned by the same simple mechanism.
* Any lexical item can be learnt by observing about six examples of its use
* A word's sound, semantics and prosodic constraints are learnt together
* Any language can be learnt in a bootstrap manner, starting from zero vocabulary
* Learning is robust against noisy or misleading data

(Feature structure unification is written A u B. Generalisation is written A ^ B, and the inverse of subsumption (extension) is written A > B )

We adopt a radically lexicalist formulation of categorial grammar, in which each lexical item is represented by a single feature structure. This structure embodies the syntax, semantics and phonology of the word. On hearing a sentence whose words have known feature structures U, V, W... we understand it by forming the derivation D = U u V u W u X by unifying the word feature structures. To generate the same sentence from its meaning feature structure, we form the same derivation D, by unifying the same word feature structures in a different order.


Word feature structures and the sentence meaning over-determine the derivation D. Therefore if a child hears a sentence in which she knows all the word feature structures except for one word W, and can also infer the intended meaning by non-linguistic means, she can still construct the derivation. This is the key to learning the new word W. Suppose the child hears several sentences all containing the same unknown word W, and constructs their derivations D1 , D2 , D3 and so on. The feature structure for W subsumes each of these derivations, Di > W for all i. So if we generalise the derivations, G = (D1 ^ D2 ^ D3 ...) > W.


G has all the structure of W, but may contain extra structure as well, from any other coincidental similarities between all the Di. We can show statistically that as the number of Di increases, any extra information in G is rapidly pruned away; when there are six or more distinct Di , then to a very good approximation G = W. In this way the child can learn the feature structure W for any new word. This learning mechanism is embedded in a Bayesian theory of learning, which gives rapid, robust learning. By this mechanism, words of any linguistic category can be learnt, in a bootstrap fashion starting with nouns and working up to full adult language.

2. Unification Categorial Grammars


In many versions of unification categorial grammars (e.g Uszkoreit '86; Zeevat et al '87; Bouma '88) each lexical item is represented by a 'deep' feature structure; every nested bracket in a category such as ((np\s)/np) adds an extra unit of depth to the structure. There is a branching node for each '/' or '\'. Some simple feature structures of this form are shown in figure 1.

In these feature structures, re-entrant links are represented by curved 'threads'. The threads can be pulled tight to give a DAG-like representation. The re-entrant links implement beta-reduction in function application - putting an argument at the requisite places in the representation of a function. The time-order information of '\' and '/' is built in (following Steedman '90) by adding a 'start time' and 'end time' feature to some nodes, and co-indexing their values. These features are shown as boxes the sides of the nodes, and their value co-indexing is represented by shared integer values (1, 2, ..) rather than by curved threads, to avoid cluttering the diagram. Weakly-ordered constructs could be represented by weakening these coindexing constraints. In a category (A/B) or (B\A), A is the 'res' branch and B is the 'arg' branch. Full category information (bracketing structure, '/' and '\' time ordering) can be recovered from each feature structure. Nodes denoting the sound of a word are shaded (with the word sound in quotes below) and semantic features are indicated in italics.

To use these feature structures in language understanding or generation, we may either do unifications interspersed with rearrangements of the feature structures (e.g. the 'stripping' of Zeevat et al '87), or we may unify the word feature structures with another 'function application' feature structure (e.g Uszkoreit '86; Karttunen '86; Sanfillipo '93). We illustrate the second approach below.

The function application feature structure R is shown in figure 2. It has three main branches, used for 'argument' 'function' and 'result' from left to right. Its use to understand the sentence 'John walks', by forming the unification D = (R u J u W), is shown in figure 2.

The sentence 'John hits Fred' is understood by forming D' = (R u J u (R u F u H)). In these derivations, the initial state (which is simply a time-ordered sequence of words) appears at the left end of D, and the sentence meaning is in the rightmost branch of D. Sentence generation can start with this meaning structure and form D by doing the same unifications in different order. The co-indexed start and end times are like nodes of a chart. Unification builds a chart parse.

It is possible to make the learning theory work with these deep feature structures, but the presence of R makes the learning mechanism more complex than it need be. We will use in stead a more radically lexical theory, in which understanding proceeds just by unifying word feature structures, without any need for R. Some of the required feature structures are shown in figure 3, and the derivation D = (J u W) for 'John walks' is also shown.

In these feature structures, the tree structures have been flattened, so that no extra structure R is required to reconcile different depths of nodes in derivations; word feature structures are simply unified together directly. Time order information is still embodied in co-indexing of node start and end times. The ordered word sounds are still at the left end of D, and the full sentence meaning at the right end. Understanding forms D by unification from left to right, and generation forms D by unification from right to left.


Because the trees are shallow, they now encode category information more liberally. For instance, the feature structure for 'hits' may be read as either ((np\s)/np) or (np\(s/np)) ; it does not define the order in which it consumes its two np arguments. However, in spite of the flattening of the feature structures, unification is still highly selective (e.g. because of the time order constraints). Relative to the 'deep structure' formulation of categorial grammar, the theory does not markedly over-generate or accept ungrammatical sentences. However, because the feature structures effectively represent multi-argument functions, and do not have any one preferred argument, the need for type-raising is diminished. For instance, the co-ordination 'John cooked and Henry ate the fish' can be handled just as simply as 'John cooked the fish and drank the wine', without type raising.

3. The Learning Mechanism


Suppose a word W appears in N different sentences S1, S2 .. SN which have derivation feature structures D1 = A u B u W.. , D2 = F u W u G.., and so on. Form the generalisation G = (D1 ^ D2 .... ^ DN). It is then easy to show that G > W . Since each of the derivations Di contains W within it, their generalisation must contain all the structure of W.

For any feature structure A, we can define its information content I(A) (measured in bits) as a sum of the information in all its features. Each derivation can be written as Di = W u Xi , where the Xi differ from one another in quasi-random ways. G = W u Z where Z = (X1 ^ X2 ... ^ XN ). Since the Xi have no guaranteed features in common, we can show statistically that the expected information content I(Z) decreases exponentially with increasing N, I(Z) ~ exp(-mu*N) where mu is of order 1. So even for quite modest N (say N= 6), I(Z) is very small, and so to a very good approximation G = W. From six derivations, we can reliably recover the feature structure W for a shared word. This is the key to the learning algorithm.

This does not yet give a working learning algorithm because, it appears, you need to know the feature structures for all words in any sentence in order to construct its derivation D = (A u B u W..). If you do not yet know the feature structure for W, you cannot construct the derivations D1, D2, D3 ... needed in order to learn it. Catch-22.

However, suppose a child hears a sentence when she knows the feature structures for all its words except one word W, and can also guess the intended meaning of the sentence. In these circumstances, she can always construct the full derivation D, except for those coindexing links that originate from W itself.
For instance, suppose a child hears 'john walks', knowing the feature structure J for 'john' and knowing the intended meaning, but not knowing the feature structure W for 'walks'. She can recover the whole derivation D = (J u W) in figure 3 as follows: labelling the sub-trees 1- 4 from left to right, sub-trees 1 and 2 come from the heard word sounds, sub-tree 3 comes from the feature structure for 'john' and sub-tree 4 comes from the inferred meaning. The only part missing is the coindexing link from the 'walks' feature structure.

In general we can always find a derivation D for a sentence in which we know feature structures for all words except one, if we can infer (by non-linguistic means) the intended meaning of the sentence. The only thing missing from D will be the coindexing links from the missing word. Therefore the derivations Di for sentences containing an unknown word W can be constructed, and their generalisation (D1 ^ D2 ^ D3 ...) is (to a very good approximation) the new feature structure W, without its coindexing links.

There is a simple test to discover the co-indexing links of the new feature structure W. Denoting nodes of any feature structure X by a, b, c.., and denoting the subtree rooted at node a by X[a], then if (Di[a] u Di[b]) exists for all i, W should have a coindexing link from node a to node b.

It may appear from this description that you need to know some words in order to learn any others. However, a 'bootstrap' learning process can start with nouns, knowing no other words. The process can then continue to learn any parts of speech in a language. (I have built a program which implements this bootstrap learning, from nouns up to many other parts of speech.) Knowledge of the word feature structures gives the child a working language capability, which can be extended indefinitely to large vocabularies, higher types and complete syntactic coverage.

4. Bayesian Learning Theory


While a child may sometimes hear all words in a sentence clearly, and may sometimes infer the intended meaning correctly by non-linguistic means, she will not always do so reliably. We need to show that the learning mechanism continues to work reliably, even if many (or most) of the example sentences are not correctly heard or construed. We can do this by embedding the learning mechanism in a Bayesian theory of feature structure learning.

The Bayesian learning theory hinges on a relation between the information content I(D) of a feature structure D, and its probability of occurring. If an event is described by the feature structure D, then in the absence of any other knowledge the probability of the event occurring is approximately 2**[-I(D)] . Similarly for a sequence of events D1 ... DM, in the absence of any other knowledge the probability of the sequence is 2**[ - sigmaM I(DM)]. sigmaM denotes summation over the index M.

This small probability can be made larger if we know some causal regularities about the domain, so that feature structures Di do not simply occur at random. Each causal regularity can be summarised by a feature structure W with two branches - a 'cause' branch C and an 'effect' branch E. The regularity means that for any event feature structure D, if D > C then with probability Q(W), D > W. (which also implies that D > E). The cause C implies the effect E with probability Q(W). In the presence of this regularity, the probability of D occurring is no longer 2**[-I(D)], but is Q(W) 2**[I(E)-I(D)] , which may be much larger if I(E) is not small.

The learning problem is to infer from a sequence of feature structures [D1 ... DM] what causal regularities W hold over the sequence. In the case of language, each Di consists initially of the word sounds of a sentence (the left end of a derivation Di) and its meaning, inferred by non-linguistic means (the right end of Di). For a word W of category (X/Y), the 'cause' branch includes the word sound and Y, while the 'effect' branch is X. The causal probability Q(W) can, for instance, model the frequencies of different word senses.

In the absence of any restrictions on the W, it is possible to 'overfit' the data Di by postulating a special regularity Wi just to account for each Di . This would just learn a set of sentence-meaning pairs, rather than a language. Bayesian learning theory avoids this overfitting of the data by assuming that each possible set of regularities {W1 ....WK} has a prior probability P({W1 ....WK}), representing the probability of just that set of regularities holding in the learner's environment. To make the infinite sum of these prior probabilities convergent, we need to penalise the more complex regularities WK. A simple form which accomplishes this is P({W1 ....WK}) = 2**[- sigmaK lambda I(WK)]; lambda =2 is sufficient to ensure convergence of the sum over all possible 'languages' {W1 ....WK}. This prior probability is equivalent to a 'minimum description length' learning bias in the hypothesis space.

We can now apply Bayes' theorem to the learning of each possible word W in the language. There are two possible states - 'W holds' and 'W does not hold' (written as 'not W'). In the absence of any learning data Di , their prior probabilities, written as PA, are PA(W holds) = 2**[ - lambda*I(W)], and (approximately) PA(not W) = 1.

Suppose that amongst the M learning examples [D1 ... DM] there are N cases where the regularity W holds - that is, where Di > W. Bayes' theorem then gives:

P(not W|[D1 ... DM]) = c PA(not W) P([D1 ... DM]| not W)
= c 2**[ - sigmaM I(DM)].

P(W holds|[D1 ... DM]) = c PA(W holds) P([D1 ... DM]|W)
= c 2**[ - lambda*I(W)] [Q(W)]N 2**[ N*I(E) - sigmaM I(DM)].

The regularity W is learnt when the assumption that it holds gives the most likely account of the data - that is, when P(W holds|D) exceeds P(not W|D). This occurs approximately when N I(E) > lambda I(W). In typical cases where lambda = 2 and I(E) = I(W)/3 this requires N > 6; the word can be learnt from around six positive examples. From that point onwards, the initial log-likelihood penalty lambda*I(W) from assuming W holds is more than compensated by higher log-likelihoods of the Di which obey Di > W - those Di which W helps to explain.

Note that learning depends only on the number of positive examples accumulated in the evidence [D1 ... DM] where Di > W. There can be an unlimited number of other examples where the child mishears the word sound, or mis-construes the intended meaning; but learning will still occur as long as six positive examples are found. Bayesian learning is very robust against misleading evidence.

The algorithm for detecting new words W involves forming pairwise generalisations Di ^ Dj of the derivations, and testing if these generalisations have significant information content. If there are many derivations, this might seem computationally prohibitive. However, if we assume an efficient associative retrieval of derivations according to the word sounds they contain, we need never try to pair two derivations unless they have an interesting word sound in common; this avoids a square-law expansion of generalisations.

Learning theories based on discrete mathematics have led to very pessimistic conclusions (Gold etc.). If language learning is characterised as a problem of narrowing down from a set of possible languages to one actual language with 100% certainty, then either the set must be highly restricted or the learning data set must be immense. Even Valiant's 'probably approximately correct' (pac-learning) framework requires very large learning data sets. However, if we address in stead the problem of learning the most likely possible language given the data, then the Bayesian learning theory gives rapid and robust language learning, as we observe in children.

5. Further Implications and Conclusions


This radically lexical form of categorial grammar, with its learning theory, has further implications which can only be briefly summarised here:


Categorial grammar is arguably the most constrained and predictive theory of language today, accounting for many otherwise puzzling phenomena in an economical and unforced manner. However, it has until now lacked a theory of language learning - an important omission, given the necessity of language learning before language use.

This theory of learning uses one core mechanism, of feature structure generalisation, to learn the whole of any language - all parts of speech, syntax, semantics, phonology, lexical and morphological rules. It has been shown to work, in a computer implementation. This simple, uniform mechanism is in keeping with the economy of categorial grammars, and is in broad agreement with the known features of language learning, such as:

* ability to learn all parts of speech in diverse languages
* rapid, robust learning from noisy data
* learning of regularities after learning individual words
* learning exceptions without explicit negative evidence

Much remains to be done in consolidating the formal base of the theory and making comparisons with learning data. However, I believe that the existence of a working theory of language learning strengthens the position of categorial grammar (and more broadly, of unification-based grammars) as the leading theory of grammar for natural language.

I thank Ted Briscoe for comments and discussion on an earlier draft.

Return to main list of papers.