Kernel vocabulary and Zipf's Law in maternal input to syntactic development
The Hebrew University
Boston University Conference on Language Development,
November 4-6, 2005
Power-law distribution of form-class items
It is a very old finding, and it has the status of a universal law, that if we take a very large text and rank the words in order of their frequency of use, we get a severely skewed distribution, with a few very frequent items and very many infrequent ones. Formally, when the frequency of words is plotted as a function of their rank-order in the vocabulary, the resultant graph follows a power law. This is the Zipf-Mandelbrot law (Mandelbrot, 1966; Zipf, 1935/1965) that actually applies to all sorts of large sets of items that form a complex system, but it was very early on established regarding the behavior of words used in a text.
In the reported study the hypothesis was tested that the same would be true of verbs appearing in specific types of syntactic word combinations, namely, belonging to a specific form-class. A form-
class is a subgroup of the total lexicon with similar grammatical behaviour. Such a group of items is a priori quite homogeneous as the items are selected for possessing identical syntax, so it is not trivial to expect that they, too, would possess a very skewed power-law distribution.
Let's see first the results, then think what they can teach us about language and language acquisition.
We were in particular interested in the global statistical features of the so-called "motherese" speech register, namely, the speech of mothers addressed to young children, which is the input to the acquisition process. Previous studies showed that maternal input has a skewed distribution, so that some verbs account for a very high percentage of all utterance tokens in various syntactic patterns. For example, Goldberg, Casenhiser & Sethuraman (2004) analyzed English-speaking mothers' utterance tokens in a large corpus and concluded that one specific verb accounts for a large percentage of utterance tokens in different argument structure constructions, among them the Subject- Verb-Oblique Intransitive and the Subject-Verb-Object-Object2 Ditransitive frames. Similar skewed distributions in English maternal speech samples were reported by Naigles & Hoff-Ginsberg (1995), Sethuraman & Goodman (2004) and Theakston, Lieven, Pine & Rowland (2004), and for Hebrew maternal samples by Ninio (1999a) and Ninio (1999b).
Previous studies, however, did not go beyond estimating the relative frequency of a few items from the total distribution. In the reported study, we plotted the whole range of items used by Hebrew-speaking mothers in one specific syntactic construction, and tested the hypothesis that the distribution will be power-law.
The focus of the study are maternal input sentences containing a verb or adjective followed by an indirect object with a le- preposition. Like the English `to', the Hebrew le- has a homophone used as
an adverb of direction; sentences with the adverbial use were not included in the study.
The maternal speech sample used in the study was the pooled corpus of 48 Hebrew-speaking mothers talking with their young children who were between 0;10 and 2;8 at the time of the observations. The speech samples were taken from a videotaped observational study (Ninio, 1984). There are more than 56,000 multiword utterances in the pooled multiword corpus. The corpus was searched for sentences in which there was a verb or adjective followed by an indirect object with a le- preposition. Overall,
there were 6956 utterances of this kind, representing 230 different verbs and adjectives.
Figure 1. presents the rank/frequency distribution of all the verbs appearing with an indirect object in the pooled corpus of 48 mothers.
Figure 1. Rank/frequency of maternal VI sentences with fitted
power-law Zipf curve
1100-1.6199y = 5589.8x10002900R = 0.9804
Rank order of verb
We can observe the very clear power-law distribution of the 230 verbs; the fit of the power-law curve is excellent (98%). We might conclude that the use frequency of verbs participating in the indirect-object construction is a typical Zipf distribution.
There is another way to present the same statistical feature and that's the Pareto presentation. Mathematically the two are transformations of each other, but graphically they turn the axes around.
Figure 2. Cumulative probability of maternal VI sentences
having larger token frequency than X, with fitted Pareto function
0y = -0.6665x + 0.04112 = 0.9776R-0.5
-2Cumulative probability >X
Token frequency ( log)
On the X axis we plot the token frequency of items ordered by rank, and it increases as we get higher values; on the Y axis we plot the cumulative probability of an item having equal or larger frequency than X. For the lowest value, the probability is 100%, and it decreases as the frequencies get higher.
The cumulative probabilities also distribute under a power-law function, but Figure 2 presents them on a log-log plot, in which the power-law distribution shows up as a straight line.
The fit is like it was in the Zipf presentation we saw before, namely, 98%. However, in the Pareto graph we can see very clearly that the fit of the curve is not so good when we come to the highest values.
Zipf has pointed out that Zipf's Law does not apply to the very-frequent items at the start of the distribution. The exceptions to Zipf's Law can be exploited for a closer examination of the maternal input. Ferrer and Sole (2001) and Montemurro (2001) systematically explored the exceptions in general corpora of English texts and found that the total vocabulary can be divided into two or more power-law registers, differing in their mathematical distribution as well as their content. They showed that regardless of sample size, there is a set of very-frequent items that have a less steep decay exponent in the power-law distribution of rank by frequency than other items. The crucial finding was that the quantitative difference is also a qualitative one. The registers represent two kinds of
vocabulary items: the very-frequent items belong to a basic vocabulary, while the less-frequent items are specific words. Their estimate of the basic vocabulary of English is about 4,000-6,000 words.
We wanted to use this method to identify the nuclear items of verbs sub-categories.
The idea that classes of verbs have nuclear items comes from Dixon (1982:121-125) who claims that universally, all open classes in the lexicon of natural languages contain subsets each of which consist of one or very few generic items and a larger set of more specific items which are their almost-hyponyms. The generics share the semantics and syntax of their more specific relatives;
however, they have more extensive and more varied semantic and syntactic options than the more
This analysis was repeated on the indirect-object construction in Hebrew maternal speech.
Figure 3 presents the Pareto graph, but this time the set has been cut into 5 domains of 10 values each.
All but the first 10 had the same distribution; the first 10 was different.
Figure 3. Pareto graph: Cumulative probability maternal VI
sentences having larger token frequency than X, for 6 regimes
Linear (31-40)-2Linear (21-30)
Cumulative probability > X Linear (11-20)-2.5Linear (first 10)00.511.522.53
Token frequency ( log )
We may conclude that there are two different domains, shown in Figure 4:
Figure 4. Cumulative probability of maternal VI sentences having larger (logtoken frequency than X, with fitted Pareto function for 2 regimes (log-log )plot)
y = -0.6432x + 0.0088y = -1.2693x + 1.6393022 = 0.9954R = 0.9793R
-2Cumulative probability >X
Token frequency (log)
The first 10 most frequent items -- all verbs -- form a separate register with a different power-law
exponent, whereas all less-frequent items have the same exponent. We can switch back to a Zipf presentation:
Figure 5 Rank/frequency of maternal VI sentences with fitted
power-law Zipf curve - 2 regimes1300
1000-1.6861-0.7715y = 7610.5xy = 1384.8x90022R = 0.9795R = 0.9793800
Rank order of verb
We see the part of the total Zipf curve which is in fact a separate power-law regime. The curves of
both regimes have a very high fit.
The quantitative analysis thus identifies 10 verbs as the kernel vocabulary for this construction in the
maternal input. So what are these verbs and why are there 10 of them?
Table 1 presents the 10 most frequent verbs in the VI pattern:
Table 1. First 10 most frequent verbs in the VI pattern:
Hebrew verb Translation equivalent
ASA 'make, do'
Two phenomena are evident: first, all the verbs are very general. Second, they cover a wide variety of semantics. The verbs belong to at least 6 different semantic
Transfer of objects (give, bring, put); Possession (have,); Creation of effected object (make/do);
Transfer of information (show, tell story, say), Provision of service (help), and Refer
As an example, Table 2 presents all the verbs that appeared in the maternal
sentences with a VI pattern, in the semantic field of "Provide a Service".
Table 2. Verbs in the semantic field "PROVIDE A SERVICE"
change (diaper) clean
cut/trim nails dress (tr)
The generic verb HELP was among the 10 kernel items; It appeared in sentences like "Let me help you". The other 32 verbs refer to more specific services that can be provided, such as "Do you want me to tie it for you?". Both, in Hebrew, have an identical Indirect-Object manifestation.
Similar analyses can be made for the other kernel verbs; each is a generic verb for one of the semantic subsets of all verbs that occurs in the VI pattern.
The most important conclusion for language development is that the kernel vocabulary for the verb/adjective-indirect-object construction consists of a group of verbs with varied semantics. There is no single syntax-semantics mapping in this grammatical pattern in Hebrew; in fact there are at least
8 different semantic categories of verbs taking an indirect object in Hebrew. However, each slice of the syntactic pattern is covered by one or more verb-frequently used, generic verb of its own. The whole category repeats the generic-specific structure of the total vocabulary in a fractal manner. Namely, we should be open to the possibility that syntax-semantics linking is a multiple phenomenon even within a single syntactic pattern, and that each of these slices of grammar makes some use of the kernel items which are presented to children with extremely high frequency in the input.
Cancho, R. F., & Sole, R. V. (2001). Two regimes in the frequency of words and the origins of
complex lexicons: Zipf's law revisited. Journal of Quantitative Linguistics, 8, 165–173.
Goldberg, A. E., Casenhiser, D. & Sethuraman, N. (2004). Learning argument structure
generalizations.Cognitive Linguistics, 15, 289-316.
Mandelbrot, B. (1966). Information theory and psycholinguistics: A theory of words frequencies. In
P. Lazafeld and N. Henry (Eds.), Readings in mathematical social science. Cambridge, MA:
Montemurro, M. A. (2001). Beyond the Zipf–Mandelbrot law in quantitative linguistics. Physica A,
Naigles, L. & Hoff-Ginsberg, E. (1995). Input to verb learning: evidence for the plausibility of
syntactic bootstrapping. Developmental Psychology, 31, 827-837.
Ninio, A. (1999a). Model learning in syntactic development: intransitive verbs. International Journal
of Bilingualism, 3, 111-31.
Ninio, A. (1999b). Pathbreaking verbs in syntactic development and the question of prototypical
transitivity. Journal of Child Language, 26, 619-53.