Language Data

Translations of this material:

into Russian: Языковые данные. Translated in draft, editing and proof-reading required.
Submitted for translation by Den315 14.03.2017


Modern language technology is mostly based – one way or another – on “big data”. Each language has been around for a long time, and some of them over a very wide range of user communities. Modern digital techniques of storage mean that it is easy to bring together the record of a language’s past use, and look for patterns in it.

These patterns may recollect the effects of grammatical rules, the traditional way to understand language structure; but they may reveal other regularities too, and in a downright, “in your face” way. At last, it feels, we have some objective evidence of language structure. We have no choice but to accept the repeating collocations, substitution equivalences, and gaps, that emerge. And, reassuringly for those disappointed by twentieth-century linguistics, there is no need for appeal to the personal intuitions of native speakers – or worse still, of theoretical linguists themselves.

The revulsion from cognitive methods or elicited text has been so great that some have come to believe in the “unreasonable effectiveness of data”, seeing computer-accessible databases of language records as analogous to the role of mathematics in natural sciences. Electronic pattern recognition, it seems, gives us the means to find structure in the vast hinterland of a language’s back catalogue.

The impression is beginning to creep in that computational techniques can beat humanity at its own game, namely the correct and meaningful deployment of human language. Oh, and if performance is still a little substandard, as native speakers judge: that is only because the exceptions are too rare, or conditioned at too long a distance, for their causes to show up as yet, in the amount of data that has been collected, and the processing to which they are submitted. If billions of words do not suffice, just wait until the trillions… or the decillions come into play. Have patience, therefore, until the triumph of Moore’s Law and pattern-matching is fulfilled.

In a way, this apparent effectiveness of past data in revealing patterns is unsurprising. How, after all, do people learn their languages but through being exposed to others’ usage? Mostly they become comprehending listeners, and active users, without any attempt by parents or teachers actually to instruct them.

Nevertheless, there is something unsatisfying in the radical assumption that all the information needed is in the past record. Language, in use, is not just a matter of recalling what one has heard. It is a productive skill. We understand new messages by re-configuring memories, actively putting together fragments of past language experience. We produce new utterances likewise. Somehow we are able, not just to mimic past utterances, but to innovate. We apply patterns actively as rules. We speculate about the limits of what is possible, and then we go ahead and explore it.

Clearly, memory – fed by teachers who enlarge our experience of past practice – plays a large part in what we call culture, and education. Much of our formal learning is made up essentially of Repetitions: nursery rhymes, songs and poems are learnt as complete structures to be repeated, and so are quotations, even large-scale recitations. If we think about the learning of languages, this is the kind of contribution made by dictionaries and phrase-books.

But there are also items which are learnt not for use ready-made, but rather as abstract recipes, patterns which can be applied either to organize other items, or to indicate their role in larger structures. These range from phonotactic principles for the structure of words (e.g. in English, the sound written ng can only occur to end a syllable) and grammar rules (e.g., inverting subject and main verb can indicate a question) to systems of morphology, such as the principles of conjugation and declension in Latin: and even rhymes predicting the gender of nouns:

“To nouns that cannot be declined | The neuter gender is assigned…”

It is principles like these which may be applied dynamically to produce more of a language. Although there will be such principles in a dead language (such as old English), they will only exist historically. The corpus of a dead language is now closed – unless it should be revived. But by definition, a living language is open-ended. Its principles apply productively, even innovatively, and are known (usually only implicitly) to all those who are competent in the language.

The fault in using language data as a system to implicitly define a language is that it cannot tell which principles are dynamic: and so it misses the distinction between a dead and a living language. At any one time, a corpus contains just the sentences which it does: and so it might as well be representative of a language that will never have any more data.

Of course, it is possible to derive rules, statistically or stochastically, which will be compatible with the set of sentences in a corpus. Such rules might be taken as a simple substitute for a grammar of the language. These may, or may not, correspond to dynamic principles used by speakers in actually using the language. But as pointed out by Wittgenstein, a series of items, however long, does not determine the choice among the possible rules that may have generated the series; and as pointed out by Quine (in his Thesis of the Indeterminacy of Translation), an equivalence between sentences in two languages, however extended, will never fully determine the principles needed to interpret one language in terms of another. Hence, however useful and practical language data, and rules derived from them, may be as approximations of a living language, they can never ultimately pin it down.

They will not generate rules to interpret sentences (as an actual user of a language must); nor will they produce a theory of rhetoric – of how to get effects with the language. They cannot progress from rules for incidence of words to properties of a semantic model: the picture of the world that the language user has in mind. Techniques for expression of anything outside language, or communication between language users, remain a mystery.