Are 52% of words really not included in dictionaries?
In 2011, a remarkable article appeared in the journal Science that argued, based on a computational analysis of five million books, that “52 percent of the English lexicon—the majority of the words used in English books—consists of lexical ‘dark matter’ undocumented in standard references”.
Taken at face value, this might seem like an astonishing claim. Fifty-two percent of words in English written usage don’t appear in dictionaries! Take that, prescriptivists and lexicographers! We are the 52 percent! In this post, I would like to contextualize the article’s findings by taking into account a factor that most of the journalistic coverage written when the article appeared did not: namely, the presence of derivatives in the ‘dark matter’ lexicon. First, however, I would like to underline the meaning of the 52 percent by discussing the article’s stress on the word lexicon, especially in the context of a phenomenon (which the authors cite) called Zipf’s Law.
Lexicon vs word count
The lexicon of a corpus is different from the word count of the corpus. A lexicon is a list of all of the different words that appear in a corpus: for example, if we used as a corpus a book about dogs, the lexicon might include the words dog, hound, puppy, spaniel, and so forth. The word count of a corpus is a list of all of the words in that corpus, ranked by the number of times they appear: for example, the word dog might appear (let’s say) 100 times, followed by the word hound (twenty times), followed by the word puppy (five times), and so forth.
So far as the Science article is concerned, the ‘dark matter’ that comprises 52 percent of the lexicon is interesting not because the words grouped under this description are common, but because they are rare. In a list of all of the different words that have appeared in print in English, considered without regard to their popularity—every word appears just once in the list, regardless of whether it is dog or bichon frise—52 percent of the words do not appear in standard dictionaries. Of that 52 percent, the vast majority are words that have appeared in print just once or a few times.
If we were, instead, to take into account the popularity of words—to allow the word dog to appear in the list millions of times—we would find that the ‘dark matter’ group is positively dwarfed by the group of words that appear in standard dictionaries. This is what Zipf’s law, a distributional regularity that was first discovered through corpus analysis, affirms in a linguistic context: if you graph the rank of the words in a corpus (10 if a word is the 10th most common, for example) against the number of times that each word appears in the corpus, the two scores will be inversely proportional. That is, the most frequent word will occur approximately twice as often as the second most frequent word, and so forth.
Derivatives: to include or not include?
Even so, there is reason to suspect that the article’s analysis could be further improved. The authors seem to include derivatives in their ‘dark matter’ list, which may obscure the actual proportion of undocumented words in the English lexicon. Derivatives are words that reflect various inflections added to core words, or lexemes: for example, gloom is a lexeme, whereas gloomy and gloomier are derivatives. Derivatives are not documented in standard dictionaries in the same way that lexemes are, but they are also not linguistic inventions on the level of, say, the word jabberwocky.
Including derivatives in the count is why Shakespeare has sometimes been credited for inventing many thousands of words. As, for example, Ward E.Y. Elliott and Robert J. Valenza have recently discussed, many of the words in Shakespeare’s supposed corpus of inventions already existed; he is merely the first writer of record to have used these in a form that incorporates an ending like –ed, –less, or –ment, or a beginning like un–, well–, or dis–. (For example, he has been credited for misplaced, although he did not coin misplace.) In recent years, language historians have sometimes advocated focusing, in cases like this, on what Elliott and Valenza call ‘lexeme’ coinages: coinages that produce new lexemes rather than merely new derivatives of existing lexemes. Of course, even with this correction, Shakespeare can still be credited for adding many hundreds of lexeme coinages to the English language, which one can hardly sneeze at.
The Science article seems to include derivatives in its pool of words that do not “appear in any dictionary”. The authors mention, for example, the words deletable and aridification as examples of words that fail to appear in the dictionaries they consulted (the 2002 Webster’s Third New International Dictionary; the 2000 American Heritage Dictionary of the English Language; the 1993 Oxford English Dictionary) and therefore count under their rubric. But these words represent regular derivatives of the words delete and aridify, which do appear in the dictionary. (This issue is further complicated by the fact that these words currently appear on the OED website. The article refers to print dictionaries only—in particular, the three above, which were published in the early years of the Web.)
Why are some words ‘missing’?
The authors argue that many of the words in their dark-matter list fail to appear in the dictionary simply because dictionaries must omit many rare and obsolete words for reasons of space. (Dictionaries also usually avoid including many proper nouns, which the authors therefore omitted from their count.) Thus the word diestock, which was once in more common use (it appears in the OED) but is now obsolete, appears in the dark-matter list because it does not appear in many dictionaries. The word netiquette also appears in the dark-matter list; in 2011, many dictionaries still omitted it, presumably because it was rare or because they were waiting to see if it is passing slang. Overall, the authors found that the rarer a word is in the corpus of five million books, the less likely it is to have a listing in the dictionary.
But the inclusion of derivatives alongside, for example, obsolete words, rare compound words, whimsical one-offs, and trendy new words like netiquette may obscure our picture of the state of our communication. A print dictionary does not omit the word deletable specifically because of its rarity, but rather because, for reasons of space, dictionaries usually make the list of derivatives for a given verb very short: delete, deleted, deleting—or, for some words, unusual conjugations. A search for the word deletes on the OED website pulls up the entry for delete, but the word deletes, which is common enough, does not itself appear in the OED.
Still, even if full-blown linguistic inventions are rarer than this article may suggest, doing new things with words adds color and whimsy to our lives. Even derivatives can suddenly make a word that seemed boring into a new toy bristling with funny connector pieces. By way of celebrating coinages of every shade, I’d like to end with a list of some of the ‘dark matter’ words I’ve heard in the past few months, which get the point across even if they may never appear in the dictionary: powerpointable, invisibilized, crankypants, unsharkulated, unregulable, dragonization, and bookified.