Text analysis tracks 200 years of language and societal trends
There’s a handy new way for researchers and the just plain curious to track how many words get added to the English language each year and the speed at which celebrities flame out of public consciousness: Google them.
The dominant web browser’s digital archive of books from around the world offers a vast new resource for investigating vocabulary and grammar changes, the rate at which new technologies get adopted, collective memory for major events and the changing nature of fame, to name a few research topics, according to a report published online December 16 in Science.
A team led by biologist Jean-Baptiste Michel and bioengineer Erez Lieberman-Aiden, both of Harvard, tracked the frequency with which various words appeared in nearly 5.2 million digitized books published between 1800 and 2000. That works out to about 4 percent of all books ever published, and roughly one-third of Google’s digital archive.
Michel, Lieberman-Aiden and their colleagues, including researchers at Google, Encyclopedia Britannica and the American Heritage Dictionary, refer to their mathematical analysis of texts over time as culturomics.
“The sheer scale of this research — 500 billion words traced over two centuries — takes the breath away,” comments Harvard cultural historian Robert Darnton, who was not involved in the project. “The first results point the way toward a rigorous, quantitative, historical linguistics.”
In one part of its analysis, the culturomics team estimates that about 8,500 new words annually entered the English language between 1950 and 2000. That process fueled a 70 percent growth in the number of English words, from 597,000 to 1,022,000.
About half of the words used in English-language books don’t appear in standard dictionaries, the researchers say. Words that rarely turn up in books often get omitted from dictionaries, including slenthum (an Indonesian musical instrument) and, ironically, deletable.
Michel’s group also documented a shift toward applying a regular past tense, denoted by adding -ed, to 16 percent of English irregular verbs, which are conjugated in unusual ways. For the word speed, for example, a past-tense change from sped to speeded may have been accelerated by a shift in the word’s meaning from “to move rapidly” towards “to exceed the speed limit.” Linguists regard transitions from irregular to regular verb forms as key markers of grammatical change (SN: 10/13/07, p. 227).
Intriguingly, the verbs light and wake, already known to have been irregular 500 years ago, became mostly regular by 1800 (lighted and waked) and have now returned to irregular past tenses (lit and woke).
Regardless of what happens to past tenses, society increasingly forgets the past, the researchers say. They estimated the frequency with which each year from 1875 to 1975 appeared in books from other years, as a measure of interest in events from particular years. It took 32 years for mentions of 1880 to peak and then fall by half, a mark considered crucial by the researchers, compared with a 10-year span for references to 1975.
With the decline of the past, the investigators say, society has embraced new inventions with increasing speed. Technologies invented from 1840 to 1880 took an average of 50 years to achieve widespread mention in books, versus 27 years for devices invented from 1880 to 1920.
Fame similarly mushrooms faster than ever, even as it becomes more fleeting, Michel notes. His group estimates that the most written-about people born in 1950 achieved fame at an average age of 29, compared to age 43 for luminaries born in 1800. Yet the number of references to ultrafamous folks declined sharply over progressively shorter periods during the 19th century.
“People are getting more famous than ever before but are being forgotten more rapidly than ever,” Lieberman-Aiden says.
J.-B. Michel et al. Quantitative analysis of culture using millions of digitized books. Science. doi:10.1126/science.1199644.