A Computer Science Approach to Linguistic Archeology and Forensic Science

Last week (Sept 2014),  I heard a story on NPR’s morning edition that really got me thinking…(side note, I’m in Ontario so there is no NPR but my favourite station is WKSU via TuneIn radio on my smart phone). It was a short story, but I thought it was one the most interesting I’ve heard in last few months, and it got me thinking about how computer science has been used to understand natural language cognition.

Linguistic Archeology

Here is a link to the actual story (with transcript). MIT computer scientist Boris Katz realized that when people learn English as second language, they make certain errors that are a function of their native language (e.g. native Russian speakers leave out articles in English). This is not a novel finding, people have known this. Katz, by the way, is one of many scientists that worked with Watson, the IBM computer that competed on jeopardy

Katz trained a computer model to learn from samples of English text productions such that it could detect the writer’s native language based on errors in their written English text. But the model also learned to determine similarities among other native languages. The model discovered, based on errors in English, that Polish and Russian have historical overlap. In short, the model was able to determinethe well know linguistic family tree among many natural languages.

The next step is to use the model to uncover new things about dying or languages. As Katz says

But if those dying languages have left traces in the brains of some of those speakers and those traces show up in the mistakes those speakers make when they’re speaking and writing in English, we can use the errors to learn something about those disappearing languages.”

Computational Linguistic Forensics

This is only one example. Another one that fascinated me was the work of Ian Lancashire, an English professor at the University of Toronto and Graeme Hirst, a professor in the computer science department. The noticed that the output of Agatha Christie—she wrote around 80 novels, and many short stories— declined in quality in her later years. That itself is not surprising, but they thought there was a pattern. After digitizing her work, they analyzed the technical quality of her output and found richness of her vocabulary fell by one-fifth between the earliest two works and the final two works. That, and other patterns, are more consistent with Alzheimer’s than normal aging. In short, they are tentatively diagnosing Christie with Alzheimer disease, based on her written work. You can read a summary HERE and you can read the actual paper HERE.  It’s really cool work.

Text Analysis at Large

I think this work is really fascinating and exciting. It highlights just how much can be understood via text analysis. Some of the this is already commonplace. We educators rely on software to detect plagiarism. Facebook and Google are using these tools as well. One assumes that the NSA might be able to rely on many of these same ideas to infer and predict information and characteristics about the author of some set of written statements. And if a computer can detect a person’s linguistic origin from English textual errors, I’d imagine it can be trained to mimic the same effects and produce English that looks like  it was written by a native speaker of another language…but was not. That’s slightly unnerving…