September 20, 2010

Frederick Jelinek’s legacy in language and speech processing technology

Shortly after the recent death of Johns Hopkins University faculty member Frederick Jelinek, one of his colleagues, Jason Eisner, spoke about Jelinek’s critical contributions to the field of language and speech processing technology. Eisner is an associate professor of computer science at Johns Hopkins.

Q. To begin with, what do you mean by “speech recognition”?

A. Well, you talk, and the computer writes down what you said.

Q. Who uses speech recognition?

A. The original application was dictation. On your cell phone, for example, speaking is easier than typing, and there are people who cannot easily type at all because they are handicapped or illiterate or their hands are busy. The other side of the coin is that reading is easier than listening, and not only for the deaf. You can now use speech recognition to skim or automatically search voice mail, meeting transcripts, lectures, political speeches or YouTube video soundtracks, as if those things had been typed out in the first place.
Q. Why is speech recognition difficult for computers?

A. Because human speakers are not robotic voices. The same word can come out sounding many different ways. Even supposing that you could control for the speaker’s voice, accent, speech rate, emotional state, microphone quality, background noise, etc., the same word would still sound very different in different contexts. And then there would still be a lot of random variation on top of that.

Q. How did Dr. Jelinek’s approach solve this?

A. The computer rapidly considers billions of possibilities for what you might have said. Some of these possible transcriptions get low scores because they contain unusual sequences of words, or because they seem to be at odds with the audio in places. So even though the unpredictability of speech means that billions of transcriptions are possible, the computer tries to find the one that is most plausible overall.

Q. Dr. Jelinek is said to have recast language problems as mathematics. Where do the statistical formulas come in?

A. The technique rests on a method for scoring transcriptions. One of the transformative insights was that these methods didn’t have to be rules manually devised by linguists. The computer could automatically learn how good transcriptions differed from bad ones by looking at lots of examples of correctly transcribed speech. What it learned in this way could be enormously detailed.
Q. Is this approach still used in practice?

A. Yes, Fred’s methods are at the heart of all systems today. Speech recognition is a $5 billion business. The first commercial products were in the early 1980s, done by IBM and by ex-IBMers from Fred’s group. The scoring models are getting better and better. In the past decade, Fred focused on developing sophisticated models that paid more attention to linguistic properties like grammar, meaning and context.
Q. Did Dr. Jelinek do anything else?

A. Plenty. For one thing, he applied the same ideas to translation between languages. This allows you, for example, to type in a sentence in French, and the computer will tell you what it says in English. Fred saw that it was really the same problem: There are billions of possible English translations, too, and a computer can automatically learn how to score them. Try it yourself at translate.google.com. In fact, this way of thinking has taken over across artificial intelligence. Most of AI is now about learning statistical models from uncertain data and using them to make predictions.