Asia Institute Seminar with Mark Hasegawa-Johnson, University of Illinois

“The future of Voice Recognition”

KOREA IT TIMES

 

Asia Institute Seminar with Mark Hasegawa-Johnson
 
Tuesday, May 29th, 2012
 

Professor Mark Hasegawa-Johnson received his Ph.D. from MIT in 1996 and serves as associate professor in the University of Illinois department of Electrical and Computer Engineering and a full-time faculty member in the Artificial Intelligence group at the Beckman Institute. His field of interest is speech production and recognition by humans and computers, including landmark-based speech recognition, integration of prosody in speech recognition and understanding, audiovisual speech recognition, computational auditory scene analysis, and biomedical imaging of the muscular and neurological correlates of speech production and perception. Professor Johnson is visiting Korea to meet with experts in academics and industry to discuss his research.

Emanuel Pastreich:

Can we say that the focus of your research is speech recognition—or is that speech recognition?

Mark Hasegawa-Johnson:

I would say that’s one of the two areas in which I’ve published most of my work. The other area is Multimedia Analytics –generally, audio and video event detection and multimedia database browsing.

Emanuel Pastreich:

Let us start with voice recognition—is that the same as speech recognition?

Mark Hasegawa-Johnson:

In the field we usually say “speech recognition.” The term “voice recognition” is ambiguous; it could mean recognizing what a person is saying, or it could mean recognizing who is the person saying it. In my case, it so happens, I’ve worked on both problems.

Emanuel Pastreich:

Has the field of speech recognition changed significantly, and how is progress in technology changing that field of study?

Mark Hasegawa-Johnson:

The field has changed dramatically in the past ten years. Basically, the technology has gone from being an object of study in research labs, to being a ubiquitous product that everybody uses.

Emanuel Pastreich:

When you say the technology, you are talking about the recorded message on the phone that asks you which city? Are there other less obvious applications?

Mark Hasegawa-Johnson:

The highest-profile application this year has been Siri, the application shipped with iphone 5 that allows one to send an e-mail, make a calendar entry, search the maps for restaurants, or search the web for arbitrary information, all using a voice interaction with a pseudo-intelligent agent named Siri.

Siri seems a little funny to experts in this research field, because the speech recognition technology behind Siri is almost identical to the recognition software that was released by the same company in 2008. What is new in Siri (the part that SRI developed) is the dialogue manager, the pseudo-intelligent agent interface. That user-friendly interface has become much more robust in the past three years.

Emanuel Pastreich:

Is Siri part of a larger, long-term trend?

Mark Hasegawa-Johnson:

Prior to Siri, I’d say the previous big milestone was the first commercial desktop dictation software to achieve more than 99% recognition accuracy for a large vocabulary, for anybody with a standard North American accent. That happened in 2006.

Emanuel Pastreich:

So as voice recognition becomes ubiquitous, how will the world change? What are you imagining this technology will be like in ten or twenty years? I assume we are talking about having computers we can talk with comfortably, maybe not even that far away. 5 years?

The growing strength of these automated interface systems can be linked to such services as Facebook. So now Facebook is quite useful, but what if we start to populate Facebook with more and more “people” who are in fact just programs? Well, oddly, those “people” might be more responsive to our needs than actual friends who do not have time to interact 24 hours a day and reply to every email. So could it be that human machine interaction might become preferable to human-human interaction?

Mark Hasegawa-Johnson:

That’s an interesting thought, the idea of pseudo-intelligent agents interacting with people on the web. It is quite possible to imagine. There are spammers trying to create such programs right now (I’ve had spam from some of them), but they are still relatively easy to fool. It’s possible that might not be the case five years from now.

But on the other hand, would that be very much different from a world in which human beings can be easily hired to advertise to you? It just means that you have to be careful about who you allow to “friend” you on Facebook — the other person might be a computer.

Emanuel Pastreich:

As for networking, I have often thought it would be great to have a computer do my networking for me on Facebook and Linkedin; it could search out possible partners around the world 24 hours a day and writing them all personalized letters, or send personal voice messages in my voice. In the case of voice reproduction. What might be the implications of having a computer that can speak in my own voice, indistinguishable from me, and make up what it will say without my direct imput?

Mark Hasegawa-Johnson:

That situation could pose some serious challenges.

Emanuel Pastreich:

So what are the big challenges in the field today? Are there fundamental disagreements as to the linguistic and neurological aspects of language, or have we come to a consensus? Are there specific technological breakthroughs we are waiting for?

Mark Hasegawa-Johnson:

The big challenges, I think, are (1) increasing robustness for individual variations within the human-computer interface, and (2) developing multimedia search applications that will be able to recognize speech in low quality recordings, such as open-microphone recordings of group meetings, and the like.

Right now if you speak a standard North American dialect and you talk directly to the microphone, you can use a human-computer interface; no problem. But if you have a disability like Cerebral Palsy, you can’t; if you speak with a strong South African accent (not one of the standard accents well modeled by software manufacturers), you can’t. For that matter, if you are speaking a language with fewer than about 9 million native speakers, you can’t.

Emanuel Pastreich:

So, looking forward to the solution of these problems, is that a matter merely increasing computing power and functionality, or might there be more complex theoretical issues involved? Are there algorithms that can solve that problem, or do we have to go to a new level in our understanding of language?

Mark Hasegawa-Johnson:

Actually, it’s really hard to say whether the issues are gaps in our understanding, or just gaps in our sets of labeled training data. I think that both are true. The problem of collecting more data is one that companies and universities and governments are working collectively to solve. I have seen some intriguing new proposals recently for on-line data sharing sites for linguists and speech scientists.

In my own research, I have been pursuing possible theoretical solutions to these problems. Specifically, I am working on approaches to semi-supervised learning: learning from a small amount of labeled data plus a large amount of unlabeled audio data, transfer learning: learning accurate speech recognition in one language or dialect by leveraging systems that already work well in related languages and dialects, and the use of prior knowledge in machine learning systems: reducing the amount of training data necessary to achieve a certain level of accuracy by constraining the way in which the system learns from data. The last example can consist of a bias in the learning process so that the system is encouraged to find a solution similar to some initial solution that we already know to be approximately true based on linguistic research.

Emanuel Pastreich:

And when it comes to actually getting a computer to speak, say in a person’s voice…

Mark Hasegawa-Johnson:

For generating your voice, the most recent standards competitions, the most famous being the Blizzard Competition for Speech Synthesis, have featured new systems that come very close to being indistinguishable from a human voice. Of course that process involves quite a lot of careful engineering. Commercial systems are not there yet in terms of verisimilitude, but I think that the commercial systems will be there soon.

Emanuel Pastreich:

But as we imagine what might be possible in the future, in terms of technology, we are, in a sense, forced to go back to the basics of language, back to Noam Chomsky and his suggestions as to universal attributes of language and their neurological basis. So, will there ultimately be one algorithm that works for all languages, or will there be different strategies and approaches for different language groups? Can all languages be seen as part of a larger meta-language in a technological sense? That is to say, there could be a divergence between how brains operate in the reproduction of multiple languages and how computers do, or maybe not.

Mark Hasegawa-Johnson:

With regards to Chomsky: actually, there is a set of algorithms that seem to work very well for all languages, but they need a lot of language-dependent training data in order to learn the right kinds of parameters. The best commercial systems right now are using hundreds of times as much training data as a child hears in the first five years of life, so you know that we are missing something.

With regards to building technologies that leverage what we know about the neurology and psychology of speech processing, that is very hard, because, frankly, the amount that we know about the neurology of language is a very small fraction of what we would need to know. Language production and perception involve the use of neural circuits that have some poorly understood approximately universal physiological substrate. On top of this physiological substrate you get at least twelve years of precision fine tuning of the synaptic connection weights in order to optimize the child’s speed and accuracy of language processing, and the only way that we can measure those synaptic weights is by carefully testing the behavior of the child or adult. Simply put, the amount that we know is dwarfed by the amount that we don’t know.

The result of our ignorance is that if you try to develop an algorithm that simply implements what you know, in the form of a set of rules, the algorithm will fail completely. Such technologies were obsolete by 1990.

On the other hand, if you use general-purpose machine learning algorithms, and completely ignore what we know about human speech and language, then you wind up in the situation we have currently: you have algorithms that require 100 times as much training data as a human child, and that achieve a speech recognition accuracy that is, at best, ten times worse than that of a human child.

One of the things I’ve been trying to do in my research is to find ways that we can have the best of both worlds, by encoding scientific knowledge in a way that can guide machine learning algorithms. So, for example, we know that humans are differentially sensitive to salient phonetic landmarks, e.g., consonant releases and consonant closures, so I have developed landmark-based speech recognition algorithms that incorporate differential sensitivity to consonant releases and consonant closures, and I’ve shown that doing this improves recognition accuracy.

Emanuel Pastreich:

Will computers approach humans in terms of how they handle voice, or will the fact that the computer is silicon-based, not carbon based, system, ultimately mean that it will be mimicking very accurately, but not really functioning the way a human would in terms of recognition of language, or replication of language?

Mark Hasegawa-Johnson:

How would you know? Until we have a method for imaging synapse weights and neural activation patterns on a scale of nanometers and milliseconds in a living brain, I don’t think it will ever be possible to know for sure whether the brain is recognizing speech in the same way as the computer. I think it’s interesting, though, that in many ways the technology and the science converge toward the same model, despite starting from very different starting points.

Emanuel Pastreich:

Is this your first visit to Korea?

Mark Hasegawa-Johnson:

I visited Jeju Island for the Interspeech 2004 conference. This is, however, my first trip to Seoul.

Emanuel Pastreich:

So what is interesting to you about Korea and the Korean language with regards to your field? I know you have worked a bit on Japanese.

Mark Hasegawa-Johnson:

The Korean language is interesting in many respects. There are ways of distinguishing phonemes in Korean, and patterns of coarticulation, that are quite different from any other language that I’ve worked with. I would be very much interested in applying my landmark-based speech recognition methods (basically, learning high-precision binary classifiers that distinguish pairs of speech sounds to Korean, because the binary distinctions we would have to learn would be so different from English (or Arabic, or Mandarin, or Japanese, or German or Spanish).

On the other hand, I’ve been told that automatic speech recognition already works very well in Korean, so there is some disincentive against further research in this area for the Korean language. I haven’t seen demos, so I’m not sure what to expect.

Emanuel Pastreich:

Do you think Korean is quite different from Japanese in your field of research? Are people doing the two languages together as a meta-language, for example? Could you develop a program that recognizes both Japanese and Korean?

Mark Hasegawa-Johnson:

Yes, in terms of the speech acoustics, Korean is very different from Japanese.

In terms of syllable structure, and some aspects of morphology and syntax one could perhaps work on them as a meta-language, but not with the same kind of transfer that you could achieve between English and Dutch, or between Swahili and other Bantu languages, for example.

The other way in which Korean is similar to Japanese is prosody –or, at least, the prosody is less dissimilar than the segmental phonetics. I’ve done a lot of work in the automatic recognition of prosodically prominent words, and prosodic phrase boundaries, so now that you mention it, it might be interesting to see if one could learn automatic phrase boundary detectors in Korean and Japanese that take advantage of some of the similarities.

Emanuel Pastreich:

Are there any specific features of the research and technology landscape in Korea that are intriguing? That is to say, do you find unusual combinations of strengths that might be different than what you find in the US or Japan?

Mark Hasegawa-Johnson:

I think that the research and technology landscape in Korea is very different from the US, but I’m not sure that I know of substantial differences between the Korean and Japanese research landscapes, at least not in the way that these bear on my field. Both Japan and Korea have very strong traditions of research in both universities and in corporate research labs. There is perhaps more collaboration between the university and corporate labs in Korea than in Japan, if I understand the situation correctly. Such collaborations in Japan tend to be less direct, I think, as they are less direct in the US.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: