One of the talks I attended yesterday was about speech recognition and synthesis in Windows Vista and how developers can leverage the System.Speech APIs to make their own applications "speech-enabled." The session was presented by Robert Brown (whom we were first introduced to in an old Channel9 video), Philipp Schmid and Steve Chang.
In Robert's first demo, he showed us how it's now possible to control the entire Windows UI using just speech. He was able to change his desktop wallpaper, open applications and dictate text to them, and even navigate through a MSN Virtual Earth map without touching his keyboard or mouse. It was clear that the speech recognition engine in Vista is already miles ahead of what it currently is in XP, and there's still almost a year for them to improve upon it.
He then gave us a preview of the new speech synthesis engine in Vista. Microsoft Sam and Mary, the two robotic voices in XP, have been replaced by Anna, who sounds much more natural. Even better, Vista will ship with recognition and synthesis support for 8 different languages. Robert showed us how the system reads Chinese text. Lili, the Chinese voice, doesn't sound anything like the English voice and is much more suited for reading Chinese text, for example.
Philipp Schmid then took the stage and showed us how ISV's can enable speech recognition and synthesis in their own applications. Microsoft's goal with Vista is to take speech mainstream, and in order to facilitate this, the System.Speech API's are both easy to use and very powerful. For example, in order to enable speech recognition in an existing application, all one has to do is create an instance of SpeechRecognizer and Grammar, load the Grammar instance into the recognizer, and subscribe to the SpeechRecognized event. The grammar simply consists of a finite state machine that goes through the different states to build up a sentence or command.
Finally, Steve Chang, who manages the Microsoft Speech Server team, demonstrated how applications can become even more ubiquitous by making them accessible through any telephone line. The first demo app was a simple one that allowed users to dial in and book concert tickets. The second one was more interesting and was developed by a team of SDETs at Microsoft. It allows users to dial in, let the system know where they're leaving from, where they want to go, and at what time, and the system responds by giving them the time that the next shuttle is scheduled to arrive. However, it goes one step further and even calls the user back five minutes before the shuttle arrives. Speech Server, like the speech engine in Vista, is also multilingual, and to demonstrate that, Steve interacted with the system in French. :)
Finally, he explained the concept of "mixed initiative," which allows speech recognition systems to be more natural. In most current applications, the system prompts you for something, and you respond and this cycle continues until the system has asked you for all the information that it needs. This becomes tedious after a while. Wouldn't it be nice if you could, in one go, tell the system everything it needed to know, and it could intelligently break up what you said into multiple pieces and do its job? That's what "mixed initiative" is all about - the user and system jointly control the dialog flow. As an example, Steve called the shuttle service app and, in one go told the system where he was leaving from, where he wanted to go and at what time. The system then broke down his command into pieces, recognized the source, destination and time separately and replied with the time the next shuttle was going to arrive. :)
Perfecting speech recognition engines is an incredibly difficult problem to solve, and it was great to see how much progress has been made since the release of XP a few years ago. The presentation was pretty fascinating, and I'm now curious to see how Build 5219 responds to my voice!
Tags: Vista, Longhorn, Speech, PDC05