Public interest in speech technology recently increased dramatically, as Google Duplex was demonstrated, and elicited a wide range of reactions. Most observers were stunned at the apparent sophistication and power of the application, which can book appointments over the telephone by “conducting natural conversations” (to quote the Google blog that accompanied the public release of Duplex). However, Duplex also generated substantial controversy, with criticisms ranging from concerns about ethics (is Duplex misleading people into thinking that they are talking to a human?) to privacy (how much consent is required from the companies that are called by Duplex?), all the way to feasibility (were the Duplex demonstrations faked, or at least edited to a substantial degree?)
These are interesting issues, and should also raise general awareness that speech technology is continually improving, along with many other forms of pattern recognition. Both automatic speech recognition (ASR) and speech synthesis (also known as “text to speech” – TTS) have benefited greatly from breakthroughs in Deep Learning during the past decade. These improvements have advanced the capabilities of applications such as voice-based Web search and on-demand audio content generation far beyond the levels achievable five years ago.
However, it is also important to understand the limitations that remain in place despite such advances. Although the most advanced ASR systems are able to recognize certain spoken utterances with human-level accuracy, and the best TTS systems sound every bit as natural as a (human) voice artist under appropriate circumstances, these algorithmic listeners and talkers still lack anything like the understanding that characterizes human intelligence. As a consequence, ASR and TTS systems fail in ways that are deeply unintuitive to anybody who is not intimately familiar with the technology. (In fact, even those of us who have been working on ASR and TTS for decades are sometimes surprised at their failures!)
Such failures make for funny Youtube videos, but can be disastrous in customer-facing applications. The first big wave of speech-technology deployments, in the late 1990s, was only partially successful since many consumers found it too confusing to interact with such unpredictable systems. Corporations that had invested millions of dollars in speech technology were sometimes surprised to find that the withdrawal of their newly-installed systems was a crucial step towards customer satisfaction! Modern ASR and TTS is much better than the technologies of that era, but it is still not clear how widely it can succeed in automating dialogues with customers, where any mistakes of the technology translate directly into unhappy customers.
Another major trend of this decade has, however, created a huge range of additional applications for ASR: in this era of Big Data, we now realize that there is great commercial and social value in the analysis of aggregated data sources. For numerical data, such as consumer purchasing, movement or communication patterns, Big Data algorithms have been spectacularly successful and underpin the operations of companies such as Amazon and Uber. Similarly, textual data has been extensively utilized in the modern economy. However, speech data has been harder to use in the same fashion. Many companies have potential treasure troves of spoken information (e.g., collected in their call centres) which could provide crucial business intelligence, but are currently not accessible for analysis. This is a perfect application domain for fallible speech technology, since occasional mistakes will not obscure the significant trends that are the main aim of analytics for business intelligence.
It will be interesting to see what the future holds for systems such as Google Duplex, and also for less ambitious attempts to perform live communication with end users through speech technology. However, the most impactful applications of ASR right now are likely to operate behind the scenes, and to make speech information a seamless component of Big Data – that is, speech analytics.