EPSRC (EP/M018946/1)
Spoken Dialogue Systems (SDS) encompass the technologies required to build effective man-machine interfaces which depend primarily on voice. To date they have mostly been deployed in telephone-based call centre applications such as banking, billing queries and travel information and they are built using hand-crafted rules.
Spoken dialogue systems are difficult to build because human language is rich in ambiguity and a user's intentions are frequently expressed imprecisely. In practice, this uncertainty is further exacerbated by speech recognition errors which in noisy environments frequently reach word error rates of 25% or more. Rule-based systems are particularly prone to these errors, and one of the primary motivations for the statistical approach to SDS is to significantly increase robustness by explicitly modelling uncertainty. Hence, rather than estimating a ``best guess'' at the user's intended goal, a statistical system maintains a probability distribution called the belief state over all possible goals. The decisions as to how the system should then respond are based on a dialogue policy which maps belief states into actions. The two key processes at the heart of a statistical SDS are therefore (a) how to robustly estimate the belief state at each turn based on the evidence provided by the noisy user input; and (b) how to optimise the policy so as to ensure that the sequence of actions leads to the best possible outcome as defined by a reward function.
The recent introduction of Apple Siri and Google Now has moved voice-based interfaces into the mainstream. These virtual personal assistants offer the potential to revolutionise the way we interact with machines, and they open the way to properly control and manage the emerging Internet of Things: the rapidly growing network of smart devices which lack any form of conventional user interface. However, current personal assistants are built using the same technology as limited domain spoken dialogue systems. They are not capable of sustaining conversational dialogues except within the selected limited domains which they have been explicitly programmed to handle.