The BUDS POMDP Dialogue System

M. Szummer, M. Henderson, C. Breslin, M. Gasic, D. Kim, B. Thomson, P. Tsiakoulis, S. Young

A demonstration at the NIPS conference, December 2012.

The Bayesian update of dialogue state (BUDS) system is a state-of-the art system for human-computer conversation in dialogues. Here, it is employed to build a speech-driven intelligent assistant. The system manages the conversation to help the user achieve their goal as quickly as possible. Its main challenge is to converse in a way that overcomes mistakes made by the speech recognizer, or ambiguous utterances by the user. The system can ask for confirmations, pose choices, and ask for additional information, thereby gaining certainty, all in order to maximize conversation dialog utility.

Here, we demonstrate conversations in a restaurant-finding domain. The system helps users to find suitable restaurants, using criteria including area, price-range, and cuisine, and offering information about detailed address, phone number and signature dishes.

System Components

The system employs a very long machine learning pipeline, going all the way from the input of raw sound samples to the semantics of language and the pragmatics of maintaining a natural dialogue.

Components of system

The recorded demos below show dialogues between the user and the machine. They also visualize the information available at the following points in the system.

Speech Recognition Output (1)

A list of the top speech recognition hypotheses, with probabilities.

Semantic Decoding (2)

A representation of the meaning of the user utterance. Shown as a list of the most likely meanings, with probabilities. The semantics have the form of an ActType with some arguments. The ActTypes are:

inform()The user supplies a constraint on the venue, such as on area, cuisine, or price range
reqalts()The user requests alternative venues
affirm()The user says 'Yes, ...'
negate()The user says 'No, ...'
null()The user meant something other than our semantics can represent, or the semantic decoding was unsuccessful.

ActTypes may have arguments, e.g. inform(food=british, pricerange=expensive), signifying that the user has requested a venue with british cuisine in the expensive price range.

System Action (3)

The chosen system action (with the highest utility) is shown, in a semantic representation.

request()Ask the user to provide a value for an attribute
describe()Give feedback on the dialog, and possibly mention the number of matching restaurants
inform()Detail the name, phone, or address of a particular restaurant
confirm()Ask the user to confirm an attribute
select()Ask the user to choose between two values for an attribute
bye()End the dialogue

The ActTypes come with arguments e.g. inform(food=british, pricerange=expensive)

Recorded Demo

Instructions: ensure that your browser supports Flash, and that you have headphones or speakers to hear the dialogue.

  1. 1) Choose from dialogue 1, 2, or 3 using the tabs.
  2. 2) Hover your mouse over the video window, and press the arrow button.
    The video will begin to play, but will pause automatically after one turn of human-computer interaction. The visualizations below the video show the output for that turn.
  3. 3) Press the Continue button (or the play button) to play and visualize subsequent turns of the dialogue.

Hello do you have any expensive British restaurants?

Speech Recognition

  • uh in any expensive british restaurant
  • i in any expensive british restaurant
  • uh don't any expensive british restaurant
  • i don't any expensive british restaurant
  • uh do any expensive british restaurant

Semantic Decoding

  • inform(food=british, pricerange=expensive, type=restaurant)
  • inform(pricerange=expensive, type=restaurant)

System Action

confirm(food=british)

Let me confirm, you are looking for a venue that serves British food?

Notes

The recognizer and semantic decoder are correct, but not very certain, thus the system chooses to confirm the cuisine.

Yeah that's correct; also is expensive.

Speech Recognition

  • yes the price of the is expensive
  • yes the price of the that is expensive
  • yeah the price of the is expensive

Semantic Decoding

  • affirm(pricerange=expensive)
  • affirm(name=venue, pricerange=expensive)
  • inform(pricerange=expensive)

System Action

describe(count=6, food=british, type=restaurant, pricerange=expensive)
request(area, option=centre, option=east)

There are 6 restaurants serving British food in the expensive price range!
What part of town do you have in mind? For example, something in the centre or the east.

Notes

The recognizer captures the correct meaning, and the semantic decoder does the right thing.

Um, the centre of town.

Speech Recognition

  • um the centre area
  • um in the centre area
  • um the centre
  • the centre area
  • um the centre area

Semantic Decoding

  • inform(area=centre)
  • null()

System Action

inform(name="Midsummer House Restaurant", type=restaurant, food=british, pricerange=expensive, area=centre)

Midsummer House Restaurant is an expensive British restaurant in the centre of town.

Notes

Again, the recognizer captures the correct meaning, and the semantic decoder does the right thing.

Are there any other ones?

Speech Recognition

  • are there any other one
  • uh there any other one
  • are is there any other one
  • uh is there any other one
  • hi there any other one

Semantic Decoding

  • request_alternatives()
  • null()

System Action

inform(name="Fitzbillies Restaurant", type=restaurant, food=british, pricerange=expensive, area=centre)

Fitzbillies Restaurant is an expensive british restaurant in the centre of town.

Notes

Very good performance.

Okay, what's the phone number of that one?

Speech Recognition

  • okay what's the phone number of that one
  • okay uh what's the phone number of that one
  • and what's the phone number of that one
  • okay what's their phone number of that one
  • okay what the phone number of that one

Semantic Decoding

  • request(phone)
  • request(phone, type=restaurant)
  • request(phone, area=south)

System Action

inform(name="Fitzbillies Restaurant", phone="01223 352500")

The phone number of Fitzbillies Restaurant is 01223 352500.

Notes

Perfect speech recognition, yielding correct semantics.

Okay that's all. Good bye.

Speech Recognition

  • okay sounds good bye
  • okay that sounds good bye
  • okay that's sounds good bye
  • okay sounds bye
  • okay that that's sounds good bye

Semantic Decoding

  • bye()
  • null()

System Action

bye()

Thank you good bye.

Notes

Easy.

Hi, um, I'm looking for some Indian cuisine please.

Speech Recognition

  • hi i am looking for something in the city
  • hi i am looking for something in city
  • hi i looking for something in the city
  • hi i'm am looking for something in the city
  • hi i looking for something in city

Semantic Decoding

  • hello()
  • hello(food=indian)

System Action

describe(count=22, type=restaurant, food=indian)
request(area, option=centre, option=east)

There are 22 restaurants serving Indian food!
What part of town do you have in mind? For example, something in the centre or the east?

Notes

The speech recognizer misrecognizes "indian" as combinations of "in the". The semantic decoder has a relatively low probability of recovering from this (about 0.20), restoring the "indian" meaning.

Uh what about the north of town?

Speech Recognition

  • oh what about the north of town
  • oh what about north of town
  • uh what about the north of town
  • what about the north of town
  • um what about the north of town

Semantic Decoding

  • inform(area=north)
  • inform(area=north, type=restaurant)

System Action

describe(count=11, type=restaurant, area=north)
request(pricerange, option=cheap, option=expensive)

There are 11 restaurants in the north of town! What pricerange would you like? For example, cheap or expensive.

Notes

The speech recognizer is accurate here, so this is easy.

Well something moderately priced.

Speech Recognition

  • oh something moderately priced
  • oh from moderately priced
  • oh something the moderately priced
  • oh something moderately price
  • oh something moderately the priced

Semantic Decoding

  • inform(pricerange=moderate)
  • inform(=dontcare, pricerange=moderate)
  • reqalts(pricerange=moderate)
  • affirm(pricerange=moderate)

System Action

describe(count=3, area=north, type=restaurant, pricerange=moderate)
request(food, option=chinese, option=indian)

There are 3 restaurants in the north of town and in the moderate price range!
What kind of food would you like? For example, Chinese or Indian?

Notes

The speech recognizer is accurate here, so this is easy. The system requests the cuisine, as it is not sure whether it is indian.

Uh Indian actually.

Speech Recognition

  • uh indian actually
  • ah indian actually
  • uh indian
  • uh international actually
  • uh actually

Semantic Decoding

  • inform(food=indian)
  • inform(=dontcare, food=indian)
  • inform(food=indian, pricerange=moderate)

System Action

inform(name="The Nirala", type=restaurant, food=indian, area=north, pricerange=moderate)

The Nirala is a moderate Indian restaurant in the north of town.

Notes

The system has narrowed down the number of suitable slots sufficiently, and now offers a restaurant.

Okay. What's the address and phone?

Speech Recognition

  • okay um what's the address
  • okay um what's the address
  • okay um what's the address phone
  • okay um what's the address phone
  • okay um what's the address of

Semantic Decoding

  • request(phone, addr)
  • request(addr)

System Action

inform(name="The Nirala", phone="01223 360966", addr="7 Milton Road Chesterton")

The phone number of The Nirala is 01223 360966 and it is on 7 Milton Road Chesterton.

Notes

The speech recognizer hears the request for address, but only weakly hears "phone". The semantic recovers by suggesting the user meant both.

Alright, good bye.

Speech Recognition

  • bye bye
  • bye
  • bye goodbye
  • goodbye
  • bye bye

Semantic Decoding

  • bye()
  • null()

System Action

bye()

Thank you, good bye.

Notes

The semantic decoding is quite certain.

I want inexpensive restaurant in the centre.

Speech Recognition

  • i want in expensive restaurant in the centre
  • i want in expensive restaurant in centre
  • i want expensive restaurant in the centre
  • i want to in expensive restaurant in the centre
  • i want the expensive restaurant in the centre

Semantic Decoding

  • inform(pricerange=expensive, type=restaurant)
  • inform(area=centre, pricerange=expensive, type=restaurant)
  • hello(pricerange=expensive, type=restaurant)

System Action

describe(count=57, type=restaurant, pricerange=expensive)
request(area, option=centre, option=east)

There are 57 restaurants in the expensive price range! What part of town do you have in mind? For example, something in the centre or the east.

Notes

The speech recognizer misrecognized 'inexpensive' as the two words 'in expensive'. (The spoken sentence is somewhat ungrammatical as the article 'an' is missing, which makes the spoken sentence less likely under the language model). The semantic decoder interpreted this as a request for an expensive restaurant.

No I want something cheap.

Speech Recognition

  • uh no something cheap
  • ah no something cheap
  • i no something cheap
  • uh know something cheap
  • uh no cheap

Semantic Decoding

  • inform(pricerange=cheap)
  • inform(area=north, pricerange=cheap)
  • negate(pricerange=cheap)
  • negate(area=north, pricerange=cheap)

System Action

select(pricerange=expensive, pricerange=cheap)

Sorry would you like something in the expensive price range or in the cheap price range?

Notes

The recognizer misses the word 'want', but the top semantic hypothesis correctly interprets the state as cheap. The system has now heard both 'expensive' and 'cheap' and therefore proceeds to query the user to choose between the two.

I'd like a cheap Indian restaurant.

Speech Recognition

  • i'd like a cheap indian restaurant
  • i'd i'd like a cheap indian restaurant
  • uh like a cheap indian restaurant
  • i'd like to cheap indian restaurant
  • i'd like a a cheap indian restaurant

Semantic Decoding

  • inform(food=indian, pricerange=cheap, type=restaurant)
  • inform(pricerange=cheap, type=restaurant)

System Action

desribe(count=4, type=restaurant, food=indian, pricerange=cheap)
request(area, option=centre, option=north)

There are 4 restaurants serving indian food in the cheap price range! What part of town do you have in mind? For example, something in the centre or the north.

Notes

Speech recognizer and semantic decoding is ok.

Uh the north would be good.

Speech Recognition

  • uh no it again
  • uh no would again
  • uh no again
  • uh no it good
  • uh no would good

Semantic Decoding

  • inform(area=north)
  • null()
  • negate(area=north)
  • negate()

System Action

inform(name="Royal Spice", type=restaurant, pricerange=cheap, food=indian, area=north)

Royal Spice is a cheap Indian restaurant in the north of town.

Notes

The speech recognizer mistakes the word 'north' for a combination of 'no' and another word, e.g. 'no it', 'no would'. However, the semantic decoder has learned that such combinations may actually mean north, and these are more than twice as probable as the semantic decodings for negation ('no').

What's the phone number, and the address?

Speech Recognition

  • uh what's the phone number and the address
  • ah what's the phone number and the address
  • uh what's the phone number and address
  • uh what's is the phone number and the address
  • uh what the phone number and the address

Semantic Decoding

  • request(phone, addr)
  • -->
  • request(phone)

System Action

inform(name="Royal Spice", phone="01733 553355", addr="Victoria Avenue Chesterton")

The phone number of Royal Spice is 01733 553355 and it is on Victoria Avenue Chesterton.

Notes

The speech recognizer is pretty much correct, so the top semantic hypothesis is very likely too.

Uh great, thanks. Good bye.

Speech Recognition

  • okay thanks goodbye
  • okay thanks goodbye bye
  • okay thanks good
  • uh thanks goodbye
  • okay thanks good bye

Semantic Decoding

  • bye()
  • null()

System Action

bye()

Thank you good bye.

Notes

The speech recognizer is okay here. Although the correct recognition is only 8-th in the list, the semantic recognizer is very certain of what the intended meaning is.

Live demo

A system you can speak to. Loaded with restaurants in Cambridge, UK. For NIPS, the system will be switched to restaurants in Tahoe. Requires a good Internet connection, microphone and headphones.

Visualization

The live demonstration (shown during the conference) additionally includes a visualization of the system state, in particular the three components (1,2,3) mentioned above. Below is a screenshot - click to enlarge it.

Screenshot of demo