Positions

  • Present 2017

    Research Intern

    Google Brain

  • 2017 2016

    Research Consultant

    IPSoft, Amelia team

  • 2014 2013

    Compulsory Second Lieutenant Chief Counselor

    Taiwan Army

  • 2013 2011

    Teaching Assistant

    National Taiwan University, Digital Speech Processing and Speech Special Project

  • 2013 2011

    Part-time Algorithm Developer

    StorySense Computing, Inc, acquired by 电话帮 in 2014.

Education

  • Ph.D. Present

    Ph.D. student in Engineering

    University of Cambridge

  • M.A.2013

    Master of Science in Engneering

    National Taiwan University

  • B.A.2011

    Bachelor of Science in Engineering

    National Taiwan University

Honors, Awards and Grants

  • 2015
    Best Paper Award, EMNLP 2015
    Earned 1 of 3 out of 312 accepted papers.
  • 2015
    Best Paper Award, SigDial 2015
    Earned 1 of 3 out of around 100 accepted papers.
  • 2015
    Toshiba Research Studentship, Toshiba Research Europe Ltd
    3-year studentship funded by Toshiba Research Europe Ltd, Cambridge Research Laboratory, for developing wide domain statistical dialogue systems.
  • 2015
    Government Scholarship for Stufying Overseas, MOE of Taiwan
    1 of 16 selected EECS students based on outstanding academic achievements.
  • Aug 2013
    InterSpeech 2013 Best Student Paper Nominee, ISCA
    Earned 1 of 12 out of thousands of accepted papers.
  • Dec 2012
    InterSpeech 2012 Best Student Paper Nominee, ISCA
    Earned 1 of 10 out of thousands of accepted papers.
  • 2010
    Sir Zong Education Foundation Student Grant, Sir Zong Foundation
    Scholoarship for outstanding college and high school students.

Research Projects

  • image

    Neural Network for Language Generation

    Stochastic Language Generation using Neural Networks

    to be appear

  • image

    Personalised Language Modeling

    Personalising language models using social network crowdsourcing

    Designed a crowdsourcing platform to collect personal corpora from social network.

    Built personalized language models by adopting social properties.

    Compared personalization capabilities of N-gram and Recurrent Neural Network LMs.

  • image

    Interactive Retrieval

    Interactive retrieval system for spoken content

    Cast interactive retrieval problems as an MDP decision framework.

    Developed and compared various MDP models and reinforcement learning methods.

    Implemented a state (retrieval quality) estimator to project retrieval indicators to state.

Industry Invited Talks

  • 19 Jun 2017

    Samsung R&D, Warsaw, Poland

    Title: "Deep Learning for Natural Language Generation and End-to-End Dialogue Modeling" [slides]

  • 08 Mar 2017

    Apple Siri team, Cambridge, UK

    Title: "Task-oriented Neural Dialogue Systems" [slides]

  • 23 Jun 2016

    Google Deep Dialogue team, Mountain View, CA, USA

    Title: "A Network-based End-to-End Trainable Task-oriented Dialogue System" [slides]

  • 23 Feb 2016

    Xerox Research Centre Europe, Grenoble, France

    Title: "Scalable Neural Language Generation for Spoken Dialogue Systems" [slides]

  • 29 Jul 2015

    Baidu NLP group seminar, Beijing, China.

    Title: "Scalable Neural Language Generation for Open Domain Dialogue Systems" [slides]

Academia Invited Talks

  • 05 Jan 2017

    National Taiwan University, Taipei, Taiwan

    Title: "Task-oriented Neural Dialogue Systems" [slides]

  • 09 Nov 2016

    Toyota Technological Institute at Chicago, Chicago, IL, USA

    Title: "Task-oriented Neural Dialogue Systems" [slides]

  • 06 Sep 2016

    Tutorial @ INLG, Edinburgh, UK

    Title: "Deep Learning for Natural Language Generation" [slides] [opensource]

  • 24 May 2016

    Heriot Watt University, Edinburgh, UK

    Title: "Beyond Conditional LM: NN Language Generation for Dialogue Systems" [slides]

  • 19 Nov 2015

    University of Sheffield, United Kingdom

    Title: "Neural Language Generation for Spoken Dialogue Systems" [slides]

  • 11 Sep 2015

    University of Cambridge, United Kingdom

    Title: "Semantically Conditioned LSTM-based NLG for Spoken Dialogue Systems" [slides]

  • 18 Aug 2015

    Academic Sinica, Taipei, Taiwan

    Title: "Scalable Neural Language Generation for Open Domain Dialogue Systems" [slides]

Teaching

  • 25 Feb 2016

    MPhil course for Spoken Dialogue Systems, University of Cambridge, UK

    Title: "Statistical Natural Language Generation" [slides]

Filter by type:

Sort by year:

Latent Intention Dialogue Models

Tsung-Hsien Wen, Yishu Miao, Phil Blumson, Steve Young
Conference PapersIn Proceedings on ICML, Sydney, Australia, Auguest, 2017

Abstract

Developing a dialogue agent that is capable of making autonomous decisions and communicating by natural language is one of the long-term goals of machine learning research. Traditional approaches either rely on hand-crafting a small state-action set for applying reinforcement learning that is not scalable or constructing deterministic models for learning dialogue sentences that fail to capture natural conversational variability. In this paper, we propose a Latent Intention Dialogue Model (LIDM) that employs a discrete latent variable to learn underlying dialogue intentions in the framework of neural variational inference. In a goal-oriented dialogue scenario, these latent intentions can be interpreted as actions guiding the generation of machine responses, which can be further refined autonomously by reinforcement learning. The experimental evaluation of LIDM shows that the model out-performs published benchmarks for both corpus-based and human evaluation, demonstrating the effectiveness of discrete latent variable models for learning goal-oriented dialogues.

A Network-based End-to-End Trainable Task-oriented Dialogue System

Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M. R.-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young
Conference PapersIn Proceedings on EACL, Valencia, Spain, April, 2017

Abstract

Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring labelled datasets and solving a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable dialogue system along with a new way of collecting task-oriented dialogue data based on a novel pipe-lined Wizard-of-Oz framework. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. The results show that the model can converse with human subjects naturally whilst helping them to accomplish tasks in a restaurant search domain.

Conditional Generation and Snapshot Learning in Neural Dialogue Systems

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M. R.-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young
Conference PapersIn Proceedings on EMNLP, Austin Texas, USA, November 2016

Abstract

Recently a variety of LSTM-based conditional language models (LM) have been applied across a range of language generation tasks. In this work we study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework. A method called snapshot learning is also proposed to facilitate learning from supervised sequential signals by applying a companion cross-entropy objective function to the conditioning vector. The experimental and analytical results demonstrate firstly that competition occurs between the conditioning vector and the LM, and the differing architectures provide different trade-offs between the two. Secondly, the discriminative power and transparency of the conditioning vector is key to providing both model interpretability and better performance. Thirdly, snapshot learning leads to consistent performance improvements independent of which architecture is used.

Multi-domain Neural Network Language Generation for Spoken Dialogue Systems

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M. R.-Barahona, Pei-Hao Su, David Vandyke, and Steve Young
Conference PapersIn Proceedings on NAACL-HLT, San Diego, USA, June 2016

Abstract

Moving from limited-domain natural language generation (NLG) to open domain is difficult because the number of semantic input combinations grows exponentially with the number of domains. Therefore, it is important to leverage existing resources and exploit similarities between domains to facilitate domain adaptation. In this paper, we propose a procedure to train multi-domain, Recurrent Neural Network-based (RNN) language generators via multiple adaptation steps. In this procedure, a model is first trained on counterfeited data synthesised from an out-of-domain dataset, and then fine tuned on a small set of in-domain utterances with a discriminative objective function. Corpus-based evaluation results show that the proposed procedure can achieve competitive performance in terms of BLEU score and slot error rate while significantly reducing the data needed to train generators in new, unseen domains. In subjective testing, human judges confirm that the procedure greatly improves generator performance when only a small amount of data is available in the domain.

Toward Multi-domain Language Generation using Recurrent Neural Networks

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M. R.-Barahona, Pei-Hao Su, David Vandyke, and Steve Young
Workshop PapersNIPS Workshop on ML for SLU and Interaction, Montreal, Canada, December 2015

Abstract

In this paper we study the performance and domain scalability of two different Neural Network architectures for Natural Language Generation in Spoken Dialogue Systems. We found that by imposing a sigmoid gate on the dialogue act vector, the Semantically Conditioned Long Short-term Memory generator can prevent semantic repetitions and achieve better performance across all domains compared to an RNN Encoder-Decoder generator. However, in a domain adaptation experiment, the RNN Encoder-Decoder generator, with a separate slot and value parameterisation, is capable of learning faster by leveraging out-of-domain data. We conclude that the way to represent and integrate the semantic elements is of great importance to NN-based NLG systems. Further advances will therefore require a representation that is more scalable across domains without significantly compromising in-domain performance.

Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems [Best Paper Award]

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young
Conference PapersIn Proceedings on EMNLP, Lisbon, Portugal, September 2015

Abstract

Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality. Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language. They are also not easily scaled to systems covering multiple domains and languages. This paper presents a statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure. The LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates. With fewer heuristics, an objective evaluation in two differing test domains showed the proposed method improved performance compared to previous methods. Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems.

Stochastic Language Generation in Dialogue using Recurrent Neural Networks with Convolutional Sentence Reranking [Best Paper Award]

Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young
Conference PapersIn Proceedings on SigDial, Prague, Czech Public, September 2015

Abstract

The natural language generation (NLG) component of a spoken dialogue system (SDS) usu- ally needs a substantial amount of handcrafting or a well-labeled dataset to be trained on. These limitations add significantly to development costs and make cross-domain, multi-lingual dia- logue systems intractable. Moreover, human languages are context-aware. The most natural response should be directly learned from data rather than depending on predefined syntaxes or rules. This paper presents a statistical language generator based on a joint recurrent and convolu- tional neural network structure which can be trained on dialogue act-utterance pairs without any semantic alignments or predefined grammar trees. Objective metrics suggest that this new model outperforms previous methods under the same experimental conditions. Results of an evalu- ation by human judges indicate that it produces not only high quality but linguistically varied utterances which are preferred compared to n-gram and rule-based systems.

Recurrent Neural Network Based Language Model Personalization by Social Network Crowdsourcing [Best Paper Shortlist]

Tsung-Hsien Wen, Aaron Heidel, Hung-yi Lee, Yu Tsao and Lin-Shan Lee
Conference PapersIn Proceedings on InterSpeech, Lyon, France, August 2013

Abstract

Speech recognition has become an important feature in smartphones in recent years. Different from traditional au- tomatic speech recognition, the speech recognition on smartphones can take advantage of personalized language models to model the linguistic patterns and wording habits of a particular smartphone owner better. Owing to the popularity of social networks in recent years, personal texts and messages are no longer inaccessible. However, data sparseness is still an unsolved problem. In this paper, we propose a three-step adaptation approach to personalize recurrent neural network language models (RNNLMs). We believe that its capability to model word histories as distributed representations of arbitrary length can help mitigate the data sparseness problem. Furthermore, we also propose additional user-oriented features to empower the RNNLMs with stronger capabilities for personalization. The experiments on a Facebook dataset showed that the proposed method not only drastically reduced the model perplexity in preliminary experiments, but also moderately reduced the word error rate in n-best rescoring tests.

Interactive Spoken Content Retrieval by Extended Query Model and Continuous State Space Markov Decision Process

Tsung-Hsien Wen, Hung-yi Lee, Pei-hao Su, and Lin-Shan Lee
Conference PapersIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013

Abstract

Interactive retrieval is important for spoken content because the retrieved spoken items are not only difficult to be shown on the screen but also scanned and selected by the user, in addition to the speech recognition uncertainty. The user cannot playback and go through all the retrieved items to find out what he is looking for. Markov Decision Process (MDP) was used in a previous work to help the system take different actions to interact with the user based on an estimated retrieval performance, but the MDP state was represented by the less precise quantized retrieval performance metric. In this paper, we consider the retrieval performance metric as a continuous state variable in MDP and optimize the MDP by fitted value iteration (FVI). We also use query expansion with the language modeling retrieval framework to produce the next set of retrieval results. Improved performance was found in the preliminary experiments.

Personalized Language Modeling by Crowd Sourcing with Social Network Data for Voice Access of Cloud Applications

Tsung-Hsien Wen, Hung-yi Lee, Tai-Yuan Chen, and Lin-Shan Lee
Conference PapersIEEE Workshop on Spoken Language Technology (SLT), Miami, Florida, December 2012

Abstract

Voice access of cloud applications via smartphones is very attractive today, specifically because a smartphones is used by a single user, so personalized acoustic/language models become feasible. In particular, huge quantities of texts are available within the social networks over the Internet with known authors and given relationships, it is possible to train personalized language models because it is reasonable to assume users with those relationships may share some common subject topics, wording habits and linguistic patterns. In this paper, we propose an adaptation framework for building a robust personalized language model by incorporating the texts the target user and other users had posted on the social networks over the Internet to take care of the linguistic mismatch across different users. Experiments on Facebook dataset showed encouraging improvements in terms of both model perplexity and recognition accuracy with proposed approaches considering relationships among users, similarity based on latent topics, and random walk over a user graph.

Interactive Spoken Content Retrieval with Different Types of Action Optimized by a Markov Decision Process [Best Paper Shortlist]

Tsung-Hsien Wen, Hung-yi Lee, and Lin-Shan Lee
Conference PapersIn Proceedings on InterSpeech, Portland OR, USA, September 2012

Abstract

Interaction with user is specially important for spoken content retrieval, not only because of the recognition uncertainty, but because the retrieved spoken content items are difficult to be shown on the screen and difficult to be scanned and selected by the user. The user cannot playback and go through all the retrieved items and then find out they are not what he is looking for. In this paper, we propose a new approach for interactive spoken content retrieval, in which the system can estimate the quality of the retrieved results, and take different types of actions to clarify the user’s intention based on an intrinsic policy. The policy is optimized by a Markov Decision Process (MDP) trained with Reinforcement Learning based on a set of pre-defined rewards considering the extra burden given to the user.

Voice Access of Cloud Applications : Language Model Personalization and Interactive Spoken Content Retrieval

Tsung-Hsien Wen
Thesis

Abstract

This thesis considers voice access of cloud applications with two parts: (1) Personalized Language Model and (2) Interactive spoken document retrieval. Model mismatch has been a major problem in speech recognition. With hand-held devices widely used today, personalized models become possible. A huge quantities of posts and comments with known owners emerged on social network websites, personal corpora become practically available but with data sparseness problem unsolved. In the first part of this thesis, we proposed personalized language modeling approaches by estimating the language similarities between different social network users and integrating the corresponding personal corpora accordingly. We studied both N-gram language models as well as recurrent neural network language models, and the experimental results support the concept. In the second part of this thesis, we studied interactive spoken document retrieval. Interactive retrieval is helpful to spoken content retrieval because retrieved spoken items are difficult to be shown on screen and browsed by the user, in addition to the speech recognition uncertainty. We model the interaction process by a Markov Decision Process and train the policy with Reinforcement Learning. Experimental results demonstrate the retrieval performance can be improved with the interactions.

Stochastic Language Generation in Dialogue using Recurrent Neural Networks with Convolutional Sentence Reranking

Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young
Technical ReportUniversity of Cambridge Engineering Department

Abstract

The natural language generation (NLG) component of a spoken dialogue system (SDS) usu- ally needs a substantial amount of handcrafting or a well-labeled dataset to be trained on. These limitations add significantly to development costs and make cross-domain, multi-lingual dia- logue systems intractable. Moreover, human languages are context-aware. The most natural response should be directly learned from data rather than depending on predefined syntaxes or rules. This paper presents a statistical language generator based on a joint recurrent and convolu- tional neural network structure which can be trained on dialogue act-utterance pairs without any semantic alignments or predefined grammar trees. Objective metrics suggest that this new model outperforms previous methods under the same experimental conditions. Results of an evalu- ation by human judges indicate that it produces not only high quality but linguistically varied utterances which are preferred compared to n-gram and rule-based systems.