CUHTK Brodacast News English Recipe

Introduction

Speech Communication -- Broadcast News Special Issue

The commands given below assume that the standard ears environment variables are set-up. bash users should execute:

bash$ . /home/widowb/ears/tools/ears_env.sh

tcsh users:

tcsh$ source /home/widowb/ears/tools/ears_env.csh

Training Data

The training data consits of two sets of data released by the LDC. We refer to them as the 1997 an 1998 traing sets (LDC calls them 1996 and 1997 resepctively).

The audio files can be found in $ears/bn-e/audio

The transcriptions for both of these sets are provided by the LDC as well and are encoded in SGML. Two different formats are used for the two sets. See $ears/bn-e/lib/trans

The transcriptions were converted into STM format using LDC/NIST provided tools (different ones for the 97 and 98 data!) $ears/bn-e/lib/stm

From these STM files HTK-format MLF files with the transcriptions and SCP files with a list of the segments were generated: $ears/bn-e/lib/wlabs/bnetrain02-unsplit and $ears/bn-e/lib/flists/bnetrain02-unsplit

Test Data

to be completed

For testing with some decoders (e.g. xvx) the model set has to be converted into RMF format.

ge204@watch$ /home/solveb/hub5/htkbin/bin.i686/HRConvert.090302 /home/widowb/ears/bn-E/exp/eval98-test/hmms/hrconvert.cfg -a hmm164/MMF -b xwrd.clustered.mlist -c hmm164/RMF | tee hmm164/hrconvert.LOG

Training MLE Models

Clustering cross-word triphones

Preparing models

The first step in the training is the creation of the decision trees that define the tying of states for all cross-word triphones. The clustering is performed by HHEd and as input needs untied triphones trained using HERest on all the training data. This training is performed using 2-model re-estimation where a set of well-trained (state-clustered, multi-mixture) models are used to perform the state alignment. The input model set must have the same topology as the alignment model set. Typically simple monophone HMMs (found in monoHMMs/) are cloned for all triphones in the training set (this list is in train.xwrd.mlist, see train.xwrd.mlist.LOG). The transition matrices of all allophones of the same centre-phone are tied. The resulting MMF is stored in hmm0/.

ge204@widow$ /home/widowb/ears/bin/bin.linux/HHEd -A -D -V -T 1 -B -H monoHMMs/MMF-silsp -H monoHMMs/MMF.silsp -w hmm0/MMF \
  /home/widowb/ears/bn-E/lib/edScripts/tie_clone.hed /home/widowb/ears/bn-E/lib/mlists/mono.mlist

see tie_clone.LOG

Gathering statistics for clustering

The next step is the 2-model reestimation which yields the trained untied triphones in hmm1/ and the occupation statistics file in hmm1/stats.

2-model reestimation
ge204@watch$ /home/widowb/ears/tools/herest.codine HTEfiles/HTE.2model hmm0 hmm1

The untied triphones and the occupation statistics provide all the data needed in the actual clustering.

Clustering

The clustering is performed by the HTKTool HHEd. As input it needs the trained untied triphones found in hmm1/MMF with the associated statistics in hmm1/stats. To control the number of states in the resulting tied-state system two parameters can be varied. These are the Outlier threshold (argument to RO in HHEd commnd files) and the splitting threshold (argument to TB).

In preparation for the clustering a list of all triphones required in the resulting tied-state model set has to be provided. We always use the full list of all possible triphones at this point, to make sure that our model sets can be used with arbitrary dictionaries.

ge204@watch$ ln -s /home/widowb/ears/bn-E/lib/mlists/all.tri.list unseen
ge204@watch$ gunzip hmm1/stats.gz

To start the clustering a subdirectory clustering/ is created and a template of the HHEd command file is copied (or linked) into it. The standard template can be found in $ears/bn-E/lib/edScripts/cluster_ROVAL_TBVAL.hed. The actual clustering job is submitted by running

ge204@watch$  cluster.sh 1000 750

The results will be in clustering/hmm10_1000_750 and consist of the tied model file MMF, the associated modellist newlist.1000_750 and the decision trees cluster_1000_750.trees. The number of states can be found in the LOG file (final TB: Stats line).

Typically a number of different RO and TB values are tried and the a particular combination is chosen, by creating a two links ponting to the chosen set, e.g.:

ge204@watch$ ln -s clustering/hmm10_1500_1500 hmm10
ge204@watch$ ln -s clustering/hmm10_1500_1500/newlist.1500_1500 xwrd.clustered.mlist

Re-estimation and Mixing-up

Now four iterations of embedded re-estimation are performed on the singele-mixture system

ge204@watch$ hbuild.codine HTEfiles/HTE 11 14

At this stage the variance floors of the models can be re-calculated:

ge204@watch$ mv hmm14/MMF hmm14/MMF.orig
ge204@watch$ gunzip hmm14/stats.gz

ge204@watch$ HHEd -A -D -V -B -C cfgs/ave_var.cfg -T 1 -H hmm14/MMF.orig -w hmm14/MMF \
               $ears/lib/edScripts/ave_var_0.1_hmm14.hed xwrd.clustered.mlist | \
             tee ave_var_0.1_hmm14.LOG

Now the resulting single-mixture models can be trained up by alternating mixing-up with re-estimation steps.

ge204@watch$ hconstruct.codine HTEfiles/HTE 1 16 high 4 2 ibm

Gender-dependent models

Gender-dependent models are created by performing on iteration of re-estimation only on the gender specific data while only updating the means and mixture weights.

ge204@watch$ /home/widowb/ears/tools/herest.codine HTEfiles/HTE.m hmm164 hmm164.m
ge204@watch$ /home/widowb/ears/tools/herest.codine HTEfiles/HTE.f hmm164 hmm164.f

References

DARPA Broadcast News Recognition Workshops

LDC Catalog: Hub4

BN97 system paper

BN97 system paper (ICASSP98)

BN98 system paper

10xRT BN98 system paper

Gunnar Evermann

Last modified: Fri Oct 11 19:55:20 BST 2002