¹ Cambridge University Engineering Department | ² MIT-Lincoln Laboratory |
Trumpington Street, Cambridge, CB2 1PZ, UK | 244 Wood Street, Lexington, MA 02420-9185, USA |
Email: sej28@eng.cam.ac.uk | dar@ll.mit.edu |
A particularly challenging domain for speaker labelling is broadcast news
shows. These programs contain an unpredictable number of speakers who speak for
a wide range of different times, sometimes simultaneously; as well as
containing unwanted regions such as commercial (advert) breaks.
However, tracking speakers through current affairs debates, or being able to
search for information known to be spoken by the primary anchor or newsreader,
can be very beneficial - and so the task of identifying 'who spoke when'
in broadcast news audio is particularly interesting and challenging.
This paper describes systems developed at CUED and MIT-LL to perform automatic segmentation, clustering and labelling of speakers (and in some cases commercial breaks) in broadcast news data. The paper is arranged as follows: the 'diarisation' error rate used for scoring is explained in Section 2 and the data used for experiments defined in Section 3. The December 2003 CUED diarisation system is described in Section 4 which also introduces a new clustering procedure with new stopping criteria. The MIT diarisation system is described in Section 5 and a hybrid 'Plug and Play' system which combines stages of both the CUED and MIT-LL system is described in Section 6 along with comprehensive experimental results. Finally conclusions are offered in Section 7.
A system hypothesises a set of speaker segments each of which consist of
a speaker-id label and the corresponding start and end time. This
is then scored against a reference 'ground-truth' speaker
segmentation. A one-to-one mapping of the reference
speaker IDs to the hypothesis speaker IDs is performed so as to maximise the
total overlap of the reference and (corresponding) mapped
hypothesis speakers.
Speaker detection performance is then expressed in terms of the miss
(speaker in reference but not in hypothesis), false alarm
(speaker in hypothesis but not in reference), and speaker-error
(mapped reference speaker is not the same as the hypothesised
speaker) rates.
The overall diarisation score is the sum of these three
components, and can be calculated using the following
formula:
where s is the longest continuous piece of audio for which the reference
and hypothesised speakers do not change,
dur(s) is the duration of s,
NR(s)
is the number of reference speakers in s,
NH(s)
is the number of hypothesised speakers in s and
NC(s)
is the number of mapped reference speakers which match
the hypothesised speakers.
Since the RT-03s diarisation score excluded from scoring areas where multiple
reference speakers were talking simultaneously,
and we do not postulate any regions of
overlapping speech in the hypotheses, this formula becomes:
where H is always zero except
Hmiss(s) is 1 for a
missed speech
segment, Hfa(s)
is 1 for a false alarm speech segment,
Hspe(s) is 1 for a
segment with a speaker error, and
Href(s)
is 1 for a segment containing a reference speaker.
The references used for scoring were generated according to the rules specified in [1, 5]. Effectively, speaker turns were derived using word times generated by a word-level forced alignment from the Linguistic Data Consortium (LDC), with segment breaks when either a new speaker starts talking, or the speaker pauses for more than a certain critical length of time (here fixed at 0.3s as was used in the RT-03s diarisation evaluation). Speaker-attributable non-lexical events, such as {cough, breath, lipsmack, sneeze and laughter} were excluded from scoring along with their adjoining silences. Commercial breaks were not transcribed for the reference, and as a result were also excluded from scoring in the primary scoring metric, although we also consider a secondary metric which penalises systems for retaining adverts in their hypothesised output.
Each data set consists of one 30 minute extract from 6 different US broadcast news shows. Two of these are radio shows, namely Voice of America English News (VOA_ENG) and PRI The World (PRI_TWD); and four are TV shows, namely NBC Nightly News (NBC_NNW), ABC World News Tonight (ABC_WNT), MSNBC News with Brian Williams (MNB_NBW) and CNN Headline News (CNN_HDL). Details of the exact composition of the data sets can be found in [7].
A library of broadcast news shows was made [The library used in the
advert detection was obtained automatically by windowing over whole training
shows. It is possible in theory to manually mark the training data to define
a true 'library of known adverts' but this was considered impractical
on these large data sets.] using the English TDT-4 training
data, excluding the shows from the RT-03s development sets. This consisted
of between 40 and 70 shows for each of the 6 broadcasters spanning October
2000 to January 2001. This library is denoted CU_TDT4. A further
library was generated which excluded shows for each broadcaster which
were broadcast in the same calendar month as that broadcaster's episode in the
diarisation development data. This was to simulate conditions in the
RT-03s evaluation, where there was a temporal gap between the test-audio
and the training shows. This library is denoted CU_EVAL.
The data for both the library and the evaluation shows is first coded at a
frame rate of 100Hz into 39-dimensional feature
vectors consisting of the normalised log-energy and 12 Mel-frequency PLP
cepstral parameters along with their first and second derivatives.
Overlapping windows are generated on the data; 5 seconds long with a 1 second
shift for the ABC, CNN, MNB and NBC shows, and 2.5 seconds long with a
0.5 second shift for the VOA and PRI shows. The difference in these values
reflects the nature of the shows, the radio shows in general having
fewer well-defined commercial breaks, but still including other
repeated material such as station jingles which could be removed automatically.
The windows are then represented by a diagonal correlation matrix.
(It was found that using the correlation matrix instead of the covariance
matrix gave better results due to the retention of the mean information.)
The Arithmetic Harmonic Sphericity (AHS) distance[10] is
then calculated for each evaluation window compared to each library window.
This is marked as a repeat if this distance metric falls below a small
threshold.
For a perfect match the distance would be zero, but since the granularity of
the windows means there may be a delay of up to half the window shift between
corresponding events in the two audio streams, causing a slight mismatch in
the
data, a threshold is required. This is set conservatively so that there should
not be any false matches whose distance metric is lower than the threshold.
To remove any false positives, and guard against the possibility of a
news-story being repeated on different shows, the evaluation window had
to match at least 2 different library windows to be marked as a repeat.
After finding the repeats, smoothing was
carried out between the areas labelled as repeats in order
to identify the commercial breaks. The smoothing relabelled any audio of less
than a certain duration which occurred between two repeats as part of the
adverts
unless this made the overall commercial break exceed a maximum duration.
These values were chosen on a broadcaster-specific basis to reflect the
overall properties of the broadcasts, but in general the maximum permitted
duration was around 3 minutes, and the smoothing for the TV shows was just
over 1 minute, with minimal smoothing for the radio shows.
CNN had less smoothing than the other TV sources due to the frequent occurrence
of 20 to 30s long sports reports between adverts and station
jingles.
Finally the boundaries of the postulated commercial breaks were refined to
take into account the granularity of the initial windowing.
Further details and analysis of the effectiveness of this technique can be found in [7]. The CU_TDT4 system removed 18.4% of the audio, which consisted of 1783s=86.3% of all the adverts and 198s=2.28% of all the news; whilst the CU_EVAL system removed 6.75% of the audio, which consisted of 582s=28.2% of all the adverts and 144s=1.66% of all the news on the diarisation development data. The system removed 8.9% of the evaluation data, which consisted of 867s=40.5% of all the adverts and 70s=0.83% of all the news.
A phone recogniser, which has 45 context independent phone models per
gender plus a silence/noise model with a null language model is then run
for each bandwidth separately.
The output of the phone recogniser is a sequence of phones with male,
female or silence tags. The phone identifiers are ignored but the phone
sequences
with the same gender are merged and some heuristic smoothing rules applied
to produce a series of small segments, using the silence tags to help
define the boundary locations.
Finally clustering and merging of similar temporally adjacent segments is performed using the GMM classifier output to restrict the boundary locations, to produce the final segmentation with bandwidth and putative gender labels. The final gender labels are produced by aligning the output of the first-pass of the CUED RT-03 Broadcast News ASR system [11] with gender dependent models. The segments are then assigned to the gender which gives the highest likelihood.
Each segment is represented by a full correlation matrix of the 13-dimensional PLP vectors (without first or second derivatives) and the distance metric used is the Arithmetic Harmonic Sphericity (AHS).[10] The clustering is performed top-down as follows:
The stopping criteria are critical in determining the final clusters. The system allows several different criteria to be used which reflect the aim of the clustering. These include specifying a minimum occupancy for clusters (used in the ASR system where a certain amount of data is necessary for adaptation, but not for diarisation where speakers can talk for arbitrarily short portions of time) or using measures based on the 'cost' as defined by the average distance of the segments from the nodes. For this paper we also implemented a new stopping criterion based on the Bayesian Information Criterion.
Three parameters were used to control the cost-based
stopping criteria. The first was the most important,
and specified the ratio of the
gain in cost function from splitting to the global node cost.
A node cost is the sum of the distances of its segments to itself
and the gain in splitting is the cost of the parent node minus the
cost of the children nodes. We call this the 'h-parameter'.
Additionally the 'p-parameter' controlled the ratio of the inter:intra
child cost and the 'j-parameter' provided a multiplicative
component used to weight the scores for the special case of a node
containing only one segment since the distances are zero in this case.
This system gave a diarisation error rate of 33.29% on the development
(bndidev03) data and 32.30% on the evaluation (bneval03) data.
The CUED December 2003 system discussed in this paper uses a simpler
2-way splitting algorithm. Therefore, the 'p-parameter' is set arbitrarily high
and the 'j-parameter' is set to 1 since they are not as important
for the 2-way splitting procedure. The cost-based stopping criterion
is thus controlled solely by varying the 'h-parameter'.
The results from varying the h-parameter on the diarisation development (bndidev03) and evaluation (bneval03) data are illustrated in Figure 2, showing the method generalises reasonably well to the unseen evaluation data.
where #M is the number of free parameters, N the number of data
points and α the tuning parameter, usually set to 1.
The data is modelled using a full Gaussian of dimension d:
where μ is the mean vector,
S is the covariance matrix and |S| is
the determinant of S .
The log likelihood term, L, is then
where C is a constant, - ½ d ( 1 + log(2π)).
The number of free parameters for K clusters is:
Thus when making a local decision as to whether a cluster Z should be split into 2 clusters, X and Y, the equations become:
and the split goes ahead if ΔBIC > 0 . We call this
formulation BIC-local as the decision about whether to split a
particular cluster is taken locally. Alternatively the whole cluster set can
be viewed as an entity, and the decision then becomes should the K clusters
be increased to K+1. In this case the formula for the ΔBIC
remains the same except that the N used in the penalty term, P,
becomes the total number of frames in all the clusters,
Nf, rather than
the number in the cluster being split,
Nz. We call this formulation
BIC-global.
In general the BIC formula is used in conjunction with agglomerative
clustering, so can be thought of as a decision as to whether to merge
the two clusters X and Y into Z (rather than splitting Z into X and Y).
In this case, the choice of which clusters to merge is usually made such as
to produce the most negative ΔBIC . If this is non-negative
the merge does not go ahead and all clustering is stopped.
The CUED implementation instead uses a divisive clustering scheme which
tries to
split each active node in turn and does not order the decisions. The
segment assignment for a given potential split is made as before using
the full correlation matrix and the AHS distance, but the decision
as to whether to split a node is now taken by testing if the ΔBIC
is >0. For this reason it was felt that the BIC-local formulation may be
more appropriate for this case.
The results from changing the α penalty using both the BIC-global and BIC-local implementations on the development (bndidev03) and evaluation (bneval03) data are illustrated in Figure 3. The BIC-global implementation seems to generalise slightly better, with the performance across both data sets roughly matching each other except for one point and the same value of α producing the best performance in both cases. However, the BIC-local implementation, although slightly more noisy, does give slightly better performance.
Stopping Criterion |
Optimal Param | Diarisation Score | ||
bndidev03 | bneval03 | bndidev03 | bneval03 | |
RT-03s sys | - | - | 33.29 | 32.30 |
Cost-based | 0.825 | - | 28.51 | 27.24 |
Cost-based | - | 0.8 | 28.66 | 27.09 |
BIC-global | 6.25 | 6.25 | 26.13 | 25.21 |
BIC-local | 7.25 | - | 25.54 | 25.12 |
BIC-local | - | 6.75 | 26.47 | 24.27 |
The new 2-way clustering strategy with the introduction of the BIC stopping criteria has reduced the diarisation error by 7 --> 8% absolute compared to the CUED RT-03s evaluation system[7] on both the development and evaluation data.
This system gave a diarisation error of 24.46% on the development and 23.85% on the evaluation data.
where
,
and
are full covariance Gaussian models trained with
X,X1,X2
respectively. P and α are defined as in Section 4.3.2.
A change point is detected when
ΔBIC(i) > 0.
If no change point is found in the
current window, the window length is increased and the
search is repeated. Once a maximum search window
length is reached and no change is found, a change
point is declared and the process is restarted. When a
change point is found, a new search window is begun
one vector after the detected change point.
To help minimise the cost of computing the BIC
statistics at every point, a faster Hotelling's T² test is
first used to identify the potential change point in a
search window[14]. The full BIC statistic is then
computed for the point with the maximum Hotelling's T²
value in the window.
After the above process is run on the entire audio
sequence, a second-pass BIC test is run on each
detected change point to determine if adjacent segments
should be merged. This second-pass mainly helps in
eliminating very short segments and artificial change
points due to reaching the maximum search window
length.
When advert detection is used (as discussed in
Section 4.1), detected advert regions
are skipped during the change point detection.
Based on experimentation, the following
settings are used for the change point detection algorithm: An
initial search window size of 100 frames, a search
window increment of 50 frames, a maximum search
window size of 1500 frames, and α=1.0.
The segments are then classified as speech or non-speech using a GMM based maximum likelihood classifier. Five 128 mixture diagonal covariance GMMs are built for Speech, Speech+Music, Speech+Other, Music and Other. Any segments labelled as Music or Other are discarded before clustering. Further details can be found in [7].
The distance between clusters is :
where x and y are the data from two different clusters, z
is the union of x and y, and
p( x | λ x )
is the likelihood of
data x given the pdf model λ x
for data x. The pdf
model used is a tied-mixture model where the basis
densities are estimated from the entire set of speech
segments and the weights are estimated for each
segment. Advantages of this model are the per-frame
likelihoods to the basis densities need only be computed
once and the weights for merged clusters are computed
as a simple averaging of counts.
The BIC criterion for this case is:
where M is the number of basis densities (and hence the number of free parameters) and N the total number of feature vectors. The clustering is stopped when ΔBICTGMM > 0. Again, the penalty weight α , was set to 1.0, whilst M was 128.
1. Advert Removal
The output from this stage consisted of a list of portions of audio for each show which were left after the advert removal stage.
2. Segmentation
The output from this stage was a list of segments with bandwidth
and gender labels.
3. Clustering
The results from running all combinations of this hybrid 'Plug-and-Play' system on the diarisation development (bndidev03) data are given in Table 2 and illustrated in Figure 5.
Adverts excluded from scoring |
Adverts scored as FA | |||||||
ADV | SEG | CLU | GE | MS | FA | DIA | FA | DIA |
NONE | CU | CU | 1.9 | 0.2 | 9.1 | 25.54 | 29.7 | 46.14 |
CU | MIT | 2.1 | 2.5 | 5.3 | 24.23 | 24.7 | 43.60 | |
CU | PER | 0.4 | 0.2 | 9.1 | 11.60 | 29.7 | 32.20 | |
MIT | CU | 2.5 | 0.4 | 9.3 | 27.67 | 31.5 | 49.91 | |
MIT | MIT | 2.2 | 2.7 | 5.6 | 24.46 | 26.8 | 45.68 | |
MIT | PER | 0.6 | 0.4 | 9.3 | 11.67 | 31.5 | 33.91 | |
CU-EVAL | CU | CU | 2.0 | 0.6 | 9.1 | 25.89 | 23.5 | 40.34 |
CU | MIT | 1.7 | 2.9 | 5.3 | 24.92 | 18.7 | 38.38 | |
CU | PER | 0.5 | 0.6 | 9.1 | 11.65 | 23.5 | 26.11 | |
MIT | CU | 2.5 | 1.4 | 9.2 | 26.87 | 25.1 | 42.81 | |
MIT | MIT | 2.3 | 3.7 | 5.6 | 25.93 | 20.6 | 40.96 | |
MIT | PER | 0.6 | 1.4 | 9.2 | 12.54 | 25.1 | 28.49 | |
CU-TDT4 | CU | CU | 2.3 | 1.0 | 8.9 | 27.03 | 12.6 | 30.80 |
CU | MIT | 1.8 | 3.4 | 5.1 | 26.67 | 8.2 | 29.80 | |
CU | PER | 0.8 | 1.0 | 8.9 | 12.69 | 12.6 | 16.46 | |
MIT | CU | 2.4 | 1.8 | 8.9 | 28.37 | 12.7 | 32.18 | |
MIT | MIT | 1.7 | 4.1 | 5.3 | 25.02 | 8.5 | 28.26 | |
MIT | PER | 0.6 | 1.8 | 8.9 | 12.67 | 12.7 | 16.48 | |
PERF | CU | CU | 2.0 | 0.3 | 9.0 | 25.03 | 10.0 | 26.06 |
CU | MIT | 2.4 | 2.7 | 5.2 | 27.18 | 5.8 | 27.73 | |
CU | PER | 0.6 | 0.3 | 9.0 | 11.93 | 10.0 | 12.96 | |
MIT | CU | 2.5 | 0.9 | 9.3 | 26.12 | 10.3 | 27.12 | |
MIT | MIT | 2.2 | 3.2 | 5.6 | 25.78 | 6.1 | 26.30 | |
MIT | PER | 0.6 | 0.9 | 9.3 | 12.12 | 10.3 | 13.12 | |
PERF | PER | CU | 0.0 | 0.0 | 0.0 | 18.71 | 0.0 | 18.73 |
PER | MIT | 2.3 | 2.5 | 0.0 | 17.55 | 0.0 | 17.57 | |
PER | PER | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 |
Figure 5: Results from the 'Plug-and-Play' hybrid
diarisation system on the
development (bndidev03) data. The inset table shows the ordering
of the bars in each group of six.
The MIT-LL RT-03s diarisation system was also described, and a new hybrid
'Plug and Play' system was developed to allow the benefits of both
the CUED and MIT-LL systems to be exploited in a single system. Analysis
showed that on average the best performance came from using the CUED
advert detection (when adverts were not excluded from scoring)
and segmentation stages, whereas the
MIT-LL clustering generally performed best. The lowest diarisation error
rate whether adverts were excluded from scoring or not, came from
a hybrid system, outperforming the individual systems from either site.
Future work will look at removing the 'Plug-and-Play' method's restriction on the diarisation systems having a common architecture, by combining the outputs from different diarisation systems directly using a cluster-voting scheme.[17] This could potentially allow information from many different systems (including those that do segmentation and clustering in a single stage) to be integrated to try to improve diarisation performance further.
Many Cambridge University Engineering Department publications are available
from
http://mi.eng.cam.ac.uk/reports
and those associated with the DARPA Effective, Affordable Reusable Speech-to-text (EARS) programme can be found via
http://mi.eng.cam.ac.uk/research/projects/EARS/references.html