As speech technology has matured, there has been a push towards systems that can process conversational speech, reflecting the so-called “cocktail party problem,” which includes not only more challenging acoustic conditions, but also necessitates solutions to new problems, such as identifying who spoke when and processing multiple concurrent streams of speech. Such problems have been approached primarily via corpora comprising business meetings and dinner parties, overlooking the broad range of conversational dynamics and speaker demographics that fall under the category of multi-talker speech. To this end, we introduce the use of the Santa Barbara Corpus of Spoken American English for evaluation of speech technology—including preparing the corpus and annotations for automatic processing, demonstrating the failure of state-of-the-art systems to withstand the heterogeneity of conditions, and highlighting the situations where standard methods struggle to perform at all.
The Santa Barbara corpus serves as a diagnostic test set for asking the ultimate question of conversational speech technology: if it were possible to drop a microphone in a random place where people are talking, could a system process this speech?
SBCSAE not only spans a variety of conversational dynamics, but also includes other non-social multi-talker interactions not contained in many existing datasets, such as lectures or town hall meetings.
Having been collected with portable audio recorders, SBCSAE's acoustic conditions vary extensively, including large echoing halls, the outdoors, a noisy restaurant, and even the inside of a car.
With participants of varying ages, genders, and regional backgrounds, SBCSAE represents a breadth of accents, speaking styles, and vocal qualities.
Detailed transcriptions and additional metadata allow for precise analysis of which scenarios challenge speech processing systems the most.
Setup
We released the dataset as a Lhotse recipe as well as a HuggingFace dataset. Alongside the frameworks, we also released the ground truth annotations consisting of RTTM files for diarization and STM files for ASR.
What Alignments to Use?
As described in our paper, the original segments sometimes lack tight boundaries, containing excessive silence padding, which is unsuitable for speaker diarization and may harm ASR performance.
We produced two sets of alignments:
Transcript
Note: The statistics are calculated based on the diarization alignments. The time format used in the table is mm:ss. Numeric statistics in Per-Corpus table represent mean ± standard deviation.
Field | Description |
---|---|
Duration | Length of the recording |
Speaking Time | Amount of time where at least one speaker is speaking |
Amount of Speech | Amount of time where at least one speaker is speaking + overlaps are counted multiple times (i.e. if two speakers are speaking at the same time for 5 seconds, such segment contributes 10s to the overall value) |
Overlap by Time (%) | Percentage of time where two or more speakers are speaking at the same time |
Overlap by Speech (%) | Amount of overlapped speech (overlaps count multiple times) divided by Amount of Speech |
Laughter (%) | Amount of laughter divided by Amount of Speech (overlaps count multiple times) |
Speaker Entropy | Entropy of the distribution of speech among speakers normalized by the maximum entropy (binary logarithm of the number of speakers) in the given recording |
Description | Recording |
---|---|
SBC001: An "easy", two-person dialogue. | |
SBC011: A conversation between three older women, challenging systems to handle speech of the elderly. | |
SBC012: A classroom lecture, with a dominant speaker and many participants who contribute minimally. | |
SBC013: A family birthday party, demonstrating heavy amounts of overlap among comfortable participants. | |
SBC024: A young couple playing a computer game, showing how radically different task-oriented speech can be from social interactions. | |
SBC054: A storytelling session, which includes "chorus" speech as the audience speaks in unison. | |
SBC057: A judo class, displaying challenging acoustic conditions, accented speech, and technical (non-English) terminology. |
The playacted meetings from AMI are more orderly and easy-to-follow than this veterinary office interaction from SBC018, where familiarity and shared knowledge and context lead to somewhat disjointed conversations that lack context, include technical terminology, and are generally difficult to understand.
Similarly, although the CHiME parties are spontaneous and can be quite challenging, they are somewhat restrained, with a limited number of participants who at times show their awareness of the recording activity they have been asked to participate in. In contrast, the party recordings of SBCSAE can be more lively, having occurred more naturally.
@inproceedings{maciejewski24_interspeech,
title={Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language},
author={Matthew Maciejewski and Dominik Klement and Ruizhe Huang and Matthew Wiesner and Sanjeev Khudanpur},
year={2024},
booktitle={Interspeech 2024},
pages={2155--2159},
doi={10.21437/Interspeech.2024-2119},
issn={2958-1796},
}
@misc{dubois_2005,
author={Du Bois, John W. and Chafe, Wallace L. and Meyer, Charles and Thompson, Sandra A. and Englebretson, Robert and Martey, Nii},
year={2000--2005},
title={{S}anta {B}arbara corpus of spoken {A}merican {E}nglish, {P}arts 1--4},
address={Philadelphia},
organization={Linguistic Data Consortium},
}