Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language

Matthew Maciejewski1*, Dominik Klement2,3*, Ruizhe Huang2, Matthew Wiesner1, Sanjeev Khudanpur1,2

1HLTCOE and 2CLSP, Johns Hopkins University, USA
3Speech@FIT, Brno University of Technology, Czechia
* denotes equal contribution

Abstract

As speech technology has matured, there has been a push towards systems that can process conversational speech, reflecting the so-called “cocktail party problem,” which includes not only more challenging acoustic conditions, but also necessitates solutions to new problems, such as identifying who spoke when and processing multiple concurrent streams of speech. Such problems have been approached primarily via corpora comprising business meetings and dinner parties, overlooking the broad range of conversational dynamics and speaker demographics that fall under the category of multi-talker speech. To this end, we introduce the use of the Santa Barbara Corpus of Spoken American English for evaluation of speech technology—including preparing the corpus and annotations for automatic processing, demonstrating the failure of state-of-the-art systems to withstand the heterogeneity of conditions, and highlighting the situations where standard methods struggle to perform at all.

Why evaluate on SBCSAE?

The Santa Barbara corpus serves as a diagnostic test set for asking the ultimate question of conversational speech technology: if it were possible to drop a microphone in a random place where people are talking, could a system process this speech?

Diverse Spoken Interactions

SBCSAE not only spans a variety of conversational dynamics, but also includes other non-social multi-talker interactions not contained in many existing datasets, such as lectures or town hall meetings.

Challenging Acoustic Conditions

Having been collected with portable audio recorders, SBCSAE's acoustic conditions vary extensively, including large echoing halls, the outdoors, a noisy restaurant, and even the inside of a car.

Broad Speaker Demographics

With participants of varying ages, genders, and regional backgrounds, SBCSAE represents a breadth of accents, speaking styles, and vocal qualities.

Rich Contextual Information

Detailed transcriptions and additional metadata allow for precise analysis of which scenarios challenge speech processing systems the most.

Details

Setup

We released the dataset as a Lhotse recipe as well as a HuggingFace dataset. Alongside the frameworks, we also released the ground truth annotations consisting of RTTM files for diarization and STM files for ASR.

What Alignments to Use?

As described in our paper, the original segments sometimes lack tight boundaries, containing excessive silence padding, which is unsuitable for speaker diarization and may harm ASR performance.
We produced two sets of alignments:

  • For ASR: realigned only the segments containing mostly silence, aiming for looser boundaries, so that the spoken words are always fully contained in the new segmentation,
  • For Diarization: more precise autoalignments to be as close to the true speech activity labels as possible, with more aggressive silence removal at the cost of likely removing some speech as well.

Transcript

  • The original authors of the SBCSAE corpus produced very detailed transcriptions, not only containing words like "uh, mhm" but also hesitations marked by a single dash "-".
  • The transcripts also contain special tokens: <L2 {text} L2> denoting a non-english text, <LAUGH> and <YELL> denoting laughter and yell (screaming) respectively, and <UNK> denoting an unintelligible speech.
  • To anonymize spoken personal or business names, the authors used low-pass filter and replaced the names by fake ones, which are denoted by tilda "~" at the beginning.
  • We automatically detected the anonymized segments (detector code) and manually checked if they were anonymized in the transcript as well. If not, we added a dollar sign "$" at the beginning of such words. Also, the anonymization process sometimes caused surrounding words to be unintelligible. We marked such words with a hashtag "#".

Statistics

Note: The statistics are calculated based on the diarization alignments. The time format used in the table is mm:ss. Numeric statistics in Per-Corpus table represent mean ± standard deviation.

FieldDescription
DurationLength of the recording
Speaking TimeAmount of time where at least one speaker is speaking
Amount of SpeechAmount of time where at least one speaker is speaking + overlaps are counted multiple times (i.e. if two speakers are speaking at the same time for 5 seconds, such segment contributes 10s to the overall value)
Overlap by Time (%)Percentage of time where two or more speakers are speaking at the same time
Overlap by Speech (%)Amount of overlapped speech (overlaps count multiple times) divided by Amount of Speech
Laughter (%)Amount of laughter divided by Amount of Speech (overlaps count multiple times)
Speaker EntropyEntropy of the distribution of speech among speakers normalized by the maximum entropy (binary logarithm of the number of speakers) in the given recording

SBCSAE Recordings

DescriptionRecording
SBC001: An "easy", two-person dialogue.
SBC011: A conversation between three older women, challenging systems to handle speech of the elderly.
SBC012: A classroom lecture, with a dominant speaker and many participants who contribute minimally.
SBC013: A family birthday party, demonstrating heavy amounts of overlap among comfortable participants.
SBC024: A young couple playing a computer game, showing how radically different task-oriented speech can be from social interactions.
SBC054: A storytelling session, which includes "chorus" speech as the audience speaks in unison.
SBC057: A judo class, displaying challenging acoustic conditions, accented speech, and technical (non-English) terminology.

Comparisons

AMI
SBC018

The playacted meetings from AMI are more orderly and easy-to-follow than this veterinary office interaction from SBC018, where familiarity and shared knowledge and context lead to somewhat disjointed conversations that lack context, include technical terminology, and are generally difficult to understand.

CHiME-6
SBC033

Similarly, although the CHiME parties are spontaneous and can be quite challenging, they are somewhat restrained, with a limited number of participants who at times show their awareness of the recording activity they have been asked to participate in. In contrast, the party recordings of SBCSAE can be more lively, having occurred more naturally.

BibTeX

@inproceedings{maciejewski24_interspeech,
    title={Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language},
    author={Matthew Maciejewski and Dominik Klement and Ruizhe Huang and Matthew Wiesner and Sanjeev Khudanpur},
    year={2024},
    booktitle={Interspeech 2024},
    pages={2155--2159},
    doi={10.21437/Interspeech.2024-2119},
    issn={2958-1796},
}
@misc{dubois_2005,
    author={Du Bois, John W. and Chafe, Wallace L. and Meyer, Charles and Thompson, Sandra A. and Englebretson, Robert and Martey, Nii},
    year={2000--2005},
    title={{S}anta {B}arbara corpus of spoken {A}merican {E}nglish, {P}arts 1--4},
    address={Philadelphia},
    organization={Linguistic Data Consortium},
  }