Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language

Matthew Maciejewski^1*, Dominik Klement^2,3*, Ruizhe Huang², Matthew Wiesner¹, Sanjeev Khudanpur^1,2

¹HLTCOE and ²CLSP, Johns Hopkins University, USA
³Speech@FIT, Brno University of Technology, Czechia
* denotes equal contribution

Abstract

As speech technology has matured, there has been a push towards systems that can process conversational speech, reflecting the so-called “cocktail party problem,” which includes not only more challenging acoustic conditions, but also necessitates solutions to new problems, such as identifying who spoke when and processing multiple concurrent streams of speech. Such problems have been approached primarily via corpora comprising business meetings and dinner parties, overlooking the broad range of conversational dynamics and speaker demographics that fall under the category of multi-talker speech. To this end, we introduce the use of the Santa Barbara Corpus of Spoken American English for evaluation of speech technology—including preparing the corpus and annotations for automatic processing, demonstrating the failure of state-of-the-art systems to withstand the heterogeneity of conditions, and highlighting the situations where standard methods struggle to perform at all.

Why evaluate on SBCSAE?

The Santa Barbara corpus serves as a diagnostic test set for asking the ultimate question of conversational speech technology: if it were possible to drop a microphone in a random place where people are talking, could a system process this speech?

Diverse Spoken Interactions

SBCSAE not only spans a variety of conversational dynamics, but also includes other non-social multi-talker interactions not contained in many existing datasets, such as lectures or town hall meetings.

Challenging Acoustic Conditions

Having been collected with portable audio recorders, SBCSAE's acoustic conditions vary extensively, including large echoing halls, the outdoors, a noisy restaurant, and even the inside of a car.

Broad Speaker Demographics

With participants of varying ages, genders, and regional backgrounds, SBCSAE represents a breadth of accents, speaking styles, and vocal qualities.

Rich Contextual Information

Detailed transcriptions and additional metadata allow for precise analysis of which scenarios challenge speech processing systems the most.

Details

Setup

We released the dataset as a Lhotse recipe as well as a HuggingFace dataset. Alongside the frameworks, we also released the ground truth annotations consisting of RTTM files for diarization and STM files for ASR.

What Alignments to Use?

As described in our paper, the original segments sometimes lack tight boundaries, containing excessive silence padding, which is unsuitable for speaker diarization and may harm ASR performance.
We produced two sets of alignments:

For ASR: realigned only the segments containing mostly silence, aiming for looser boundaries, so that the spoken words are always fully contained in the new segmentation,
For Diarization: more precise autoalignments to be as close to the true speech activity labels as possible, with more aggressive silence removal at the cost of likely removing some speech as well.

Transcript

The original authors of the SBCSAE corpus produced very detailed transcriptions, not only containing words like "uh, mhm" but also hesitations marked by a single dash "-".
The transcripts also contain special tokens: <L2 {text} L2> denoting a non-english text, <LAUGH> and <YELL> denoting laughter and yell (screaming) respectively, and <UNK> denoting an unintelligible speech.
To anonymize spoken personal or business names, the authors used low-pass filter and replaced the names by fake ones, which are denoted by tilda "~" at the beginning.
We automatically detected the anonymized segments (detector code) and manually checked if they were anonymized in the transcript as well. If not, we added a dollar sign "$" at the beginning of such words. Also, the anonymization process sometimes caused surrounding words to be unintelligible. We marked such words with a hashtag "#".

Statistics

	avg	std	min	max
# Recordings	60
# Speakers (total)	439
Duration	23:18	04:25	10:46	30:31
Speaking Time	17:56	04:15	07:31	25:38
Amount of Speech	19:58	05:23	08:16	34:04
Overlap by Time (%)	10.26	8.25	0.09	34.92
Overlap by Speech (%)	17.99	13.27	0.18	53.47
Laughter (%)	3.28	3.02	0.07	11.88
Speaker Entropy	0.663	0.286	0.008	1.000

	SBCSAE	AliMeeting	AMI	CHiME-6	DIHARD3	DiPCo
# Recordings	60	237	169	20	513	10
# Speakers	439	537	189	48	581	32
Duration	23:18 ± 04:25	31:59 ± 03:09	35:11 ± 13:32	150:31 ± 11:26	07:51 ± 02:52	32:00 ± 12:12
Speaking Time	17:56 ± 04:15	29:40 ± 03:18	28:39 ± 12:36	116:56 ± 16:18	06:14 ± 02:51	29:14 ± 11:00
Amount of Speech	19:58 ± 05:23	39:58 ± 09:44	32:56 ± 15:00	168:59 ± 37:56	06:57 ± 03:28	39:03 ± 16:10
Overlap by Time (%)	10.3 ± 8.3	27.2 ± 18.4	12.9 ± 6.1	33.2 ± 10.5	9.4 ± 10.8	26.9 ± 7.5
Overlap by Speech (%)	18.0 ± 13.3	41.2 ± 24.2	23.1 ± 9.7	51.7 ± 13.3	16.1 ± 16.2	43.9 ± 10.7
Speaker Entropy	0.663 ± 0.286	0.968 ± 0.029	0.917 ± 0.080	0.985 ± 0.014	0.802 ± 0.262	0.952 ± 0.034

Recording ID	# Speakers	Duration	Speaking Time	Amount of Speech	Overlap by Time (%)	Overlap by Speech (%)	Laughter (%)	Speaker Entropy
SBC001	3	25:16	20:13	20:54	3.39	6.56	0.38	0.561
SBC002	4	23:57	19:30	22:43	15.39	27.36	7.72	0.916
SBC003	3	26:07	20:50	23:18	11.48	20.86	3.15	0.966
SBC004	7	19:22	17:01	19:59	16.89	29.23	6.20	0.583
SBC005	2	20:28	14:35	15:53	8.94	16.42	7.09	0.950
SBC006	3	27:16	23:15	25:15	8.47	15.70	7.70	0.421
SBC007	2	23:05	14:09	14:21	1.40	2.76	0.76	0.998
SBC008	5	25:25	20:31	22:15	8.21	15.32	1.39	0.532
SBC009	2	25:00	12:07	12:50	6.04	11.39	8.29	1.000
SBC010	2	15:41	14:01	16:30	16.93	29.44	0.40	0.880
SBC011	4	20:31	15:47	17:14	8.77	16.39	7.53	0.744
SBC012	12	25:41	19:41	20:11	2.49	4.88	0.32	0.552
SBC013	6	27:25	24:37	34:04	31.24	50.33	10.74	0.854
SBC014	5	28:17	23:51	25:01	4.69	9.11	0.66	0.672
SBC015	3	26:07	22:47	28:27	22.89	38.25	2.64	0.896
SBC016	4	22:12	19:12	23:01	19.21	32.61	0.62	0.503
SBC017	2	20:18	17:20	18:58	9.37	17.17	1.51	0.943
SBC018	5	11:52	07:32	08:16	9.81	17.97	0.54	0.795
SBC019	6	21:50	17:14	20:45	18.62	32.42	5.88	0.816
SBC020	4	25:12	21:02	21:08	0.41	0.83	0.43	0.035
SBC021	113	28:07	20:37	21:14	2.83	5.65	1.42	0.113
SBC022	3	13:55	08:57	09:25	5.04	9.63	1.80	0.623
SBC023	16	24:23	22:40	29:03	24.36	40.98	2.96	0.824
SBC024	2	26:31	11:04	11:59	8.34	15.42	3.95	0.941
SBC025	2	22:15	18:13	18:15	0.21	0.41	0.07	0.008
SBC026	11	26:44	21:13	21:39	2.06	4.04	0.13	0.683
SBC027	37	16:41	13:50	14:09	2.26	4.45	2.46	0.112
SBC028	4	25:17	21:40	23:59	10.61	19.26	9.89	0.520
SBC029	2	27:00	18:46	20:32	9.36	17.12	0.96	1.000
SBC030	3	25:59	16:40	16:45	0.52	1.04	0.85	0.045
SBC031	4	24:39	19:34	23:10	17.80	30.59	3.47	0.851
SBC032	6	27:47	24:00	29:48	21.30	36.61	4.52	0.605
SBC033	9	10:47	09:52	13:49	34.92	53.47	9.93	0.897
SBC034	2	24:40	12:00	12:22	2.97	5.78	0.92	0.828
SBC035	9	19:30	18:28	24:28	30.75	47.75	3.63	0.688
SBC036	3	26:51	23:31	28:18	18.57	32.32	8.88	0.956
SBC037	5	26:30	16:56	17:56	5.69	10.95	3.91	0.735
SBC038	19	24:16	14:47	15:10	2.52	4.93	0.28	0.203
SBC039	7	25:33	21:32	24:06	10.41	19.90	3.64	0.488
SBC040	7	22:34	17:53	17:54	0.09	0.18	0.30	0.079
SBC041	2	20:34	17:37	19:38	11.47	20.58	2.50	0.848
SBC042	7	19:18	12:34	13:40	8.53	15.93	2.08	0.766
SBC043	2	25:02	21:12	22:47	7.48	13.92	2.05	1.000
SBC044	2	29:07	24:10	25:02	3.63	7.01	3.27	0.721
SBC045	2	30:15	19:05	19:39	2.98	5.79	3.16	1.000
SBC046	3	15:07	12:22	13:02	5.29	10.06	1.36	0.558
SBC047	2	20:13	16:32	17:47	7.60	14.13	1.81	0.896
SBC048	5	18:05	13:24	15:57	18.05	31.18	11.88	0.787
SBC049	11	18:58	16:08	19:44	20.22	34.74	4.88	0.646
SBC050	5	16:33	11:57	13:53	14.91	26.72	2.75	0.846
SBC051	5	25:49	22:08	26:03	16.85	29.37	4.62	0.770
SBC052	4	25:54	21:35	23:38	9.40	17.25	2.26	0.766
SBC053	10	20:59	14:47	15:25	4.33	8.30	0.24	0.736
SBC054	8	19:22	14:30	15:16	5.27	10.02	0.58	0.130
SBC055	7	27:22	19:31	19:40	0.71	1.40	1.52	0.245
SBC056	6	30:31	25:38	28:26	10.60	19.38	2.93	0.423
SBC057	7	26:12	15:43	16:08	2.73	5.31	1.94	0.455
SBC058	2	25:47	17:10	18:08	5.69	10.76	1.25	0.981
SBC059	4	27:55	24:37	29:59	20.69	34.86	6.49	0.936
SBC060	2	24:50	18:41	19:22	3.64	7.02	1.48	0.473

Show Table Description

Note: The statistics are calculated based on the diarization alignments. The time format used in the table is mm:ss. Numeric statistics in Per-Corpus table represent mean ± standard deviation.

Field	Description
Duration	Length of the recording
Speaking Time	Amount of time where at least one speaker is speaking
Amount of Speech	Amount of time where at least one speaker is speaking + overlaps are counted multiple times (i.e. if two speakers are speaking at the same time for 5 seconds, such segment contributes 10s to the overall value)
Overlap by Time (%)	Percentage of time where two or more speakers are speaking at the same time
Overlap by Speech (%)	Amount of overlapped speech (overlaps count multiple times) divided by Amount of Speech
Laughter (%)	Amount of laughter divided by Amount of Speech (overlaps count multiple times)
Speaker Entropy	Entropy of the distribution of speech among speakers normalized by the maximum entropy (binary logarithm of the number of speakers) in the given recording

SBCSAE Recordings

Description	Recording
SBC001: An "easy", two-person dialogue.
SBC011: A conversation between three older women, challenging systems to handle speech of the elderly.
SBC012: A classroom lecture, with a dominant speaker and many participants who contribute minimally.
SBC013: A family birthday party, demonstrating heavy amounts of overlap among comfortable participants.
SBC024: A young couple playing a computer game, showing how radically different task-oriented speech can be from social interactions.
SBC054: A storytelling session, which includes "chorus" speech as the audience speaks in unison.
SBC057: A judo class, displaying challenging acoustic conditions, accented speech, and technical (non-English) terminology.

Comparisons

AMI

SBC018

The playacted meetings from AMI are more orderly and easy-to-follow than this veterinary office interaction from SBC018, where familiarity and shared knowledge and context lead to somewhat disjointed conversations that lack context, include technical terminology, and are generally difficult to understand.

CHiME-6

SBC033

Similarly, although the CHiME parties are spontaneous and can be quite challenging, they are somewhat restrained, with a limited number of participants who at times show their awareness of the recording activity they have been asked to participate in. In contrast, the party recordings of SBCSAE can be more lively, having occurred more naturally.

BibTeX

@inproceedings{maciejewski24_interspeech,
    title={Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken Language},
    author={Matthew Maciejewski and Dominik Klement and Ruizhe Huang and Matthew Wiesner and Sanjeev Khudanpur},
    year={2024},
    booktitle={Interspeech 2024},
    pages={2155--2159},
    doi={10.21437/Interspeech.2024-2119},
    issn={2958-1796},
}
@misc{dubois_2005,
    author={Du Bois, John W. and Chafe, Wallace L. and Meyer, Charles and Thompson, Sandra A. and Englebretson, Robert and Martey, Nii},
    year={2000--2005},
    title={{S}anta {B}arbara corpus of spoken {A}merican {E}nglish, {P}arts 1--4},
    address={Philadelphia},
    organization={Linguistic Data Consortium},
  }