Some of the ‘first of its kind’ large datasets that I have helped create and led the research on

Data

Acoustic Echo Cancellation Dataset

The ICASSP 2021 Acoustic Echo Cancellation Challenge This was an effort conducted at Microsoft Corporation. I was the lead researcher and devised the entire pipeline for data collection, pre-processing, annotation, customer service and annotation. This is the first large-scale dataset created to stimulate research in the areas of speech enhancement, particularly in acoustic echo cancellation The ICASSP 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems.

Data

The MSP-Podcast Corpus

This is a database created by the Multimodal Signal Processing Laboratory at The University of Texas at Dallas. The principal investigator is Prof. Carlos Busso. We are building the largest naturalistic speech emotional dataset in the community. The MSP-Podcast corpus contains speech segments from podcast recordings which are perceptually annotated using crowdsourcing. The collection of this corpus is an ongoing process. Version 1.8 of the corpus has 73,042 speaking turns (113hrs)