Computational speech group

08.10.2017 -

Research group

Main topic of our group is speaker recognition: the task of connecting speech sample to an identity (who is speaking?). We work on improving both robustness (improved accuracy under varied channel, noise and other disturbances) and security of such systems (detection of representation attacks, also known as spoofing attacks). Our research group not only takes regularly part to the technology evaluations in our field, but also contributes to open data science as a co-organizer of public evaluation benchmarks (including ASVspoof and VC challenge) and collection of other data, such as the AVOID corpus.

One of the most significant areas of research in speech technology is speaker recognition, i.e. recognition of persons based on their voice. We carry out both basic methodology research (e.g. feature extraction, segmentation, classification and data fusion algorithms), and apply the developed techniques as parts of state-of-the-art speaker and language recognition systems that rely heavily on statistical machine learning from "big data". The research targets for instance at improving recognition under noisy environments and improving security of voice biometric systems against voice spoofing attacks. We take regularly part to NIST technology evaluations, and have been also establishing other benchmarking events (ASVspoof). The research group has received funding from Academy of Finland and H2020 programme.

Research group description

One of the most significant areas of research in speech technology is speaker recognition, i.e. recognition of persons based on their voice. We carry out both basic methodology research (e.g. feature extraction, segmentation, classification and data fusion algorithms), and apply the developed techniques as parts of state-of-the-art speaker and language recognition systems that rely heavily on statistical machine learning from big data. One of the central research questions is how to recognize a person robustly from highly uncertain speech data, which often originates from low-quality telephone call in an unknown environment and being mixed with unknown background noise or reverberation. In addition, the person him/herself can modify ones voice purposefully. These inherent data-related challenges require intelligent combinations of digital signal processing models of human speech production (domain knowledge) combined with modern machine learning techniques that are used to infer speaker- and language-specific representations from thousands of hours of unannotated speech material. Another recent research direction, recognition and characterization of voice spoofing attacks, is particularly relevant in information security  how to detect whether a given audio was actually produced by a live human, or be fabricated for instance by a speech synthesizer? Such questions are already of key importance in the world of fake news, and with the rapidly evolving speech synthesis technology that can already now be used to create speech that sounds like a particular targeted speaker. The research group has been one of the international forerunners in detecting such spoofing attacks. In a recently concluded H2020-funded OCTAVE project, such methods were implemented in a cloud-based voice biometric system. Besides speaker recognition, other topics in the group include e.g. language and accent identification, voice conversion (imitation by a computer), and comparing human listeners and automatic techniques in these task.

Homepage of research group

Keywords

Time period

08.10.2017 -

Group members - UEF