Category Archives: Current Research Projects

A list of current projects going on in the lab.

Forced alignment on raw audio with deep neural networks

Linguists performing phonetic research often need to perform measurements on the acoustic segments that make up spoken utterances. Segmenting an audio file is a difficult and time-intensive task, however, so many researchers turn to computer programs to perform this task for them. These programs are called forced aligners, and they perform a process called forced alignment whereby the temporally align phonemes—the term used to refer to the acoustic segments in speech recognition literature—to their location in an audio file. This process is intended to yield an alignment as close to what an expert human aligner would produce so that minimal editing of the boundary locations is needed before analyzing the segments.

Stacked waveform, spectrogram, and segments with segment boundaries

Sample segment boundaries for “phoneme”

Forced aligners traditionally have the user pass in the audio files they want aligned, accompanied by an orthographic transcription of the content in the audio files and a dictionary that converts the words in the transcription into phonemes. The program will then step through the audio, convert it to Mel-frequency cepstral coefficients, and process those with a hidden Markov model based back end to determine the temporal boundaries of each phoneme contained within the audio file.

Recently, however, deep neural networks have been found to outperform the traditional hidden Markov model implementations in speech recognition tasks. But, there are few forced-alignment programs available that use deep neural networks as their back end. Those that do still rely on analyzing the hand-coded Mel-frequency cepstral coefficients instead of the speech waveform itself, even though convolutional neural networks can learn the features needed for discrimination of classes in a classification task.

Our lab is working to develop a new forced alignment program that uses deep neural networks as the back end and takes in raw audio instead of the Mel-frequency cepstral coefficients. By having the network learn features from the audio itself rather than use features determined before ever running the network, only features that are useful for the classification task will be used. Additionally, the methodology of training the network will be more generalizable to other tasks because there will not be a need to develop hand-crafted features as the input to the network.

MALD: Massive Auditory Lexical Decision

How do humans recognize speech? How do factors such as native language, age, and dialect have an effect on the way in which words are recognized? A common concern among people as they get older is age related decline; in other words, does our cognitive ability decline with age? Ramscar et al. (2014) show that it may not be the case that older readers are slower due to cognitive decline. Will similar result be found for listeners when they hear language? Additionally, interactions with speakers of other dialects can be a relatively common occurrence. How is it that there are some dialects that are easy to understand and that other dialects are more difficult to understand? Are there aspects of these dialects that are more difficult to adapt to than others (Clarke & Garrett, 2004)? The present proposal seeks to
investigate these and other questions regarding spoken language recognition. There are many ways in which answers to these questions can be found, one way is by creating and conducting large studies.

This megastuIMG_5507dy contains over 26,000 words and 9,600 non-words from a male speaker of Western Canadian English. Participants (largely from Edmonton, AB) will span ages ranging from 20-70 years. Participants will also be expanded to include additional dialect regions (Arizona, USA; Nova Scotia; New Zealand).

This project will contribute to the ongoing investigation of language comprehension. Novel and
theoretical contributions emerging from this research program:
– testing and creation of models of spoken word recognition
– creation of an open source dataset which can be used by a wide range of researchers
– insight into how age related anatomical changes in the voice affect spoken word recognition
– insight into how aging affects spoken word recognition
– insight into how dialect affects spoken word recognition

Assessment of vowel overlap metrics

An item of interest to linguists is vowel overlap, or how much two categories of vowels overlap in a language. Though there are a number of cues that help distinguish one vowel the first two formants, F1 and F2, as well as the duration of the vowel, are the most prominent.

The question is how to use F1, F2, and, optionally, duration to calculate the overlap between two vowel categories. There have been a small number of metrics published that seek to quantify vowel overlap, such as Alicia Wassink’s Spectral Overlap Assessment Metric, Geoffrey Stewart Morrison’s a posteriori probability metric, and Erin F. Haynes and Michael Taylor’s Vowel Overlap Assessment with Convex Hulls metric. Despite these metrics having existed for some time, there has not been a robust comparison between them to determine which of them, if any, is the most accurate and precise.

Matthew C. Kelley, Geoffrey Stewart Morrison, and Benjamin V. Tucker are collaborating to prepare a robust comparison of these metrics using Monte Carlo simulation to test them for how accurate they are and how precise they are, as well as whether there are situations in which one if preferable over the other. In the spirit of open science, we will also be releasing our implementations of each of these metrics in the R programming language so that researchers will have easy access to using these metrics. Each implementation will also include visualization capabilities appropriate to each metric.

Having a vowel overlap metric that is accurate and precise will be a boon to a number of fields, such as dialectology and sociophonetics in studying vowel merger and variation, as well as in second language speech learning, to help language learners and users more closely match the vowel targets in their target language.

Sample visualizations of the overlap metrics run on Hillenbrand /i/ and /ɪ/ data can be seen below.

Sample spectral overlap assessment metric visualization

Spectral Overlap Assessment Metric on Hillenbrand vowel data. /i/ is blue, /ɪ/ is orange. Calculated overlap is 0.215.

Sample a posteriori visualization

A posteriori probability metric on Hillenbrand vowel data. /i/ is blue, /ɪ/ is orange. Calculated overlap is 0.21.

Sample visualization of Vowel Overlap Assessment with Convex Hulls

Vowel Overlap Assessment with Convex Hulls metric on Hillenbrand vowel data. /i/ is blue, /ɪ/ is orange. The overlapping vowel points are black. Calculated overlap is 0.34.

Corpus of Spontaneous Multimodal-Interactive Language

Drs. Jarvikivi and Tucker have received funding to begin the project Corpus of Spontaneous Multimodal-Interactive Language. This is an interdisciplinary collaborative initiative (with Drs. S. Rice, H. Colston, E. Nicoladis, S. Moore, A. Arppe, C. Boliek) to design, systematically collect and code, and publish a digital resource for the study of natural human spoken interaction in multimodal context. Thank you to the Kule Institute for Advance Studies for funding this project.

Online Phonetics Class

Dr. B. Tucker and Dr. K. Pollock (Speech Pathology) with Dr. Tim Mills are currently working to enhance introductory phonetics by developing online interactive laboratory activities and also developing and offering a fully-online version of the course. This project is funded by the Teaching and Learning Enhancement Fund from the University of Alberta.