Category Archives: Current Research Projects

A list of current projects going on in the lab.

Forced alignment on raw audio with deep neural networks

Linguists performing phonetic research often need to perform measurements on the acoustic segments that make up spoken utterances. Segmenting an audio file is a difficult and time-intensive task, however, so many researchers turn to computer programs to perform this task for them. These programs are called forced aligners, and they perform a process called forced alignment whereby the temporally align phonemes—the term used to refer to the acoustic segments in speech recognition literature—to their location in an audio file. This process is intended to yield an alignment as close to what an expert human aligner would produce so that minimal editing of the boundary locations is needed before analyzing the segments.

Stacked waveform, spectrogram, and segments with segment boundaries

Sample segment boundaries for “phoneme”

Forced aligners traditionally have the user pass in the audio files they want aligned, accompanied by an orthographic transcription of the content in the audio files and a dictionary that converts the words in the transcription into phonemes. The program will then step through the audio, convert it to Mel-frequency cepstral coefficients, and process those with a hidden Markov model based back end to determine the temporal boundaries of each phoneme contained within the audio file.

Recently, however, deep neural networks have been found to outperform the traditional hidden Markov model implementations in speech recognition tasks. But, there are few forced-alignment programs available that use deep neural networks as their back end. Those that do still rely on analyzing the hand-coded Mel-frequency cepstral coefficients instead of the speech waveform itself, even though convolutional neural networks can learn the features needed for discrimination of classes in a classification task.

Our lab is working to develop a new forced alignment program that uses deep neural networks as the back end and takes in raw audio instead of the Mel-frequency cepstral coefficients. By having the network learn features from the audio itself rather than use features determined before ever running the network, only features that are useful for the classification task will be used. Additionally, the methodology of training the network will be more generalizable to other tasks because there will not be a need to develop hand-crafted features as the input to the network.

MALD: Massive Auditory Lexical Decision

How do humans recognize speech? How do factors such as native language, age, and dialect have an effect on the way in which words are recognized? A common concern among people as they get older is age related decline; in other words, does our cognitive ability decline with age? Ramscar et al. (2014) show that it may not be the case that older readers are slower due to cognitive decline. Will similar result be found for listeners when they hear language? Additionally, interactions with speakers of other dialects can be a relatively common occurrence. How is it that there are some dialects that are easy to understand and that other dialects are more difficult to understand? Are there aspects of these dialects that are more difficult to adapt to than others (Clarke & Garrett, 2004)? The present proposal seeks to
investigate these and other questions regarding spoken language recognition. There are many ways in which answers to these questions can be found, one way is by creating and conducting large studies.

This megastuIMG_5507dy contains over 26,000 words and 9,600 non-words from a male speaker of Western Canadian English. Participants (largely from Edmonton, AB) will span ages ranging from 20-70 years. Participants will also be expanded to include additional dialect regions (Arizona, USA; Nova Scotia; New Zealand).

This project will contribute to the ongoing investigation of language comprehension. Novel and
theoretical contributions emerging from this research program:
– testing and creation of models of spoken word recognition
– creation of an open source dataset which can be used by a wide range of researchers
– insight into how age related anatomical changes in the voice affect spoken word recognition
– insight into how aging affects spoken word recognition
– insight into how dialect affects spoken word recognition

Assessment of vowel overlap metrics

An item of interest to linguists is vowel overlap, or how much two categories of vowels overlap in a language. Though there are a number of cues that help distinguish one vowel the first two formants, F1 and F2, as well as the duration of the vowel, are the most prominent.

The question is how to use F1, F2, and, optionally, duration to calculate the overlap between two vowel categories. There have been a small number of metrics published that seek to quantify vowel overlap, such as Alicia Wassink’s Spectral Overlap Assessment Metric, Geoffrey Stewart Morrison’s a posteriori probability metric, and Erin F. Haynes and Michael Taylor’s Vowel Overlap Assessment with Convex Hulls metric. Despite these metrics having existed for some time, there has not been a robust comparison between them to determine which of them, if any, is the most accurate and precise.

Matthew C. Kelley, Geoffrey Stewart Morrison, and Benjamin V. Tucker are collaborating to prepare a robust comparison of these metrics using Monte Carlo simulation to test them for how accurate they are and how precise they are, as well as whether there are situations in which one if preferable over the other. In the spirit of open science, we will also be releasing our implementations of each of these metrics in the R programming language so that researchers will have easy access to using these metrics. Each implementation will also include visualization capabilities appropriate to each metric.

Having a vowel overlap metric that is accurate and precise will be a boon to a number of fields, such as dialectology and sociophonetics in studying vowel merger and variation, as well as in second language speech learning, to help language learners and users more closely match the vowel targets in their target language.

Sample visualizations of the overlap metrics run on Hillenbrand /i/ and /ɪ/ data can be seen below.

Sample of Spectral Overlap Assessment Metric visualization

Spectral Overlap Assessment Metric on Hillenbrand vowel data. /i/ is blue, /ɪ/ is orange. Calculated overlap is 76.3%.

Sample a posteriori visualization

A posteriori probability metric on Hillenbrand vowel data. /i/ is blue, /ɪ/ is orange. Calculated overlap is 21.0%.

Sample visualization of Vowel Overlap Assessment with Convex Hulls

Vowel Overlap Assessment with Convex Hulls metric on Hillenbrand vowel data. /i/ is blue, /ɪ/ is orange. The overlapping vowel points are black. Calculated overlap is 34.0%.

Forensic Science Research

Volunteers Wanted for Forensic Science Research:

We are conducting research on forensic voice comparison, and we are looking for volunteers to help us with this research.

The purpose of the research is to demonstrate how to perform forensic voice comparison under conditions reflecting those of an actual case. We are basing this research on the conditions of an element of the Saskatchewan robocall scandal in which the outgoing voicemail message of a speaker of known identity had to be compared with the outgoing voicemail message of a speaker of questioned identity.

http://www.ottawacitizen.com/news/Tories+admit+they+sent+Saskatchewan+robocall/7922470/story.html

Our approach is based on relevant data, quantitative measurements, and statistical models, and we test the validity and reliability of our system under conditions reflecting those of the case under investigation.

Volunteers will be asked to make five telephone calls over the course of a week and each time leave the message: “You’ve reached Haste Research. Please leave your name, phone number and reason for call, and I’ll return your call as soon as possible.”

Volunteers completing the task will be given a gift certificate of $25 value.

If you are an adult male speaker of Canadian English, and are interested in participating, please send the following e-mail message to <apl@ualberta.ca> with subject line “potential volunteer”.

I am interested participating in the forensic voicemail research study. Please contact me by e-mail and telephone to give me more information so I can decide whether I want to participate. My telephone number is _______

Alberta Phonetics Laboratory
University of Alberta
Edmonton AB
T6G 2E7
Canada
tel: (780)248 1409 mailbox 0
fax: (780) 492 0806
e-mail: apl@ualberta.ca
website: http://aphl.artsrn.ualberta.ca/

Principal Investigator:
Dr Geoffrey Stewart Morrison
http://geoff-morrison.net/

Corpus of Spontaneous Multimodal-Interactive Language

Drs. Jarvikivi and Tucker have received funding to begin the project Corpus of Spontaneous Multimodal-Interactive Language. This is an interdisciplinary collaborative initiative (with Drs. S. Rice, H. Colston, E. Nicoladis, S. Moore, A. Arppe, C. Boliek) to design, systematically collect and code, and publish a digital resource for the study of natural human spoken interaction in multimodal context. Thank you to the Kule Institute for Advance Studies for funding this project.

Articulography recordings during speech production

PARTICIPANTS WANTED 
 
Is English your native language? Are you 18 or older? Want to participate in a cool experiment and get an awesome Facebook/Twitter profile pic out of it?
 
Sign up for our study!
 
Using an electromagnetic articulograph, we will record the movement of your lips, tongue and jaw while you read out loud or speak English words. The study investigates how your tongue, jaw and lips work together while you produce speech. In order to track the movement of your tongue, lip and jaw movements, we will fit you with small metal sensors (just like Hollywood magic!).
 
Watch an entertaining YouTube video we made about the experiment process: http://www.youtube.com/watch?v=blFkVdj9Wbo
 

 
Pay rate is $15 per hour and the experiment is about 1.75 hours long.
 
Time slots are scheduled between the May 6th-23rd in the Alberta Phonetics Laboratory at the University of Alberta. There are three different time slots available every day.
 
For more information about the experiment and to sign up, email Fabian at: tomasche@ualberta.ca
 
NOTE: If you have a pacemaker, you can not participate in this experiment.

Online Phonetics Class

Dr. B. Tucker and Dr. K. Pollock (Speech Pathology) with Dr. Tim Mills are currently working to enhance introductory phonetics by developing online interactive laboratory activities and also developing and offering a fully-online version of the course. This project is funded by the Teaching and Learning Enhancement Fund from the University of Alberta.