Person-specific automatic speaker recognition: understanding the behaviour of individual speakers for applications of ASR
2022-25 (ESRC: £1,012,570)
Automatic speaker recognition (ASR) software processes and analyses speech to make decisions about whether two voices belong to the same or different individuals. Such technology is becoming an increasingly important part of our lives; used as a security measure when gaining access to personal accounts (e.g. banks), or as a means of tailoring content to a specific person on smart devices. Around the world, ASR systems are commonly used for investigative and forensic purposes, to analyse recordings of criminal voices where identity is unknown. The overarching aim of this project is to systematically analyse the factors that make individuals easy or difficult to recognise within automatic speaker recognition (ASR) systems. By understanding these factors, we can better predict which speakers are likely to be problematic, tailor systems to those individuals, and ultimately improve overall accuracy and performance. We will use innovative methods and large-scale data, uniting expertise from linguistics, speech technology, and forensic speech analysis, from the academic, professional, and commercial sectors.
Humans and machines: novel methods for testing speaker recognition performance
2021-23 (AHRC Early Career: £251,431)
Recognising an individual from his or her voice is something that we, as humans, do every day. However, machines are now increasingly being used to perform speaker recognition for a range of applications, including verifying identity to gain access to a bank account, searching through databases of persons of interest for intelligence purposes, or as a form of forensic evidence. But, do machines recognise voices in the way that humans do? To answer this question, we are developing a bespoke and innovative computer game which will not only allow us to address this issue, but will also test the use of voice as a central component of a computer game. Our game allows us to extract conceptually equivalent data from human listeners to compare and combine with the results of machines at the task of speaker recognition. We utilise large, existing corpora of speakers of different regional and social backgrounds and recordings of diverse technical qualities to assess the contexts in which humans and machines may perform better or worse. We also examine the potential cognitive biases which affect human judgements to understand the effects of contextual information on speaker recognition, especially in the forensic context.
You can find out more at our project website.
From swallowing to speech to singing: investigating the vocal tract using electromagnetic articulography and ultrasound
2018-19 (University of York Priming Fund: £68,186)
This project seeks to purchase equipment for analysis of speech articulation that will put York at the cutting edge of voice science research, bringing together researchers from different disciplines across the University who all have the same aim: to understand the complexities of the human voice and vocal tract. The result will be a world-class laboratory producing novel interdisciplinary research with applications in technology, healthcare and security, and establishing national and international networks for researchers in voice science at York.
WikiDialects: Creating an Online Resource for Accent Descriptions
2019-20 (IAFPA: £1,300)
with Jessica Wormald (J P French and University of York)
This project will create an open-access, Wikipedia-style repository for descriptive information about the linguistic and phonetic patterns within language varieties (focussing initially on British English). We call it WikiDialects. The repository will contain summaries of literature from phonetics and sociolinguistics organised by regional and social groups, and by linguistic variable. It will also signpost users to sources of more detailed information and provide links to relevant academic papers. The primary aim of WikiDialects is to be a central, ‘go-to’ resource for forensic phoneticians when assessing the typicality of a given variable within the relevant population in casework which uses auditory and/or acoustic analysis. WikiDialects will also be a valuable resource for other disciplines that require baseline descriptions of language varieties, such as speech and language therapy, sociolinguistics, and language teaching and learning.
Voice and Identity: Source, Filter, Biometric
2015-18 (AHRC: £892,210)
The aim of the project is to compare the performance of different methods for forensic voice (or speaker) comparison – from linguistics and phonetics, acoustics, and automatic speaker recognition (ASR) – on the same set of recordings. We will explore the performance of the methods to assess their relative strengths, the consistency of their results and error patterns, and thus the potential for different methods to be integrated into a single framework. The ultimate aim is to improve methods in forensic voice comparison, taking a major step towards the development of a methodology that is more transparent, validated, and replicable. This outcome will benefit academics and forensic practitioners, the public, judicial systems, and investigative/security agencies.
Find out more here.
Modelling Features for Forensic Speaker Comparison
2013-15 (BA/Leverhulme: £9,848)
In forensic speaker comparison (FSC), experts compare speech patterns in criminal and suspect audio recordings to assess the evidence under competing prosecution and defence hypotheses, i.e. the criminal voice is that of the suspect versus that of someone else. There is a move toward expressing expert evidence in the form of Bayesian likelihood ratios. Speech presents considerable difficulties for this approach, as different types of data are analysed in forensic casework: linguistic data can be normally or non-normally distributed; variables can be continuous or discrete; and complex correlations exist between variables. It is imperative to develop statistical models that cater for these difficulties.
This project brings together leading forensic statisticians with forensic phoneticians. We will explore typical forensic phonetic data to assess the value of complex datasets for statistical modelling. We aim to develop new statistical models that incorporate a broader array of phonetic variables into FSC analyses and thus quantify forensic phonetic evidence more reliably.
Identifying Correlations Between Speech Parameters for Forensic Speaker Comparisons
2013-14 (IAFPA: £1,400)
with Erica Gold (University of Huddersfield)
Building on a pilot study carried out by the applicants (Gold and Hughes 2012), the project sets out to investigate two aspects of correlation between speech parameters. The first involves empirical testing of data from a homogeneous group of speakers (DyViS: Nolan 2009) to reveal correlations that may exist between traditional acoustic-phonetic parameters commonly used in forensic speaker comparisons. Secondly, we aim to address theoretical issues underlying the application of logistic regression fusion (Brümmer et al. 2007) in a likelihood ratio (LR) framework, by comparing the levels of correlations found in the data against the levels of correlations found for LRs computed by a given system. The results have two sets of implications. Firstly, the results will provide an empirically-based starting point for making informed decisions concerned with the combination of parameters in real forensic speaker comparisons. This applies both to experts working in a LR framework who must account for naïve Bayes, as well as those working in other frameworks where the expert personally selects parameters to consider and combine in casework. Those making uninformed assumptions in regards to correlations that may exist between parameters used in casework could potentially carry out analyses which lead to miscarriages of justice in the representation of the strength of evidence. Secondly, the results are relevant in addressing how appropriate fusion is as a method for combining dependent parameters using LRs. For those working within a LR framework, the results are intended to provide a basis for the development of a Bayesian network (Taroni et al. 2006) to create a ‘front-end’ mathematical model of interdependencies between speech parameters in order to appropriately combine parameters.