Consonant and vowel articulation accuracy in younger and middle-aged Spanish healthy adults.
ABSTRACT: Children acquire vowels earlier than consonants, and the former are less vulnerable to speech disorders than the latter. This study explores the hypothesis that a similar contrast exists later in life and that consonants are more vulnerable to ageing than vowels. Data was obtained with two experiments comparing the speech of Younger Adults (YAs) and Middle-aged Adults (MAs). In the first experiment an Automatic Speech Recognition (ASR) system was trained with a balanced corpus of 29 YAs and 27 MAs. The productions of each speaker were obtained in a Spanish language word (W) and non-word (NW) repetition task. The performance of the system was evaluated with the same corpus used for training using a cross validation approach. The ASR system recognized to a similar extent the Ws of both groups of speakers, but it was more successful with the NWs of the YAs than with those of the MAs. Detailed error analysis revealed that the MA speakers scored below the YA speakers for consonants and also for the place and manner of articulation features; the results were almost identical in both groups of speakers for vowels and for the voicing feature. In the second experiment a group of healthy native listeners was asked to recognize isolated syllables presented with background noise. The target speakers were one YA and one MA that had taken part in the first experiment. The results were consistent with those of the ASR experiment: the manner and place of articulation were better recognized, and vowels and voicing were worse recognized, in the YA speaker than in the MA speaker. We conclude that consonant articulation is more vulnerable to ageing than vowel articulation. Future studies should explore whether or not these early and selective changes in articulation accuracy might be caused by changes in speech perception skills (e.g., in auditory temporal processing).
Project description:Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.
Project description:A central challenge for articulatory speech synthesis is the simulation of realistic articulatory movements, which is critical for the generation of highly natural and intelligible speech. This includes modeling coarticulation, i.e., the context-dependent variation of the articulatory and acoustic realization of phonemes, especially of consonants. Here we propose a method to simulate the context-sensitive articulation of consonants in consonant-vowel syllables. To achieve this, the vocal tract target shape of a consonant in the context of a given vowel is derived as the weighted average of three measured and acoustically-optimized reference vocal tract shapes for that consonant in the context of the corner vowels /a/, /i/, and /u/. The weights are determined by mapping the target shape of the given context vowel into the vowel subspace spanned by the corner vowels. The model was applied for the synthesis of consonant-vowel syllables with the consonants /b/, /d/, /g/, /l/, /r/, /m/, /n/ in all combinations with the eight long German vowels. In a perception test, the mean recognition rate for the consonants in the isolated syllables was 82.4%. This demonstrates the potential of the approach for highly intelligible articulatory speech synthesis.
Project description:Vowel reduction is a prominent feature of American English, as well as other stress-timed languages. As a phonological process, vowel reduction neutralizes multiple vowel quality contrasts in unstressed syllables. For bilinguals whose native language is not characterized by large spectral and durational differences between tonic and atonic vowels, systematically reducing unstressed vowels to the central vowel space can be problematic. Failure to maintain this pattern of stressed-unstressed syllables in American English is one key element that contributes to a "foreign accent" in second language speakers. Reduced vowels, or "schwas," have also been identified as particularly vulnerable to the co-articulatory effects of adjacent consonants. The current study examined the effects of adjacent sounds on the spectral and temporal qualities of schwa in word-final position. Three groups of English-speaking adults were tested: Miami-based monolingual English speakers, early Spanish-English bilinguals, and late Spanish-English bilinguals. Subjects performed a reading task to examine their schwa productions in fluent speech when schwas were preceded by consonants from various points of articulation. Results indicated that monolingual English and late Spanish-English bilingual groups produced targeted vowel qualities for schwa, whereas early Spanish-English bilinguals lacked homogeneity in their vowel productions. This extends prior claims that schwa is targetless for F2 position for native speakers to highly-proficient bilingual speakers. Though spectral qualities lacked homogeneity for early Spanish-English bilinguals, early bilinguals produced schwas with near native-like vowel duration. In contrast, late bilinguals produced schwas with significantly longer durations than English monolinguals or early Spanish-English bilinguals. Our results suggest that the temporal properties of a language are better integrated into second language phonologies than spectral qualities. Finally, we examined the role of nonstructural variables (e.g. linguistic history measures) in predicting native-like vowel duration. These factors included: Age of L2 learning, amount of L1 use, and self-reported bilingual dominance. Our results suggested that different sociolinguistic factors predicted native-like reduced vowel duration than predicted native-like vowel qualities across multiple phonetic environments.
Project description:Purpose. To investigate changes in vowel articulation with the electrical deep brain stimulation (DBS) of the subthalamic nucleus (STN) in dysarthric speakers with Parkinson's disease (PD). Methods. Eight Quebec-French speakers diagnosed with idiopathic PD who had undergone STN DBS were evaluated ON-stimulation and OFF-stimulation (1 hour after DBS was turned off). Vowel articulation was compared ON-simulation versus OFF-stimulation using acoustic vowel space and formant centralization ratio, calculated with the first (F1) and second formant (F2) of the vowels /i/, /u/, and /a/. The impact of the preceding consonant context on articulation, which represents a measure of coarticulation, was also analyzed as a function of the stimulation state. Results. Maximum vowel articulation increased during ON-stimulation. Analyses also indicate that vowel articulation was modulated by the consonant context but this relationship did not change with STN DBS. Conclusions. Results suggest that STN DBS may improve articulation in dysarthric speakers with PD, in terms of range of movement. Optimization of the electrical parameters for each patient is important and may lead to improvement in speech fine motor control. However, the impact on overall speech intelligibility may still be small. Clinical considerations are discussed and new research avenues are suggested.
Project description:When we observe someone else speaking, we tend to automatically activate the corresponding speech motor patterns. When listening, we therefore covertly imitate the observed speech. Simulation theories of speech perception propose that covert imitation of speech motor patterns supports speech perception. Covert imitation of speech has been studied with interference paradigms, including the stimulus-response compatibility paradigm (SRC). The SRC paradigm measures covert imitation by comparing articulation of a prompt following exposure to a distracter. Responses tend to be faster for congruent than for incongruent distracters; thus, showing evidence of covert imitation. Simulation accounts propose a key role for covert imitation in speech perception. However, covert imitation has thus far only been demonstrated for a select class of speech sounds, namely consonants, and it is unclear whether covert imitation extends to vowels. We aimed to demonstrate that covert imitation effects as measured with the SRC paradigm extend to vowels, in two experiments. We examined whether covert imitation occurs for vowels in a consonant-vowel-consonant context in visual, audio, and audiovisual modalities. We presented the prompt at four time points to examine how covert imitation varied over the distracter's duration. The results of both experiments clearly demonstrated covert imitation effects for vowels, thus supporting simulation theories of speech perception. Covert imitation was not affected by stimulus modality and was maximal for later time points.
Project description:The acoustic dimensions that distinguish speech sounds (like the vowel differences in "boot" and "boat") also differentiate speakers' voices. Therefore, listeners must normalize across speakers without losing linguistic information. Past behavioral work suggests an important role for auditory contrast enhancement in normalization: preceding context affects listeners' perception of subsequent speech sounds. Here, using intracranial electrocorticography in humans, we investigate whether and how such context effects arise in auditory cortex. Participants identified speech sounds that were preceded by phrases from two different speakers whose voices differed along the same acoustic dimension as target words (the lowest resonance of the vocal tract). In every participant, target vowels evoke a speaker-dependent neural response that is consistent with the listener's perception, and which follows from a contrast enhancement model. Auditory cortex processing thus displays a critical feature of normalization, allowing listeners to extract meaningful content from the voices of diverse speakers.
Project description:American English (AE) speakers' perceptual assimilation of 14 North German (NG) and 9 Parisian French (PF) vowels was examined in two studies using citation-form disyllables (study 1) and sentences with vowels surrounded by labial and alveolar consonants in multisyllabic nonsense words (study 2). Listeners categorized multiple tokens of each NG and PF vowel as most similar to selected AE vowels and rated their category "goodness" on a nine-point Likert scale. Front, rounded vowels were assimilated primarily to back AE vowels, despite their acoustic similarity to front AE vowels. In study 1, they were considered poorer exemplars of AE vowels than were NG and PF back, rounded vowels; in study 2, front and back, rounded vowels were perceived as similar to each other. Assimilation of some front, unrounded and back, rounded NG and PF vowels varied with language, speaking style, and consonantal context. Differences in perceived similarity often could not be predicted from context-specific cross-language spectral similarities. Results suggest that listeners can access context-specific, phonetic details when listening to citation-form materials, but assimilate non-native vowels on the basis of context-independent phonological equivalence categories when processing continuous speech. Results are interpreted within the Automatic Selective Perception model of speech perception.
Project description:The impact of clear speech, increased vocal intensity, and rate reduction on acoustic characteristics of vowels was compared in speakers with Parkinson's disease (PD), speakers with multiple sclerosis (MS), and healthy controls.Speakers read sentences in habitual, clear, loud, and slow conditions. Variations in clarity, intensity, and rate were stimulated using magnitude production. Formant frequency values for peripheral and nonperipheral vowels were obtained at 20%, 50%, and 80% of vowel duration to derive static and dynamic acoustic measures. Intensity and duration measures were obtained.Rate was maximally reduced in the slow condition, and vocal intensity was maximized in the loud condition. The clear condition also yielded a reduced articulatory rate and increased intensity, although less than for the slow or loud conditions. Overall, the clear condition had the most consistent impact on vowel spectral characteristics. Spectral and temporal distinctiveness for peripheral-nonperipheral vowel pairs was largely similar across conditions.Clear speech maximized peripheral and nonperipheral vowel space areas for speakers with PD and MS while also reducing rate and increasing vocal intensity. These results suggest that a speech style focused on increasing articulatory amplitude yields the most robust changes in vowel segmental articulation.
Project description:An algorithm that operates in real-time to enhance the salient features of speech is described and its efficacy is evaluated. The Contrast Enhancement (CE) algorithm implements dynamic compressive gain and lateral inhibitory sidebands across channels in a modified winner-take-all circuit, which together produce a form of suppression that sharpens the dynamic spectrum. Normal-hearing listeners identified spectrally smeared consonants (VCVs) and vowels (hVds) in quiet and in noise. Consonant and vowel identification, especially in noise, were improved by the processing. The amount of improvement did not depend on the degree of spectral smearing or talker characteristics. For consonants, when results were analyzed according to phonetic feature, the most consistent improvement was for place of articulation. This is encouraging for hearing aid applications because confusions between consonants differing in place are a persistent problem for listeners with sensorineural hearing loss.
Project description:Automated speech recognition (ASR) systems, which use sophisticated machine-learning algorithms to convert spoken language to text, have become increasingly widespread, powering popular virtual assistants, facilitating automated closed captioning, and enabling digital dictation platforms for health care. Over the last several years, the quality of these systems has dramatically improved, due both to advances in deep learning and to the collection of large-scale datasets used to train the systems. There is concern, however, that these tools do not work equally well for all subgroups of the population. Here, we examine the ability of five state-of-the-art ASR systems-developed by Amazon, Apple, Google, IBM, and Microsoft-to transcribe structured interviews conducted with 42 white speakers and 73 black speakers. In total, this corpus spans five US cities and consists of 19.8 h of audio matched on the age and gender of the speaker. We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers. We trace these disparities to the underlying acoustic models used by the ASR systems as the race gap was equally large on a subset of identical phrases spoken by black and white individuals in our corpus. We conclude by proposing strategies-such as using more diverse training datasets that include African American Vernacular English-to reduce these performance differences and ensure speech recognition technology is inclusive.