Speech-like rhythm in a voiced and voiceless orangutan call.
ABSTRACT: The evolutionary origins of speech remain obscure. Recently, it was proposed that speech derived from monkey facial signals which exhibit a speech-like rhythm of ?5 open-close lip cycles per second. In monkeys, these signals may also be vocalized, offering a plausible evolutionary stepping stone towards speech. Three essential predictions remain, however, to be tested to assess this hypothesis' validity; (i) Great apes, our closest relatives, should likewise produce 5Hz-rhythm signals, (ii) speech-like rhythm should involve calls articulatorily similar to consonants and vowels given that speech rhythm is the direct product of stringing together these two basic elements, and (iii) speech-like rhythm should be experience-based. Via cinematic analyses we demonstrate that an ex-entertainment orangutan produces two calls at a speech-like rhythm, coined "clicks" and "faux-speech." Like voiceless consonants, clicks required no vocal fold action, but did involve independent manoeuvring over lips and tongue. In parallel to vowels, faux-speech showed harmonic and formant modulations, implying vocal fold and supralaryngeal action. This rhythm was several times faster than orangutan chewing rates, as observed in monkeys and humans. Critically, this rhythm was seven-fold faster, and contextually distinct, than any other known rhythmic calls described to date in the largest database of the orangutan repertoire ever assembled. The first two predictions advanced by this study are validated and, based on parsimony and exclusion of potential alternative explanations, initial support is given to the third prediction. Irrespectively of the putative origins of these calls and underlying mechanisms, our findings demonstrate irrevocably that great apes are not respiratorily, articulatorilly, or neurologically constrained for the production of consonant- and vowel-like calls at speech rhythm. Orangutan clicks and faux-speech confirm the importance of rhythmic speech antecedents within the primate lineage, and highlight potential articulatory homologies between great ape calls and human consonants and vowels.
Project description:A central challenge for articulatory speech synthesis is the simulation of realistic articulatory movements, which is critical for the generation of highly natural and intelligible speech. This includes modeling coarticulation, i.e., the context-dependent variation of the articulatory and acoustic realization of phonemes, especially of consonants. Here we propose a method to simulate the context-sensitive articulation of consonants in consonant-vowel syllables. To achieve this, the vocal tract target shape of a consonant in the context of a given vowel is derived as the weighted average of three measured and acoustically-optimized reference vocal tract shapes for that consonant in the context of the corner vowels /a/, /i/, and /u/. The weights are determined by mapping the target shape of the given context vowel into the vowel subspace spanned by the corner vowels. The model was applied for the synthesis of consonant-vowel syllables with the consonants /b/, /d/, /g/, /l/, /r/, /m/, /n/ in all combinations with the eight long German vowels. In a perception test, the mean recognition rate for the consonants in the isolated syllables was 82.4%. This demonstrates the potential of the approach for highly intelligible articulatory speech synthesis.
Project description:PURPOSE:To develop and evaluate a technique for 3D dynamic MRI of the full vocal tract at high temporal resolution during natural speech. METHODS:We demonstrate 2.4 × 2.4 × 5.8 mm3 spatial resolution, 61-ms temporal resolution, and a 200 × 200 × 70 mm3 FOV. The proposed method uses 3D gradient-echo imaging with a custom upper-airway coil, a minimum-phase slab excitation, stack-of-spirals readout, pseudo golden-angle view order in kx -ky , linear Cartesian order along kz , and spatiotemporal finite difference constrained reconstruction, with 13-fold acceleration. This technique is evaluated using in vivo vocal tract airway data from 2 healthy subjects acquired at 1.5T scanner, 1 with synchronized audio, with 2 tasks during production of natural speech, and via comparison with interleaved multislice 2D dynamic MRI. RESULTS:This technique captured known dynamics of vocal tract articulators during natural speech tasks including tongue gestures during the production of consonants "s" and "l" and of consonant-vowel syllables, and was additionally consistent with 2D dynamic MRI. Coordination of lingual (tongue) movements for consonants is demonstrated via volume-of-interest analysis. Vocal tract area function dynamics revealed critical lingual constriction events along the length of the vocal tract for consonants and vowels. CONCLUSION:We demonstrate feasibility of 3D dynamic MRI of the full vocal tract, with spatiotemporal resolution adequate to visualize lingual movements for consonants and vocal tact shaping during natural productions of consonant-vowel syllables, without requiring multiple repetitions.
Project description:Virtually every human faculty engage with imitation. One of the most natural and unexplored objects for the study of the mimetic elements in language is the onomatopoeia, as it implies an imitative-driven transformation of a sound of nature into a word. Notably, simple sounds are transformed into complex strings of vowels and consonants, making difficult to identify what is acoustically preserved in this operation. In this work we propose a definition for vocal imitation by which sounds are transformed into the speech elements that minimize their spectral difference within the constraints of the vocal system. In order to test this definition, we use a computational model that allows recovering anatomical features of the vocal system from experimental sound data. We explore the vocal configurations that best reproduce non-speech sounds, like striking blows on a door or the sharp sounds generated by pressing on light switches or computer mouse buttons. From the anatomical point of view, the configurations obtained are readily associated with co-articulated consonants, and we show perceptual evidence that these consonants are positively associated with the original sounds. Moreover, the pairs vowel-consonant that compose these co-articulations correspond to the most stable syllables found in the knock and click onomatopoeias across languages, suggesting a mechanism by which vocal imitation naturally embeds single sounds into more complex speech structures. Other mimetic forces received extensive attention by the scientific community, such as cross-modal associations between speech and visual categories. The present approach helps building a global view of the mimetic forces acting on language and opens a new venue for a quantitative study of word formation in terms of vocal imitation.
Project description:PURPOSE:Infectious agents, such as SARS-CoV-2, can be carried by droplets expelled during breathing. The spatial dissemination of droplets varies according to their initial velocity. After a short literature review, our goal was to determine the velocity of the exhaled air during vocal exercises. METHODS:A propylene glycol cloud produced by 2 e-cigarettes' users allowed visualization of the exhaled air emitted during vocal exercises. Airflow velocities were measured during the first 200 ms of a long exhalation, a sustained vowel /a/ and varied vocal exercises. For the long exhalation and the sustained vowel /a/, the decrease of airflow velocity was measured until 3 s. Results were compared with a Computational Fluid Dynamics (CFD) study using boundary conditions consistent with our experimental study. RESULTS:Regarding the production of vowels, higher velocities were found in loud and whispered voices than in normal voice. Voiced consonants like /?/ or /v/ generated higher velocities than vowels. Some voiceless consonants, e.g., /t/ generated high velocities, but long exhalation had the highest velocities. Semi-occluded vocal tract exercises generated faster airflow velocities than loud speech, with a decreased velocity during voicing. The initial velocity quickly decreased as was shown during a long exhalation or a sustained vowel /a/. Velocities were consistent with the CFD data. CONCLUSION:Initial velocity of the exhaled air is a key factor influencing droplets trajectory. Our study revealed that vocal exercises produce a slower airflow than long exhalation. Speech therapy should, therefore, not be associated with an increased risk of contamination when implementing standard recommendations.
Project description:Speech sounds are traditionally divided into consonants and vowels. When only vowels or only consonants are replaced by noise, listeners are more accurate understanding sentences in which consonants are replaced but vowels remain. From such data, vowels have been suggested to be more important for understanding sentences; however, such conclusions are mitigated by the fact that replaced consonant segments were roughly one-third shorter than vowels. We report two experiments that demonstrate listener performance to be better predicted by simple psychoacoustic measures of cochlea-scaled spectral change across time. First, listeners identified sentences in which portions of consonants (C), vowels (V), CV transitions, or VC transitions were replaced by noise. Relative intelligibility was not well accounted for on the basis of Cs, Vs, or their transitions. In a second experiment, distinctions between Cs and Vs were abandoned. Instead, portions of sentences were replaced on the basis of cochlea-scaled spectral entropy (CSE). Sentence segments having relatively high, medium, or low entropy were replaced with noise. Intelligibility decreased linearly as the amount of replaced CSE increased. Duration of signal replaced and proportion of consonants/vowels replaced fail to account for listener data. CSE corresponds closely with the linguistic construct of sonority (or vowel-likeness) that is useful for describing phonological systematicity, especially syllable composition. Results challenge traditional distinctions between consonants and vowels. Speech intelligibility is better predicted by nonlinguistic sensory measures of uncertainty (potential information) than by orthodox physical acoustic measures or linguistic constructs.
Project description:The speech signal contains many acoustic properties that may contribute differently to spoken word recognition. Previous studies have demonstrated that the importance of properties present during consonants or vowels is dependent upon the linguistic context (i.e., words versus sentences). The current study investigated three potentially informative acoustic properties that are present during consonants and vowels for monosyllabic words and sentences. Natural variations in fundamental frequency were either flattened or removed. The speech envelope and temporal fine structure were also investigated by limiting the availability of these cues via noisy signal extraction. Thus, this study investigated the contribution of these acoustic properties, present during either consonants or vowels, to overall word and sentence intelligibility. Results demonstrated that all processing conditions displayed better performance for vowel-only sentences. Greater performance with vowel-only sentences remained, despite removing dynamic cues of the fundamental frequency. Word and sentence comparisons suggest that the speech envelope may be at least partially responsible for additional vowel contributions in sentences. Results suggest that speech information transmitted by the envelope is responsible, in part, for greater vowel contributions in sentences, but is not predictive for isolated words.
Project description:Several stories in the popular media have speculated that it may be possible to infer from the brain which word a person is speaking or even thinking. While recent studies have demonstrated that brain signals can give detailed information about actual and imagined actions, such as different types of limb movements or spoken words, concrete experimental evidence for the possibility to 'read the mind', i.e. to interpret internally-generated speech, has been scarce. In this study, we found that it is possible to use signals recorded from the surface of the brain (electrocorticography) to discriminate the vowels and consonants embedded in spoken and in imagined words, and we defined the cortical areas that held the most information about discrimination of vowels and consonants. The results shed light on the distinct mechanisms associated with production of vowels and consonants, and could provide the basis for brain-based communication using imagined speech.
Project description:The obstruent consonants (e.g., stops) are more susceptible to noise than vowels, raising the question whether the degradation of speech intelligibility in noise can be attributed, at least partially, to the loss of information carried by obstruent consonants. Experiment 1 assesses the contribution of obstruent consonants to speech recognition in noise by presenting sentences containing clean obstruent consonants but noise-corrupted voiced sounds (e.g., vowels). Results indicated substantial (threefold) improvement in speech recognition, particularly at low signal-to-noise ratio levels (-5 dB). Experiment 2 assessed the importance of providing partial information, within a frequency region, of the obstruent-consonant spectra while leaving the remaining spectral region unaltered (i.e., noise corrupted). Access to the low-frequency (0-1000 Hz) region of the clean obstruent-consonant spectra was found to be sufficient to realize significant improvements in performance and that was attributed to improvement in transmission of voicing information. The outcomes from the two experiments suggest that much of the improvement in performance must be due to the enhanced access to acoustic landmarks, evident in spectral discontinuities signaling the onsets of obstruent consonants. These landmarks, often blurred in noisy conditions, are critically important for understanding speech in noise for better determination of the syllable structure and word boundaries.
Project description:Recent evidence suggests that spectral change, as measured by cochlea-scaled entropy (CSE), predicts speech intelligibility better than the information carried by vowels or consonants in sentences. Motivated by this finding, the present study investigates whether intelligibility indices implemented to include segments marked with significant spectral change better predict speech intelligibility in noise than measures that include all phonetic segments paying no attention to vowels/consonants or spectral change. The prediction of two intelligibility measures [normalized covariance measure (NCM), coherence-based speech intelligibility index (CSII)] is investigated using three sentence-segmentation methods: relative root-mean-square (RMS) levels, CSE, and traditional phonetic segmentation of obstruents and sonorants. While the CSE method makes no distinction between spectral changes occurring within vowels/consonants, the RMS-level segmentation method places more emphasis on the vowel-consonant boundaries wherein the spectral change is often most prominent, and perhaps most robust, in the presence of noise. Higher correlation with intelligibility scores was obtained when including sentence segments containing a large number of consonant-vowel boundaries than when including segments with highest entropy or segments based on obstruent/sonorant classification. These data suggest that in the context of intelligibility measures the type of spectral change captured by the measure is important.
Project description:Weak consonants (e.g., stops) are more susceptible to noise than vowels, owing partially to their lower intensity. This raises the question whether hearing-impaired (HI) listeners are able to perceive (and utilize effectively) the high-frequency cues present in consonants. To answer this question, HI listeners were presented with clean (noise absent) weak consonants in otherwise noise-corrupted sentences. Results indicated that HI listeners received significant benefit in intelligibility (4 dB decrease in speech reception threshold) when they had access to clean consonant information. At extremely low signal-to-noise ratio (SNR) levels, however, HI listeners received only 64% of the benefit obtained by normal-hearing listeners. This lack of equitable benefit was investigated in Experiment 2 by testing the hypothesis that the high-frequency cues present in consonants were not audible to HI listeners. This was tested by selectively amplifying the noisy consonants while leaving the noisy sonorant sounds (e.g., vowels) unaltered. Listening tests indicated small (∼10%), but statistically significant, improvements in intelligibility at low SNR conditions when the consonants were amplified in the high-frequency region. Selective consonant amplification provided reliable low-frequency acoustic landmarks that in turn facilitated a better lexical segmentation of the speech stream and contributed to the small improvement in intelligibility.