ABSTRACT: Visual information from a speaker's face can enhance or interfere with accurate auditory perception. This integration of information across auditory and visual streams has been observed in functional imaging studies, and has typically been attributed to the frequency and robustness with which perceivers jointly encounter event-specific information from these two modalities. Adding the tactile modality has long been considered a crucial next step in understanding multisensory integration. However, previous studies have found an influence of tactile input on speech perception only under limited circumstances, either where perceivers were aware of the task or where they had received training to establish a cross-modal mapping. Here we show that perceivers integrate naturalistic tactile information during auditory speech perception without previous training. Drawing on the observation that some speech sounds produce tiny bursts of aspiration (such as English 'p'), we applied slight, inaudible air puffs on participants' skin at one of two locations: the right hand or the neck. Syllables heard simultaneously with cutaneous air puffs were more likely to be heard as aspirated (for example, causing participants to mishear 'b' as 'p'). These results demonstrate that perceivers integrate event-relevant tactile information in auditory perception in much the same way as they do visual information.
Project description:Speech perception is thought to be linked to speech motor production. This linkage is considered to mediate multimodal aspects of speech perception, such as audio-visual and audio-tactile integration. However, direct coupling between articulatory movement and auditory perception has been little studied. The present study reveals a clear dissociation between the effects of a listener's own speech action and the effects of viewing another's speech movements on the perception of auditory phonemes. We assessed the intelligibility of the syllables [pa], [ta], and [ka] when listeners silently and simultaneously articulated syllables that were congruent/incongruent with the syllables they heard. The intelligibility was compared with a condition where the listeners simultaneously watched another's mouth producing congruent/incongruent syllables, but did not articulate. The intelligibility of [ta] and [ka] were degraded by articulating [ka] and [ta] respectively, which are associated with the same primary articulator (tongue) as the heard syllables. But they were not affected by articulating [pa], which is associated with a different primary articulator (lips) from the heard syllables. In contrast, the intelligibility of [ta] and [ka] was degraded by watching the production of [pa]. These results indicate that the articulatory-induced distortion of speech perception occurs in an articulator-specific manner while visually induced distortion does not. The articulator-specific nature of the auditory-motor interaction in speech perception suggests that speech motor processing directly contributes to our ability to hear speech.
Project description:Speech perception is influenced by vision through a process of audiovisual integration. This is demonstrated by the McGurk illusion where visual speech (for example /ga/) dubbed with incongruent auditory speech (such as /ba/) leads to a modified auditory percept (/da/). Recent studies have indicated that perception of the incongruent speech stimuli used in McGurk paradigms involves mechanisms of both general and audiovisual speech specific mismatch processing and that general mismatch processing modulates induced theta-band (4-8 Hz) oscillations. Here, we investigated whether the theta modulation merely reflects mismatch processing or, alternatively, audiovisual integration of speech. We used electroencephalographic recordings from two previously published studies using audiovisual sine-wave speech (SWS), a spectrally degraded speech signal sounding nonsensical to naïve perceivers but perceived as speech by informed subjects. Earlier studies have shown that informed, but not naïve subjects integrate SWS phonetically with visual speech. In an N1/P2 event-related potential paradigm, we found a significant difference in theta-band activity between informed and naïve perceivers of audiovisual speech, suggesting that audiovisual integration modulates induced theta-band oscillations. In a McGurk mismatch negativity paradigm (MMN) where infrequent McGurk stimuli were embedded in a sequence of frequent audio-visually congruent stimuli we found no difference between congruent and McGurk stimuli. The infrequent stimuli in this paradigm are violating both the general prediction of stimulus content, and that of audiovisual congruence. Hence, we found no support for the hypothesis that audiovisual mismatch modulates induced theta-band oscillations. We also did not find any effects of audiovisual integration in the MMN paradigm, possibly due to the experimental design.
Project description:Speech and emotion perception are dynamic processes in which it may be optimal to integrate synchronous signals emitted from different sources. Studies of audio-visual (AV) perception of neutrally expressed speech demonstrate supra-additive (i.e., where AV>[unimodal auditory+unimodal visual]) responses in left STS to crossmodal speech stimuli. However, emotions are often conveyed simultaneously with speech; through the voice in the form of speech prosody and through the face in the form of facial expression. Previous studies of AV nonverbal emotion integration showed a role for right (rather than left) STS. The current study therefore examined whether the integration of facial and prosodic signals of emotional speech is associated with supra-additive responses in left (cf. results for speech integration) or right (due to emotional content) STS. As emotional displays are sometimes difficult to interpret, we also examined whether supra-additive responses were affected by emotional incongruence (i.e., ambiguity). Using magnetoencephalography, we continuously recorded eighteen participants as they viewed and heard AV congruent emotional and AV incongruent emotional speech stimuli. Significant supra-additive responses were observed in right STS within the first 250 ms for emotionally incongruent and emotionally congruent AV speech stimuli, which further underscores the role of right STS in processing crossmodal emotive signals.
Project description:Visual information about lip and facial movements plays a role in audiovisual (AV) speech perception. Although this has been widely confirmed, previous behavioural studies have shown interlanguage differences, that is, native Japanese speakers do not integrate auditory and visual speech as closely as native English speakers. To elucidate the neural basis of such interlanguage differences, 22 native English speakers and 24 native Japanese speakers were examined in behavioural or functional Magnetic Resonance Imaging (fMRI) experiments while mono-syllabic speech was presented under AV, auditory-only, or visual-only conditions for speech identification. Behavioural results indicated that the English speakers identified visual speech more quickly than the Japanese speakers, and that the temporal facilitation effect of congruent visual speech was significant in the English speakers but not in the Japanese speakers. Using fMRI data, we examined the functional connectivity among brain regions important for auditory-visual interplay. The results indicated that the English speakers had significantly stronger connectivity between the visual motion area MT and the Heschl's gyrus compared with the Japanese speakers, which may subserve lower-level visual influences on speech perception in English speakers in a multisensory environment. These results suggested that linguistic experience strongly affects neural connectivity involved in AV speech integration.
Project description:The McGurk illusion is experienced to various degrees among the general population. Previous studies have implicated the left superior temporal sulcus (STS) and auditory cortex (AC) as regions associated with this interindividual variability. We sought to further investigate the neurophysiology underlying this variability using a variant of the McGurk illusion design. Electroencephalography (EEG) was recorded while human subjects were presented with videos of a speaker uttering the consonant-vowels (CVs) /ba/ and /fa/, which were mixed and matched with audio of /ba/ and /fa/ to produce congruent and incongruent conditions. Subjects were also presented with unimodal stimuli of silent videos and audios of the CVs. They responded to whether they heard (or saw in the silent condition) /ba/ or /fa/. An illusion during the incongruent conditions was deemed successful when individuals heard the syllable conveyed by mouth movements. We hypothesized that individuals who experience the illusion more strongly should exhibit more robust desynchronization of alpha (7-12?Hz) at fronto-central and temporal sites, emphasizing more engagement of neural generators at the AC and STS. We found, however, that compared to weaker illusion perceivers, stronger illusion perceivers exhibited greater alpha synchronization at fronto-central and posterior temporal sites, which is consistent with inhibition of auditory representations. These findings suggest that stronger McGurk illusion perceivers possess more robust cross-modal sensory gating mechanisms whereby phonetic representations not conveyed by the visual system are inhibited, and in turn reinforcing perception of the visually targeted phonemes.
Project description:Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.
Project description:During natural speech perception, humans must parse temporally continuous auditory and visual speech signals into sequences of words. However, most studies of speech perception present only single words or syllables. We used electrocorticography (subdural electrodes implanted on the brains of epileptic patients) to investigate the neural mechanisms for processing continuous audiovisual speech signals consisting of individual sentences. Using partial correlation analysis, we found that posterior superior temporal gyrus (pSTG) and medial occipital cortex tracked both the auditory and the visual speech envelopes. These same regions, as well as inferior temporal cortex, responded more strongly to a dynamic video of a talking face compared to auditory speech paired with a static face. Occipital cortex and pSTG carry temporal information about both auditory and visual speech dynamics. Visual speech tracking in pSTG may be a mechanism for enhancing perception of degraded auditory speech.
Project description:Speech, for most of us, is a bimodal percept whenever we both hear the voice and see the lip movements of a speaker. Children who are born deaf never have this bimodal experience. We tested children who had been deaf from birth and who subsequently received cochlear implants for their ability to fuse the auditory information provided by their implants with visual information about lip movements for speech perception. For most of the children with implants (92%), perception was dominated by vision when visual and auditory speech information conflicted. For some, bimodal fusion was strong and consistent, demonstrating a remarkable plasticity in their ability to form auditory-visual associations despite the atypical stimulation provided by implants. The likelihood of consistent auditory-visual fusion declined with age at implant beyond 2.5 years, suggesting a sensitive period for bimodal integration in speech perception.
Project description:Delayed auditory feedback (DAF) leads to nonfluent speech where the voice of a speaker is heard after a delay. Previous studies suggested the involvement of attention to auditory feedback in speech disfluency. To date, there are no studies that have revealed the relationship between attention and nonfluent speech by controlling the attention allocated to the delayed own voice. This study examined these issues under three conditions: a single task where the subject was asked to read aloud under DAF (single DAF task), a dual task where the subject was asked to read aloud while reacting to a pure tone (auditory DAF task), and a dual task where the subject was asked to read aloud while reacting to the vibration of their finger (tactile DAF task). The subjects also performed the single and dual tasks (auditory/tactile) under nonaltered auditory feedback where no delayed voices were involved. Results showed that the nonfluency rate under the auditory DAF task was significantly greater than that under the single DAF task. In contrast, the nonfluency rate under the tactile DAF task was significantly lower compared with that of the single DAF task. Speech became nonfluent when attention was captured by the same modality stimulus, i.e., auditory tone. In contrast, speech became fluent when attention was allocated to the stimulus that is irreverent to auditory modality, i.e., tactile vibration. This indicates that nonfluent speech under DAF is involved in attention capture owing to the delayed own voice.
Project description:Human faces contain multiple sources of information. During speech perception, visual information from the talker's mouth is integrated with auditory information from the talker's voice. By directly recording neural responses from small populations of neurons in patients implanted with subdural electrodes, we found enhanced visual cortex responses to speech when auditory speech was absent (rendering visual speech especially relevant). Receptive field mapping demonstrated that this enhancement was specific to regions of the visual cortex with retinotopic representations of the mouth of the talker. Connectivity between frontal cortex and other brain regions was measured with trial-by-trial power correlations. Strong connectivity was observed between frontal cortex and mouth regions of visual cortex; connectivity was weaker between frontal cortex and non-mouth regions of visual cortex or auditory cortex. These results suggest that top-down selection of visual information from the talker's mouth by frontal cortex plays an important role in audiovisual speech perception.