Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG.
ABSTRACT: How humans solve the cocktail party problem remains unknown. However, progress has been made recently thanks to the realization that cortical activity tracks the amplitude envelope of speech. This has led to the development of regression methods for studying the neurophysiology of continuous speech. One such method, known as stimulus-reconstruction, has been successfully utilized with cortical surface recordings and magnetoencephalography (MEG). However, the former is invasive and gives a relatively restricted view of processing along the auditory hierarchy, whereas the latter is expensive and rare. Thus it would be extremely useful for research in many populations if stimulus-reconstruction was effective using electroencephalography (EEG), a widely available and inexpensive technology. Here we show that single-trial (?60 s) unaveraged EEG data can be decoded to determine attentional selection in a naturalistic multispeaker environment. Furthermore, we show a significant correlation between our EEG-based measure of attention and performance on a high-level attention task. In addition, by attempting to decode attention at individual latencies, we identify neural processing at ?200 ms as being critical for solving the cocktail party problem. These findings open up new avenues for studying the ongoing dynamics of cognition using EEG and for developing effective and natural brain-computer interfaces.
Project description:The human auditory system is highly skilled at extracting and processing information from speech in both single-speaker and multi-speaker situations. A commonly studied speech feature is the amplitude envelope which can also be used to determine which speaker a listener is attending to in those multi-speaker situations. Non-invasive brain imaging (electro-/magnetoencephalography [EEG/MEG]) has shown that the phase of neural activity below 16 Hz tracks the dynamics of speech, whereas invasive brain imaging (electrocorticography [ECoG]) has shown that such processing is strongly reflected in the power of high frequency neural activity (around 70-150 Hz; known as high gamma). The first aim of this study was to determine if high gamma power scalp recorded EEG carries useful stimulus-related information, despite its reputation for having a poor signal to noise ratio. Specifically, linear regression was used to investigate speech envelope and attention decoding in low frequency EEG, high gamma power EEG, and in both EEG signals combined. The second aim was to assess whether the information reflected in high gamma power EEG may be complementary to that reflected in well-established low frequency EEG indices of speech processing. Exploratory analyses were also completed to examine how low frequency and high gamma power EEG may be sensitive to different features of the speech envelope. While low frequency speech tracking was evident for almost all subjects as expected, high gamma power also showed robust speech tracking in some subjects. This same pattern was true for attention decoding using a separate group of subjects who participated in a cocktail party attention experiment. For the subjects who showed speech tracking in high gamma power EEG, the spatiotemporal characteristics of that high gamma tracking differed from that of low-frequency EEG. Furthermore, combining the two neural measures led to improved measures of speech tracking for several subjects. Our results indicated that high gamma power EEG can carry useful information regarding speech processing and attentional selection in some subjects. Combining high gamma power and low frequency EEG can improve the mapping between natural speech and the resulting neural responses.
Project description:Human listeners are able to selectively attend to target speech in a noisy environment with multiple-people talking. Using recordings of scalp electroencephalogram (EEG), this study investigated how selective attention facilitates the cortical representation of target speech under a simulated "cocktail-party" listening condition with speech-on-speech masking. The result shows that the cortical representation of target-speech signals under the multiple-people talking condition was specifically improved by selective attention relative to the non-selective-attention listening condition, and the beta-band activity was most strongly modulated by selective attention. Moreover, measured with the Granger Causality value, selective attention to the single target speech in the mixed-speech complex enhanced the following four causal connectivities for the beta-band oscillation: the ones (1) from site FT7 to the right motor area, (2) from the left frontal area to the right motor area, (3) from the central frontal area to the right motor area, and (4) from the central frontal area to the right frontal area. However, the selective-attention-induced change in beta-band causal connectivity from the central frontal area to the right motor area, but not other beta-band causal connectivities, was significantly correlated with the selective-attention-induced change in the cortical beta-band representation of target speech. These findings suggest that under the "cocktail-party" listening condition, the beta-band oscillation in EEGs to target speech is specifically facilitated by selective attention to the target speech that is embedded in the mixed-speech complex. The selective attention-induced unmasking of target speech may be associated with the improved beta-band functional connectivity from the central frontal area to the right motor area, suggesting a top-down attentional modulation of the speech-motor process.
Project description:Perceiving speech-in-noise (SIN) demands precise neural coding between brainstem and cortical levels of the hearing system. Attentional processes can then select and prioritize task-relevant cues over competing background noise for successful speech perception. In animal models, brainstem-cortical interplay is achieved via descending corticofugal projections from cortex that shape midbrain responses to behaviorally-relevant sounds. Attentional engagement of corticofugal feedback may assist SIN understanding but has never been confirmed and remains highly controversial in humans. To resolve these issues, we recorded source-level, anatomically constrained brainstem frequency-following responses (FFRs) and cortical event-related potentials (ERPs) to speech via high-density EEG while listeners performed rapid SIN identification tasks. We varied attention with active vs. passive listening scenarios whereas task difficulty was manipulated with additive noise interference. Active listening (but not arousal-control tasks) exaggerated both ERPs and FFRs, confirming attentional gain extends to lower subcortical levels of speech processing. We used functional connectivity to measure the directed strength of coupling between levels and characterize "bottom-up" vs. "top-down" (corticofugal) signaling within the auditory brainstem-cortical pathway. While attention strengthened connectivity bidirectionally, corticofugal transmission disengaged under passive (but not active) SIN listening. Our findings (i) show attention enhances the brain's transcription of speech even prior to cortex and (ii) establish a direct role of the human corticofugal feedback system as an aid to cocktail party speech perception.
Project description:Audiovisual cross-modal training has been proposed as a tool to improve human spatial hearing. Here, we investigated training-induced modulations of event-related potential (ERP) components that have been associated with processes of auditory selective spatial attention when a speaker of interest has to be localized in a multiple speaker ("cocktail-party") scenario. Forty-five healthy participants were tested, including younger (19-29 years; <i>n</i> = 21) and older (66-76 years; <i>n</i> = 24) age groups. Three conditions of short-term training (duration 15 min) were compared, requiring localization of non-speech targets under "cocktail-party" conditions with either (1) synchronous presentation of co-localized auditory-target and visual stimuli (audiovisual-congruency training) or (2) immediate visual feedback on correct or incorrect localization responses (visual-feedback training), or (3) presentation of spatially incongruent auditory-target and visual stimuli presented at random positions with synchronous onset (control condition). Prior to and after training, participants were tested in an auditory spatial attention task (15 min), requiring localization of a predefined spoken word out of three distractor words, which were presented with synchronous stimulus onset from different positions. Peaks of ERP components were analyzed with a specific focus on the N2, which is known to be a correlate of auditory selective spatial attention. N2 amplitudes were significantly larger after audiovisual-congruency training compared with the remaining training conditions for younger, but not older, participants. Also, at the time of the N2, distributed source analysis revealed an enhancement of neural activity induced by audiovisual-congruency training in dorsolateral prefrontal cortex (Brodmann area 9) for the younger group. These findings suggest that cross-modal processes induced by audiovisual-congruency training under "cocktail-party" conditions at a short time scale resulted in an enhancement of correlates of auditory selective spatial attention.
Project description:During the past decade, several studies have identified electroencephalographic (EEG) correlates of selective auditory attention to speech. In these studies, typically, listeners are instructed to focus on one of two concurrent speech streams (the "target"), while ignoring the other (the "masker"). EEG signals are recorded while participants are performing this task, and subsequently analyzed to recover the attended stream. An assumption often made in these studies is that the participant's attention can remain focused on the target throughout the test. To check this assumption, and assess when a participant's attention in a concurrent speech listening task was directed toward the target, the masker, or neither, we designed a behavioral listen-then-recall task (the Long-SWoRD test). After listening to two simultaneous short stories, participants had to identify keywords from the target story, randomly interspersed among words from the masker story and words from neither story, on a computer screen. To modulate task difficulty, and hence, the likelihood of attentional switches, masker stories were originally uttered by the same talker as the target stories. The masker voice parameters were then manipulated to parametrically control the similarity of the two streams, from clearly dissimilar to almost identical. While participants listened to the stories, EEG signals were measured and subsequently, analyzed using a temporal response function (TRF) model to reconstruct the speech stimuli. Responses in the behavioral recall task were used to infer, retrospectively, when attention was directed toward the target, the masker, or neither. During the model-training phase, the results of these behavioral-data-driven inferences were used as inputs to the model in addition to the EEG signals, to determine if this additional information would improve stimulus reconstruction accuracy, relative to performance of models trained under the assumption that the listener's attention was unwaveringly focused on the target. Results from 21 participants show that information regarding the actual - as opposed to, assumed - attentional focus can be used advantageously during model training, to enhance subsequent (test phase) accuracy of auditory stimulus-reconstruction based on EEG signals. This is the case, especially, in challenging listening situations, where the participants' attention is less likely to remain focused entirely on the target talker. In situations where the two competing voices are clearly distinct and easily separated perceptually, the assumption that listeners are able to stay focused on the target is reasonable. The behavioral recall protocol introduced here provides experimenters with a means to behaviorally track fluctuations in auditory selective attention, including, in combined behavioral/neurophysiological studies.
Project description:The goal of this study was to investigate how cognitive factors influence performance in a multi-talker, "cocktail-party" like environment in musicians and non-musicians. This was achieved by relating performance in a spatial hearing task to cognitive processing abilities assessed using measures of executive function (EF) and visual attention in musicians and non-musicians. For the spatial hearing task, a speech target was presented simultaneously with two intelligible speech maskers that were either colocated with the target (0° azimuth) or were symmetrically separated from the target in azimuth (at ±15°). EF assessment included measures of cognitive flexibility, inhibition control and auditory working memory. Selective attention was assessed in the visual domain using a multiple object tracking task (MOT). For the MOT task, the observers were required to track target dots (n = 1,2,3,4,5) in the presence of interfering distractor dots. Musicians performed significantly better than non-musicians in the spatial hearing task. For the EF measures, musicians showed better performance on measures of auditory working memory compared to non-musicians. Furthermore, across all individuals, a significant correlation was observed between performance on the spatial hearing task and measures of auditory working memory. This result suggests that individual differences in performance in a cocktail party-like environment may depend in part on cognitive factors such as auditory working memory. Performance in the MOT task did not differ between groups. However, across all individuals, a significant correlation was found between performance in the MOT and spatial hearing tasks. A stepwise multiple regression analysis revealed that musicianship and performance on the MOT task significantly predicted performance on the spatial hearing task. Overall, these findings confirm the relationship between musicianship and cognitive factors including domain-general selective attention and working memory in solving the "cocktail party problem".
Project description:The human brain tracks amplitude fluctuations of both speech and music, which reflects acoustic processing in addition to the encoding of higher-order features and one's cognitive state. Comparing neural tracking of speech and music envelopes can elucidate stimulus-general mechanisms, but direct comparisons are confounded by differences in their envelope spectra. Here, we use a novel method of frequency-constrained reconstruction of stimulus envelopes using EEG recorded during passive listening. We expected to see music reconstruction match speech in a narrow range of frequencies, but instead we found that speech was reconstructed better than music for all frequencies we examined. Additionally, models trained on all stimulus types performed as well or better than the stimulus-specific models at higher modulation frequencies, suggesting a common neural mechanism for tracking speech and music. However, speech envelope tracking at low frequencies, below 1 Hz, was associated with increased weighting over parietal channels, which was not present for the other stimuli. Our results highlight the importance of low-frequency speech tracking and suggest an origin from speech-specific processing in the brain.
Project description:The ability to focus on and understand one talker in a noisy social environment is a critical social-cognitive capacity, whose underlying neuronal mechanisms are unclear. We investigated the manner in which speech streams are represented in brain activity and the way that selective attention governs the brain's representation of speech using a "Cocktail Party" paradigm, coupled with direct recordings from the cortical surface in surgical epilepsy patients. We find that brain activity dynamically tracks speech streams using both low-frequency phase and high-frequency amplitude fluctuations and that optimal encoding likely combines the two. In and near low-level auditory cortices, attention "modulates" the representation by enhancing cortical tracking of attended speech streams, but ignored speech remains represented. In higher-order regions, the representation appears to become more "selective," in that there is no detectable tracking of ignored speech. This selectivity itself seems to sharpen as a sentence unfolds.
Project description:Studies suggest that long-term music experience enhances the brain's ability to segregate speech from noise. Musicians' "speech-in-noise (SIN) benefit" is based largely on perception from simple figure-ground tasks rather than competitive, multi-talker scenarios that offer realistic spatial cues for segregation and engage binaural processing. We aimed to investigate whether musicians show perceptual advantages in cocktail party speech segregation in a competitive, multi-talker environment. We used the coordinate response measure (CRM) paradigm to measure speech recognition and localization performance in musicians vs. non-musicians in a simulated 3D cocktail party environment conducted in an anechoic chamber. Speech was delivered through a 16-channel speaker array distributed around the horizontal soundfield surrounding the listener. Participants recalled the color, number, and perceived location of target callsign sentences. We manipulated task difficulty by varying the number of additional maskers presented at other spatial locations in the horizontal soundfield (0-1-2-3-4-6-8 multi-talkers). Musicians obtained faster and better speech recognition amidst up to around eight simultaneous talkers and showed less noise-related decline in performance with increasing interferers than their non-musician peers. Correlations revealed associations between listeners' years of musical training and CRM recognition and working memory. However, better working memory correlated with better speech streaming. Basic (QuickSIN) but not more complex (speech streaming) SIN processing was still predicted by music training after controlling for working memory. Our findings confirm a relationship between musicianship and naturalistic cocktail party speech streaming but also suggest that cognitive factors at least partially drive musicians' SIN advantage.
Project description:Based on the Motor Theory of speech perception, the interaction between the auditory and motor systems plays an essential role in speech perception. Since the Motor Theory was proposed, it has received remarkable attention in the field. However, each of the three hypotheses of the theory still needs further verification. In this review, we focus on how the auditory-motor anatomical and functional associations play a role in speech perception and discuss why previous studies could not reach an agreement and particularly whether the motor system involvement in speech perception is task-load dependent. Finally, we suggest that the function of the auditory-motor link is particularly useful for speech perception under adverse listening conditions and the further revised Motor Theory is a potential solution to the "cocktail-party" problem.