Envelope reconstruction of speech and music highlights stronger tracking of speech at low frequencies.
ABSTRACT: The human brain tracks amplitude fluctuations of both speech and music, which reflects acoustic processing in addition to the encoding of higher-order features and one's cognitive state. Comparing neural tracking of speech and music envelopes can elucidate stimulus-general mechanisms, but direct comparisons are confounded by differences in their envelope spectra. Here, we use a novel method of frequency-constrained reconstruction of stimulus envelopes using EEG recorded during passive listening. We expected to see music reconstruction match speech in a narrow range of frequencies, but instead we found that speech was reconstructed better than music for all frequencies we examined. Additionally, models trained on all stimulus types performed as well or better than the stimulus-specific models at higher modulation frequencies, suggesting a common neural mechanism for tracking speech and music. However, speech envelope tracking at low frequencies, below 1 Hz, was associated with increased weighting over parietal channels, which was not present for the other stimuli. Our results highlight the importance of low-frequency speech tracking and suggest an origin from speech-specific processing in the brain.
Project description:A growing body of research suggests that intrinsic neuronal slow (<10 Hz) oscillations in auditory cortex appear to track incoming speech and other spectro-temporally complex auditory signals. Within this framework, several recent studies have identified critical-band temporal envelopes as the specific acoustic feature being reflected by the phase of these oscillations. However, how this alignment between speech acoustics and neural oscillations might underpin intelligibility is unclear. Here we test the hypothesis that the 'sharpness' of temporal fluctuations in the critical band envelope acts as a temporal cue to speech syllabic rate, driving delta-theta rhythms to track the stimulus and facilitate intelligibility. We interpret our findings as evidence that sharp events in the stimulus cause cortical rhythms to re-align and parse the stimulus into syllable-sized chunks for further decoding. Using magnetoencephalographic recordings, we show that by removing temporal fluctuations that occur at the syllabic rate, envelope-tracking activity is reduced. By artificially reinstating these temporal fluctuations, envelope-tracking activity is regained. These changes in tracking correlate with intelligibility of the stimulus. Together, the results suggest that the sharpness of fluctuations in the stimulus, as reflected in the cochlear output, drive oscillatory activity to track and entrain to the stimulus, at its syllabic rate. This process likely facilitates parsing of the stimulus into meaningful chunks appropriate for subsequent decoding, enhancing perception and intelligibility.
Project description:This psychophysics study used musicians as a model to investigate whether musical expertise shapes the temporal integration window for audiovisual speech, sinewave speech, or music. Musicians and non-musicians judged the audiovisual synchrony of speech, sinewave analogs of speech, and music stimuli at 13 audiovisual stimulus onset asynchronies (±360, ±300 ±240, ±180, ±120, ±60, and 0 ms). Further, we manipulated the duration of the stimuli by presenting sentences/melodies or syllables/tones. Critically, musicians relative to non-musicians exhibited significantly narrower temporal integration windows for both music and sinewave speech. Further, the temporal integration window for music decreased with the amount of music practice, but not with age of acquisition. In other words, the more musicians practiced piano in the past 3 years, the more sensitive they became to the temporal misalignment of visual and auditory signals. Collectively, our findings demonstrate that music practicing fine-tunes the audiovisual temporal integration window to various extents depending on the stimulus class. While the effect of piano practicing was most pronounced for music, it also generalized to other stimulus classes such as sinewave speech and to a marginally significant degree to natural speech.
Project description:Numerous neuroimaging studies demonstrated that the auditory cortex tracks ongoing speech and that, in multi-speaker environments, tracking of the attended speaker is enhanced compared to the other irrelevant speakers. In contrast to speech, multi-instrument music can be appreciated by attending not only on its individual entities (i.e., segregation) but also on multiple instruments simultaneously (i.e., integration). We investigated the neural correlates of these two modes of music listening using electroencephalography (EEG) and sound envelope tracking. To this end, we presented uniquely composed music pieces played by two instruments, a bassoon and a cello, in combination with a previously validated music auditory scene analysis behavioral paradigm (Disbergen et al., 2018). Similar to results obtained through selective listening tasks for speech, relevant instruments could be reconstructed better than irrelevant ones during the segregation task. A delay-specific analysis showed higher reconstruction for the relevant instrument during a middle-latency window for both the bassoon and cello and during a late window for the bassoon. During the integration task, we did not observe significant attentional modulation when reconstructing the overall music envelope. Subsequent analyses indicated that this null result might be due to the heterogeneous strategies listeners employ during the integration task. Overall, our results suggest that subsequent to a common processing stage, top-down modulations consistently enhance the relevant instrument's representation during an instrument segregation task, whereas such an enhancement is not observed during an instrument integration task. These findings extend previous results from speech tracking to the tracking of multi-instrument music and, furthermore, inform current theories on polyphonic music perception.
Project description:Neural processing along the ascending auditory pathway is often associated with a progressive reduction in characteristic processing rates. For instance, the well-known frequency-following response (FFR) of the auditory midbrain, as measured with electroencephalography (EEG), is dominated by frequencies from ?100?Hz to several hundred Hz, phase-locking to the acoustic stimulus at those frequencies. In contrast, cortical responses, whether measured by EEG or magnetoencephalography (MEG), are typically characterized by frequencies of a few Hz to a few tens of Hz, time-locking to acoustic envelope features. In this study we investigated a crossover case, cortically generated responses time-locked to continuous speech features at FFR-like rates. Using MEG, we analyzed responses in the high gamma range of 70-200?Hz to continuous speech using neural source-localized reverse correlation and the corresponding temporal response functions (TRFs). Continuous speech stimuli were presented to 40 subjects (17 younger, 23 older adults) with clinically normal hearing and their MEG responses were analyzed in the 70-200?Hz band. Consistent with the relative insensitivity of MEG to many subcortical structures, the spatiotemporal profile of these response components indicated a cortical origin with ?40?ms peak latency and a right hemisphere bias. TRF analysis was performed using two separate aspects of the speech stimuli: a) the 70-200?Hz carrier of the speech, and b) the 70-200?Hz temporal modulations in the spectral envelope of the speech stimulus. The response was dominantly driven by the envelope modulation, with a much weaker contribution from the carrier. Age-related differences were also analyzed to investigate a reversal previously seen along the ascending auditory pathway, whereby older listeners show weaker midbrain FFR responses than younger listeners, but, paradoxically, have stronger cortical low frequency responses. In contrast to both these earlier results, this study did not find clear age-related differences in high gamma cortical responses to continuous speech. Cortical responses at FFR-like frequencies shared some properties with midbrain responses at the same frequencies and with cortical responses at much lower frequencies.
Project description:Auditory stimulus reconstruction is a technique that finds the best approximation of the acoustic stimulus from the population of evoked neural activity. Reconstructing speech from the human auditory cortex creates the possibility of a speech neuroprosthetic to establish a direct communication with the brain and has been shown to be possible in both overt and covert conditions. However, the low quality of the reconstructed speech has severely limited the utility of this method for brain-computer interface (BCI) applications. To advance the state-of-the-art in speech neuroprosthesis, we combined the recent advances in deep learning with the latest innovations in speech synthesis technologies to reconstruct closed-set intelligible speech from the human auditory cortex. We investigated the dependence of reconstruction accuracy on linear and nonlinear (deep neural network) regression methods and the acoustic representation that is used as the target of reconstruction, including auditory spectrogram and speech synthesis parameters. In addition, we compared the reconstruction accuracy from low and high neural frequency ranges. Our results show that a deep neural network model that directly estimates the parameters of a speech synthesizer from all neural frequencies achieves the highest subjective and objective scores on a digit recognition task, improving the intelligibility by 65% over the baseline method which used linear regression to reconstruct the auditory spectrogram. These results demonstrate the efficacy of deep learning and speech synthesis algorithms for designing the next generation of speech BCI systems, which not only can restore communications for paralyzed patients but also have the potential to transform human-computer interaction technologies.
Project description:Children's speech presents a challenging problem for formant frequency measurement. In part, this is because high fundamental frequencies, typical of a children's speech production, generate widely spaced harmonic components that may undersample the spectral shape of the vocal tract transfer function. In addition, there is often a weakening of upper harmonic energy and a noise component due to glottal turbulence. The purpose of this study was to develop a formant measurement technique based on cepstral analysis that does not require modification of the cepstrum itself or transformation back to the spectral domain. Instead, a narrow-band spectrum is low-pass filtered with a cutoff point (i.e., cutoff "quefrency" in the terminology of cepstral analysis) to preserve only the spectral envelope. To test the method, speech representative of a 2-3 year-old child was simulated with an airway modulation model of speech production. The model, which includes physiologically-scaled vocal folds and vocal tract, generates sound output analogous to a microphone signal. The vocal tract resonance frequencies can be calculated independently of the output signal and thus provide test cases that allow for assessing the accuracy of the formant tracking algorithm. When applied to the simulated child-like speech, the spectral filtering approach was shown to provide a clear spectrographic representation of formant change over the time course of the signal, and facilitates tracking formant frequencies for further analysis.
Project description:The human auditory system is highly skilled at extracting and processing information from speech in both single-speaker and multi-speaker situations. A commonly studied speech feature is the amplitude envelope which can also be used to determine which speaker a listener is attending to in those multi-speaker situations. Non-invasive brain imaging (electro-/magnetoencephalography [EEG/MEG]) has shown that the phase of neural activity below 16 Hz tracks the dynamics of speech, whereas invasive brain imaging (electrocorticography [ECoG]) has shown that such processing is strongly reflected in the power of high frequency neural activity (around 70-150 Hz; known as high gamma). The first aim of this study was to determine if high gamma power scalp recorded EEG carries useful stimulus-related information, despite its reputation for having a poor signal to noise ratio. Specifically, linear regression was used to investigate speech envelope and attention decoding in low frequency EEG, high gamma power EEG, and in both EEG signals combined. The second aim was to assess whether the information reflected in high gamma power EEG may be complementary to that reflected in well-established low frequency EEG indices of speech processing. Exploratory analyses were also completed to examine how low frequency and high gamma power EEG may be sensitive to different features of the speech envelope. While low frequency speech tracking was evident for almost all subjects as expected, high gamma power also showed robust speech tracking in some subjects. This same pattern was true for attention decoding using a separate group of subjects who participated in a cocktail party attention experiment. For the subjects who showed speech tracking in high gamma power EEG, the spatiotemporal characteristics of that high gamma tracking differed from that of low-frequency EEG. Furthermore, combining the two neural measures led to improved measures of speech tracking for several subjects. Our results indicated that high gamma power EEG can carry useful information regarding speech processing and attentional selection in some subjects. Combining high gamma power and low frequency EEG can improve the mapping between natural speech and the resulting neural responses.
Project description:Background noise reduces the depth of the low-frequency envelope modulations known to be important for speech intelligibility. The relative strength of the target and masker envelope modulations can be quantified using a modulation signal-to-noise ratio, (S/N)(mod), measure. Such a measure can be used in noise-suppression algorithms to extract target-relevant modulations from the corrupted (target + masker) envelopes for potential improvement in speech intelligibility. In the present study, envelopes are decomposed in the modulation spectral domain into a number of channels spanning the range of 0-30 Hz. Target-dominant modulations are identified and retained in each channel based on the (S/N)(mod) selection criterion, while modulations which potentially interfere with perception of the target (i.e., those dominated by the masker) are discarded. The impact of modulation-selective processing on the speech-reception threshold for sentences in noise is assessed with normal-hearing listeners. Results indicate that the intelligibility of noise-masked speech can be improved by as much as 13 dB when preserving target-dominant modulations, present up to a modulation frequency of 18 Hz, while discarding masker-dominant modulations from the mixture envelopes.
Project description:Speech perception often takes place in noisy environments, where multiple auditory signals compete with one another. The addition of visual cues such as talkers' faces or lip movements to an auditory signal can help improve the intelligibility of speech in those suboptimal listening environments. This is referred to as audiovisual benefits. The current study aimed to delineate the signal-to-noise ratio (SNR) conditions under which visual presentations of the acoustic amplitude envelopes have their most significant impact on speech perception. Seventeen adults with normal hearing were recruited. Participants were presented with spoken sentences in babble noise either in auditory-only or auditory-visual conditions with various SNRs at -7, -5, -3, -1, and 1 dB. The visual stimulus applied in this study was a sphere that varied in size syncing with the amplitude envelope of the target speech signals. Participants were asked to transcribe the sentences they heard. Results showed that a significant improvement in accuracy in the auditory-visual condition versus the audio-only condition was obtained at the SNRs of -3 and -1 dB, but no improvement was observed in other SNRs. These results showed that dynamic temporal visual information can benefit speech perception in noise, and the optimal facilitative effects of visual amplitude envelope can be observed under an intermediate SNR range.
Project description:<h4>Objectives</h4>Selectively attending to a target talker while ignoring multiple interferers (competing talkers and background noise) is more difficult for hearing-impaired (HI) individuals compared to normal-hearing (NH) listeners. Such tasks also become more difficult as background noise levels increase. To overcome these difficulties, hearing aids (HAs) offer noise reduction (NR) schemes. The objective of this study was to investigate the effect of NR processing (inactive, where the NR feature was switched off, <i>vs.</i> active, where the NR feature was switched on) on the neural representation of speech envelopes across two different background noise levels [+3 dB signal-to-noise ratio (SNR) and +8 dB SNR] by using a stimulus reconstruction (SR) method.<h4>Design</h4>To explore how NR processing supports the listeners' selective auditory attention, we recruited 22 HI participants fitted with HAs. To investigate the interplay between NR schemes, background noise, and neural representation of the speech envelopes, we used electroencephalography (EEG). The participants were instructed to listen to a target talker in front while ignoring a competing talker in front in the presence of multi-talker background babble noise.<h4>Results</h4>The results show that the neural representation of the attended speech envelope was enhanced by the active NR scheme for both background noise levels. The neural representation of the attended speech envelope at lower (+3 dB) SNR was shifted, approximately by 5 dB, toward the higher (+8 dB) SNR when the NR scheme was turned on. The neural representation of the ignored speech envelope was modulated by the NR scheme and was mostly enhanced in the conditions with more background noise. The neural representation of the background noise was modulated (i.e., reduced) by the NR scheme and was significantly reduced in the conditions with more background noise. The neural representation of the net sum of the ignored acoustic scene (ignored talker and background babble) was not modulated by the NR scheme but was significantly reduced in the conditions with a reduced level of background noise. Taken together, we showed that the active NR scheme enhanced the neural representation of both the attended and the ignored speakers and reduced the neural representation of background noise, while the net sum of the ignored acoustic scene was not enhanced.<h4>Conclusion</h4>Altogether our results support the hypothesis that the NR schemes in HAs serve to enhance the neural representation of speech and reduce the neural representation of background noise during a selective attention task. We contend that these results provide a neural index that could be useful for assessing the effects of HAs on auditory and cognitive processing in HI populations.