Musicians Show Improved Speech Segregation in Competitive, Multi-Talker Cocktail Party Scenarios.
ABSTRACT: Studies suggest that long-term music experience enhances the brain's ability to segregate speech from noise. Musicians' "speech-in-noise (SIN) benefit" is based largely on perception from simple figure-ground tasks rather than competitive, multi-talker scenarios that offer realistic spatial cues for segregation and engage binaural processing. We aimed to investigate whether musicians show perceptual advantages in cocktail party speech segregation in a competitive, multi-talker environment. We used the coordinate response measure (CRM) paradigm to measure speech recognition and localization performance in musicians vs. non-musicians in a simulated 3D cocktail party environment conducted in an anechoic chamber. Speech was delivered through a 16-channel speaker array distributed around the horizontal soundfield surrounding the listener. Participants recalled the color, number, and perceived location of target callsign sentences. We manipulated task difficulty by varying the number of additional maskers presented at other spatial locations in the horizontal soundfield (0-1-2-3-4-6-8 multi-talkers). Musicians obtained faster and better speech recognition amidst up to around eight simultaneous talkers and showed less noise-related decline in performance with increasing interferers than their non-musician peers. Correlations revealed associations between listeners' years of musical training and CRM recognition and working memory. However, better working memory correlated with better speech streaming. Basic (QuickSIN) but not more complex (speech streaming) SIN processing was still predicted by music training after controlling for working memory. Our findings confirm a relationship between musicianship and naturalistic cocktail party speech streaming but also suggest that cognitive factors at least partially drive musicians' SIN advantage.
Project description:Are musicians better able to understand speech in noise than non-musicians? Recent findings have produced contradictory results. Here we addressed this question by asking musicians and non-musicians to understand target sentences masked by other sentences presented from different spatial locations, the classical 'cocktail party problem' in speech science. We found that musicians obtained a substantial benefit in this situation, with thresholds ~6?dB better than non-musicians. Large individual differences in performance were noted particularly for the non-musically trained group. Furthermore, in different conditions we manipulated the spatial location and intelligibility of the masking sentences, thus changing the amount of 'informational masking' (IM) while keeping the amount of 'energetic masking' (EM) relatively constant. When the maskers were unintelligible and spatially separated from the target (low in IM), musicians and non-musicians performed comparably. These results suggest that the characteristics of speech maskers and the amount of IM can influence the magnitude of the differences found between musicians and non-musicians in multiple-talker "cocktail party" environments. Furthermore, considering the task in terms of the EM-IM distinction provides a conceptual framework for future behavioral and neuroscientific studies which explore the underlying sensory and cognitive mechanisms contributing to enhanced "speech-in-noise" perception by musicians.
Project description:Speech sounds are perceived relative to spectral properties of surrounding speech. For instance, target words that are ambiguous between /b?t/ (with low F1) and /b?t/ (with high F1) are more likely to be perceived as "bet" after a "low F1" sentence, but as "bit" after a "high F1" sentence. However, it is unclear how these spectral contrast effects (SCEs) operate in multi-talker listening conditions. Recently, Feng and Oxenham (J.Exp.Psychol.-Hum.Percept.Perform. 44(9), 1447-1457, 2018b) reported that selective attention affected SCEs to a small degree, using two simultaneously presented sentences produced by a single talker. The present study assessed the role of selective attention in more naturalistic "cocktail party" settings, with 200 lexically unique sentences, 20 target words, and different talkers. Results indicate that selective attention to one talker in one ear (while ignoring another talker in the other ear) modulates SCEs in such a way that only the spectral properties of the attended talker influences target perception. However, SCEs were much smaller in multi-talker settings (Experiment 2) than those in single-talker settings (Experiment 1). Therefore, the influence of SCEs on speech comprehension in more naturalistic settings (i.e., with competing talkers) may be smaller than estimated based on studies without competing talkers.
Project description:The goal of this study was to investigate how cognitive factors influence performance in a multi-talker, "cocktail-party" like environment in musicians and non-musicians. This was achieved by relating performance in a spatial hearing task to cognitive processing abilities assessed using measures of executive function (EF) and visual attention in musicians and non-musicians. For the spatial hearing task, a speech target was presented simultaneously with two intelligible speech maskers that were either colocated with the target (0° azimuth) or were symmetrically separated from the target in azimuth (at ±15°). EF assessment included measures of cognitive flexibility, inhibition control and auditory working memory. Selective attention was assessed in the visual domain using a multiple object tracking task (MOT). For the MOT task, the observers were required to track target dots (n = 1,2,3,4,5) in the presence of interfering distractor dots. Musicians performed significantly better than non-musicians in the spatial hearing task. For the EF measures, musicians showed better performance on measures of auditory working memory compared to non-musicians. Furthermore, across all individuals, a significant correlation was observed between performance on the spatial hearing task and measures of auditory working memory. This result suggests that individual differences in performance in a cocktail party-like environment may depend in part on cognitive factors such as auditory working memory. Performance in the MOT task did not differ between groups. However, across all individuals, a significant correlation was found between performance in the MOT and spatial hearing tasks. A stepwise multiple regression analysis revealed that musicianship and performance on the MOT task significantly predicted performance on the spatial hearing task. Overall, these findings confirm the relationship between musicianship and cognitive factors including domain-general selective attention and working memory in solving the "cocktail party problem".
Project description:In cocktail party situations multiple talkers speak simultaneously, which causes listening to be perceptually and cognitively challenging. Such situations can either be static (fixed target talker) or dynamic, meaning the target talker switches occasionally and in a potentially unpredictable way. To shed light on the perceptional and cognitive mechanisms in static and dynamic cocktail party situations, we conducted an analysis of error types that occur during a multi-talker speech recognition test. The error analysis distinguished between misunderstood or omitted words (random errors) and target-masker confusions. To investigate the effects of aging and hearing impairment, we compared data from three listener groups, comprised of young as well as older adults with and without hearing loss. In the static condition, error rates were generally very low, except for the older hearing-impaired listeners. Consistent with the assumption of decreased audibility, they showed a notable amount of random errors. In the dynamic condition, errors increased compared to the static condition, especially immediately following a target talker switch. Those increases were similar for random and confusion errors. The older hearing-impaired listeners showed greater difficulties than the younger adults in trials not preceded by a switch. These results suggest that the load associated with dynamic cocktail party listening affects the ability to focus attention on the talker of interest and the retrieval of words from short-term memory, as indicated by the increased amount of confusion and random errors. This was most pronounced in the older hearing-impaired listeners proposing an interplay of perceptual and cognitive mechanisms.
Project description:Previous studies have shown that listeners are better able to understand speech when they are familiar with the talker's voice. In most of these studies, talker familiarity was ensured by explicit voice training; that is, listeners learned to identify the familiar talkers. In the real world, however, the characteristics of familiar talkers are learned incidentally, through communication. The present study investigated whether speech comprehension benefits from implicit voice training; that is, through exposure to talkers' voices without listeners explicitly trying to identify them. During four training sessions, listeners heard short sentences containing a single verb (e.g., "he writes"), spoken by one talker. The sentences were mixed with noise, and listeners identified the verb within each sentence while their speech-reception thresholds (SRT) were measured. In a final test session, listeners performed the same task, but this time they heard different sentences spoken by the familiar talker and three unfamiliar talkers. Familiar and unfamiliar talkers were counterbalanced across listeners. Half of the listeners performed a test session in which the four talkers were presented in separate blocks (blocked paradigm). For the other half, talkers varied randomly from trial to trial (interleaved paradigm). The results showed that listeners had lower SRT when the speech was produced by the familiar talker than the unfamiliar talkers. The type of talker presentation (blocked vs. interleaved) had no effect on this familiarity benefit. These findings suggest that listeners implicitly learn talker-specific information during a speech-comprehension task, and exploit this information to improve the comprehension of novel speech material from familiar talkers.
Project description:Successful speech communication often requires selective attention to a target stream amidst competing sounds, as well as the ability to switch attention among multiple interlocutors. However, auditory attention switching negatively affects both target detection accuracy and reaction time, suggesting that attention switches carry a cognitive cost. Pupillometry is one method of assessing mental effort or cognitive load. Two experiments were conducted to determine whether the effort associated with attention switches is detectable in the pupillary response. In both experiments, pupil dilation, target detection sensitivity, and reaction time were measured; the task required listeners to either maintain or switch attention between two concurrent speech streams. Secondary manipulations explored whether switch-related effort would increase when auditory streaming was harder. In experiment 1, spatially distinct stimuli were degraded by simulating reverberation (compromising across-time streaming cues), and target-masker talker gender match was also varied. In experiment 2, diotic streams separable by talker voice quality and pitch were degraded by noise vocoding, and the time alloted for mid-trial attention switching was varied. All trial manipulations had some effect on target detection sensitivity and/or reaction time; however, only the attention-switching manipulation affected the pupillary response: greater dilation was observed in trials requiring switching attention between talkers.
Project description:This investigation examined the effect of accent of target talkers and background speech maskers on listeners' ability to use cues to separate speech from noise. Differences in accent may create a disparity in the relative timing between signal and background, and such timing cues may be used to separate the target talker from the background speech masker. However, the use of this cue could be reduced for older listeners with temporal processing deficits, especially those with hearing loss. Participants were younger and older listeners with normal hearing and older listeners with hearing loss. Stimuli were IEEE sentences recorded in English by male native speakers of English and Spanish. These sentences were presented in different maskers that included speech-modulated noise and background babbles varying in talker gender and accent. Signal-to-noise ratios corresponding to 50% correct performance were measured. Results indicate that a pronounced Spanish accent limits a listener's ability to take advantage of cues to speech segregation and that a difference in accentedness between the target talker and background masker may be a useful cue for speech segregation. Older hearing-impaired listeners performed poorly in all conditions with the accented talkers.
Project description:Musical training is thought to be related to improved language skills, for example, understanding speech in background noise. Although studies have found that musicians and nonmusicians differed in morphology of bilateral arcuate fasciculus (AF), none has associated such white matter features with speech-in-noise (SIN) perception. Here, we tested both SIN and the diffusivity of bilateral AF segments in musicians and nonmusicians using diffusion tensor imaging. Compared with nonmusicians, musicians had higher fractional anisotropy (FA) in the right direct AF and lower radial diffusivity in the left anterior AF, which correlated with SIN performance. The FA-based laterality index showed stronger right lateralization of the direct AF and stronger left lateralization of the posterior AF in musicians than nonmusicians, with the posterior AF laterality predicting SIN accuracy. Furthermore, hemodynamic activity in right superior temporal gyrus obtained during a SIN task played a full mediation role in explaining the contribution of the right direct AF diffusivity on SIN performance, which therefore links training-related white matter plasticity, brain hemodynamics, and speech perception ability. Our findings provide direct evidence that differential microstructural plasticity of bilateral AF segments may serve as a neural foundation of the cross-domain transfer effect of musical experience to speech perception amid competing noise.
Project description:Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single "independent noise" signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.
Project description:Speech processing is slower and less accurate when listeners encounter speech from multiple talkers compared to one continuous talker. However, interference from multiple talkers has been investigated only using immediate speech recognition or long-term memory recognition tasks. These tasks reveal opposite effects of speech processing time on speech recognition - while fast processing of multi-talker speech impedes immediate recognition, it also results in more abstract and less talker-specific long-term memories for speech. Here, we investigated whether and how processing multi-talker speech disrupts working memory maintenance, an intermediate stage between perceptual recognition and long-term memory. In a digit sequence recall task, listeners encoded seven-digit sequences and recalled them after a 5-s delay. Sequences were spoken by either a single talker or multiple talkers at one of three presentation rates (0-, 200-, and 500-ms inter-digit intervals). Listeners' recall was slower and less accurate for sequences spoken by multiple talkers than a single talker. Especially for the fastest presentation rate, listeners were less efficient when recalling sequences spoken by multiple talkers. Our results reveal that talker-specificity effects for speech working memory are most prominent when listeners must rapidly encode speech. These results suggest that, like immediate speech recognition, working memory for speech is susceptible to interference from variability across talkers. While many studies ascribe effects of talker variability to the need to calibrate perception to talker-specific acoustics, these results are also consistent with the idea that a sudden change of talkers disrupts attentional focus, interfering with efficient working-memory processing.