Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images.
ABSTRACT: Numerous factors have been reported to underlie the representation of complex images in high-level human visual cortex, including categories (e.g. faces, objects, scenes), animacy, and real-world size, but the extent to which this organization reflects behavioral judgments of real-world stimuli is unclear. Here, we compared representations derived from explicit behavioral similarity judgments and ultra-high field (7T) fMRI of human visual cortex for multiple exemplars of a diverse set of naturalistic images from 48 object and scene categories. While there was a significant correlation between similarity judgments and fMRI responses, there were striking differences between the two representational spaces. Behavioral judgements primarily revealed a coarse division between man-made (including humans) and natural (including animals) images, with clear groupings of conceptually-related categories (e.g. transportation, animals), while these conceptual groupings were largely absent in the fMRI representations. Instead, fMRI responses primarily seemed to reflect a separation of both human and non-human faces/bodies from all other categories. Further, comparison of the behavioral and fMRI representational spaces with those derived from the layers of a deep neural network (DNN) showed a strong correspondence with behavior in the top-most layer and with fMRI in the mid-level layers. These results suggest a complex relationship between localized responses in high-level visual cortex and behavioral similarity judgments - each domain reflects different properties of the images, and responses in high-level visual cortex may correspond to intermediate stages of processing between basic visual features and the conceptual categories that dominate the behavioral response.
Project description:The unique way in which each of us perceives the world must arise from our brain representations. If brain imaging could reveal an individual's unique mental representation, it could help us understand the biological substrate of our individual experiential worlds in mental health and disease. However, imaging studies of object vision have focused on commonalities between individuals rather than individual differences and on category averages rather than representations of particular objects. Here we investigate the individually unique component of brain representations of particular objects with functional MRI (fMRI). Subjects were presented with unfamiliar and personally meaningful object images while we measured their brain activity on two separate days. We characterized the representational geometry by the dissimilarity matrix of activity patterns elicited by particular object images. The representational geometry remained stable across scanning days and was unique in each individual in early visual cortex and human inferior temporal cortex (hIT). The hIT representation predicted perceived similarity as reflected in dissimilarity judgments. Importantly, hIT predicted the individually unique component of the judgments when the objects were personally meaningful. Our results suggest that hIT brain representational idiosyncrasies accessible to fMRI are expressed in an individual's perceptual judgments. The unique way each of us perceives the world thus might reflect the individually unique representation in high-level visual areas.
Project description:The degree to which we perceive real-world objects as similar or dissimilar structures our perception and guides categorization behavior. Here, we investigated the neural representations enabling perceived similarity using behavioral judgments, fMRI and MEG. As different object dimensions co-occur and partly correlate, to understand the relationship between perceived similarity and brain activity it is necessary to assess the unique role of multiple object dimensions. We thus behaviorally assessed perceived object similarity in relation to shape, function, color and background. We then used representational similarity analyses to relate these behavioral judgments to brain activity. We observed a link between each object dimension and representations in visual cortex. These representations emerged rapidly within 200?ms of stimulus onset. Assessing the unique role of each object dimension revealed partly overlapping and distributed representations: while color-related representations distinctly preceded shape-related representations both in the processing hierarchy of the ventral visual pathway and in time, several dimensions were linked to high-level ventral visual cortex. Further analysis singled out the shape dimension as neither fully accounted for by supra-category membership, nor a deep neural network trained on object categorization. Together our results comprehensively characterize the relationship between perceived similarity of key object dimensions and neural activity.
Project description:Object similarity, in brain representations and conscious perception, must reflect a combination of the visual appearance of the objects on the one hand and the categories the objects belong to on the other. Indeed, visual object features and category membership have each been shown to contribute to the object representation in human inferior temporal (IT) cortex, as well as to object-similarity judgments. However, the explanatory power of features and categories has not been directly compared. Here, we investigate whether the IT object representation and similarity judgments are best explained by a categorical or a feature-based model. We use rich models (>100 dimensions) generated by human observers for a set of 96 real-world object images. The categorical model consists of a hierarchically nested set of category labels (such as "human", "mammal", and "animal"). The feature-based model includes both object parts (such as "eye", "tail", and "handle") and other descriptive features (such as "circular", "green", and "stubbly"). We used non-negative least squares to fit the models to the brain representations (estimated from functional magnetic resonance imaging data) and to similarity judgments. Model performance was estimated on held-out images not used in fitting. Both models explained significant variance in IT and the amounts explained were not significantly different. The combined model did not explain significant additional IT variance, suggesting that it is the shared model variance (features correlated with categories, categories correlated with features) that best explains IT. The similarity judgments were almost fully explained by the categorical model, which explained significantly more variance than the feature-based model. The combined model did not explain significant additional variance in the similarity judgments. Our findings suggest that IT uses features that help to distinguish categories as stepping stones toward a semantic representation. Similarity judgments contain additional categorical variance that is not explained by visual features, reflecting a higher-level more purely semantic representation.
Project description:Recent advances in Deep convolutional Neural Networks (DNNs) have enabled unprecedentedly accurate computational models of brain representations, and present an exciting opportunity to model diverse cognitive functions. State-of-the-art DNNs achieve human-level performance on object categorisation, but it is unclear how well they capture human behavior on complex cognitive tasks. Recent reports suggest that DNNs can explain significant variance in one such task, judging object similarity. Here, we extend these findings by replicating them for a rich set of object images, comparing performance across layers within two DNNs of different depths, and examining how the DNNs' performance compares to that of non-computational "conceptual" models. Human observers performed similarity judgments for a set of 92 images of real-world objects. Representations of the same images were obtained in each of the layers of two DNNs of different depths (8-layer AlexNet and 16-layer VGG-16). To create conceptual models, other human observers generated visual-feature labels (e.g., "eye") and category labels (e.g., "animal") for the same image set. Feature labels were divided into parts, colors, textures and contours, while category labels were divided into subordinate, basic, and superordinate categories. We fitted models derived from the features, categories, and from each layer of each DNN to the similarity judgments, using representational similarity analysis to evaluate model performance. In both DNNs, similarity within the last layer explains most of the explainable variance in human similarity judgments. The last layer outperforms almost all feature-based models. Late and mid-level layers outperform some but not all feature-based models. Importantly, categorical models predict similarity judgments significantly better than any DNN layer. Our results provide further evidence for commonalities between DNNs and brain representations. Models derived from visual features other than object parts perform relatively poorly, perhaps because DNNs more comprehensively capture the colors, textures and contours which matter to human object perception. However, categorical models outperform DNNs, suggesting that further work may be needed to bring high-level semantic representations in DNNs closer to those extracted by humans. Modern DNNs explain similarity judgments remarkably well considering they were not trained on this task, and are promising models for many aspects of human cognition.
Project description:A comprehensive picture of object processing in the human brain requires combining both spatial and temporal information about brain activity. Here we acquired human magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI) responses to 92 object images. Multivariate pattern classification applied to MEG revealed the time course of object processing: whereas individual images were discriminated by visual representations early, ordinate and superordinate category levels emerged relatively late. Using representational similarity analysis, we combined human fMRI and MEG to show content-specific correspondence between early MEG responses and primary visual cortex (V1), and later MEG responses and inferior temporal (IT) cortex. We identified transient and persistent neural activities during object processing with sources in V1 and IT. Finally, we correlated human MEG signals to single-unit responses in monkey IT. Together, our findings provide an integrated space- and time-resolved view of human object categorization during the first few hundred milliseconds of vision.
Project description:Deep neural networks provide the current best models of visual information processing in the primate brain. Drawing on work from computer vision, the most commonly used networks are pretrained on data from the ImageNet Large Scale Visual Recognition Challenge. This dataset comprises images from 1,000 categories, selected to provide a challenging testbed for automated visual object recognition systems. Moving beyond this common practice, we here introduce <i>ecoset</i>, a collection of >1.5 million images from 565 basic-level categories selected to better capture the distribution of objects relevant to humans. Ecoset categories were chosen to be both frequent in linguistic usage and concrete, thereby mirroring important physical objects in the world. We test the effects of training on this ecologically more valid dataset using multiple instances of two neural network architectures: AlexNet and vNet, a novel architecture designed to mimic the progressive increase in receptive field sizes along the human ventral stream. We show that training on ecoset leads to significant improvements in predicting representations in human higher-level visual cortex and perceptual judgments, surpassing the previous state of the art. Significant and highly consistent benefits are demonstrated for both architectures on two separate functional magnetic resonance imaging (fMRI) datasets and behavioral data, jointly covering responses to 1,292 visual stimuli from a wide variety of object categories. These results suggest that computational visual neuroscience may take better advantage of the deep learning framework by using image sets that reflect the human perceptual and cognitive experience. Ecoset and trained network models are openly available to the research community.
Project description:Intrinsic cortical dynamics are thought to underlie trial-to-trial variability of visually evoked responses in animal models. Understanding their function in the context of sensory processing and representation is a major current challenge. Here we report that intrinsic cortical dynamics strongly affect the representational geometry of a brain region, as reflected in response-pattern dissimilarities, and exaggerate the similarity of representations between brain regions. We characterized the representations in several human visual areas by representational dissimilarity matrices (RDMs) constructed from fMRI response-patterns for natural image stimuli. The RDMs of different visual areas were highly similar when the response-patterns were estimated on the basis of the same trials (sharing intrinsic cortical dynamics), and quite distinct when patterns were estimated on the basis of separate trials (sharing only the stimulus-driven component). We show that the greater similarity of the representational geometries can be explained by coherent fluctuations of regional-mean activation within visual cortex, reflecting intrinsic dynamics. Using separate trials to study stimulus-driven representations revealed clearer distinctions between the representational geometries: a Gabor wavelet pyramid model explained representational geometry in visual areas V1-3 and a categorical animate-inanimate model in the object-responsive lateral occipital cortex.
Project description:Studies of the primate visual system have begun to test a wide range of complex computational object-vision models. Realistic models have many parameters, which in practice cannot be fitted using the limited amounts of brain-activity data typically available. Task performance optimization (e.g. using backpropagation to train neural networks) provides major constraints for fitting parameters and discovering nonlinear representational features appropriate for the task (e.g. object classification). Model representations can be compared to brain representations in terms of the representational dissimilarities they predict for an image set. This method, called representational similarity analysis (RSA), enables us to test the representational feature space as is (fixed RSA) or to fit a linear transformation that mixes the nonlinear model features so as to best explain a cortical area's representational space (mixed RSA). Like voxel/population-receptive-field modelling, mixed RSA uses a training set (different stimuli) to fit one weight per model feature and response channel (voxels here), so as to best predict the response profile across images for each response channel. We analysed response patterns elicited by natural images, which were measured with functional magnetic resonance imaging (fMRI). We found that early visual areas were best accounted for by shallow models, such as a Gabor wavelet pyramid (GWP). The GWP model performed similarly with and without mixing, suggesting that the original features already approximated the representational space, obviating the need for mixing. However, a higher ventral-stream visual representation (lateral occipital region) was best explained by the higher layers of a deep convolutional network and mixing of its feature set was essential for this model to explain the representation. We suspect that mixing was essential because the convolutional network had been trained to discriminate a set of 1000 categories, whose frequencies in the training set did not match their frequencies in natural experience or their behavioural importance. The latter factors might determine the representational prominence of semantic dimensions in higher-level ventral-stream areas. Our results demonstrate the benefits of testing both the specific representational hypothesis expressed by a model's original feature space and the hypothesis space generated by linear transformations of that feature space.
Project description:People interact with entities in the environment in distinct and categorizable ways (e.g., <i>kicking</i> is <i>making contact with foot</i>). We can recognize these action categories across variations in actors, objects, and settings; moreover, we can recognize them from both dynamic and static visual input. However, the neural systems that support action recognition across these perceptual differences are unclear. Here, we used multivoxel pattern analysis of fMRI data to identify brain regions that support visual action categorization in a format-independent way. Human participants were scanned while viewing eight categories of interactions (e.g., <i>pulling</i>) depicted in two visual formats: (1) visually controlled videos of two interacting actors and (2) visually varied photographs selected from the internet involving different actors, objects, and settings. Action category was decodable across visual formats in bilateral inferior parietal, bilateral occipitotemporal, left premotor, and left middle frontal cortex. In most of these regions, the representational similarity of action categories was consistent across subjects and visual formats, a property that can contribute to a common understanding of actions among individuals. These results suggest that the identified brain regions support action category codes that are important for action recognition and action understanding.<b>SIGNIFICANCE STATEMENT</b> Humans tend to interpret the observed actions of others in terms of categories that are invariant to incidental features: whether a girl pushes a boy or a button and whether we see it in real-time or in a single snapshot, it is still <i>pushing</i> Here, we investigated the brain systems that facilitate the visual recognition of these action categories across such differences. Using fMRI, we identified several areas of parietal, occipitotemporal, and frontal cortex that exhibit action category codes that are similar across viewing of dynamic videos and still photographs. Our results provide strong evidence for the involvement of these brain regions in recognizing the way that people interact physically with objects and other people.
Project description:Every human cognitive function, such as visual object recognition, is realized in a complex spatio-temporal activity pattern in the brain. Current brain imaging techniques in isolation cannot resolve the brain's spatio-temporal dynamics, because they provide either high spatial or temporal resolution but not both. To overcome this limitation, we developed an integration approach that uses representational similarities to combine measurements of magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI) to yield a spatially and temporally integrated characterization of neuronal activation. Applying this approach to 2 independent MEG-fMRI data sets, we observed that neural activity first emerged in the occipital pole at 50-80 ms, before spreading rapidly and progressively in the anterior direction along the ventral and dorsal visual streams. Further region-of-interest analyses established that dorsal and ventral regions showed MEG-fMRI correspondence in representations later than early visual cortex. Together, these results provide a novel and comprehensive, spatio-temporally resolved view of the rapid neural dynamics during the first few hundred milliseconds of object vision. They further demonstrate the feasibility of spatially unbiased representational similarity-based fusion of MEG and fMRI, promising new insights into how the brain computes complex cognitive functions.