Recognition of natural scenes from global properties: seeing the forest without representing the trees.
ABSTRACT: Human observers are able to rapidly and accurately categorize natural scenes, but the representation mediating this feat is still unknown. Here we propose a framework of rapid scene categorization that does not segment a scene into objects and instead uses a vocabulary of global, ecological properties that describe spatial and functional aspects of scene space (such as navigability or mean depth). In Experiment 1, we obtained ground truth rankings on global properties for use in Experiments 2-4. To what extent do human observers use global property information when rapidly categorizing natural scenes? In Experiment 2, we found that global property resemblance was a strong predictor of both false alarm rates and reaction times in a rapid scene categorization experiment. To what extent is global property information alone a sufficient predictor of rapid natural scene categorization? In Experiment 3, we found that the performance of a classifier representing only these properties is indistinguishable from human performance in a rapid scene categorization task in terms of both accuracy and false alarms. To what extent is this high predictability unique to a global property representation? In Experiment 4, we compared two models that represent scene object information to human categorization performance and found that these models had lower fidelity at representing the patterns of performance than the global property model. These results provide support for the hypothesis that rapid categorization of natural scenes may not be mediated primarily though objects and parts, but also through global properties of structure and affordance.
Project description:Observers can rapidly perform a variety of visual tasks such as categorizing a scene as open, as outdoor, or as a beach. Although we know that different tasks are typically associated with systematic differences in behavioral responses, to date, little is known about the underlying mechanisms. Here, we implemented a single integrated paradigm that links perceptual processes with categorization processes. Using a large image database of natural scenes, we trained machine-learning classifiers to derive quantitative measures of task-specific perceptual discriminability based on the distance between individual images and different categorization boundaries. We showed that the resulting discriminability measure accurately predicts variations in behavioral responses across categorization tasks and stimulus sets. We further used the model to design an experiment, which challenged previous interpretations of the so-called "superordinate advantage." Overall, our study suggests that observed differences in behavioral responses across rapid categorization tasks reflect natural variations in perceptual discriminability.
Project description:Humans can categorize complex natural scenes quickly and accurately. Which scene properties enable people to do this with such apparent ease? We extracted structural properties of contours (orientation, length, curvature) and contour junctions (types and angles) from line drawings of natural scenes. All of these properties contain information about scene categories that can be exploited computationally. However, when we compared error patterns from computational scene categorization with those from a six-alternative forced-choice scene-categorization experiment, we found that only junctions and curvature made significant contributions to human behavior. To further test the critical role of these properties, we perturbed junctions in line drawings by randomly shifting contours and found a significant decrease in human categorization accuracy. We conclude that scene categorization by humans relies on curvature as well as the same nonaccidental junction properties used for object recognition. These properties correspond to the visual features represented in area V2.
Project description:Categorization performance is a popular metric of scene recognition and understanding in behavioral and computational research. However, categorical constructs and their labels can be somewhat arbitrary. Derived from exhaustive vocabularies of place names (e.g., Deng et al., 2009), or the judgements of small groups of researchers (e.g., Fei-Fei, Iyer, Koch, & Perona, 2007), these categories may not correspond with human-preferred taxonomies. Here, we propose clustering by increasing the rand index via coordinate ascent (CIRCA): an unsupervised, data-driven clustering method for deriving ground-truth scene categories. In Experiment 1, human participants organized 80 stereoscopic images of outdoor scenes from the Southampton-York Natural Scenes (SYNS) dataset (Adams et al., 2016) into discrete categories. In separate tasks, images were grouped according to i) semantic content, ii) three-dimensional spatial structure, or iii) two-dimensional image appearance. Participants provided text labels for each group. Using the CIRCA method, we determined the most representative category structure and then derived category labels for each task/dimension. In Experiment 2, we found that these categories generalized well to a larger set of SYNS images, and new observers. In Experiment 3, we tested the relationship between our category systems and the spatial envelope model (Oliva & Torralba, 2001). Finally, in Experiment 4, we validated CIRCA on a larger, independent dataset of same-different category judgements. The derived category systems outperformed the SUN taxonomy (Xiao, Hays, Ehinger, Oliva, & Torralba, 2010) and an alternative clustering method (Greene, 2019). In summary, we believe this novel categorization method can be applied to a wide range of datasets to derive optimal categorical groupings and labels from psychophysical judgements of stimulus similarity.
Project description:The processes underlying object recognition are fundamental for the understanding of visual perception. Humans can recognize many objects rapidly even in complex scenes, a task that still presents major challenges for computer vision systems. A common experimental demonstration of this ability is the rapid animal detection protocol, where human participants earliest responses to report the presence/absence of animals in natural scenes are observed at 250-270 ms latencies. One of the hypotheses to account for such speed is that people would not actually recognize an animal per se, but rather base their decision on global scene statistics. These global statistics (also referred to as spatial envelope or gist) have been shown to be computationally easy to process and could thus be used as a proxy for coarse object recognition. Here, using a saccadic choice task, which allows us to investigate a previously inaccessible temporal window of visual processing, we showed that animal - but not vehicle - detection clearly precedes scene categorization. This asynchrony is in addition validated by a late contextual modulation of animal detection, starting simultaneously with the availability of scene category. Interestingly, the advantage for animal over scene categorization is in opposition to the results of simulations using standard computational models. Taken together, these results challenge the idea that rapid animal detection might be based on early access of global scene statistics, and rather suggests a process based on the extraction of specific local complex features that might be hardwired in the visual system.
Project description:Colour discrimination has been widely studied in red-green (R-G) dichromats but the extent to which their colour constancy is affected remains unclear. This work estimated the extent of colour constancy for four normal trichromatic observers and seven R-G dichromats when viewing natural scenes under simulated daylight illuminants. Hyperspectral imaging data from natural scenes were used to generate the stimuli on a calibrated CRT display. In experiment 1, observers viewed a reference scene illuminated by daylight with a correlated colour temperature (CCT) of 6700K; observers then viewed sequentially two versions of the same scene, one illuminated by either a higher or lower CCT (condition 1, pure CCT change with constant luminance) or a higher or lower average luminance (condition 2, pure luminance change with a constant CCT). The observers' task was to identify the version of the scene that looked different from the reference scene. Thresholds for detecting a pure CCT change or a pure luminance change were estimated, and it was found that those for R-G dichromats were marginally higher than for normal trichromats regarding CCT. In experiment 2, observers viewed sequentially a reference scene and a comparison scene with a CCT change or a luminance change above threshold for each observer. The observers' task was to identify whether or not the change was an intensity change. No significant differences were found between the responses of normal trichromats and dichromats. These data suggest robust colour constancy mechanisms along daylight locus in R-G dichromacy.
Project description:In four experiments, we investigated how attention to local and global levels of hierarchical Navon figures affected the selection of diagnostic spatial scale information used in scene categorization. We explored this issue by asking observers to classify hybrid images (i.e., images that contain low spatial frequency (LSF) content of one image, and high spatial frequency (HSF) content from a second image) immediately following global and local Navon tasks. Hybrid images can be classified according to either their LSF, or HSF content; thus, making them ideal for investigating diagnostic spatial scale preference. Although observers were sensitive to both spatial scales (Experiment 1), they overwhelmingly preferred to classify hybrids based on LSF content (Experiment 2). In Experiment 3, we demonstrated that LSF based hybrid categorization was faster following global Navon tasks, suggesting that LSF processing associated with global Navon tasks primed the selection of LSFs in hybrid images. In Experiment 4, replicating Experiment 3 but suppressing the LSF information in Navon letters by contrast balancing the stimuli examined this hypothesis. Similar to Experiment 3, observers preferred to classify hybrids based on LSF content; however and in contrast, LSF based hybrid categorization was slower following global than local Navon tasks.
Project description:The visual system has an extraordinary capability to extract categorical information from complex natural scenes. For example, subjects are able to rapidly detect the presence of object categories such as animals or vehicles in new scenes that are presented very briefly. This is even true when subjects do not pay attention to the scenes and simultaneously perform an unrelated attentionally demanding task, a stark contrast to the capacity limitations predicted by most theories of visual attention. Here we show a neural basis for rapid natural scene categorization in the visual cortex, using functional magnetic resonance imaging and an object categorization task in which subjects detected the presence of people or cars in briefly presented natural scenes. The multi-voxel pattern of neural activity in the object-selective cortex evoked by the natural scenes contained information about the presence of the target category, even when the scenes were task-irrelevant and presented outside the focus of spatial attention. These findings indicate that the rapid detection of categorical information in natural scenes is mediated by a category-specific biasing mechanism in object-selective cortex that operates in parallel across the visual field, and biases information processing in favour of objects belonging to the target object category.
Project description:Can people categorize complex visual scenes unconsciously? The possibility of unconscious perception remains controversial. Here, we addressed this question using psychophysical methods applied to unmasked visual stimuli presented for extremely short durations (in the ?sec range) by means of a custom-built modern tachistoscope. Our experiment was composed of two phases. In the first phase, natural or urban scenes were either absent or present (for varying durations) on the tachistoscope screen, and participants were simply asked to evaluate their subjective perception using a 3-points scale (absence of stimulus, stimulus detection or stimulus identification). Participants' responses were tracked by means of two staircases. The first psychometric function aimed at defining participants' proportion of subjective detection responses (i.e., not having seen anything vs. having seen something without being able to identify it), while the second staircase tracked the proportion of subjective identification rates (i.e., being unaware of the stimulus' category vs. being aware of it). In the second phase, the same participants performed an objective categorization task in which they had to decide, on each trial, whether the image was a natural vs. an urban scene. A third staircase was used in this phase so as to build a psychometric curve reflecting the objective categorization performance of each participant. In this second phase, participants also rated their subjective perception of each stimulus on every trial, exactly as in the first phase of the experiment. Our main result is that objective categorization performance, here assumed to reflect the contribution of both conscious and unconscious trials, cannot be explained based exclusively on conscious trials. This clearly suggests that the categorization of complex visual scenes is possible even when participants report being unable to consciously perceive the contents of the stimulus.
Project description:Looking for objects within complex natural environments is a task everybody performs multiple times each day. In this study, we explore how the brain uses the typical composition of real-world environments to efficiently solve this task. We recorded fMRI activity while participants performed two different categorization tasks on natural scenes. In the object task, they indicated whether the scene contained a person or a car, while in the scene task, they indicated whether the scene depicted an urban or a rural environment. Critically, each scene was presented in an "intact" way, preserving its coherent structure, or in a "jumbled" way, with information swapped across quadrants. In both tasks, participants' categorization was more accurate and faster for intact scenes. These behavioral benefits were accompanied by stronger responses to intact than to jumbled scenes across high-level visual cortex. To track the amount of object information in visual cortex, we correlated multi-voxel response patterns during the two categorization tasks with response patterns evoked by people and cars in isolation. We found that object information in object- and body-selective cortex was enhanced when the object was embedded in an intact, rather than a jumbled scene. However, this enhancement was only found in the object task: When participants instead categorized the scenes, object information did not differ between intact and jumbled scenes. Together, these results indicate that coherent scene structure facilitates the extraction of object information in a task-dependent way, suggesting that interactions between the object and scene processing pathways adaptively support behavioral goals.
Project description:Observers can store thousands of object images in visual long-term memory with high fidelity, but the fidelity of scene representations in long-term memory is not known. Here, we probed scene-representation fidelity by varying the number of studied exemplars in different scene categories and testing memory using exemplar-level foils. Observers viewed thousands of scenes over 5.5 hr and then completed a series of forced-choice tests. Memory performance was high, even with up to 64 scenes from the same category in memory. Moreover, there was only a 2% decrease in accuracy for each doubling of the number of studied scene exemplars. Surprisingly, this degree of categorical interference was similar to the degree previously demonstrated for object memory. Thus, although scenes have often been defined as a superset of objects, our results suggest that scenes and objects may be entities at a similar level of abstraction in visual long-term memory.