Project description:BackgroundWe implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.MethodologyDocuments extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.ConclusionsStudy results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.
Project description:As the basis of animals' natal homing behavior, path integration can continuously provide current position information relative to the initial position. Some neurons in freely moving animals' brains can encode current positions and surrounding environments by special firing patterns. Research studies show that neurons such as grid cells (GCs) in the hippocampus of animals' brains are related to the path integration. They might encode the coordinate of the animal's current position in the same way as the residue number system (RNS) which is based on the Chinese remainder theorem (CRT). Hence, in order to provide vehicles a bionic position estimation method, we propose a model to decode the GCs' encoding information based on the improved traditional self-organizing map (SOM), and this model makes full use of GCs' firing characteristics. The details of the model are discussed in this paper. Besides, the model is realized by computer simulation, and its performance is analyzed under different conditions. Simulation results indicate that the proposed position estimation model is effective and stable.
Project description:Antimicrobial peptides (AMPs) show remarkable selectivity toward lipid membranes and possess promising antibiotic potential. Their modes of action are diverse and not fully understood, and innovative peptide design strategies are needed to generate AMPs with improved properties. We present a de novo peptide design approach that resulted in new AMPs possessing low-nanomolar membranolytic activities. Thermal analysis revealed an entropy-driven mechanism of action. The study demonstrates sustained potential of advanced computational methods for designing peptides with the desired activity.
Project description:The Korean consumer credit panel offers a well-organized set of microdata representing various characteristics of individual borrowers. To overcome the difficulty of fragmented microdata details, we construct a cluster of Korean consumers' credit, to develop a self-organizing map that visualizes individuals' characteristics along two dimensions. The result of cluster analysis reveals that most borrowers belong to one large cluster representing diligent borrowers who honor their loan payments. Conversely, several small clusters that represent borrowers with high default probability are identified, and we also found that these borrowers' characteristics vary. No significant change is found in the structure of the cluster, even when the aggregate amount of consumer credit is increased. Moreover, the expansionary monetary policy did not change the quantitative structure of household debt in Korea.Supplementary informationThe online version contains supplementary material available at 10.1007/s00181-021-02120-5.
Project description:In the sea of data generated daily, unlabeled samples greatly outnumber labeled ones. This is due to the fact that, in many application areas, labels are scarce or hard to obtain. In addition, unlabeled samples might belong to new classes that are not available in the label set associated with data. In this context, we propose A3SOM, an abstained explainable semi-supervised neural network that associates a self-organizing map to dense layers in order to classify samples. Abstained classification enables the detection of new classes and class overlaps. The use of a self-organizing map in A3SOM allows integrated visualization and makes the model explainable. Along with describing our approach, this paper shows that the method is competitive with other classifiers and demonstrates the benefits of including abstention rules. A use case is presented on breast cancer subtype classification and discovery to show the relevance of our method in real-world medical problems.
Project description:Recovering a system's underlying structure from its historical records (also called structure mining) is essential to making valid inferences about that system's behavior. For example, making reliable predictions about system failures based on maintenance work-order data requires determining how concepts described within the work order are related. Obtaining such structural information is challenging, requiring system understanding, synthesis, and representation design. This is often either too difficult or too time-consuming to produce. Consequently, a common approach to quickly eliciting tacit structural knowledge from experts is to gather uncontrolled keywords as record labels-i.e., "tags." One can then map those tags to concepts within the structure and quantitatively infer relationships between them. Existing models of tag similarity tend to either depend on correlation strength (e.g. overall co-occurrence frequencies), or on conditional strength (e.g. tag sequence probabilities). A key difficulty in applying either model is understanding under what conditions one is better than the other for overall structure recovery. In this paper, we investigate the core assumptions and implications of these two classes of similarity measures on structure recovery tasks. Then, using lessons from this characterization, we borrow from recent psychology literature on semantic fluency tasks to construct a tag similarity measure that emulates how humans recall tags from memory. We show through empirical testing that this method combines strengths of both common modeling paradigms. We also demonstrate its potential as a pre-processor for structure mining tasks via a case study in semi-supervised learning on real excavator maintenance work-orders.
Project description:To reach a better understanding of the spatial variability of water quality in the Lower Mekong Basin (LMB), the Self-Organizing Map (SOM) was used to classify 117 monitoring sites and hotspots of pollution within the basin identified according to water quality indicators and US-EPA guidelines. Four different clusters were identified based on their similar physicochemical characteristics. The majority of sites in upper (Laos and Thailand) and middle part (Cambodia) of the basin were grouped in two clusters, considered as good quality water with high DO and low nutrient levels. The other two clusters were mostly composed of sites in Mekong delta (Vietnam) and few sites in upstream tributaries (i.e., northwestern Thailand, Tonle Sap Lake, and swamps close to Vientiane), known for moderate to poor quality of water and characterized by high nutrient and dissolved solid levels. Overall, we found that the water in the mainstream was less polluted than its tributaries; eutrophication and salinity could be key factors affecting water quality in LMB. Moreover, the seasonal variation of water quality seemed to be less marked than spatial variation occurring along the longitudinal gradient of Mekong River. Significant degradations were mainly associated with human disturbance and particularly apparent in sites distributed along the man-made canals in Vietnam delta where population growth and agricultural development are intensive.
Project description:In the present study, a global presence/absence dataset including 2486 scale insect species in 157 countries was extracted to assess the establishment risk of potential invasive species based on a self-organizing map (SOM). According to the similarities in species assemblages, a risk list of scale insects for each country was generated. Meanwhile, all countries in the dataset were divided into five clusters, each of which has high similarities of species assemblages. For those countries in the same neuron of the SOM output, they may pose the greatest threats to each other as the sources of potential invasive scale insect species, and therefore, require more attention from quarantine departments. In addition, normalized ζi values were used to measure the uncertainty of the SOM output. In total, 9 out of 63 neurons obtained high uncertainty with very low species counts, indicating that more investigation of scale insects should be undertaken in some parts of Africa, Asia and Northern Europe.
Project description:The evaluation of air pollution is a critical concern due to its potential severe impacts on human health. Currently, vast quantities of data are collected at high frequencies, and researchers must navigate multiannual, multisite datasets trying to identify possible pollutant sources while addressing the presence of noise and sparse missing data. To address this challenge, multivariate data analysis is widely used with an increasing interest in neural networks and deep learning networks along with well-established chemometrics methods and receptor models. Here, we report a combined approach involving the Self-Organizing Map (SOM) algorithm, Hierarchical Clustering Analysis (HCA), and Positive Matrix Factorization (PMF) to disentangle multiannual, multisite data in a single elaboration without previously separating the sites and years. The approach proved to be valid, allowing us to detect the site peculiarities in terms of pollutant sources, the variation in pollutant profiles during years and the outliers, affording a reliable interpretation.
Project description:A defective interfering particle (DIP) in the context of influenza A virus is a virion with a significantly shortened RNA segment substituting one of eight full-length parent RNA segments, such that it is preferentially amplified. Hence, a cell co-infected with DIPs will produce mainly DIPs, suppressing infectious virus yields and affecting infection kinetics. Unfortunately, the quantification of DIPs contained in a sample is difficult because they are indistinguishable from standard virus (STV). Using a mathematical model, we investigated the standard experimental method for counting DIPs based on the reduction in STV yield (Bellett & Cooper, 1959, Journal of General Microbiology 21, 498-509 (doi:10.1099/00221287-21-3-498)). We found the method is valid for counting DIPs provided that: (i) an STV-infected cell's co-infection window is approximately half its eclipse phase (it blocks infection by other virions before it begins producing progeny virions), (ii) a cell co-infected by STV and DIP produces less than 1 STV per 1000 DIPs and (iii) a high MOI of STV stock (more than 4 PFU per cell) is added to perform the assay. Prior work makes no mention of these criteria such that the method has been applied incorrectly in several publications discussed herein. We determined influenza A virus meets these criteria, making the method suitable for counting influenza A DIPs.