Project description:As newborn screening programs transition from paper-based data exchange toward automated, electronic methods, significant data exchange challenges must be overcome. This article outlines a data model that maps newborn screening data elements associated with patient demographic information, birthing facilities, laboratories, result reporting, and follow-up care to the LOINC, SNOMED CT, ICD-10-CM, and HL7 healthcare standards. The described framework lays the foundation for the implementation of standardized electronic data exchange across newborn screening programs, leading to greater data interoperability. The use of this model can accelerate the implementation of electronic data exchange between healthcare providers and newborn screening programs, which would ultimately improve health outcomes for all newborns and standardize data exchange across programs.
Project description:BackgroundThere are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.ResultsPharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.ConclusionsUnderstandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.
Project description:Colorectal neoplasia causes bleeding, enabling detection using Faecal Occult Blood tests (FOBt). The National Health Service (NHS) Bowel Cancer Screening Programme (BCSP) guaiac-based FOBt (gFOBt) kits contain six sample windows (or 'spots') and each kit returns either a positive, unclear or negative result. Test kits with five or six positive windows are termed 'abnormal' and the subject is referred for further investigation, usually colonoscopy. If 1-4 windows are positive, the result is initially 'unclear' and up to two further kits are submitted, further positivity leads to colonoscopy ('weak positive'). If no further blood is detected, the test is deemed 'normal' and subjects are tested again in 2 years' time. We studied the association between spot positivity % (SP%) and neoplasia.Subjects in the Southern Hub completing the first of two consecutive episodes between April 2009 and March 2011 were studied. Each episode included up to three kits and a maximum of 18 windows (spots). For each positivity combination, the percentage of positive spots out of the total number of spots completed by an individual in a single-screening episode was derived and named 'SP%'. Fifty-five combinations of SP can occur if the position of positive/negative spots on the same test card is ignored.The proportion of individuals for whom neoplasia was identified in Episode 2 was derived for each of the 55 spot combinations. In addition, the Episode 1 spot pattern was analysed for subjects with cancer detected in Episode 2.During Episode 2, 284,261 subjects completed gFOBT screening and colonoscopies were performed on 3891 (1.4%) subjects. At colonoscopy, cancer was detected in 7.4% (n=286) and a further 39.8% (n=1550) had adenomas. Cancer was detected in 21.3% of subjects with an abnormal first kit (five or six positive spots) and in 5.9% of those with a weak positive test result.The proportion of cancers detected was positively correlated with SP%, with an R(2) correlation (linear) of 0.89. As the SP% increased from 11 to 100%, so the colorectal cancer (CRC) detection rate increased from 4 to 25%. At the lower SP%s, from 11to 25%, the CRC risk was relatively static at ~4%. Above an SP% of 25%, every 10-percentage points increase in the SP%, was associated with an increase in cancer detection of 2.5%.This study demonstrated a strong correlation between SP% and cancer detection within the NHS BCSP. At the population level, subjects' cancer risk ranged from 4 to 25% and correlated with the gFOBt spot pattern.Some subjects with an SP% of 11% proceed to colonoscopy, whereas others with an SP% of 22% do not. Colonoscopy on patients with four positive spots in kit 1 (SP% 22%) would, we estimate, detect cancer in ~4% of cases and increase overall colonoscopy volume by 6%. This study also demonstrated how screening programme data could be used to guide its ongoing implementation and inform other programmes.
Project description:In ultra-high dimensional data analysis, it is extremely challenging to identify important interaction effects, and a top concern in practice is computational feasibility. For a data set with n observations and p predictors, the augmented design matrix including all linear and order-2 terms is of size n × (p2 + 3p)/2. When p is large, say more than tens of hundreds, the number of interactions is enormous and beyond the capacity of standard machines and software tools for storage and analysis. In theory, the interaction selection consistency is hard to achieve in high dimensional settings. Interaction effects have heavier tails and more complex covariance structures than main effects in a random design, making theoretical analysis difficult. In this article, we propose to tackle these issues by forward-selection based procedures called iFOR, which identify interaction effects in a greedy forward fashion while maintaining the natural hierarchical model structure. Two algorithms, iFORT and iFORM, are studied. Computationally, the iFOR procedures are designed to be simple and fast to implement. No complex optimization tools are needed, since only OLS-type calculations are involved; the iFOR algorithms avoid storing and manipulating the whole augmented matrix, so the memory and CPU requirement is minimal; the computational complexity is linear in p for sparse models, hence feasible for p ≫ n. Theoretically, we prove that they possess sure screening property for ultra-high dimensional settings. Numerical examples are used to demonstrate their finite sample performance.
Project description:Scope:Jurisdictional-based Early Hearing Detection and Intervention Information Systems (EHDI-IS) collect data on the hearing screening and follow-up status of infants across the United States. These systems serve as tools that assist EHDI programs' staff and partners in their tracking activities and provide a variety of data reports to help ensure that all children who are deaf/hard of hearing (DHH) are identified early and receive recommended intervention services. The quality and timeliness of the data collected with these systems are crucial to effectively meeting these goals. Methodology:Forty-eight EHDI programs, funded by the Centers for Disease Control and Prevention (CDC), successfully evaluated the accuracy, completeness, uniqueness, and timeliness of the hearing screening data as well as the acceptability (i.e., willingness to report) of the EHDI-IS among data reporters (2013-2016). This article describes the evaluations conducted and presents the findings from these evaluation activities. Conclusions:Most state EHDI programs are receiving newborn hearing screening results from hospitals and birthing facilities in a consistent way and data reporters are willing to report according to established protocols. However, additional efforts are needed to improve the accuracy and completeness of reported demographic data, results from infants transferred from other hospitals, and results from infants admitted to the Neonatal Intensive Care Unit.
Project description:Quantitative high throughput screening (qHTS) experiments can generate 1000s of concentration-response profiles to screen compounds for potentially adverse effects. However, potency estimates for a single compound can vary considerably in study designs incorporating multiple concentration-response profiles for each compound. We introduce an automated quality control procedure based on analysis of variance (ANOVA) to identify and filter out compounds with multiple cluster response patterns and improve potency estimation in qHTS assays. Our approach, called Cluster Analysis by Subgroups using ANOVA (CASANOVA), clusters compound-specific response patterns into statistically supported subgroups. Applying CASANOVA to 43 publicly available qHTS data sets, we found that only about 20% of compounds with response values outside of the noise band have single cluster responses. The error rates for incorrectly separating true clusters and incorrectly clumping disparate clusters were both less than 5% in extensive simulation studies. Simulation studies also showed that the bias and variance of concentration at half-maximal response (AC50 ) estimates were usually within 10-fold when using a weighted average approach for potency estimation. In short, CASANOVA effectively sorts out compounds with "inconsistent" response patterns and produces trustworthy AC50 values.
Project description:Unbiased discovery approaches have the potential to uncover neurobiological insights into CNS disease and lead to the development of therapies. Here, we review lessons learned from imaging-based screening approaches and recent advances in these areas, including powerful new computational tools to synthesize complex data into more useful knowledge that can reliably guide future research and development.
Project description:BackgroundFunctional genomics employs several experimental approaches to investigate gene functions. High-throughput techniques, such as loss-of-function screening and transcriptome profiling, allow to identify lists of genes potentially involved in biological processes of interest (so called hit list). Several computational methods exist to analyze and interpret such lists, the most widespread of which aim either at investigating of significantly enriched biological processes, or at extracting significantly represented subnetworks.ResultsHere we propose a novel network analysis method and corresponding computational software that employs the shortest path approach and centrality measure to discover members of molecular pathways leading to the studied phenotype, based on functional genomics screening data. The method works on integrated interactomes that consist of both directed and undirected networks - HIPPIE, SIGNOR, SignaLink, TFactS, KEGG, TransmiR, miRTarBase. The method finds nodes and short simple paths with significant high centrality in subnetworks induced by the hit genes and by so-called final implementers - the genes that are involved in molecular events responsible for final phenotypic realization of the biological processes of interest. We present the application of the method to the data from miRNA loss-of-function screen and transcriptome profiling of terminal human muscle differentiation process and to the gene loss-of-function screen exploring the genes that regulates human oxidative DNA damage recognition. The analysis highlighted the possible role of several known myogenesis regulatory miRNAs (miR-1, miR-125b, miR-216a) and their targets (AR, NR3C1, ARRB1, ITSN1, VAV3, TDGF1), as well as linked two major regulatory molecules of skeletal myogenesis, MYOD and SMAD3, to their previously known muscle-related targets (TGFB1, CDC42, CTCF) and also to a number of proteins such as C-KIT that have not been previously studied in the context of muscle differentiation. The analysis also showed the role of the interaction between H3 and SETDB1 proteins for oxidative DNA damage recognition.ConclusionThe current work provides a systematic methodology to discover members of molecular pathways in integrated networks using functional genomics screening data. It also offers a valuable instrument to explain the appearance of a set of genes, previously not associated with the process of interest, in the hit list of each particular functional genomics screening.
Project description:The National Toxicology Program is developing a high-throughput screening (HTS) program to set testing priorities for compounds of interest, to identify mechanisms of action, and potentially to develop predictive models for human toxicity. This program will generate extensive data on the activity of large numbers of chemicals in a wide variety of biochemical- and cell-based assays. The first step in relating patterns of response among batteries of HTS assays to in vivo toxicity is to distinguish between positive and negative compounds in individual assays. Here, the authors report on a statistical approach developed to identify compounds positive or negative in an HTS cytotoxicity assay based on data collected from screening 1353 compounds for concentration-response effects in 9 human and 4 rodent cell types. In this approach, the authors develop methods to normalize the data (removing bias due to the location of the compound on the 1536-well plates used in the assay) and to analyze for concentration-response relationships. Various statistical tests for identifying significant concentration-response relationships and for addressing reproducibility are developed and presented.