Project description:BackgroundThere are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.ResultsPharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.ConclusionsUnderstandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.

Project description:Colorectal neoplasia causes bleeding, enabling detection using Faecal Occult Blood tests (FOBt). The National Health Service (NHS) Bowel Cancer Screening Programme (BCSP) guaiac-based FOBt (gFOBt) kits contain six sample windows (or 'spots') and each kit returns either a positive, unclear or negative result. Test kits with five or six positive windows are termed 'abnormal' and the subject is referred for further investigation, usually colonoscopy. If 1-4 windows are positive, the result is initially 'unclear' and up to two further kits are submitted, further positivity leads to colonoscopy ('weak positive'). If no further blood is detected, the test is deemed 'normal' and subjects are tested again in 2 years' time. We studied the association between spot positivity % (SP%) and neoplasia.Subjects in the Southern Hub completing the first of two consecutive episodes between April 2009 and March 2011 were studied. Each episode included up to three kits and a maximum of 18 windows (spots). For each positivity combination, the percentage of positive spots out of the total number of spots completed by an individual in a single-screening episode was derived and named 'SP%'. Fifty-five combinations of SP can occur if the position of positive/negative spots on the same test card is ignored.The proportion of individuals for whom neoplasia was identified in Episode 2 was derived for each of the 55 spot combinations. In addition, the Episode 1 spot pattern was analysed for subjects with cancer detected in Episode 2.During Episode 2, 284,261 subjects completed gFOBT screening and colonoscopies were performed on 3891 (1.4%) subjects. At colonoscopy, cancer was detected in 7.4% (n=286) and a further 39.8% (n=1550) had adenomas. Cancer was detected in 21.3% of subjects with an abnormal first kit (five or six positive spots) and in 5.9% of those with a weak positive test result.The proportion of cancers detected was positively correlated with SP%, with an R(2) correlation (linear) of 0.89. As the SP% increased from 11 to 100%, so the colorectal cancer (CRC) detection rate increased from 4 to 25%. At the lower SP%s, from 11to 25%, the CRC risk was relatively static at ~4%. Above an SP% of 25%, every 10-percentage points increase in the SP%, was associated with an increase in cancer detection of 2.5%.This study demonstrated a strong correlation between SP% and cancer detection within the NHS BCSP. At the population level, subjects' cancer risk ranged from 4 to 25% and correlated with the gFOBt spot pattern.Some subjects with an SP% of 11% proceed to colonoscopy, whereas others with an SP% of 22% do not. Colonoscopy on patients with four positive spots in kit 1 (SP% 22%) would, we estimate, detect cancer in ~4% of cases and increase overall colonoscopy volume by 6%. This study also demonstrated how screening programme data could be used to guide its ongoing implementation and inform other programmes.

Project description:BackgroundFunctional genomics employs several experimental approaches to investigate gene functions. High-throughput techniques, such as loss-of-function screening and transcriptome profiling, allow to identify lists of genes potentially involved in biological processes of interest (so called hit list). Several computational methods exist to analyze and interpret such lists, the most widespread of which aim either at investigating of significantly enriched biological processes, or at extracting significantly represented subnetworks.ResultsHere we propose a novel network analysis method and corresponding computational software that employs the shortest path approach and centrality measure to discover members of molecular pathways leading to the studied phenotype, based on functional genomics screening data. The method works on integrated interactomes that consist of both directed and undirected networks - HIPPIE, SIGNOR, SignaLink, TFactS, KEGG, TransmiR, miRTarBase. The method finds nodes and short simple paths with significant high centrality in subnetworks induced by the hit genes and by so-called final implementers - the genes that are involved in molecular events responsible for final phenotypic realization of the biological processes of interest. We present the application of the method to the data from miRNA loss-of-function screen and transcriptome profiling of terminal human muscle differentiation process and to the gene loss-of-function screen exploring the genes that regulates human oxidative DNA damage recognition. The analysis highlighted the possible role of several known myogenesis regulatory miRNAs (miR-1, miR-125b, miR-216a) and their targets (AR, NR3C1, ARRB1, ITSN1, VAV3, TDGF1), as well as linked two major regulatory molecules of skeletal myogenesis, MYOD and SMAD3, to their previously known muscle-related targets (TGFB1, CDC42, CTCF) and also to a number of proteins such as C-KIT that have not been previously studied in the context of muscle differentiation. The analysis also showed the role of the interaction between H3 and SETDB1 proteins for oxidative DNA damage recognition.ConclusionThe current work provides a systematic methodology to discover members of molecular pathways in integrated networks using functional genomics screening data. It also offers a valuable instrument to explain the appearance of a set of genes, previously not associated with the process of interest, in the hit list of each particular functional genomics screening.

Dataset Information

Miscellaneous screening data

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets