Project description:Transcriptional enhancers play critical roles in regulation of gene expression, but their identification has remained a challenge. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of histone modifications have previously been investigated for this purpose, leaving the questions answered whether there exist an optimal set of histone modifications that could improve the enhancer prediction. Here, we address this issue by exploring a rich dataset produced by the human Epigenome Roadmap Project. Specifically, we examined genome-wide profiles of 24 histone modifications in human embryonic stem cells and fibroblasts, and developed a Random-Forest based algorithm to integrate histone modification profiles for identification of enhancers.As a training set, we used histone modification profiles at genome-wide binding sites of p300 in the two cell types identified using ChIP-seq. We show that this algorithm not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify an optimal set of three chromatin marks for enhancer prediction.
Project description:The clinical course of prostate cancer (PCa) is highly variable, demanding an individualized approach to therapy and robust prognostic markers for treatment decisions. We here present a random forest-based classification model to predict aggressive behaviour of PCa. DNA methylation changes between PCa cases with good or poor prognosis (discovery cohort with n=78) were used as input. The model was validated with data from two independent PCa cohorts from ICGC and TCGA. Ranking of cancer progression-related DNA methylation changes allowed selection of candidate genes for additional validation by immunohistochemistry. We identified loss of ZIC2 protein expression as a promising novel prognostic biomarker for PCa in >12,000 tissue micro-array tumors. The prognostic value of ZIC2 proved to be independent from established clinico-pathological variables including Gleason, stage, nodal stage and PSA. In summary, we have developed a PCa classification model which either directly or via expression analyses of the identified top ranked candidate genes might help in decision making related to the treatment of prostate cancer patients.
Project description:A physical unclonable function (PUF) is a foundation of anti-counterfeiting processes due to its inherent uniqueness. However, the self-limitation of conventional graphical/spectral PUFs in materials often makes it difficult to have both high code flexibility and high environmental stability in practice. In this study, we propose a universal, fractal-guided film annealing strategy to realize the random Au network-based PUFs that can be designed on demand in complexity, enabling the tags' intrinsic uniqueness and stability. A dynamic deep learning-based authentication system with an expandable database is built to identify and trace the PUFs, achieving an efficient and reliable authentication with 0% "false positives". Based on the roughening-enabled plasmonic network platform, Raman-based chemical encoding is conceptionally demonstrated, showing the potential for improvements in security. The configurable tags in mass production can serve as competitive PUF carriers for high-level anti-counterfeiting and data encryption.
Project description:Transcriptional enhancers play critical roles in regulation of gene expression, but their identification has remained a challenge. Recently, it was shown that enhancers in the mammalian genome are associated with characteristic histone modification patterns, which have been increasingly exploited for enhancer identification. However, only a limited number of histone modifications have previously been investigated for this purpose, leaving the questions answered whether there exist an optimal set of histone modifications that could improve the enhancer prediction. Here, we address this issue by exploring a rich dataset produced by the human Epigenome Roadmap Project. Specifically, we examined genome-wide profiles of 24 histone modifications in human embryonic stem cells and fibroblasts, and developed a Random-Forest based algorithm to integrate histone modification profiles for identification of enhancers.As a training set, we used histone modification profiles at genome-wide binding sites of p300 in the two cell types identified using ChIP-seq. We show that this algorithm not only leads to more accurate and precise prediction of enhancers than previous methods, but also helps identify an optimal set of three chromatin marks for enhancer prediction. ChIP-Seq Analysis of p300 in hESC H1 and IMR90 cells. Sequencing was done on the Illumina Genome Analyzer II platform for the H1 data and Illumina HiSeq for IMR90.Data was mapped to hg18 using Bowtie.
Project description:tRNA fragments (tRFs) are a novel class of small RNAs comparable to the size and function of miRNAs. We and others have shown that tRFs are generally Dicer independent, can be found in abundance in the miRNA effector protein Ago, and can repress expression of specific genes that have complementarity to their 5’ seed-sequences. Given that this greatly expands the repertoire of small RNAs capable of post-transcriptional gene expression, it is important to predict tRF targets with confidence. Some attempts have been made to predict tRF targets, but are limited in the scope of tRF classes used in prediction or limited in feature selection. We hypothesized that established miRNA target prediction features applied to tRFs through a random forest machine learning algorithm will immensely improve tRF target prediction. Using this approach, we show significant improvements in tRF target prediction for all classes of tRFs and validate our predictions in two independent cell lines. Finally, using Gene Ontology analysis, we provide evidence that tRF-3009a targets may be involved in neural development. These improvements to tRF target prediction further our understanding of tRF function broadly across species and tRF types, and provide avenues for testing novel roles for tRFs in biology.
Project description:Degraded pasture is a major liability in Brazilian agriculture, but restoration and recovery efforts could turn this area into a new frontier to both agricultural yield expansion and forest restoration. Currently, rural properties with larger degraded pasture areas are associated with higher levels of technical inefficiency in Brazil. The recovery of 12 million ha of degraded pastures could generate an additional production of 17.7 million bovines while reducing the need for new agricultural land. Regional identification of degraded pastures would facilitate the targeting of agricultural extension and advisory services and rural credit efforts aimed at fostering pasture recovery. Since only 1% of Brazilian municipalities contain 25% of degraded pastures, focusing pasture recovery efforts on this small group of municipalities could generate considerable benefits. More efficient allocation of degraded and native pastures for meat production and forest restoration could provide land enough to fully comply with its Forest Code requirements, while adding 9 million heads to the cattle inventory. Degraded pasture recovery and restoration is a win-win strategy that could boost livestock husbandry and avoid deforestation in Brazil and has to be the priority strategy of agribusiness sector.
Project description:PremiseTo improve forest conservation monitoring, we developed a protocol to automatically count and identify the seeds of plant species with minimal resource requirements, making the process more efficient and less dependent on human operators.Methods and resultsSeeds from six North American conifer tree species were separated from leaf litter and imaged on a flatbed scanner. In the most successful species-classification approach, an ImageJ macro automatically extracted measurements for random forest classification in the software R. The method allows for good classification accuracy, and the same process can be used to train the model on other species.ConclusionsThis protocol is an adaptable tool for efficient and consistent identification of seed species or potentially other objects. Automated seed classification is efficient and inexpensive, making it a practical solution that enhances the feasibility of large-scale monitoring projects in conservation biology.
Project description:BackgroundClostridiales and Bacteroidales are uniquely adapted to the gut environment and have co-evolved with their hosts resulting in convergent microbiome patterns within mammalian species. As a result, members of Clostridiales and Bacteroidales are particularly suitable for identifying sources of fecal contamination in environmental samples. However, a comprehensive evaluation of their predictive power and development of computational approaches is lacking. Given the global public health concern for waterborne disease, accurate identification of fecal pollution sources is essential for effective risk assessment and management. Here, we use random forest algorithm and 16S rRNA gene amplicon sequences assigned to Clostridiales and Bacteroidales to identify common fecal pollution sources. We benchmarked the accuracy, consistency, and sensitivity of our classification approach using fecal, environmental, and artificial in silico generated samples.ResultsClostridiales and Bacteroidales classifiers were composed mainly of sequences that displayed differential distributions (host-preferred) among sewage, cow, deer, pig, cat, and dog sources. Each classifier correctly identified human and individual animal sources in approximately 90% of the fecal and environmental samples tested. Misclassifications resulted mostly from false-positive identification of cat and dog fecal signatures in host animals not used to build the classifiers, suggesting characterization of additional animals would improve accuracy. Random forest predictions were highly reproducible, reflecting the consistency of the bacterial signatures within each of the animal and sewage sources. Using in silico generated samples, we could detect fecal bacterial signatures when the source dataset accounted for as little as ~ 0.5% of the assemblage, with ~ 0.04% of the sequences matching the classifiers. Finally, we developed a proxy to estimate proportions among sources, which allowed us to determine which sources contribute the most to observed fecal pollution.ConclusionRandom forest classification with 16S rRNA gene amplicons offers a rapid, sensitive, and accurate solution for identifying host microbial signatures to detect human and animal fecal contamination in environmental samples.
Project description:Random forest is a popular prediction approach for handling high dimensional covariates. However, it often becomes infeasible to interpret the obtained high dimensional and non-parametric model. Aiming for an interpretable predictive model, we develop a forward variable selection method using the continuous ranked probability score (CRPS) as the loss function. eOur stepwise procedure selects at each step a variable that minimizes the CRPS risk and a stopping criterion for selection is designed based on an estimation of the CRPS risk difference of two consecutive steps. We provide mathematical motivation for our method by proving that in a population sense, the method attains the optimal set. In a simulation study, we compare the performance of our method with an existing variable selection method, for different sample sizes and correlation strength of covariates. Our method is observed to have a much lower false positive rate. We also demonstrate an application of our method to statistical post-processing of daily maximum temperature forecasts in the Netherlands. Our method selects about 10% covariates while retaining the same predictive power.