Unknown

Dataset Information

0

Prediction of donor splice sites using random forest with a new sequence encoding approach.


ABSTRACT: BACKGROUND:Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Bagging, Boosting, Logistic regression, kNN and Naïve Bayes classifiers for prediction of donor splice sites. RESULTS:The performance of the proposed approach is evaluated on the donor splice site sequence data of Homo sapiens, collected from Homo Sapiens Splice Sites Dataset (HS3D). The results showed that RF outperformed all the considered classifiers. Besides, RF achieved higher prediction accuracy than the existing methods viz., MEM, MDD, WMM, MM1, NNSplice and SpliceView, while compared using an independent test dataset. CONCLUSION:Based on the proposed approach, we have developed an online prediction server (MaLDoSS) to help the biological community in predicting the donor splice sites. The server is made freely available at http://cabgrid.res.in:8080/maldoss. Due to computational feasibility and high prediction accuracy, the proposed approach is believed to help in predicting the eukaryotic gene structure.

SUBMITTER: Meher PK 

PROVIDER: S-EPMC4724119 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

altmetric image

Publications

Prediction of donor splice sites using random forest with a new sequence encoding approach.

Meher Prabina Kumar PK   Sahu Tanmaya Kumar TK   Rao Atmakuri Ramakrishna AR  

BioData mining 20160122


<h4>Background</h4>Detection of splice sites plays a key role for predicting the gene structure and thus development of efficient analytical methods for splice site prediction is vital. This paper presents a novel sequence encoding approach based on the adjacent di-nucleotide dependencies in which the donor splice site motifs are encoded into numeric vectors. The encoded vectors are then used as input in Random Forest (RF), Support Vector Machines (SVM) and Artificial Neural Network (ANN), Baggi  ...[more]

Similar Datasets

| S-EPMC3530872 | biostudies-other
| S-EPMC5881105 | biostudies-other
| S-EPMC6924143 | biostudies-literature
| S-EPMC4494626 | biostudies-literature
2012-05-09 | E-GEOD-37858 | biostudies-arrayexpress
2012-05-10 | GSE37858 | GEO