A rapid method for characterization of protein relatedness using feature vectors.
ABSTRACT: We propose a feature vector approach to characterize the variation in large data sets of biological sequences. Each candidate sequence produces a single feature vector constructed with the number and location of amino acids or nucleic acids in the sequence. The feature vector characterizes the distance between the actual sequence and a model of a theoretical sequence based on the binomial and uniform distributions. This method is distinctive in that it does not rely on sequence alignment for determining protein relatedness, allowing the user to visualize the relationships within a set of proteins without making a priori assumptions about those proteins. We apply our method to two large families of proteins: protein kinase C, and globins, including hemoglobins and myoglobins. We interpret the high-dimensional feature vectors using principal components analysis and agglomerative hierarchical clustering. We find that the feature vector retains much of the information about the original sequence. By using principal component analysis to extract information from collections of feature vectors, we are able to quickly identify the nature of variation in a collection of proteins. Where collections are phylogenetically or functionally related, this is easily detected. Hierarchical agglomerative clustering provides a means of constructing cladograms from the feature vector output.
Project description:Proteins are diverse with their sequences, structures and functions, it is important to study the relations between the sequences, structures and functions. In this paper, we conduct a study that surveying the relations between the protein sequences and their structures. In this study, we use the natural vector (NV) and the averaged property factor (APF) features to represent protein sequences into feature vectors, and use the multi-class MSE and the convex hull methods to separate proteins of different structural classes into different regions. We found that proteins from different structural classes are separable by hyper-planes and convex hulls in the natural vector feature space, where the feature vectors of different structural classes are separated into disjoint regions or convex hulls in the high dimensional feature spaces. The natural vector outperforms the averaged property factor method in identifying the structures, and the convex hull method outperforms the multi-class MSE in separating the feature points. These outcomes convince the strong connections between the protein sequences and their structures, and may imply that the amino acids composition and their sequence arrangements represented by the natural vectors have greater influences to the structures than the averaged physical property factors of the amino acids.
Project description:We describe a novel system, GRIFFIN (G-protein and Receptor Interaction Feature Finding INstrument), that predicts G-protein coupled receptor (GPCR) and G-protein coupling selectivity based on a support vector machine (SVM) and a hidden Markov model (HMM) with high sensitivity and specificity. Based on our assumption that whole structural segments of ligands, GPCRs and G-proteins are essential to determine GPCR and G-protein coupling, various quantitative features were selected for ligands, GPCRs and G-protein complex structures, and those parameters that are the most effective in selecting G-protein type were used as feature vectors in the SVM. The main part of GRIFFIN includes a hierarchical SVM classifier using the feature vectors, which is useful for Class A GPCRs, the major family. For the opsins and olfactory subfamilies of Class A and other minor families (Classes B, C, frizzled and smoothened), the binding G-protein is predicted with high accuracy using the HMM. Applying this system to known GPCR sequences, each binding G-protein is predicted with high sensitivity and specificity (>85% on average). GRIFFIN (http://griffin.cbrc.jp/) is freely available and allows users to easily execute this reliable prediction of G-proteins.
Project description:Standardized DNA assembly strategies facilitate the generation of multigene constructs from collections of building blocks in plant synthetic biology. A common syntax for hierarchical DNA assembly following the Golden Gate principle employing Type IIs restriction endonucleases was recently developed, and underlies the Modular Cloning and GoldenBraid systems. In these systems, transcriptional units and/or multigene constructs are assembled from libraries of standardized building blocks, also referred to as phytobricks, in several hierarchical levels and by iterative Golden Gate reactions. Here, a toolkit containing further modules for the novel DNA assembly standards was developed. Intended for use with Modular Cloning, most modules are also compatible with GoldenBraid. Firstly, a collection of approximately 80 additional phytobricks is provided, comprising e.g. modules for inducible expression systems, promoters or epitope tags. Furthermore, DNA modules were developed for connecting Modular Cloning and Gateway cloning, either for toggling between systems or for standardized Gateway destination vector assembly. Finally, first instances of a "peripheral infrastructure" around Modular Cloning are presented: While available toolkits are designed for the assembly of plant transformation constructs, vectors were created to also use coding sequence-containing phytobricks directly in yeast two hybrid interaction or bacterial infection assays. The presented material will further enhance versatility of hierarchical DNA assembly strategies.
Project description:BACKGROUND: Searching for similarities in a set of biological data is intrinsically difficult due to possible data points that should not be clustered, or that should group within several clusters. Under these hypotheses, hierarchical agglomerative clustering is not appropriate. Moreover, if the dataset is not known enough, like often is the case, supervised classification is not appropriate either. RESULTS: CLAG (for CLusters AGgregation) is an unsupervised non hierarchical clustering algorithm designed to cluster a large variety of biological data and to provide a clustered matrix and numerical values indicating cluster strength. CLAG clusterizes correlation matrices for residues in protein families, gene-expression and miRNA data related to various cancer types, sets of species described by multidimensional vectors of characters, binary matrices. It does not ask to all data points to cluster and it converges yielding the same result at each run. Its simplicity and speed allows it to run on reasonably large datasets. CONCLUSIONS: CLAG can be used to investigate the cluster structure present in biological datasets and to identify its underlying graph. It showed to be more informative and accurate than several known clustering methods, as hierarchical agglomerative clustering, k-means, fuzzy c-means, model-based clustering, affinity propagation clustering, and not to suffer of the convergence problem proper to this latter.
Project description:Plant viral expression vectors are advantageous for high-throughput functional characterization studies of genes due to their capability for rapid, high-level transient expression of proteins. We have constructed a series of tobacco mosaic virus (TMV) based vectors that are compatible with Gateway technology to enable rapid assembly of expression constructs and exploitation of ORFeome collections. In addition to the potential of producing recombinant protein at grams per kilogram FW of leaf tissue, these vectors facilitate either N- or C-terminal fusions to a broad series of epitope tag(s) and fluorescent proteins. We demonstrate the utility of these vectors in affinity purification, immunodetection and subcellular localisation studies. We also apply the vectors to characterize protein-protein interactions and demonstrate their utility in screening plant pathogen effectors. Given its broad utility in defining protein properties, this vector series will serve as a useful resource to expedite gene characterization efforts.
Project description:BACKGROUND: Despite progress in malaria control, malaria remains an important public health concern in Cambodia, mostly linked to forested areas. Large-scale vector control interventions in Cambodia are based on the free distribution of long-lasting insecticidal nets (LLINs), targeting indoor- and late-biting malaria vectors only. The present study evaluated the vector density, early biting activity and malaria transmission of outdoor-biting malaria vectors in two forested regions in Cambodia. METHODS: In 2005 two entomological surveys were conducted in 12 villages and their related forest plots in the east and west of Cambodia. Mosquitoes were collected outdoors by human landing collections and subjected to enzyme-linked immunosorbent assay (ELISA) to detect Plasmodium sporozoites after morphological identification. Blood samples were collected in the same villages for serological analyses. Collected data were analysed by the classification and regression tree (CART) method and linear regression analysis. RESULTS: A total of 11,826 anophelines were recorded landing in 787 man-night collections. The majority (82.9%) were the known primary and secondary vectors. Most of the variability in vector densities and early biting rates was explained by geographical factors, mainly at village level. Vector densities were similar between forest and village sites. Based on ELISA results, 29% out of 17 Plasmodium-positive bites occurred before sleeping time, and 65% in the forest plots. The entomological inoculation rates of survey 1 were important predictors of the respective seroconversion rates in survey 2, whereas the mosquito densities were not. DISCUSSION: In Cambodia, outdoor malaria transmission in villages and forest plots is important. In this context, deforestation might result in lower densities of the primary vectors, but also in higher densities of secondary vectors invading deforested areas. Moreover, higher accessibility of the forest could result in a higher man-vector contact. Therefore, additional vector control measures should be developed to target outdoor- and early-biting vectors.
Project description:Plasmodium knowlesi is found in macaques and is the only major zoonotic malaria to affect humans. Transmission of P. knowlesi between people and macaques depends on the host species preferences and feeding behavior of mosquito vectors. However, these behaviours are difficult to measure due to the lack of standardized methods for sampling potential vectors attracted to different host species. This study evaluated electrocuting net traps as a safe, standardised method for sampling P. knowlesi vectors attracted to human and macaque hosts. Field experiments were conducted within a major focus on P. knowlesi transmission in Malaysian Borneo to compare the performance of human (HENET) or macaque (MENET) odour-baited electrocuting nets, human landing catches (HLC) and monkey-baited traps (MBT) for sampling mosquitoes. The abundance and diversity of Anopheles sampled by different methods were compared over 40 nights, with a focus on the P. knowlesi vector Anopheles balabancensis.HLC caught more An. balabacensis than any other method (3.6 per night). In contrast, no An. balabacensis were collected in MBT collections, which generally performed poorly for all mosquito taxa. Anopheles vector species including An. balabacensis were sampled in both HENET and MENET collections, but at a mean abundance of less than 1 per night. There was no difference between HENET and MENET in the overall abundance (P = 0.05) or proportion (P = 0.7) of An. balabacensis. The estimated diversity of Anopheles species was marginally higher in electrocuting net than HLC collections, and similar in collections made with humans or monkey hosts.Host-baited electrocuting nets had moderate success for sampling known zoonotic malaria vectors. The primary vector An. balabacensis was collected with electrocuting nets baited both with humans and macaques, but at a considerably lower density than the HLC standard. However, electrocuting nets were considerably more successful than monkey-baited traps and representatively characterised anopheline species diversity. Consequently, their use allows inferences about relative mosquito attraction to be meaningfully interpreted while eliminating confounding factors due to trapping method. On this basis, electrocuting net traps should be considered as a useful standardised method for investigating vector contact with humans and wildlife reservoirs.
Project description:BACKGROUND: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors. RESULTS: The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. CONCLUSIONS: We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.
Project description:We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.
Project description:Adeno-associated viral (AAV) vectors show great promise because of their excellent safety profile; however, pre-existing immune responses have necessitated the administration of high titer AAV, posing a significant challenge to the advancement of gene therapy involving AAV vectors. Recombinant AAV vectors contain minimum viral proteins necessary for their assembly and gene delivery functions. During the process of AAV assembly and production, AAV vectors acquire, inherently and submissively, various cellular proteins, but the identity of these proteins is poorly characterized. We reason that by identifying host cell proteins inherently associated with AAV vectors we may better understand the contribution of cellular components to AAV vector assembly and, ultimately, may improve the production of AAV vectors for gene therapy. In this study, three serotypes of recombinant AAV, namely AAV2, AAV5, and AAV8, were investigated. We used liquid chromatography-mass spectrometry/mass spectrometry (LC-MS/MS) methods to identify protein composition in purified AAV vectors, confirmed protein identities using western blotting, and explored the potential function of selected proteins in AAV vector production using small hairpin (shRNA) methods. Using LC-MS/MS, we identified 44 AAV-associated cellular proteins including Y-box binding protein (YB1). We showed for the first time that the establishment of a novel producer cell line by introducing an shRNA sequence down-regulating YB1 resulted in up to 45- and 9-fold increase in physical vector genome titers of AAV2 and AAV8, respectively, and up to 7-fold increase in AAV2 transduction vector genome titers. Our results revealed that YB1 gene knockdown promoted AAV2 rep expression and vector DNA production and reduced the number of empty particles in AAV2 products, suggesting that YB1 plays an important role in AAV vector assembly by competition with adenovirus E2A and AAV capsid proteins for binding to the inverted terminal repeat (ITR) sequence. The significance and implications of our findings in future improvement of AAV production are discussed.