Project description:Encapsulins are a class of microbial protein compartments defined by the viral HK97-fold of their capsid protein, self-assembly into icosahedral shells, and dedicated cargo loading mechanism for sequestering specific enzymes. Encapsulins are often misannotated and traditional sequence-based searches yield many false positive hits in the form of phage capsids. Here, we develop an integrated search strategy to carry out a large-scale computational analysis of prokaryotic genomes with the goal of discovering an exhaustive and curated set of all HK97-fold encapsulin-like systems. We find over 6,000 encapsulin-like systems in 31 bacterial and four archaeal phyla, including two novel encapsulin families. We formulate hypotheses about their potential biological functions and biomedical relevance, which range from natural product biosynthesis and stress resistance to carbon metabolism and anaerobic hydrogen production. An evolutionary analysis of encapsulins and related HK97-type virus families shows that they share a common ancestor, and we conclude that encapsulins likely evolved from HK97-type bacteriophages.
Project description:This article compares 32 bacterial genomes with respect to their high transcription potentialities. The sigma70 promoter has been widely studied for Escherichia coli model and a consensus is known. Since transcriptional regulations are known to compensate for promoter weakness (i.e. when the promoter similarity with regard to the consensus is rather low), predicting functional promoters is a hard task. Instead, the research work presented here comes within the scope of investigating potentially high ORF expression, in relation with three criteria: (i) high similarity to the sigma70 consensus (namely, the consensus variant appropriate for each genome), (ii) transcription strength reinforcement through a supplementary binding site--the upstream promoter (UP) element--and (iii) enhancement through an optimal Shine-Dalgarno (SD) sequence. We show that in the AT-rich Firmicutes' genomes, frequencies of potentially strong sigma70-like promoters are exceptionally high. Besides, though they contain a low number of strong promoters (SPs), some genomes may show a high proportion of promoters harbouring an UP element. Putative SPs of lesser quality are more frequently associated with an UP element than putative strong promoters of better quality. A meaningful difference is statistically ascertained when comparing bacterial genomes with similarly AT-rich genomes generated at random; the difference is the highest for Firmicutes. Comparing some Firmicutes genomes with similarly AT-rich Proteobacteria genomes, we confirm the Firmicutes specificity. We show that this specificity is neither explained by AT-bias nor genome size bias; neither does it originate in the abundance of optimal SD sequences, a typical and significant feature of Firmicutes more thoroughly analysed in our study.
Project description:The architecture of mouse and human antibody repertoires is defined by the sequence similarity networks of the clones that compose them. The major principles that define the architecture of antibody repertoires have remained largely unknown. Here, we establish a high-performance computing platform to construct large-scale networks from comprehensive human and murine antibody repertoire sequencing datasets (>100,000 unique sequences). Leveraging a network-based statistical framework, we identify three fundamental principles of antibody repertoire architecture: reproducibility, robustness and redundancy. Antibody repertoire networks are highly reproducible across individuals despite high antibody sequence dissimilarity. The architecture of antibody repertoires is robust to the removal of up to 50-90% of randomly selected clones, but fragile to the removal of public clones shared among individuals. Finally, repertoire architecture is intrinsically redundant. Our analysis provides guidelines for the large-scale network analysis of immune repertoires and may be used in the future to define disease-associated and synthetic repertoires.
Project description:A hallmark of rheumatoid arthritis (RA) is the production of autoantibodies, including anti-citrullinated protein antibodies (ACPAs). Nevertheless, the specific targets of these autoantibodies remain incompletely defined. During an immune response, B cells specific for the inciting antigen(s) are activated and differentiate into plasmablasts, which are released into the blood. We undertook this study to sequence the plasmablast antibody repertoire to define the targets of the active immune response in RA.We developed a novel DNA barcoding method to sequence the cognate heavy- and light-chain pairs of antibodies expressed by individual blood plasmablasts in RA. The method uses a universal 5' adapter that enables full-length sequencing of the antibodies' variable regions and recombinant expression of the paired antibody chains. The sequence data sets were bioinformatically analyzed to generate phylogenetic trees that identify clonal families of antibodies sharing heavy- and light-chain VJ sequences. Representative antibodies were expressed, and their binding properties were characterized using anti-cyclic citrullinated peptide 2 (anti-CCP-2) enzyme-linked immunosorbent assay (ELISA) and antigen microarrays.We used our sequencing method to generate phylogenetic trees representing the antibody repertoires of peripheral blood plasmablasts from 4 individuals with anti-CCP+ RA, and recombinantly expressed 14 antibodies that were either "singleton" antibodies or representative of clonal antibody families. Anti-CCP-2 ELISA identified 4 ACPAs, and antigen microarray analysis identified ACPAs that differentially targeted epitopes on ?-enolase, citrullinated fibrinogen, and citrullinated histone H2B.Our data provide evidence that autoantibodies targeting ?-enolase, citrullinated fibrinogen, and citrullinated histone H2B are produced by the ongoing activated B cell response in, and thus may contribute to the pathogenesis of, RA.
Project description:Virtual screening is receiving renewed attention in drug discovery, but progress is hampered by challenges on two fronts: handling the ever-increasing sizes of libraries of drug-like compounds and separating true positives from false positives. Here, we developed a machine learning-enabled pipeline for large-scale virtual screening that promises breakthroughs on both fronts. By clustering compounds according to molecular properties and limited docking against a drug target, the full library was trimmed by 10-fold; the remaining compounds were then screened individually by docking; and finally, a dense neural network was trained to classify the hits into true and false positives. As illustration, we screened for inhibitors against RPN11, the deubiquitinase subunit of the proteasome, and a drug target for breast cancer.
Project description:Cellulases are important glycosyl hydrolases (GHs) that hydrolyze cellulose polymers into smaller oligosaccharides by breaking the cellulose beta (1-->4) bonds, and they are widely used to produce cellulosic ethanol from the plant biomass. N-linked and O-linked glycosylations were proposed to impact the catalytic efficiency, cellulose binding affinity and the stability of cellulases based on observations of individual cellulases. As far as we know, there has not been any systematic analysis of the distributions of N-linked and O-linked glycosylated residues in cellulases, mainly due to the limited annotations of the relevant functional domains and the glycosylated residues. We have computationally annotated the functional domains and glycosylated residues in cellulases, and conducted a systematic analysis of the distributions of the N-linked and O-linked glycosylated residues in these enzymes. Many N-linked glycosylated residues were known to be in the GH domains of cellulases, but they are there probably just by chance, since the GH domain usually occupies more than half of the sequence length of a cellulase. Our analysis indicates that the O-linked glycosylated residues are significantly enriched in the linker regions between the carbohydrate binding module (CBM) domains and GH domains of cellulases. Possible mechanisms are discussed.
Project description:Elucidating how antigen exposure and selection shape the human antibody repertoire is fundamental to our understanding of B-cell immunity. We sequenced the paired heavy- and light-chain variable regions (VH and VL, respectively) from large populations of single B cells combined with computational modeling of antibody structures to evaluate sequence and structural features of human antibody repertoires at unprecedented depth. Analysis of a dataset comprising 55,000 antibody clusters from CD19(+)CD20(+)CD27(-) IgM-naive B cells, >120,000 antibody clusters from CD19(+)CD20(+)CD27(+) antigen-experienced B cells, and >2,000 RosettaAntibody-predicted structural models across three healthy donors led to a number of key findings: (i) VH and VL gene sequences pair in a combinatorial fashion without detectable pairing restrictions at the population level; (ii) certain VH:VL gene pairs were significantly enriched or depleted in the antigen-experienced repertoire relative to the naive repertoire; (iii) antigen selection increased antibody paratope net charge and solvent-accessible surface area; and (iv) public heavy-chain third complementarity-determining region (CDR-H3) antibodies in the antigen-experienced repertoire showed signs of convergent paired light-chain genetic signatures, including shared light-chain third complementarity-determining region (CDR-L3) amino acid sequences and/or Vκ,λ-Jκ,λ genes. The data reported here address several longstanding questions regarding antibody repertoire selection and development and provide a benchmark for future repertoire-scale analyses of antibody responses to vaccination and disease.
Project description:Learning to read is foundational for literacy development, yet many children in primary school fail to become efficient readers despite normal intelligence and schooling. This condition, referred to as developmental dyslexia, has been hypothesized to occur because of deficits in vision, attention, auditory and temporal processes, and phonology and language. Here, we used a developmentally plausible computational model of reading acquisition to investigate how the core deficits of dyslexia determined individual learning outcomes for 622 children (388 with dyslexia). We found that individual learning trajectories could be simulated on the basis of three component skills related to orthography, phonology, and vocabulary. In contrast, single-deficit models captured the means but not the distribution of reading scores, and a model with noise added to all representations could not even capture the means. These results show that heterogeneity and individual differences in dyslexia profiles can be simulated only with a personalized computational model that allows for multiple deficits.
Project description:The degradation of glycosaminoglycans (GAGs) by intestinal bacteria is critical for their colonization in the human gut and the health of the host. Both Bacteroides and Firmicutes have been reported to degrade GAGs, while the enzymatic details of the latter remain largely unknown. In this study, we isolated a Firmicutes strain, Hungatella hathewayi N2-326, that can catabolize various GAGs. While H. hathewayi N2-326 was less efficient in utilizing chondroitin sulfate A (CSA) and dermatan sulfate (DS) than Bacteroides thetaiotaomicron, a characterized GAG degrader, it outperformed B. thetaiotaomicron in assimilating hyaluronic acid. Unlike B. thetaiotaomicron, H. hathewayi N2-326 could not utilize heparin. The chondroitin lyase activity of H. hathewayi N2-326 was found to be induced by CSA and displayed both cell-associated and extracellular distributions. We further identified and characterized the first chondroitin ABC lyase from Firmicutes. The recombinant H. hathewayi chondroitin ABC lyase was found to be a predominantly exolyase and exhibited higher specific activity than any other characterized chondroitin ABC lyase. Thus, the HH-chondroitin ABC lyase offers a viable commercial option for the production of chondroitin, dermatan, and hyaluronan oligosaccharides and potential medical applications.
Project description:Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.