Project description:We have utilized a version 2.0 of our Fusion gene microarray (version 1.0 described by PMID: 19152679 and GEO, GPL8078) to screen cancer cell lines for known fusion genes We assembled a comprehensive database of published fusion genes, including those which have been reported only in individual studies and samples, and fusion genes resulting from deep sequencing of cancer genomes and transcriptomes. From the total set of 548 fusion genes, we designed 197,846 unique oligonucleotides, targeting both chimeric transcript junctions as well as sequences internal to each of the fusion gene partners. We investigated the presence of fusion genes in a series of 67 cell lines originating from 15 different cancer types. Data from ten leukaemia cell lines with known fusion gene status were used to evaluate and calibrate an automated scoring algorithm. In five out of ten cell lines the correct fusion gene was the top most scoring hit, and one came second, all six passing a cut-off of 50 percent of theoretical maximum-score. Two additional fusion genes, BCAS4-BCAS3 from the MCF-7 breast cancer cell line and CCDC6-RET from the TPC-1 thyroid cancer cell line were validated as true positive fusion transcripts from the remaining 57 cell lines. A total of 67 cell lines from various cancer types were analysed for presence of known fusion genes
Project description:Phosphorylation is a ubiquitous post-translation modification that regulates protein function by promoting, inhibiting or modulating protein-protein interactions. Hundreds of thousands of phosphosites have been identified but the vast majority have not been functionally characterised and it remains a challenge to decipher phosphorylation events modulating interactions. We generated a proteomic peptide-phage display library to screen for phosphosites that modulate short linear motif-based interactions. The peptidome covers ~13,500 phospho-serine/threonine sites found in the intrinsically disordered regions of the human proteome. Each phosphosite is represented as wild-type and phosphomimetic variant. We screened 71 protein domains to identify 248 phosphosites that modulate motif-mediated interactions. Affinity measurements confirmed the phospho-modulation of 14 out of 18 tested interactions. We performed a detailed follow-up on a phospho-dependent interaction between clathrin and the mitotic spindle protein hepatoma-upregulated protein (HURP), demonstrating the essentiality of the phospho-dependency to the mitotic function of HURP. Structural characterisation of the clathrin-HURP complex elucidated the molecular basis for the phospho-dependency. Our work showcases the power of phosphomimetic ProP-PD to discover novel phospho-modulated interactions required for cellular function.
Project description:We have utilized a version 2.0 of our Fusion gene microarray (version 1.0 described by PMID: 19152679 and GEO, GPL8078) to screen cancer cell lines for known fusion genes We assembled a comprehensive database of published fusion genes, including those which have been reported only in individual studies and samples, and fusion genes resulting from deep sequencing of cancer genomes and transcriptomes. From the total set of 548 fusion genes, we designed 197,846 unique oligonucleotides, targeting both chimeric transcript junctions as well as sequences internal to each of the fusion gene partners. We investigated the presence of fusion genes in a series of 67 cell lines originating from 15 different cancer types. Data from ten leukaemia cell lines with known fusion gene status were used to evaluate and calibrate an automated scoring algorithm. In five out of ten cell lines the correct fusion gene was the top most scoring hit, and one came second, all six passing a cut-off of 50 percent of theoretical maximum-score. Two additional fusion genes, BCAS4-BCAS3 from the MCF-7 breast cancer cell line and CCDC6-RET from the TPC-1 thyroid cancer cell line were validated as true positive fusion transcripts from the remaining 57 cell lines. A total of 67 cell lines from various cancer types were analysed for presence of known fusion genes
Project description:Finding the global minimum energy conformation (GMEC) of a huge combinatorial search space is the key challenge in computational protein design (CPD) problems. Traditional algorithms lack a scalable and efficient distributed design scheme, preventing researchers from taking full advantage of current cloud infrastructures. We design cloud OSPREY (cOSPREY), an extension to a widely used protein design software OSPREY, to allow the original design framework to scale to the commercial cloud infrastructures. We propose several novel designs to integrate both algorithm and system optimizations, such as GMEC-specific pruning, state search partitioning, asynchronous algorithm state sharing, and fault tolerance. We evaluate cOSPREY on three different cloud platforms using different technologies and show that it can solve a number of large-scale protein design problems that have not been possible with previous approaches.
Project description:We have utilized a version 2.0 of our Fusion gene microarray (version 1.0 described by PMID: 19152679 and GEO, GPL8078) to screen cancer cell lines for known fusion genes We assembled a comprehensive database of published fusion genes, including those which have been reported only in individual studies and samples, and fusion genes resulting from deep sequencing of cancer genomes and transcriptomes. From the total set of 548 fusion genes, we designed 197,846 unique oligonucleotides, targeting both chimeric transcript junctions as well as sequences internal to each of the fusion gene partners. We investigated the presence of fusion genes in a series of 67 cell lines originating from 15 different cancer types. Data from ten leukaemia cell lines with known fusion gene status were used to evaluate and calibrate an automated scoring algorithm. In five out of ten cell lines the correct fusion gene was the top most scoring hit, and one came second, all six passing a cut-off of 50 percent of theoretical maximum-score. Two additional fusion genes, BCAS4-BCAS3 from the MCF-7 breast cancer cell line and CCDC6-RET from the TPC-1 thyroid cancer cell line were validated as true positive fusion transcripts from the remaining 57 cell lines.
Project description:Colorectal neoplasia causes bleeding, enabling detection using Faecal Occult Blood tests (FOBt). The National Health Service (NHS) Bowel Cancer Screening Programme (BCSP) guaiac-based FOBt (gFOBt) kits contain six sample windows (or 'spots') and each kit returns either a positive, unclear or negative result. Test kits with five or six positive windows are termed 'abnormal' and the subject is referred for further investigation, usually colonoscopy. If 1-4 windows are positive, the result is initially 'unclear' and up to two further kits are submitted, further positivity leads to colonoscopy ('weak positive'). If no further blood is detected, the test is deemed 'normal' and subjects are tested again in 2 years' time. We studied the association between spot positivity % (SP%) and neoplasia.Subjects in the Southern Hub completing the first of two consecutive episodes between April 2009 and March 2011 were studied. Each episode included up to three kits and a maximum of 18 windows (spots). For each positivity combination, the percentage of positive spots out of the total number of spots completed by an individual in a single-screening episode was derived and named 'SP%'. Fifty-five combinations of SP can occur if the position of positive/negative spots on the same test card is ignored.The proportion of individuals for whom neoplasia was identified in Episode 2 was derived for each of the 55 spot combinations. In addition, the Episode 1 spot pattern was analysed for subjects with cancer detected in Episode 2.During Episode 2, 284,261 subjects completed gFOBT screening and colonoscopies were performed on 3891 (1.4%) subjects. At colonoscopy, cancer was detected in 7.4% (n=286) and a further 39.8% (n=1550) had adenomas. Cancer was detected in 21.3% of subjects with an abnormal first kit (five or six positive spots) and in 5.9% of those with a weak positive test result.The proportion of cancers detected was positively correlated with SP%, with an R(2) correlation (linear) of 0.89. As the SP% increased from 11 to 100%, so the colorectal cancer (CRC) detection rate increased from 4 to 25%. At the lower SP%s, from 11to 25%, the CRC risk was relatively static at ~4%. Above an SP% of 25%, every 10-percentage points increase in the SP%, was associated with an increase in cancer detection of 2.5%.This study demonstrated a strong correlation between SP% and cancer detection within the NHS BCSP. At the population level, subjects' cancer risk ranged from 4 to 25% and correlated with the gFOBt spot pattern.Some subjects with an SP% of 11% proceed to colonoscopy, whereas others with an SP% of 22% do not. Colonoscopy on patients with four positive spots in kit 1 (SP% 22%) would, we estimate, detect cancer in ~4% of cases and increase overall colonoscopy volume by 6%. This study also demonstrated how screening programme data could be used to guide its ongoing implementation and inform other programmes.
Project description:Identifying genomic regions that descended from a common ancestor helps us study the gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs are evidently used in function prediction, operon detection, etc. Currently, there are many kinds of computational methods that have been proposed defining gene clusters to identify gene families and operons. However, most of those algorithms are only available on a data set of small size. We developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships of gene clusters and study of operon formation and destruction. An analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements.
Project description:Phosphorylation is a ubiquitous post-translation modification that regulates protein function by promoting, inhibiting or modulating protein-protein interactions. Hundreds of thousands of phosphosites have been identified but the vast majority have not been functionally characterised and it remains a challenge to decipher phosphorylation events modulating interactions. We generated a phosphomimetic proteomic peptide-phage display library to screen for phosphosites that modulate short linear motif-based interactions. The peptidome covers ~13,500 phospho-serine/threonine sites found in the intrinsically disordered regions of the human proteome. Each phosphosite is represented as wild-type and phosphomimetic variant. We screened 71 protein domains to identify 248 phosphosites that modulate motif-mediated interactions. Affinity measurements confirmed the phospho-modulation of 14 out of 18 tested interactions. We performed a detailed follow-up on a phospho-dependent interaction between clathrin and the mitotic spindle protein hepatoma-upregulated protein (HURP), demonstrating the essentiality of the phospho-dependency to the mitotic function of HURP. Structural characterisation of the clathrin-HURP complex elucidated the molecular basis for the phospho-dependency. Our work showcases the power of phosphomimetic ProP-PD to discover novel phospho-modulated interactions required for cellular function.
Project description:BackgroundNetwork visualization would serve as a useful first step for analysis. However, current graph layout algorithms for biological pathways are insensitive to biologically important information, e.g. subcellular localization, biological node and graph attributes, or/and not available for large scale networks, e.g. more than 10000 elements.ResultsTo overcome these problems, we propose the use of a biologically important graph metric, betweenness, a measure of network flow. This metric is highly correlated with many biological phenomena such as lethality and clusters. We devise a new fast parallel algorithm calculating betweenness to minimize the preprocessing cost. Using this metric, we also invent a node and edge betweenness based fast layout algorithm (BFL). BFL places the high-betweenness nodes to optimal positions and allows the low-betweenness nodes to reach suboptimal positions. Furthermore, BFL reduces the runtime by combining a sequential insertion algorim with betweenness. For a graph with n nodes, this approach reduces the expected runtime of the algorithm to O(n2) when considering edge crossings, and to O(n log n) when considering only density and edge lengths.ConclusionOur BFL algorithm is compared against fast graph layout algorithms and approaches requiring intensive optimizations. For gene networks, we show that our algorithm is faster than all layout algorithms tested while providing readability on par with intensive optimization algorithms. We achieve a 1.4 second runtime for a graph with 4000 nodes and 12000 edges on a standard desktop computer.
Project description:As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.