TMIC-52. RELATIONSHIP BETWEEN MACROPHAGE AND RADIOSENSITIVITY IN HUMAN PRIMARY AND RECURRENT GLIOBLASTOMA: IN SILICO ANALYSIS WITH PUBLICLY AVAILABLE DATASETS
Project description:The glioblastoma microenvironment predominantly contains tumor-associated macrophages that support tumor growth and invasion. We investigated the relationship between tumor radiosensitivity and infiltrating M1/M2 macrophage profiles in public datasets of primary and recurrent glioblastoma. We estimated the radiosensitivity index (RSI) score based on gene expression rankings. Macrophages were profiled using the deconvolution algorithm CIBERSORTx. Samples from The Cancer Genome Atlas (TCGA), Chinese Glioma Genome Atlas (CGGA), the Ivy Glioblastoma Atlas Project dataset, a single-cell RNA sequencing dataset (GSE84465), Glioma Longitudinal Analysis Consortium (GLASS), and an immunotherapy trial dataset (GSE121810) were included. RSI-high radioresistant tumors were associated with worse overall survival in TCGA and CGGA than RSI-low tumors. M1/M2 macrophage ratios and RSI scores were inversely associated, indicating that radioresistant glioblastoma tumor microenvironments contain more M2 than M1 macrophages. In the single-cell RNA sequencing dataset, the mean RSI of neoplastic cells was positively correlated with high M2 macrophages proportions. A favorable response to programmed cell death protein 1 (PD-1) therapy was observed in recurrent glioblastomas with high M1/M2 macrophage ratios and low RSI scores. In patients with recurrent glioblastoma, fewer M2 macrophages and low RSI scores were associated with improved overall survival. High M2 macrophage proportions may be involved in radioresistant glioblastoma.
Project description:Differential gene expression analysis is widely used to study changes in gene expression profiles between two or more groups of samples (e.g., physiological versus pathological conditions, pre-treatment versus post-treatment, and infected versus non-infected tissues). This protocol aims to identify gene expression changes in a pre-selected set of genes associated with severe acute respiratory syndrome coronavirus 2 viral infection and host cell antiviral response, as well as subsequent gene expression association with phenotypic features using samples deposited in public repositories. For complete details on the use and outcome of this informatics analysis, please refer to Bizzotto et al. (2020).
Project description:The development of artificial intelligence (AI) in dentistry requires large and well-annotated datasets. However, the availability of public dental imaging datasets remains unclear. This study aimed to provide a comprehensive overview of all publicly available dental imaging datasets to address this gap and support AI development. This observational study searched all publicly available dataset resources (academic databases, preprints, and AI challenges), focusing on datasets/articles from 2020 to 2023, with PubMed searches extending back to 2011. We comprehensively searched for dental AI datasets containing images (intraoral photos, scans, radiographs, etc.) using relevant keywords. We included datasets of >50 images obtained from publicly available sources. We extracted dataset characteristics, patient demographics, country of origin, dataset size, ethical clearance, image details, FAIRness metrics, and metadata completeness. We screened 131,028 records and extracted 16 unique dental imaging datasets. The datasets were obtained from Kaggle (18.8%), GitHub, Google, Mendeley, PubMed, Zenodo (each 12.5%), Grand-Challenge, OSF, and arXiv (each 6.25%). The primary focus was tooth segmentation (62.5%) and labeling (56.2%). Panoramic radiography was the most common imaging modality (58.8%). Of the 13 countries, China contributed the most images (2,413). Of the datasets, 75% contained annotations, whereas the methods used to establish labels were often unclear and inconsistent. Only 31.2% of the datasets reported ethical approval, and 56.25% did not specify a license. Most data were obtained from dental clinics (50%). Intraoral radiographs had the highest findability score in the FAIR assessment, whereas cone-beam computed tomography datasets scored the lowest in all categories. These findings revealed a scarcity of publicly available imaging dental data and inconsistent metadata reporting. To promote the development of robust, equitable, and generalizable AI tools for dental diagnostics, treatment, and research, efforts are needed to address data scarcity, increase diversity, mandate metadata completeness, and ensure FAIRness in AI dental imaging research.
Project description:Prostate cancer (PCa) is the second most common cancer in men, and the second leading cause of death from cancer in men. Many studies on PCa have been carried out, each taking much time before the data is collected and ready to be analyzed. However, on the internet there is already a wide range of PCa datasets available, which could be used for data mining, predictive modelling or other purposes, reducing the need to setup new studies to collect data. In the current scientific climate, moving more and more to the analysis of "big data" and large, international, multi-site projects using a modern IT infrastructure, these datasets could be proven extremely valuable. This review presents an overview of publicly available patient-centered PCa datasets, divided into three categories (clinical, genomics and imaging) and an "overall" section to enable researchers to select a suitable dataset for analysis, without having to go through days of work to find the right data. To acquire a list of human PCa databases, scientific literature databases and academic social network sites were searched. We also used the information from other reviews. All databases in the combined list were then checked for public availability. Only databases that were either directly publicly available or available after signing a research data agreement or retrieving a free login were selected for inclusion in this review. Data should be available to commercial parties as well. This paper focuses on patient-centered data, so the genomics data section does not include gene-centered databases or pathway-centered databases. We identified 42 publicly available, patient-centered PCa datasets. Some of these consist of different smaller datasets. Some of them contain combinations of datasets from the three data domains: clinical data, imaging data and genomics data. Only one dataset contains information from all three domains. This review presents all datasets and their characteristics: number of subjects, clinical fields, imaging modalities, expression data, mutation data, biomarker measurements, etc. Despite all the attention that has been given to making this overview of publicly available databases as extensive as possible, it is very likely not complete, and will also be outdated soon. However, this review might help many PCa researchers to find suitable datasets to answer the research question with, without the need to start a new data collection project. In the coming era of big data analysis, overviews like this are becoming more and more useful.
Project description:In recent years, a growing number of researchers began to focus on how to establish associations between clinical and genomic data. However, up to now, there is lack of research mining clinic-genomic associations by comprehensively analysing available gene expression data for a single disease. Colorectal cancer is one of the malignant tumours. A number of genetic syndromes have been proven to be associated with colorectal cancer. This paper presents our research on mining clinic-genomic associations for colorectal cancer under biomedical big data environment. The proposed method is engineered with multiple technologies, including extracting clinical concepts using the unified medical language system (UMLS), extracting genes through the literature mining, and mining clinic-genomic associations through statistical analysis. We applied this method to datasets extracted from both gene expression omnibus (GEO) and genetic association database (GAD). A total of 23,517 clinic-genomic associations between 139 clinical concepts and 7914 genes were obtained, of which 3474 associations between 31 clinical concepts and 1689 genes were identified as highly reliable ones. Evaluation and interpretation were performed using UMLS, KEGG, and Gephi, and potential new discoveries were explored. The proposed method is effective in mining valuable knowledge from available biomedical big data and achieves a good performance in bridging clinical data with genomic data for colorectal cancer.
Project description:Scientific and technological advances within the life sciences have enabled the generation of very large datasets that must be processed, stored, and managed computationally. Researchers increasingly require data science skills to work with these datasets at scale in order to convert information into actionable insights, and undergraduate educators have started to adapt pedagogies to fulfill this need. Course-based undergraduate research experiences (CUREs) have emerged as a leading model for providing large numbers of students with authentic research experiences including data science. Originally designed around wet-lab research experiences, CURE models have proliferated and diversified globally to accommodate a broad range of academic disciplines. Within microbiology, diversity metrics derived from microbiome sequence information have become standard data products in research. In some cases, researchers have deposited data in publicly accessible repositories, providing opportunities for reproducibility and comparative analysis. In 2020, with the onset of the COVID-19 pandemic and concomitant shift to remote learning, the University of British Columbia set out to develop an online data science CURE in microbiology. A team of faculty with collective domain expertise in microbiome research and CUREs developed and implemented a data science CURE in which teams of students learn to work with large publicly available datasets, develop and execute a novel scientific research project, and disseminate their findings in the online Undergraduate Journal of Experimental Microbiology and Immunology. Analysis of the resulting student-authored research articles, including comments from peer reviews conducted by subject matter experts, demonstrate high levels of learning effectiveness. Here, we describe core insights from course development and implementation based on a reverse course design model. Our approach to course design may be applicable to the development of other data science CUREs.
Project description:BackgroundGenetic data play a crucial role in diagnosing and treating various diseases, reflecting a growing imperative to integrate these data into clinical care. However, significant barriers such as the structure of electronic health records (EHRs), insurance costs for genetic testing, and the interpretability of genetic results impede this integration.MethodsThis paper explores solutions to these challenges by combining recent technological advances with informatics and data science, focusing on the diagnostic potential of artificial intelligence (AI) in cancer research. AI has historically been applied in medical research with limited success, but recent developments have led to the emergence of large language models (LLMs). These transformer-based generative AI models, trained on vast datasets, offer significant potential for genetic and genomic analyses. However, their effectiveness is constrained by their training on predominantly human-written text rather than comprehensive, structured genetic datasets.ResultsThis study reevaluates the capabilities of LLMs, specifically GPT models, in performing supervised prediction tasks using structured gene expression data. By comparing GPT models with traditional machine learning approaches, we assess their effectiveness in predicting cancer subtypes, demonstrating the potential of AI models to analyze real-world genetic data for generating real-world evidence.
Project description:Background: Antibiotics are often prescribed empirically to treat infection syndromes before causative bacteria and their susceptibility to antibiotics are identified. Guidelines on empiric antibiotic prescribing are key to effective treatment of infection syndromes, and need to be informed by likely bacterial aetiology and antibiotic resistance patterns. We aimed to create a clinically-relevant composite index of antibiotic resistance for common infection syndromes to inform recommendations at the national level. Methods: To create our index, we used open-access antimicrobial resistance (AMR) surveillance datasets, including the ECDC Surveillance Atlas, CDDEP ResistanceMap, WHO GLASS and the newly-available Pfizer ATLAS dataset. We integrated these with data on aetiology of common infection syndromes, existing empiric prescribing guidelines, and pricing and availability of antibiotics. Results: The ATLAS dataset covered many more bacterial species (287) and antibiotics (52) than other datasets (ranges = 8-11 and 16-32 respectively), but had a similar number of samples per country per year. Using these data, we were able to make empiric prescribing recommendations for bloodstream infection, pneumonia and cellulitis/skin abscess in up to 44 countries. There was insufficient data to make national-level recommendations for the other six syndromes investigated. Results are presented in an interactive web app, where users can visualise underlying resistance proportions to first-line empiric antibiotics for infection syndromes and countries of interest. Conclusions: We found that whilst the creation of a composite resistance index for empiric antibiotic therapy was technically feasible, the ATLAS dataset in its current form can only inform on a limited number of infection syndromes. Other open-access AMR surveillance datasets are largely limited to bloodstream infection specimens and cannot directly inform treatment of other syndromes. With improving availability of international AMR data and better understanding of infection aetiology, this approach may prove useful for informing empiric prescribing decisions in settings with limited local AMR surveillance data.
Project description:Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this scoping review, we identified the publicly available datasets of breast H&E-stained whole-slide images (WSIs) that can be used to develop deep learning algorithms. We systematically searched 9 scientific literature databases and 9 research data repositories and found 17 publicly available datasets containing 10 385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled 2 lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in developing accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata.
Project description:IntroductionBreast cancer is a complex heterogeneous disease for which a substantial resource of transcriptomic data is available. Gene expression data have facilitated the division of breast cancer into, at least, five molecular subtypes, namely luminal A, luminal B, HER2, normal-like and basal. Once identified, breast cancer subtypes can inform clinical decisions surrounding patient treatment and prognosis. Indeed, it is important to identify patients at risk of developing aggressive disease so as to tailor the level of clinical intervention.MethodsWe have developed a user-friendly, web-based system to allow the evaluation of genes/microRNAs (miRNAs) that are significantly associated with survival in breast cancer and its molecular subtypes. The algorithm combines gene expression data from multiple microarray experiments which frequently also contain miRNA expression information, and detailed clinical data to correlate outcome with gene/miRNA expression levels. This algorithm integrates gene expression and survival data from 26 datasets on 12 different microarray platforms corresponding to approximately 17,000 genes in up to 4,738 samples. In addition, the prognostic potential of 341 miRNAs can be analysed.ResultsWe demonstrated the robustness of our approach in comparison to two commercially available prognostic tests, oncotype DX and MammaPrint. Our algorithm complements these prognostic tests and is consistent with their findings. In addition, BreastMark can act as a powerful reductionist approach to these more complex gene signatures, eliminating superfluous genes, potentially reducing the cost and complexity of these multi-index assays. Known miRNA prognostic markers, mir-205 and mir-93, were used to confirm the prognostic value of this tool in a miRNA setting. We also applied the algorithm to examine expression of 58 receptor tyrosine kinases in the basal-like subtype, identifying six receptor tyrosine kinases associated with poor disease-free survival and/or overall survival (EPHA5, FGFR1, FGFR3, VEGFR1, PDGFRβ, and TIE1). A web application for using this algorithm is currently available.ConclusionsBreastMark is a powerful tool for examining putative gene/miRNA prognostic markers in breast cancer. The value of this tool will be in the preliminary assessment of putative biomarkers in breast cancer. It will be of particular use to research groups with limited bioinformatics facilities.