Project description:Large language models (LLMs) are being integrated into healthcare systems; but these models may recapitulate harmful, race-based medicine. The objective of this study is to assess whether four commercially available large language models (LLMs) propagate harmful, inaccurate, race-based content when responding to eight different scenarios that check for race-based medicine or widespread misconceptions around race. Questions were derived from discussions among four physician experts and prior work on race-based medical misconceptions believed by medical trainees. We assessed four large language models with nine different questions that were interrogated five times each with a total of 45 responses per model. All models had examples of perpetuating race-based medicine in their responses. Models were not always consistent in their responses when asked the same question repeatedly. LLMs are being proposed for use in the healthcare setting, with some models already connecting to electronic health record systems. However, this study shows that based on our findings, these LLMs could potentially cause harm by perpetuating debunked, racist ideas.
Project description:This study demonstrates the potential of multimodal large language models in calculating safety indicators and predicting contraindications for laser vision correction. ChatGPT-4 effectively analyzed ocular data, calculated key indicators, generated calculator codes, and outperformed traditional machine learning models and indicators in handling unstructured data and corneal topography. Its modality-independent system enabled efficient and accurate data analysis. Despite longer processing times, ChatGPT-4's performance highlights its potential as a decision-support tool, offering advancements in improving safety.
Project description:Leveraging the chemical data available in legacy formats such as publications and patents is a significant challenge for the community. Automated reaction mining offers a promising solution to unleash this knowledge into a learnable digital form and therefore help expedite materials and reaction discovery. However, existing reaction mining toolkits are limited to single input modalities (text or images) and cannot effectively integrate heterogeneous data that is scattered across text, tables, and figures. In this work, we go beyond single input modalities and explore multimodal large language models (MLLMs) for the analysis of diverse data inputs for automated electrosynthesis reaction mining. We compiled a test dataset of 65 articles (MERMES-T24 set) and employed it to benchmark five prominent MLLMs against two critical tasks: (i) reaction diagram parsing and (ii) resolving cross-modality data interdependencies. The frontrunner MLLM achieved ≥96% accuracy in both tasks, with the strategic integration of single-shot visual prompts and image pre-processing techniques. We integrate this capability into a toolkit named MERMES (multimodal reaction mining pipeline for electrosynthesis). Our toolkit functions as an end-to-end MLLM-powered pipeline that integrates article retrieval, information extraction and multimodal analysis for streamlining and automating knowledge extraction. This work lays the groundwork for the increased utilization of MLLMs to accelerate the digitization of chemistry knowledge for data-driven research.
Project description:Large language models (LLMs) are artificial intelligence (AI) tools specifically trained to process and generate text. LLMs attracted substantial public attention after OpenAI's ChatGPT was made publicly available in November 2022. LLMs can often answer questions, summarize, paraphrase and translate text on a level that is nearly indistinguishable from human capabilities. The possibility to actively interact with models like ChatGPT makes LLMs attractive tools in various fields, including medicine. While these models have the potential to democratize medical knowledge and facilitate access to healthcare, they could equally distribute misinformation and exacerbate scientific misconduct due to a lack of accountability and transparency. In this article, we provide a systematic and comprehensive overview of the potentials and limitations of LLMs in clinical practice, medical research and medical education.
Project description:When large language models (LLMs) were introduced to the public at large in late 2022 with ChatGPT (OpenAI), the interest was unprecedented, with more than 1 billion unique users within 90 days. Until the introduction of Generative Pre-trained Transformer 4 (GPT-4) in March 2023, these LLMs only contained a single mode-text. As medicine is a multimodal discipline, the potential future versions of LLMs that can handle multimodality-meaning that they could interpret and generate not only text but also images, videos, sound, and even comprehensive documents-can be conceptualized as a significant evolution in the field of artificial intelligence (AI). This paper zooms in on the new potential of generative AI, a new form of AI that also includes tools such as LLMs, through the achievement of multimodal inputs of text, images, and speech on health care's future. We present several futuristic scenarios to illustrate the potential path forward as multimodal LLMs (M-LLMs) could represent the gateway between health care professionals and using AI for medical purposes. It is important to point out, though, that despite the unprecedented potential of generative AI in the form of M-LLMs, the human touch in medicine remains irreplaceable. AI should be seen as a tool that can augment health care professionals rather than replace them. It is also important to consider the human aspects of health care-empathy, understanding, and the doctor-patient relationship-when deploying AI.
Project description:In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.
Project description:ObjectiveThe objective is to provide an overview of the application of large language models (LLMs) in healthcare by employing a bibliometric analysis methodology.MethodWe performed a comprehensive search for peer-reviewed English-language articles using PubMed and Web of Science. The selected articles were subsequently clustered and analyzed textually, with a focus on lexical co-occurrences, country-level and inter-author collaborations, and other relevant factors. This textual analysis produced high-level concept maps that illustrate specific terms and their interconnections.FindingsOur final sample comprised 371 English-language journal articles. The study revealed a sharp rise in the number of publications related to the application of LLMs in healthcare. However, the development is geographically imbalanced, with a higher concentration of articles originating from developed countries like the United States, Italy, and Germany, which also exhibit strong inter-country collaboration. LLMs are applied across various specialties, with researchers investigating their use in medical education, diagnosis, treatment, administrative reporting, and enhancing doctor-patient communication. Nonetheless, significant concerns persist regarding the risks and ethical implications of LLMs, including the potential for gender and racial bias, as well as the lack of transparency in the training datasets, which can lead to inaccurate or misleading responses.ConclusionWhile the application of LLMs in healthcare is promising, the widespread adoption of LLMs in practice requires further improvements in their standardization and accuracy. It is critical to establish clear accountability guidelines, develop a robust regulatory framework, and ensure that training datasets are based on evidence-based sources to minimize risk and ensure ethical and reliable use.
Project description:AbstractLarge language models (LLMs) such as ChatGPT, Claude, Llama, and Qwen are emerging as transformative technologies for the diagnosis and treatment of various diseases. With their exceptional long-context reasoning capabilities, LLMs are proficient in clinically relevant tasks, particularly in medical text analysis and interactive dialogue. They can enhance diagnostic accuracy by processing vast amounts of patient data and medical literature and have demonstrated their utility in diagnosing common diseases and facilitating the identification of rare diseases by recognizing subtle patterns in symptoms and test results. Building on their image-recognition abilities, multimodal LLMs (MLLMs) show promising potential for diagnosis based on radiography, chest computed tomography (CT), electrocardiography (ECG), and common pathological images. These models can also assist in treatment planning by suggesting evidence-based interventions and improving clinical decision support systems through integrated analysis of patient records. Despite these promising developments, significant challenges persist regarding the use of LLMs in medicine, including concerns regarding algorithmic bias, the potential for hallucinations, and the need for rigorous clinical validation. Ethical considerations also underscore the importance of maintaining the function of supervision in clinical practice. This paper highlights the rapid advancements in research on the diagnostic and therapeutic applications of LLMs across different medical disciplines and emphasizes the importance of policymaking, ethical supervision, and multidisciplinary collaboration in promoting more effective and safer clinical applications of LLMs. Future directions include the integration of proprietary clinical knowledge, the investigation of open-source and customized models, and the evaluation of real-time effects in clinical diagnosis and treatment practices.
Project description:Detecting depression is a critical component of mental health diagnosis, and accurate assessment is essential for effective treatment. This study introduces a novel, fully automated approach to predicting depression severity using the E-DAIC dataset. We employ Large Language Models (LLMs) to extract depression-related indicators from interview transcripts, utilizing the Patient Health Questionnaire-8 (PHQ-8) score to train the prediction model. Additionally, facial data extracted from video frames is integrated with textual data to create a multimodal model for depression severity prediction. We evaluate three approaches: text-based features, facial features, and a combination of both. Our findings show the best results are achieved by enhancing text data with speech quality assessment, with a mean absolute error of 2.85 and root mean square error of 4.02. This study underscores the potential of automated depression detection, showing text-only models as robust and effective while paving the way for multimodal analysis.
Project description:Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.