Using iCn3D and the World Wide Web for structure-based collaborative research: Analyzing molecular interactions at the root of COVID-19.
ABSTRACT: The COVID-19 pandemic took us ill-prepared and tackling the many challenges it poses in a timely manner requires world-wide collaboration. Our ability to study the SARS-COV-2 virus and its interactions with its human host in molecular terms efficiently and collaboratively becomes indispensable and mission-critical in the race to develop vaccines, drugs, and neutralizing antibodies. There is already a significant corpus of 3D structures related to SARS and MERS coronaviruses, and the rapid generation of new structures demands the use of efficient tools to expedite the sharing of structural analyses and molecular designs and convey them in their native 3D context in sync with sequence data and annotations. We developed iCn3D (pronounced "I see in 3D") 1 to take full advantage of web technologies and allow scientists of different backgrounds to perform and share sequence-structure analyses over the Internet and engage in collaborations through a simple mechanism of exchanging "lifelong" web links (URLs). This approach solves the very old problem of "sharing of molecular scenes" in a reliable and convenient manner. iCn3D links are sharable over the Internet and make data and entire analyses findable, accessible, and reproducible, with various levels of interoperability. Links and underlying data are FAIR 2 and can be embedded in preprints and papers, bringing a 3D live and interactive dimension to a world of text and static images used in current publications, eliminating at the same time the need for arcane supplemental materials. This paper exemplifies iCn3D capabilities in visualization, analysis, and sharing of COVID-19 related structures, sequence variability, and molecular interactions.
Project description:BACKGROUND: Many online resources for the life sciences have been developed and introduced in peer-reviewed papers recently, ranging from databases and web applications to data-analysis software. Some have been introduced in special journal issues or websites with a search function, but others remain scattered throughout the Internet and in the published literature. The searchable resources on these sites are collected and maintained manually and are therefore of higher quality than automatically updated sites, but also require more time and effort. DESCRIPTION: We developed an online resource search system called OReFiL to address these issues. We developed a crawler to gather all of the web pages whose URLs appear in MEDLINE abstracts and full-text papers on the BioMed Central open-access journals. The URLs were extracted using regular expressions and rules based on our heuristic knowledge. We then indexed the online resources to facilitate their retrieval and comparison by researchers. Because every online resource has at least one PubMed ID, we can easily acquire its summary with Medical Subject Headings (MeSH) terms and confirm its credibility through reference to the corresponding PubMed entry. In addition, because OReFiL automatically extracts URLs and updates the index, minimal time and effort is needed to maintain the system. CONCLUSION: We developed OReFiL, a search system for online life science resources, which is freely available. The system's distinctive features include the ability to return up-to-date query-relevant online resources introduced in peer-reviewed papers; the ability to search using free words, MeSH terms, or author names; easy verification of each hit following links to the corresponding PubMed entry or to papers citing the URL through the search systems of BioMed Central, Scirus, HighWire Press, or Google Scholar; and quick confirmation of the existence of an online resource web page.
Project description:Patients are increasingly using the Internet to inform themselves of health-related topics and procedures, including EGD. We analyzed the quality of information and readability of websites after a search on 3 different search engines. We used an assessment tool for website quality analysis that we developed in addition to using validated instruments for website quality, Global Quality Score (GQS) and Health on Net (HON) certification. The readability was assessed using Flesch-Kincaid Reading Ease (FRE) and Flesch-Kincaid Grade level (FKG). 30 results of each search terms 'EGD' and 'Upper Endoscopy' from Google and 15 each from Bing and Yahoo were analyzed. A total of 45 websites were included from 100 URLs after removing duplicates, video links, and journal articles. Only 3 websites were found to have good quality and comprehensive and authentic information. These websites were https://www.healthline.com, https://www.uptodate.com, and https://www.emedicine.medscape.com. There were additional 13 sites with moderate quality of information. The mean Flesch-Kincaid Reading Ease (FRE) score was 46.92 (range 81.6-6.5). The mean Flesch-Kincaid Grade level (FKG) was 11th grade, with a range of 6th grade to 12th grade and above making them difficult to read. Our study shows that there are quite a few websites with moderate quality content. We recommend 3 comprehensive and authentic websites out of 45 URLs analyzed for information on Internet for EGD. In addition, the readability of the websites was consistently at a higher level than recommended by AMA at 11th grade level. In addition, we identified 3 websites with moderate quality content written at 8th grade and below readability level. We feel that gastroenterologists can help their patients better understand this procedure by directing them to these comprehensive websites.
Project description:BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources. METHODS: We first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis. RESULTS: We identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10-4). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers. CONCLUSION: SDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities.
Project description:Three-dimensional structures are now known within many protein families and it is quite likely, in searching a sequence database, that one will encounter a homolog with known structure. The goal of Entrez's 3D-structure database is to make this information, and the functional annotation it can provide, easily accessible to molecular biologists. To this end Entrez's search engine provides three powerful features. (i) Sequence and structure neighbors; one may select all sequences similar to one of interest, for example, and link to any known 3D structures. (ii) Links between databases; one may search by term matching in MEDLINE, for example, and link to 3D structures reported in these articles. (iii) Sequence and structure visualization; identifying a homolog with known structure, one may view molecular-graphic and alignment displays, to infer approximate 3D structure. In this article we focus on two features of Entrez's Molecular Modeling Database (MMDB) not described previously: links from individual biopolymer chains within 3D structures to a systematic taxonomy of organisms represented in molecular databases, and links from individual chains (and compact 3D domains within them) to structure neighbors, other chains (and 3D domains) with similar 3D structure. MMDB may be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure.
Project description:Scientific Data Analysis Resources (SDARs) such as bioinformatics programs, web servers and databases are integral to modern science, but previous studies have shown that the Uniform Resource Locators (URLs) linking to them decay in a time-dependent manner, with ?27% decayed to date. Because SDARs are overrepresented among science's most cited papers over the past 20 years, loss of widely used SDARs could be particularly disruptive to scientific research. We identified URLs in MEDLINE abstracts and used crowdsourcing to identify which reported the creation of SDARs. We used the Internet Archive's Wayback Machine to approximate 'death dates' and calculate citations/year over each SDAR's lifespan. At first glance, decayed SDARs did not significantly differ from available SDARs in their average citations per year over their lifespan or journal impact factor (JIF). But the most cited SDARs were 94% likely to be relocated to another URL versus only 34% of uncited ones. Taking relocation into account, we find that citations are the strongest predictors of current online availability after time since publication, and JIF modestly predictive. This suggests that URL decay is a general, persistent phenomenon affecting all URLs, but the most useful/recognized SDARs are more likely to persist.
Project description:Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a worldwide crisis with profound effects on both public health and the economy. In order to combat the COVID-19 pandemic, research groups have shared viral genome sequence data through the Global Initiative on Sharing All Influenza Data (GISAID). Over the past year, ?290,000 full SARS-CoV-2 proteome sequences have been deposited in the GISAID. Here, we used these sequences to assess the rate of nonsynonymous mutants over the entire viral proteome. Our analysis shows that SARS-CoV-2 proteins are mutating at substantially different rates, with most of the viral proteins exhibiting little mutational variability. As anticipated, our calculations capture previously reported mutations that arose in the first months of the pandemic, such as D614G (Spike), P323L (NSP12), and R203K/G204R (Nucleocapsid), but they also identify more recent mutations, such as A222V and L18F (Spike) and A220V (Nucleocapsid), among others. Our comprehensive temporal and geographical analyses show two distinct periods with different proteome mutation rates: December 2019 to July 2020 and August to December 2020. Notably, some mutation rates differ by geography, primarily during the latter half of 2020 in Europe. Furthermore, our structure-based molecular analysis provides an exhaustive assessment of SARS-CoV-2 mutation rates in the context of the current set of 3D structures available for SARS-CoV-2 proteins. This emerging sequence-to-structure insight is beginning to illuminate the site-specific mutational (in)tolerance of SARS-CoV-2 proteins as the virus continues to spread around the globe.
Project description:Neurophysiology requires an extensive workflow of information analysis routines, which often includes incompatible proprietary software, introducing limitations based on financial costs, transfer of data between platforms, and the ability to share. An ecosystem of free open-source software exists to fill these gaps, including thousands of analysis and plotting packages written in Python and R, which can be implemented in a sharable and reproducible format, such as the Jupyter electronic notebook. This tool chain can largely replace current routines by importing data, producing analyses, and generating publication-quality graphics. An electronic notebook like Jupyter allows these analyses, along with documentation of procedures, to display locally or remotely in an internet browser, which can be saved as an HTML, PDF, or other file format for sharing with team members and the scientific community. The present report illustrates these methods using data from electrophysiological recordings of the musk shrew vagus-a model system to investigate gut-brain communication, for example, in cancer chemotherapy-induced emesis. We show methods for spike sorting (including statistical validation), spike train analysis, and analysis of compound action potentials in notebooks. Raw data and code are available from notebooks in data supplements or from an executable online version, which replicates all analyses without installing software-an implementation of reproducible research. This demonstrates the promise of combining disparate analyses into one platform, along with the ease of sharing this work. In an age of diverse, high-throughput computational workflows, this methodology can increase efficiency, transparency, and the collaborative potential of neurophysiological research.
Project description:Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed "easy to install," and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.
Project description:A considerable amount of rapid-paced research is underway to combat the SARS-CoV-2 pandemic. In this work, we assess the 3D structure of the 5' untranslated region of its RNA, in the hopes that stable secondary structures can be targeted, interrupted, or otherwise measured. To this end, we have combined molecular dynamics simulations with previous Nuclear Magnetic Resonance measurements for stem loop 2 of SARS-CoV-1 to refine 3D structure predictions of that stem loop. We find that relatively short sampling times allow for loop rearrangement from predicted structures determined in absence of water or ions, to structures better aligned with experimental data. We then use molecular dynamics to predict the refined structure of the transcription regulatory leader sequence (TRS-L) region which includes stem loop 3, and show that arrangement of the loop around exchangeable monovalent potassium can interpret the conformational equilibrium determined by in-cell dimethyl sulfate (DMS) data.