Estimating the deep replicability of scientific findings using human and artificial intelligence.
ABSTRACT: Replicability tests of scientific papers show that the majority of papers fail replication. Moreover, failed papers circulate through the literature as quickly as replicating papers. This dynamic weakens the literature, raises research costs, and demonstrates the need for new approaches for estimating a study's replicability. Here, we trained an artificial intelligence model to estimate a paper's replicability using ground truth data on studies that had passed or failed manual replication tests, and then tested the model's generalizability on an extensive set of out-of-sample studies. The model predicts replicability better than the base rate of reviewers and comparably as well as prediction markets, the best present-day method for predicting replicability. In out-of-sample tests on manually replicated papers from diverse disciplines and methods, the model had strong accuracy levels of 0.65 to 0.78. Exploring the reasons behind the model's predictions, we found no evidence for bias based on topics, journals, disciplines, base rates of failure, persuasion words, or novelty words like "remarkable" or "unexpected." We did find that the model's accuracy is higher when trained on a paper's text rather than its reported statistics and that n-grams, higher order word combinations that humans have difficulty processing, correlate with replication. We discuss how combining human and machine intelligence can raise confidence in research, provide research self-assessment techniques, and create methods that are scalable and efficient enough to review the ever-growing numbers of publications-a task that entails extensive human resources to accomplish with prediction markets and manual replication alone.
Project description:Most research in biology is empirical, yet empirical studies rely fundamentally on theoretical work for generating testable predictions and interpreting observations. Despite this interdependence, many empirical studies build largely on other empirical studies with little direct reference to relevant theory, suggesting a failure of communication that may hinder scientific progress. To investigate the extent of this problem, we analyzed how the use of mathematical equations affects the scientific impact of studies in ecology and evolution. The density of equations in an article has a significant negative impact on citation rates, with papers receiving 28% fewer citations overall for each additional equation per page in the main text. Long, equation-dense papers tend to be more frequently cited by other theoretical papers, but this increase is outweighed by a sharp drop in citations from nontheoretical papers (35% fewer citations for each additional equation per page in the main text). In contrast, equations presented in an accompanying appendix do not lessen a paper's impact. Our analysis suggests possible strategies for enhancing the presentation of mathematical models to facilitate progress in disciplines that rely on the tight integration of theoretical and empirical work.
Project description:Replicability is an important feature of scientific research, but aspects of contemporary research culture, such as an emphasis on novelty, can make replicability seem less important than it should be. The Reproducibility Project: Cancer Biology was set up to provide evidence about the replicability of preclinical research in cancer biology by repeating selected experiments from high-impact papers. A total of 50 experiments from 23 papers were repeated, generating data about the replicability of a total of 158 effects. Most of the original effects were positive effects (136), with the rest being null effects (22). A majority of the original effect sizes were reported as numerical values (117), with the rest being reported as representative images (41). We employed seven methods to assess replicability, and some of these methods were not suitable for all the effects in our sample. One method compared effect sizes: for positive effects, the median effect size in the replications was 85% smaller than the median effect size in the original experiments, and 92% of replication effect sizes were smaller than the original. The other methods were binary - the replication was either a success or a failure - and five of these methods could be used to assess both positive and null effects when effect sizes were reported as numerical values. For positive effects, 40% of replications (39/97) succeeded according to three or more of these five methods, and for null effects 80% of replications (12/15) were successful on this basis; combining positive and null effects, the success rate was 46% (51/112). A successful replication does not definitively confirm an original finding or its theoretical interpretation. Equally, a failure to replicate does not disconfirm a finding, but it does suggest that additional investigation is needed to establish its reliability.
Project description:As part of the Reproducibility Project: Cancer Biology, we published Registered Reports that described how we intended to replicate selected experiments from 29 high-impact preclinical cancer biology papers published between 2010 and 2012. Replication experiments were completed and Replication Studies reporting the results were submitted for 18 papers, of which 17 were accepted and published by <i>eLife</i> with the rejected paper posted as a preprint. Here, we report the status and outcomes obtained for the remaining 11 papers. Four papers initiated experimental work but were stopped without any experimental outcomes. Two papers resulted in incomplete outcomes due to unanticipated challenges when conducting the experiments. For the remaining five papers only some of the experiments were completed with the other experiments incomplete due to mundane technical or unanticipated methodological challenges. The experiments from these papers, along with the other experiments attempted as part of the Reproducibility Project: Cancer Biology, provides evidence about the challenges of repeating preclinical cancer biology experiments and the replicability of the completed experiments.
Project description:In our target article, we tested the replicability of 4 popular psychopathology network estimation methods that aim to reveal causal relationships among symptoms of mental illness. We started with the focal data set from the 2 foundational psychopathology network papers (i.e., the National Comorbidity Survey-Replication) and identified the National Survey of Mental Health and Wellbeing as a close methodological match for comparison. We compared the psychopathology networks estimated in each data set-as well as in 10 sets of random split-halves within each data set-with the goal of quantifying the replicability of the network parameters as they are interpreted in the extant psychopathology network literature. We concluded that current psychopathology network methods have limited replicability both within and between samples and thus have limited utility. Here we respond to the 2 commentaries on our target article, concluding that the findings of Steinley, Hoffman, Brusco, and Sher (2017)-along with other recent developments in the literature-provide further conclusive evidence that psychopathology networks have poor replicability and utility. (PsycINFO Database Record
Project description:<h4>Background</h4>Improving rigor and transparency measures should lead to improvements in reproducibility across the scientific literature; however, the assessment of measures of transparency tends to be very difficult if performed manually.<h4>Objective</h4>This study addresses the enhancement of the Rigor and Transparency Index (RTI, version 2.0), which attempts to automatically assess the rigor and transparency of journals, institutions, and countries using manuscripts scored on criteria found in reproducibility guidelines (eg, Materials Design, Analysis, and Reporting checklist criteria).<h4>Methods</h4>The RTI tracks 27 entity types using natural language processing techniques such as Bidirectional Long Short-term Memory Conditional Random Field-based models and regular expressions; this allowed us to assess over 2 million papers accessed through PubMed Central.<h4>Results</h4>Between 1997 and 2020 (where data were readily available in our data set), rigor and transparency measures showed general improvement (RTI 2.29 to 4.13), suggesting that authors are taking the need for improved reporting seriously. The top-scoring journals in 2020 were the Journal of Neurochemistry (6.23), British Journal of Pharmacology (6.07), and Nature Neuroscience (5.93). We extracted the institution and country of origin from the author affiliations to expand our analysis beyond journals. Among institutions publishing >1000 papers in 2020 (in the PubMed Central open access set), Capital Medical University (4.75), Yonsei University (4.58), and University of Copenhagen (4.53) were the top performers in terms of RTI. In country-level performance, we found that Ethiopia and Norway consistently topped the RTI charts of countries with 100 or more papers per year. In addition, we tested our assumption that the RTI may serve as a reliable proxy for scientific replicability (ie, a high RTI represents papers containing sufficient information for replication efforts). Using work by the Reproducibility Project: Cancer Biology, we determined that replication papers (RTI 7.61, SD 0.78) scored significantly higher (P<.001) than the original papers (RTI 3.39, SD 1.12), which according to the project required additional information from authors to begin replication efforts.<h4>Conclusions</h4>These results align with our view that RTI may serve as a reliable proxy for scientific replicability. Unfortunately, RTI measures for journals, institutions, and countries fall short of the replicated paper average. If we consider the RTI of these replication studies as a target for future manuscripts, more work will be needed to ensure that the average manuscript contains sufficient information for replication attempts.
Project description:We performed a registered replication of the Oberman and Ramachandran (Soc Neurosci 3(3-4):348-355, 2008) study on the 'kiki/bouba' effect in autism spectrum conditions (ASC). The aim of the study was to test the robustness of the diminished crossmodal correspondences effect in autism, but also to verify whether this effect is not an artifact of differences in intelligence. We tested a Polish-speaking sample of 21 participants with ADOS-confirmed autism spectrum conditions (mean age 15.90) and 21 age- (mean age 15.86), sex- and IQ-matched neurotypical control participants. Procedure closely followed the replicated study. Participants' task was to match five pairs of unfamiliar words and shapes. Matching words and shapes had similar supramodal characteristics that allowed the match. We report partial replication of the diminished 'kiki/bouba' effect in individuals with ASC compared to the neurotypical control group. However, we found that nonverbal intelligence also significantly contributed to task performance, but only in participants with autism, suggesting a compensatory role of intelligence. Finally, the effect of autism severity (measured by ADOS classification) was significant-crossmodal correspondences were weaker in individuals with autism, compared to those with autism spectrum diagnosis.
Project description:Classification of medical sciences into its sub-branches is crucial for optimum administration of healthcare and specialty training. Due to the rapid and continuous evolution of medical sciences, development of unbiased tools for monitoring the evolution of medical disciplines is required.Network analysis was used to explore how the medical sciences have evolved between 1980 and 2015 based on the shared words contained in more than 9 million PubMed abstracts. The k-clique percolation method was used to extract local research communities within the network. Analysis of the shared vocabulary in research papers reflects the trends of collaboration and splintering among different disciplines in medicine. Our model identifies distinct communities within each discipline that preferentially collaborate with other communities within other domains of specialty, and overturns some common perceptions.Our analysis provides a tool to assess growth, merging, splitting and contraction of research communities and can thereby serve as a guide to inform policymakers about funding and training in healthcare.
Project description:Scientific disciplines face concerns about replicability and statistical inference, and these concerns are also relevant in animal cognition research. This paper presents a first attempt to assess how researchers make and publish claims about animal physical cognition, and the statistical inferences they use to support them. We surveyed 116 published experiments from 63 papers on physical cognition, covering 43 different species. The most common tasks in our sample were trap-tube tasks (14 papers), other tool use tasks (13 papers), means-end understanding and string-pulling tasks (11 papers), object choice and object permanence tasks (9 papers) and access tasks (5 papers). This sample is not representative of the full scope of physical cognition research; however, it does provide data on the types of statistical design and publication decisions researchers have adopted. Across the 116 experiments, the median sample size was 7. Depending on the definitions we used, we estimated that between 44% and 59% of our sample of papers made positive claims about animals' physical cognitive abilities, between 24% and 46% made inconclusive claims, and between 10% and 17% made negative claims. Several failures of animals to pass physical cognition tasks were reported. Although our measures had low inter-observer reliability, these findings show that negative results can and have been published in the field. However, publication bias is still present, and consistent with this, we observed a drop in the frequency of p-values above .05. This suggests that some non-significant results have not been published. More promisingly, we found that researchers are likely making many correct statistical inferences at the individual-level. The strength of evidence of statistical effects at the group-level was weaker, and its p-value distribution was consistent with some effect sizes being overestimated. Studies such as ours can form part of a wider investigation into statistical reliability in comparative cognition. However, future work should focus on developing the validity and reliability of the measurements they use, and we offer some starting points.
Project description:BACKGROUND:Low replication rates are a concern in most, if not all, scientific disciplines. In psychiatric genetics specifically, targeting intermediate brain phenotypes, which are more closely associated with putative genetic effects, was touted as a strategy leading to increased power and replicability. In the current study, we attempted to replicate previously published associations between single nucleotide polymorphisms and threat-related amygdala reactivity, which represents a robust brain phenotype not only implicated in the pathophysiology of multiple disorders, but also used as a biomarker of future risk. METHODS:We conducted a literature search for published associations between single nucleotide polymorphisms and threat-related amygdala reactivity and found 37 unique findings. Our replication sample consisted of 1117 young adult volunteers (629 women, mean age 19.72 ± 1.25 years) for whom both genetic and functional magnetic resonance imaging data were available. RESULTS:Of the 37 unique associations identified, only three replicated as previously reported. When exploratory analyses were conducted with different model parameters compared to the original findings, significant associations were identified for 28 additional studies: eight of these were for a different contrast/laterality; five for a different gender and/or race/ethnicity; and 15 in the opposite direction and for a different contrast, laterality, gender, and/or race/ethnicity. No significant associations, regardless of model parameters, were detected for six studies. Notably, none of the significant associations survived correction for multiple comparisons. CONCLUSIONS:We discuss these patterns of poor replication with regard to the general strategy of targeting intermediate brain phenotypes in genetic association studies and the growing importance of advancing the replicability of imaging genetics findings.
Project description:Vast numbers of scientific articles are published each year, some of which attract considerable attention, and some of which go almost unnoticed. Here, we investigate whether any of this variance can be explained by a simple metric of one aspect of the paper's presentation: the length of its title. Our analysis provides evidence that journals which publish papers with shorter titles receive more citations per paper. These results are consistent with the intriguing hypothesis that papers with shorter titles may be easier to understand, and hence attract more citations.