What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science.
ABSTRACT: A recent study of the replicability of key psychological findings is a major contribution toward understanding the human side of the scientific process. Despite the careful and nuanced analysis reported, the simple narrative disseminated by the mass, social, and scientific media was that in only 36% of the studies were the original results replicated. In the current study, however, we showed that 77% of the replication effect sizes reported were within a 95% prediction interval calculated using the original effect size. Our analysis suggests two critical issues in understanding replication of psychological studies. First, researchers' intuitive expectations for what a replication should show do not always match with statistical estimates of replication. Second, when the results of original studies are very imprecise, they create wide prediction intervals-and a broad range of replication effects that are consistent with the original estimates. This may lead to effects that replicate successfully, in that replication results are consistent with statistical expectations, but do not provide much information about the size (or existence) of the true effect. In this light, the results of the Reproducibility Project: Psychology can be viewed as statistically consistent with what one might expect when performing a large-scale replication experiment.
Project description:Mechanisms supporting human ultra-cooperativeness are very much subject to debate. One psychological feature likely to be relevant is the formation of expectations, particularly about receiving cooperative or generous behavior from others. Without such expectations, social life will be seriously impeded and, in turn, expectations leading to satisfactory interactions can become norms and institutionalize cooperation. In this paper, we assess people's expectations of generosity in a series of controlled experiments using the dictator game. Despite differences in respective roles, involvement in the game, degree of social distance or variation of stakes, the results are conclusive: subjects seldom predict that dictators will behave selfishly (by choosing the Nash equilibrium action, namely giving nothing). The majority of subjects expect that dictators will choose the equal split. This implies that generous behavior is not only observed in the lab, but also expected by subjects. In addition, expectations are accurate, matching closely the donations observed and showing that as a society we have a good grasp of how we interact. Finally, correlation between expectations and actual behavior suggests that expectations can be an important ingredient of generous or cooperative behavior.
Project description:Replicability is an important feature of scientific research, but aspects of contemporary research culture, such as an emphasis on novelty, can make replicability seem less important than it should be. The Reproducibility Project: Cancer Biology was set up to provide evidence about the replicability of preclinical research in cancer biology by repeating selected experiments from high-impact papers. A total of 50 experiments from 23 papers were repeated, generating data about the replicability of a total of 158 effects. Most of the original effects were positive effects (136), with the rest being null effects (22). A majority of the original effect sizes were reported as numerical values (117), with the rest being reported as representative images (41). We employed seven methods to assess replicability, and some of these methods were not suitable for all the effects in our sample. One method compared effect sizes: for positive effects, the median effect size in the replications was 85% smaller than the median effect size in the original experiments, and 92% of replication effect sizes were smaller than the original. The other methods were binary - the replication was either a success or a failure - and five of these methods could be used to assess both positive and null effects when effect sizes were reported as numerical values. For positive effects, 40% of replications (39/97) succeeded according to three or more of these five methods, and for null effects 80% of replications (12/15) were successful on this basis; combining positive and null effects, the success rate was 46% (51/112). A successful replication does not definitively confirm an original finding or its theoretical interpretation. Equally, a failure to replicate does not disconfirm a finding, but it does suggest that additional investigation is needed to establish its reliability.
Project description:We measure how accurately replication of experimental results can be predicted by black-box statistical models. With data from four large-scale replication projects in experimental psychology and economics, and techniques from machine learning, we train predictive models and study which variables drive predictable replication. The models predicts binary replication with a cross-validated accuracy rate of 70% (AUC of 0.77) and estimates of relative effect sizes with a Spearman ? of 0.38. The accuracy level is similar to market-aggregated beliefs of peer scientists [1, 2]. The predictive power is validated in a pre-registered out of sample test of the outcome of , where 71% (AUC of 0.73) of replications are predicted correctly and effect size correlations amount to ? = 0.25. Basic features such as the sample and effect sizes in original papers, and whether reported effects are single-variable main effects or two-variable interactions, are predictive of successful replication. The models presented in this paper are simple tools to produce cheap, prognostic replicability metrics. These models could be useful in institutionalizing the process of evaluation of new findings and guiding resources to those direct replications that are likely to be most informative.
Project description:Given the importance of effective treatments for children with reading impairment, paired with growing concern about the lack of scientific replication in psychological science, the aim of this study was to replicate a quasi-randomised trial of sight word and phonics training using a randomised controlled trial (RCT) design. One group of poor readers (N = 41) did 8 weeks of phonics training (i.e., phonological decoding) and then 8 weeks of sight word training (i.e., whole-word recognition). A second group did the reverse order of training. Sight word and phonics training each had a large and significant valid treatment effect on trained irregular words and word reading fluency. In addition, combined sight word and phonics training had a moderate and significant valid treatment effect on nonword reading accuracy and fluency. These findings demonstrate the reliability of both phonics and sight word training in treating poor readers in an era where the importance of scientific reliability is under close scrutiny.
Project description:Direct replication studies follow an original experiment's methods as closely as possible. They provide information about the reliability and validity of an original study's findings. The present paper asks what comparative cognition should expect if its studies were directly replicated, and how researchers can use this information to improve the reliability of future research. Because published effect sizes are likely overestimated, comparative cognition researchers should not expect findings with p-values just below the significance level to replicate consistently. Nevertheless, there are several statistical and design features that can help researchers identify reliable research. However, researchers should not simply aim for maximum replicability when planning studies; comparative cognition faces strong replicability-validity and replicability-resource trade-offs. Next, the paper argues that it may not even be possible to perform truly direct replication studies in comparative cognition because of: 1) a lack of access to the species of interest; 2) real differences in animal behavior across sites; and 3) sample size constraints producing very uncertain statistical estimates, meaning that it will often not be possible to detect statistical differences between original and replication studies. These three reasons suggest that many claims in the comparative cognition literature are practically unfalsifiable, and this presents a challenge for cumulative science in comparative cognition. To address this challenge, comparative cognition can begin to formally assess the replicability of its findings, improve its statistical thinking and explore new infrastructures that can help to form a field that can create and combine the data necessary to understand how cognition evolves.
Project description:Psychological studies have demonstrated that expectations can have substantial effects on choice behavior, although the role of expectations on social decision making in particular has been relatively unexplored. To broaden our knowledge, we examined the role of expectations on decision making when interacting with new game partners and then also in a subsequent interaction with the same partners. To perform this, 38 participants played an Ultimatum Game (UG) in the role of responders and were primed to expect to play with two different groups of proposers, either those that were relatively fair (a tendency to propose an equal split-the high expectation condition) or unfair (with a history of offering unequal splits-the low expectation condition). After playing these 40 UG rounds, they then played 40 Dictator Games (DG) as allocator with the same set of partners. The results showed that expectations affect UG decisions, with a greater proportion of unfair offers rejected from the high as compared to the low expectation group, suggesting that players utilize specific expectations of social interaction as a behavioral reference point. Importantly, this was evident within subjects. Interestingly, we also demonstrated that these expectation effects carried over to the subsequent DG. Participants allocated more money to the recipients of the high expectation group as well to those who made equal offers and, in particular, when the latter were expected to behave unfairly, suggesting that people tend to forgive negative violations and appreciate and reward positive violations. Therefore, both the expectations of others' behavior and their violations play an important role in subsequent allocation decisions. Together, these two studies extend our knowledge of the role of expectations in social decision making.
Project description:The idea of replication is based on the premise that there are empirical regularities or universal laws to be replicated and verified, and the scientific method is adequate for doing it. Scientific truth, however, is not absolute but relative to time, context, and the method used. Time and context are inextricably intertwined in that time (e.g., Christmas Day vs. New Year’s Day) creates different contexts for behaviors and contexts create different experiences of time, rendering psychological phenomena inherently variable. This means that internal and external conditions fluctuate and are different in a replication study vs. the original. Thus, a replication experiment is just another empirical investigation in an ongoing effort to establish scientific truth. Neither the original nor a replication is the final arbiter of whether or not something exists. Discovered patterns need not be permanent laws of human behavior proven by the pinpoint statistical verification through replication. To move forward, phenomenon replications are needed to investigate phenomena in different ways, forms, contexts, and times. Such investigations look at phenomena not just in terms the magnitude of their effects but also by their frequency, duration, and intensity in labs and real life. They will also shed light on the extent to which lab manipulations may make many phenomena subjectively conscious events and effects (e.g., causal attributions) when they are nonconsciously experienced in real life, or vice versa. As scientific knowledge in physics is temporary and incomplete, should it be any surprise that science can only provide “temporary winners” for psychological knowledge of human behavior?
Project description:There is a broad agreement that psychology is facing a replication crisis. Even some seemingly well-established findings have failed to replicate. Numerous causes of the crisis have been identified, such as underpowered studies, publication bias, imprecise theories, and inadequate statistical procedures. The replication crisis is real, but it is less clear how it should be resolved. Here we examine potential solutions by modeling a scientific community under various different replication regimes. In one regime, all findings are replicated before publication to guard against subsequent replication failures. In an alternative regime, individual studies are published and are replicated after publication, but only if they attract the community's interest. We find that the publication of potentially non-replicable studies minimizes cost and maximizes efficiency of knowledge gain for the scientific community under a variety of assumptions. Provided it is properly managed, our findings suggest that low replicability can support robust and efficient science.
Project description:BACKGROUND:Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features associated with illness. We propose a new approach, called gene set bagging, for measuring the probability that a gene set replicates in future studies. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate in the bagged samples. RESULTS:Using both simulated and publicly-available genomics data, we demonstrate that significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. We show our method estimates the replication probability (R), the probability that a gene set will replicate as a significant result in future studies, and show in simulations that this method reflects replication better than each set's p-value. CONCLUSIONS:Our results suggest that gene lists based on p-values are not necessarily stable, and therefore additional steps like gene set bagging may improve biological inference on gene sets.
Project description:In psychological science, there is an increasing concern regarding the reproducibility of scientific findings. For instance, Replication Project: Psychology (Open Science Collaboration, 2015) found that the proportion of successful replication in psychology was 41%. This proportion was calculated based on Cumming and Maillardet (2006) widely employed capture procedure (CPro) and capture percentage (CPer). Despite the popularity of CPro and CPer, we believe that using them may lead to an incorrect conclusion of (a) successful replication when the population effect sizes in the original and replicated studies are different; and (b) unsuccessful replication when the population effect sizes in the original and replicated studies are identical but their sample sizes are different. Our simulation results show that the performances of CPro and CPer become biased, such that researchers can easily make a wrong conclusion of successful/unsuccessful replication. Implications of these findings are considered in the conclusion.