Error rates of human reviewers during abstract screening in systematic reviews.
ABSTRACT: BACKGROUND:Automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human reviewers. Yet, human reviewers make errors in inclusion and exclusion of references. OBJECTIVES:To determine citation false inclusion and false exclusion rates during abstract screening by pairs of independent reviewers. These rates can help in designing, testing and implementing automated approaches. METHODS:We identified all systematic reviews conducted between 2010 and 2017 by an evidence-based practice center in the United States. Eligible reviews had to follow standard systematic review procedures with dual independent screening of abstracts and full texts, in which citation inclusion by one reviewer prompted automatic inclusion through the next level of screening. Disagreements between reviewers during full text screening were reconciled via consensus or arbitration by a third reviewer. A false inclusion or exclusion was defined as a decision made by a single reviewer that was inconsistent with the final included list of studies. RESULTS:We analyzed a total of 139,467 citations that underwent 329,332 inclusion and exclusion decisions from 86 unique reviewers. The final systematic reviews included 5.48% of the potential references identified through bibliographic database search (95% confidence interval (CI): 2.38% to 8.58%). After abstract screening, the total error rate (false inclusion and false exclusion) was 10.76% (95% CI: 7.43% to 14.09%). CONCLUSIONS:This study suggests important false inclusion and exclusion rates by human reviewers. When deciding the validity of a future automated study selection algorithm, it is important to keep in mind that the gold standard is not perfect and that achieving error rates similar to humans may be adequate and can save resources and time.
Project description:BACKGROUND:We investigated the feasibility of using a machine learning tool's relevance predictions to expedite title and abstract screening. METHODS:We subjected 11 systematic reviews and six rapid reviews to four retrospective screening simulations (automated and semi-automated approaches to single-reviewer and dual independent screening) in Abstrackr, a freely-available machine learning software. We calculated the proportion missed, workload savings, and time savings compared to single-reviewer and dual independent screening by human reviewers. We performed cited reference searches to determine if missed studies would be identified via reference list scanning. RESULTS:For systematic reviews, the semi-automated, dual independent screening approach provided the best balance of time savings (median (range) 20 (3-82) hours) and reliability (median (range) proportion missed records, 1 (0-14)%). The cited references search identified 59% (n?=?10/17) of the records missed. For the rapid reviews, the fully and semi-automated approaches saved time (median (range) 9 (2-18) hours and 3 (1-10) hours, respectively), but less so than for the systematic reviews. The median (range) proportion missed records for both approaches was 6 (0-22)%. CONCLUSION:Using Abstrackr to assist one of two reviewers in systematic reviews saves time with little risk of missing relevant records. Many missed records would be identified via other means.
Project description:BACKGROUND:Web applications that employ natural language processing technologies to support systematic reviewers during abstract screening have become more common. The goal of our project was to conduct a case study to explore a screening approach that temporarily replaces a human screener with a semi-automated screening tool. METHODS:We evaluated the accuracy of the approach using DistillerAI as a semi-automated screening tool. A published comparative effectiveness review served as the reference standard. Five teams of professional systematic reviewers screened the same 2472 abstracts in parallel. Each team trained DistillerAI with 300 randomly selected abstracts that the team screened dually. For all remaining abstracts, DistillerAI replaced one human screener and provided predictions about the relevance of records. A single reviewer also screened all remaining abstracts. A second human screener resolved conflicts between the single reviewer and DistillerAI. We compared the decisions of the machine-assisted approach, single-reviewer screening, and screening with DistillerAI alone against the reference standard. RESULTS:The combined sensitivity of the machine-assisted screening approach across the five screening teams was 78% (95% confidence interval [CI], 66 to 90%), and the combined specificity was 95% (95% CI, 92 to 97%). By comparison, the sensitivity of single-reviewer screening was similar (78%; 95% CI, 66 to 89%); however, the sensitivity of DistillerAI alone was substantially worse (14%; 95% CI, 0 to 31%) than that of the machine-assisted screening approach. Specificities for single-reviewer screening and DistillerAI were 94% (95% CI, 91 to 97%) and 98% (95% CI, 97 to 100%), respectively. Machine-assisted screening and single-reviewer screening had similar areas under the curve (0.87 and 0.86, respectively); by contrast, the area under the curve for DistillerAI alone was just slightly better than chance (0.56). The interrater agreement between human screeners and DistillerAI with a prevalence-adjusted kappa was 0.85 (95% CI, 0.84 to 0.86%). CONCLUSIONS:The accuracy of DistillerAI is not yet adequate to replace a human screener temporarily during abstract screening for systematic reviews. Rapid reviews, which do not require detecting the totality of the relevant evidence, may find semi-automation tools to have greater utility than traditional systematic reviews.
Project description:<h4>Background</h4>Systematic reviews address a specific clinical question by unbiasedly assessing and analyzing the pertinent literature. Citation screening is a time-consuming and critical step in systematic reviews. Typically, reviewers must evaluate thousands of citations to identify articles eligible for a given review. We explore the application of machine learning techniques to semi-automate citation screening, thereby reducing the reviewers' workload.<h4>Results</h4>We present a novel online classification strategy for citation screening to automatically discriminate "relevant" from "irrelevant" citations. We use an ensemble of Support Vector Machines (SVMs) built over different feature-spaces (e.g., abstract and title text), and trained interactively by the reviewer(s). Semi-automating the citation screening process is difficult because any such strategy must identify all citations eligible for the systematic review. This requirement is made harder still due to class imbalance; there are far fewer "relevant" than "irrelevant" citations for any given systematic review. To address these challenges we employ a custom active-learning strategy developed specifically for imbalanced datasets. Further, we introduce a novel undersampling technique. We provide experimental results over three real-world systematic review datasets, and demonstrate that our algorithm is able to reduce the number of citations that must be screened manually by nearly half in two of these, and by around 40% in the third, without excluding any of the citations eligible for the systematic review.<h4>Conclusions</h4>We have developed a semi-automated citation screening algorithm for systematic reviews that has the potential to substantially reduce the number of citations reviewers have to manually screen, without compromising the quality and comprehensiveness of the review.
Project description:BACKGROUND:Stroke secondary prevention guidelines recommend medication prescription and adherence, active education and behavioural counselling regarding lifestyle risk factors. To impact on recurrent vascular events, positive behaviour/s must be adopted and sustained as a lifestyle choice, requiring theoretically informed behaviour change and self-management interventions. A growing number of systematic reviews have addressed complex interventions in stroke secondary prevention. Differing terminology, inclusion criteria and overlap of studies between reviews makes the mechanism/s that affect positive change difficult to identify or replicate clinically. Adopting a two-phase approach, this overview will firstly comprehensively summarise systematic reviews in this area and secondly identify and synthesise primary studies in these reviews which provide person-centred, theoretically informed interventions for stroke secondary prevention. METHODS:An overview of reviews will be conducted using a systematic search strategy across the Cochrane Database of Systematic Reviews, PubMed and Epistomonikas. INCLUSION CRITERIA:systematic reviews where the population comprises individuals post-stroke or TIA and where data relating to person-centred risk reduction are synthesised for evidence of efficacy when compared to standard care or no intervention. Primary outcomes of interest include mortality, recurrent stroke and other cardiovascular events. In phase 1, two reviewers will independently (1) assess the eligibility of identified reviews for inclusion; (2) rate the quality of included reviews using the ROBIS tool; (3) identify unique primary studies and overlap between reviews; (4) summarise the published evidence supporting person-centred behavioural change and self-management interventions in stroke secondary prevention and (5) identify evidence gaps in this field. In phase 2, two independent reviewers will (1) examine person-centred, primary studies in each review using the Template for Intervention Description and Replication (TIDieR checklist), itemising, where present, theoretical frameworks underpinning interventions; (2) group studies employing theoretically informed interventions by the intervention delivered and by the outcomes reported (3) apply GRADE quality of evidence for each intervention by outcome/s identified from theoretically informed primary studies. Disagreement between reviewers at each process stage will be discussed and a third reviewer consulted. DISCUSSION:This overview will comprehensively bring together the best available evidence supporting person-centred, stroke secondary prevention strategies in an accessible format, identifying current knowledge gaps.
Project description:PURPOSE:To inform recommendations by the Canadian Task Force on Preventive Health Care by systematically reviewing direct evidence on the effectiveness and acceptability of screening adults 40 years and older in primary care to reduce fragility fractures and related mortality and morbidity, and indirect evidence on the accuracy of fracture risk prediction tools. Evidence on the benefits and harms of pharmacological treatment will be reviewed, if needed to meaningfully influence the Task Force's decision-making. METHODS:A modified update of an existing systematic review will evaluate screening effectiveness, the accuracy of screening tools, and treatment benefits. For treatment harms, we will integrate studies from existing systematic reviews. A de novo review on acceptability will be conducted. Peer-reviewed searches (Medline, Embase, Cochrane Library, PsycINFO [acceptability only]), grey literature, and hand searches of reviews and included studies will update the literature. Based on pre-specified criteria, we will screen studies for inclusion following a liberal-accelerated approach. Final inclusion will be based on consensus. Data extraction for study results will be performed independently by two reviewers while other data will be verified by a second reviewer; there may be some reliance on extracted data from the existing reviews. The risk of bias assessments reported in the existing reviews will be verified and for new studies will be performed independently. When appropriate, results will be pooled using either pairwise random effects meta-analysis (screening and treatment) or restricted maximum likelihood estimation with Hartun-Knapp-Sidnick-Jonkman correction (risk prediction model calibration). Subgroups of interest to explain heterogeneity are age, sex, and menopausal status. Two independent reviewers will rate the certainty of evidence using the GRADE approach, with consensus reached for each outcome rated as critical or important by the Task Force. DISCUSSION:Since the publication of other guidance in Canada, new trials have been published that are likely to improve understanding of screening in primary care settings to prevent fragility fractures. A systematic review is required to inform updated recommendations that align with the current evidence base.
Project description:BACKGROUND:Systematic reviews are vital to the pursuit of evidence-based medicine within healthcare. Screening titles and abstracts (T&Ab) for inclusion in a systematic review is an intensive, and often collaborative, step. The use of appropriate tools is therefore important. In this study, we identified and evaluated the usability of software tools that support T&Ab screening for systematic reviews within healthcare research. METHODS:We identified software tools using three search methods: a web-based search; a search of the online "systematic review toolbox"; and screening of references in existing literature. We included tools that were accessible and available for testing at the time of the study (December 2018), do not require specific computing infrastructure and provide basic screening functionality for systematic reviews. Key properties of each software tool were identified using a feature analysis adapted for this purpose. This analysis included a weighting developed by a group of medical researchers, therefore prioritising the most relevant features. The highest scoring tools from the feature analysis were then included in a user survey, in which we further investigated the suitability of the tools for supporting T&Ab screening amongst systematic reviewers working in medical research. RESULTS:Fifteen tools met our inclusion criteria. They vary significantly in relation to cost, scope and intended user community. Six of the identified tools (Abstrackr, Colandr, Covidence, DRAGON, EPPI-Reviewer and Rayyan) scored higher than 75% in the feature analysis and were included in the user survey. Of these, Covidence and Rayyan were the most popular with the survey respondents. Their usability scored highly across a range of metrics, with all surveyed researchers (n =?6) stating that they would be likely (or very likely) to use these tools in the future. CONCLUSIONS:Based on this study, we would recommend Covidence and Rayyan to systematic reviewers looking for suitable and easy to use tools to support T&Ab screening within healthcare research. These two tools consistently demonstrated good alignment with user requirements. We acknowledge, however, the role of some of the other tools we considered in providing more specialist features that may be of great importance to many researchers.
Project description:BACKGROUND:Stringent requirements exist regarding the transparency of the study selection process and the reliability of results. A 2-step selection process is generally recommended; this is conducted by 2 reviewers independently of each other (conventional double-screening). However, the approach is resource intensive, which can be a problem, as systematic reviews generally need to be completed within a defined period with a limited budget. The aim of the following methodological systematic review was to analyse the evidence available on whether single screening is equivalent to double screening in the screening process conducted in systematic reviews. METHODS:We searched Medline, PubMed and the Cochrane Methodology Register (last search 10/2018). We also used supplementary search techniques and sources ("similar articles" function in PubMed, conference abstracts and reference lists). We included all evaluations comparing single with double screening. Data were summarized in a structured, narrative way. RESULTS:The 4 evaluations included investigated a total of 23 single screenings (12 sets for screening involving 9 reviewers). The median proportion of missed studies was 5% (range 0 to 58%). The median proportion of missed studies was 3% for the 6 experienced reviewers (range: 0 to 21%) and 13% for the 3 reviewers with less experience (range: 0 to 58%). The impact of missing studies on the findings of meta-analyses had been reported in 2 evaluations for 7 single screenings including a total of 18,148 references. In 3 of these 7 single screenings - all conducted by the same reviewer (with less experience) - the findings would have changed substantially. The remaining 4 of these 7 screenings were conducted by experienced reviewers and the missing studies had no impact or a negligible on the findings of the meta-analyses. CONCLUSIONS:Single screening of the titles and abstracts of studies retrieved in bibliographic searches is not equivalent to double screening, as substantially more studies are missed. However, in our opinion such an approach could still represent an appropriate methodological shortcut in rapid reviews, as long as it is conducted by an experienced reviewer. Further research on single screening is required, for instance, regarding factors influencing the number of studies missed.
Project description:The aim of this systematic review is to look at the barriers to uptake and interventions to improve uptake of postnatal screening in women who have had gestational diabetes mellitus (GDM). Increasing postnatal screening rates could lead to timely interventions that could reduce the incidence of type 2 diabetes mellitus (T2DM), the associated long-term health complications, and the financial burden of T2DM. A systematic review of the literature was undertaken. PubMed, Embase, Medline, CINAHL and the Cochrane library databases were searched using well-defined search terms. Predefined inclusion and exclusion criteria were used to identify relevant manuscripts. Data extractions and quality assessments were performed by one reviewer and checked by a second reviewer. Eleven primary studies of various research design and three systematic reviews were included. We identified seven themes within these studies and these were described in two categories, barriers and interventions. There appeared to be no single intervention that would overcome all the identified barriers, however, reminders to women and healthcare professionals appear to be most effective. Uptake rates of testing for T2DM are low in women with GDM. Interventions developed with consideration of the identified barriers to uptake could promote greater numbers of women attending for follow-up.
Project description:BACKGROUND:We explored the performance of three machine learning tools designed to facilitate title and abstract screening in systematic reviews (SRs) when used to (a) eliminate irrelevant records (automated simulation) and (b) complement the work of a single reviewer (semi-automated simulation). We evaluated user experiences for each tool. METHODS:We subjected three SRs to two retrospective screening simulations. In each tool (Abstrackr, DistillerSR, RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed and workload and time savings compared to dual independent screening. To test user experiences, eight research staff tried each tool and completed a survey. RESULTS:Using Abstrackr, DistillerSR, and RobotAnalyst, respectively, the median (range) proportion missed was 5 (0 to 28) percent, 97 (96 to 100) percent, and 70 (23 to 100) percent for the automated simulation and 1 (0 to 2) percent, 2 (0 to 7) percent, and 2 (0 to 4) percent for the semi-automated simulation. The median (range) workload savings was 90 (82 to 93) percent, 99 (98 to 99) percent, and 85 (85 to 88) percent for the automated simulation and 40 (32 to 43) percent, 49 (48 to 49) percent, and 35 (34 to 38) percent for the semi-automated simulation. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for the automated simulation and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for the semi-automated simulation. Abstrackr identified 33-90% of records missed by a single reviewer. RobotAnalyst performed less well and DistillerSR provided no relative advantage. User experiences depended on user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s). CONCLUSIONS:The workload savings afforded in the automated simulation came with increased risk of missing relevant records. Supplementing a single reviewer's decisions with relevance predictions (semi-automated simulation) sometimes reduced the proportion missed, but performance varied by tool and SR. Designing tools based on reviewers' self-identified preferences may improve their compatibility with present workflows. SYSTEMATIC REVIEW REGISTRATION:Not applicable.
Project description:The objective of this systematic review was to systematically review papers in the United States that examine current practices in privacy and security when telehealth technologies are used by healthcare providers. A literature search was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols (PRISMA-P). PubMed, CINAHL and INSPEC from 2003 - 2016 were searched and returned 25,404 papers (after duplications were removed). Inclusion and exclusion criteria were strictly followed to examine title, abstract, and full text for 21 published papers which reported on privacy and security practices used by healthcare providers using telehealth. Data on confidentiality, integrity, privacy, informed consent, access control, availability, retention, encryption, and authentication were all searched and retrieved from the papers examined. Papers were selected by two independent reviewers, first per inclusion/exclusion criteria and, where there was disagreement, a third reviewer was consulted. The percentage of agreement and Cohen's kappa was 99.04% and 0.7331 respectively. The papers reviewed ranged from 2004 to 2016 and included several types of telehealth specialties. Sixty-seven percent were policy type studies, and 14 percent were survey/interview studies. There were no randomized controlled trials. Based upon the results, we conclude that it is necessary to have more studies with specific information about the use of privacy and security practices when using telehealth technologies as well as studies that examine patient and provider preferences on how data is kept private and secure during and after telehealth sessions.