Agreement between two raters' evaluation for integrated Traditional Prosthodontic Practical Exam with Directly Observed Procedural Skills in Egypt.
ABSTRACT: PURPOSE:It aimed to determine the agreement between two raters evaluating students at prosthodontic clinical practical exam integrated with directly observed procedural skills (DOPS). METHODS:A sample of 76 students was monitored by two raters to evaluate the process and the final registered maxillomandibular relation for completely edentulous patient in Mansoura Dental School, Egypt at practical exam of the Bachelor students on May 15, till June 28, 2017. Each registered relation was evaluated from total 60 marks subdivided to three score-categories: occlusal plane orientation (OPO), vertical dimension registration (VDR), and centric relation registration (CRR). The marks of each category included mark of DOPS. The marks of OPO and VDR for both raters were compared by graph method to measure reliability using the Bland and Altman analysis. The reliability of CRR marks was evaluated by Krippendorff's alpha ratio. RESULTS:The results revealed similarity between raters for OPO (mean = 18.1) and closes of limits of agreement (0.73 and -0.78). For VDR, there were closeness of means (mean= 17.4 and 17.1 for examiner 1 and 2 respectively); with limits of agreement (2.7and-2.2). There was a strong correlation (Krippendorff's alpha ratio= 0.92; 95% CI [0.79-0.99]) among raters at evaluating CRR. CONCLUSION:The two raters' evaluation of clinical traditional practical exam integrated with directly observed procedural skills revealed not to be different to evaluate candidate at the end of the clinical prosthodontic course. The limits of agreement between raters would be optimum at exclusion subjective evaluation parameters and complicated cases from examination procedures.
Project description:PURPOSE:It aimed to know the performance of the Ebel standard-setting method in in spring 2019 Royal College of Physicians and Surgeons of Canada internal medicine certification examination consisted of multiple-choice questions. Specifically followings were searched: the inter-rater agreement; the correlation between Ebel scores and item facility indices; raters' knowledge of correct answers' impact on the Ebel score; and affection of rater's specialty on theinter-rater agreement and Ebel scores. METHODS:Data were drawn from a Royal College of Physicians and Surgeons of Canada certification exam. Ebel's method was applied to 203 MCQs by 49 raters. Facility indices came from 194 candidates. We computed Fleiss' kappa and the Pearson correlation between Ebel scores and item facility indices. We investigated differences in the Ebel score (correct answers provided or not) and differences between internists and other specialists with t-tests. RESULTS:Kappa was below 0.15 for facility and relevance. The correlation between Ebel scores and facility indices was low when correct answers were provided and negligible when they were not. The Ebel score was the same, whether the correct answers were provided or not. Inter-rater agreement and Ebel scores was not differentbetween internists and other specialists. CONCLUSION:Inter-rater agreement and correlations between item Ebel scores and facility indices wee consistently low; furthermore, raters' knowledge of correct answer and rater specialty had no effect on Ebel scores in the present setting.
Project description:The quality of patient education materials is an important issue for health educators, clinicians, and community health workers. We describe a challenge achieving reliable scores between coders when using the Patient Educational Materials Assessment Tool (PEMAT) to evaluate farmworker health materials in spring 2020. Four coders were unable to achieve reliability after three attempts at coding calibration. Further investigation identified improvements to the PEMAT codebook and evidence of the difficulty of achieving traditional interrater reliability in the form of Krippendorff's alpha. Our solution was to use multiple raters and average ratings to achieve an acceptable score with an intraclass correlation coefficient. Practitioners using the PEMAT to evaluate materials should consider averaging the scores of multiple raters as PEMAT results otherwise may be highly sensitive to who is doing the rating. Not doing so may inadvertently result in the use of suboptimal patient education materials.
Project description:PURPOSE:This study investigated interobserver agreement in lung ultrasonography (LUS) in pregnant women performed by obstetricians with different levels of expertise, with confirmation by an expert radiologist. METHODS:This prospective study was conducted at a tertiary "Coronavirus Pandemic Hospital" in April 2020. Pregnant women suspected to have coronavirus disease 2019 (COVID-19) were included. Two blinded experienced obstetricians performed LUS on pregnant women separately and noted their scores for 14 lung zones. Following a theoretical and hands-on practical course, one experienced obstetrician, two novice obstetric residents, and an experienced radiologist blindly evaluated anonymized and randomized still images and videoclips retrospectively. Weighted Cohen's kappa and Krippendorff's alpha tests were used to assess the interobserver agreement. RESULTS:Fifty-two pregnant women were included, with confirmed COVID-19 diagnosis rate of 82.7%. In total, 336 eligible still images and 115 videoclips were included in the final analysis. The overall weighted Cohen's kappa values ranged from 0.706 to 0.912 for the 14 lung zones. There were only seven instances of major disagreement (>1 point) in the evaluation of 14 lung zones of 52 patients (n=728). The overall agreement between the radiologist and obstetricians for the still images (Krippendorff's ?=0.856, 95% confidence interval [CI], 0.797 to 0.915) and videoclips (Krippendorff's ?=0.785; 95% CI, 0.709 to 0.861) was good. CONCLUSION:The interobserver agreement between obstetricians with different levels of experience on still images and videoclips of LUS was good. Following a brief theoretical course, obstetricians' performance of LUS in pregnant women and interpretation of pre-acquired LUS images can be considered consistent.
Project description:We study the cohesion within and the coalitions between political groups in the Eighth European Parliament (2014-2019) by analyzing two entirely different aspects of the behavior of the Members of the European Parliament (MEPs) in the policy-making processes. On one hand, we analyze their co-voting patterns and, on the other, their retweeting behavior. We make use of two diverse datasets in the analysis. The first one is the roll-call vote dataset, where cohesion is regarded as the tendency to co-vote within a group, and a coalition is formed when the members of several groups exhibit a high degree of co-voting agreement on a subject. The second dataset comes from Twitter; it captures the retweeting (i.e., endorsing) behavior of the MEPs and implies cohesion (retweets within the same group) and coalitions (retweets between groups) from a completely different perspective. We employ two different methodologies to analyze the cohesion and coalitions. The first one is based on Krippendorff's Alpha reliability, used to measure the agreement between raters in data-analysis scenarios, and the second one is based on Exponential Random Graph Models, often used in social-network analysis. We give general insights into the cohesion of political groups in the European Parliament, explore whether coalitions are formed in the same way for different policy areas, and examine to what degree the retweeting behavior of MEPs corresponds to their co-voting patterns. A novel and interesting aspect of our work is the relationship between the co-voting and retweeting patterns.
Project description:The Bland-Altman plot is the most common method to analyze and visualize agreement between raters or methods of quantitative outcomes in health research. While very useful for studies with two raters, a limitation of the classical Bland-Altman plot is that it is specifically used for studies with two raters. We propose an extension of the Bland-Altman plot suitable for more than two raters and derive the approximate limits of agreement with 95% confidence intervals. We validated the suggested limit of agreement by a simulation study. Moreover, we offer suggestions on how to present bias, heterogeneity among raters, as well as the uncertainty of the limits of agreement. The resulting plot could be utilized to investigate and present agreement in studies with more than two raters.
Project description:Objective:To explore the impact of Blackboard (Bb) formative assessment on the final score in the endocrine module and determine the medical students' perception of the impact and effectiveness of Bb. Methods:This exploratory case study was carried out at the King Abdulaziz University (KAU), Jeddah, Saudi Arabia (SA). Blackboard was used in the course management and formative assessment of third-year medical students and three years of data was collected (2016, 2017, 2019). In the last week of the module before the final exam, a formative assessment test that comprised of 50 Multiple Choice Questions (MCQs) was posted on Bb each year. All the students filled a questionnaire regarding their perception about the impact and effectiveness of Bb. Results:Overall, summative exam scores were significantly higher than the scores in formative assessment (p <0.001). A substantial positive correlation was observed between students' marks in the online (Bb) MCQ exam and their final exam marks (p <0.001). Regarding the features of Bb, most often used by the students' were course resources uploaded on the Bb, assignments, online quizzes, and others. Majority of the students were satisfied with the use of Bb in this module. Conclusions:The majority of the students liked this blended learning (BL) method and conceded the impact and effectiveness of Bb. The formative online assessment on Bb improved the students' performance in the final exam and a positive correlation was noted between students' marks in online (Bb) exams with their final exam marks.
Project description:<h4>Objectives</h4>Algorithm-based exposure assessments based on patterns in questionnaire responses and professional judgment can readily apply transparent exposure decision rules to thousands of jobs quickly. However, we need to better understand how algorithms compare to a one-by-one job review by an exposure assessor. We compared algorithm-based estimates of diesel exhaust exposure to those of three independent raters within the New England Bladder Cancer Study, a population-based case-control study, and identified conditions under which disparities occurred in the assessments of the algorithm and the raters.<h4>Methods</h4>Occupational diesel exhaust exposure was assessed previously using an algorithm and a single rater for all 14 983 jobs reported by 2631 study participants during personal interviews conducted from 2001 to 2004. Two additional raters independently assessed a random subset of 324 jobs that were selected based on strata defined by the cross-tabulations of the algorithm and the first rater's probability assessments for each job, oversampling their disagreements. The algorithm and each rater assessed the probability, intensity and frequency of occupational diesel exhaust exposure, as well as a confidence rating for each metric. Agreement among the raters, their aggregate rating (average of the three raters' ratings) and the algorithm were evaluated using proportion of agreement, kappa and weighted kappa (κw). Agreement analyses on the subset used inverse probability weighting to extrapolate the subset to estimate agreement for all jobs. Classification and Regression Tree (CART) models were used to identify patterns in questionnaire responses that predicted disparities in exposure status (i.e., unexposed versus exposed) between the first rater and the algorithm-based estimates.<h4>Results</h4>For the probability, intensity and frequency exposure metrics, moderate to moderately high agreement was observed among raters (κw = 0.50-0.76) and between the algorithm and the individual raters (κw = 0.58-0.81). For these metrics, the algorithm estimates had consistently higher agreement with the aggregate rating (κw = 0.82) than with the individual raters. For all metrics, the agreement between the algorithm and the aggregate ratings was highest for the unexposed category (90-93%) and was poor to moderate for the exposed categories (9-64%). Lower agreement was observed for jobs with a start year <1965 versus ≥1965. For the confidence metrics, the agreement was poor to moderate among raters (κw = 0.17-0.45) and between the algorithm and the individual raters (κw = 0.24-0.61). CART models identified patterns in the questionnaire responses that predicted a fair-to-moderate (33-89%) proportion of the disagreements between the raters' and the algorithm estimates.<h4>Discussion</h4>The agreement between any two raters was similar to the agreement between an algorithm-based approach and individual raters, providing additional support for using the more efficient and transparent algorithm-based approach. CART models identified some patterns in disagreements between the first rater and the algorithm. Given the absence of a gold standard for estimating exposure, these patterns can be reviewed by a team of exposure assessors to determine whether the algorithm should be revised for future studies.
Project description:Gait disturbance is a major symptom of idiopathic normal-pressure hydrocephalus (iNPH) and is assessed by raters of different professions or with different degrees of experience. Agreement studies are usually done by two raters or more, and comparisons among multiple groups of raters are rare. In this study, we aimed to examine the agreement among multiple groups of raters on gait patterns and a grading scale through a video-assisted gait analysis in patients with iNPH. Fifteen participants with iNPH were enrolled. Gait was assessed according to seven patterns, including freezing and wide-based gaits. The levels of severity (evident, mild, none) were rated by three groups of raters (two neurosurgeons [DR2], three experienced physiotherapists [PTe3], and two less experienced physiotherapists [PTl2]) through a simultaneous video viewing session. Severity of gait disturbance (GSg) was rated using the Japanese iNPH grading scaleiNPHGS, and Krippendorff alpha was computed to assess agreement, with alpha ?0.667 indicating good agreement and alpha ?0.8 indicating excellent agreement. For group comparisons, 84%, not 95%, confidence intervals were applied. Among the seven gait patterns in the first assessment, excellent agreement was observed in wide-based and short-stepped gaits in only DR2. Good agreement was observed in four patterns, but the agreement by two groups was in shuffling and wide-based gait. There were no gait patterns showing good agreement among three groups. In the second assessment, excellent agreement was observed in three patterns but no gait patterns showed good agreement between two groups or more. Learning effect was observed only for standing difficulty in DR2. In contrast, good or nearly good agreement on GSg was observed among the three groups with excellent agreement in two groups. Agreement on gait patterns among the three groups of raters was not high, but agreement on the iNPHGS was high, indicating the importance of a precise description facilitating differentiation between neighboring grades.
Project description:<h4>Purpose</h4>First, to evaluate inter-rater reliability when human raters estimate the reading performance of visually impaired individuals using the MNREAD acuity chart. Second, to evaluate the agreement between computer-based scoring algorithms and compare them with human rating.<h4>Methods</h4>Reading performance was measured for 101 individuals with low vision, using the Portuguese version of the MNREAD test. Seven raters estimated the maximum reading speed (MRS) and critical print size (CPS) of each individual MNREAD curve. MRS and CPS were also calculated automatically for each curve using two different algorithms: the original standard deviation method (SDev) and a non-linear mixed effects (NLME) modeling. Intra-class correlation coefficients (ICC) were used to estimate absolute agreement between raters and/or algorithms.<h4>Results</h4>Absolute agreement between raters was 'excellent' for MRS (ICC = 0.97; 95%CI [0.96, 0.98]) and 'moderate' to 'good' for CPS (ICC = 0.77; 95%CI [0.69, 0.83]). For CPS, inter-rater reliability was poorer among less experienced raters (ICC = 0.70; 95%CI [0.57, 0.80]) when compared to experienced ones (ICC = 0.82; 95%CI [0.76, 0.88]). Absolute agreement between the two algorithms was 'excellent' for MRS (ICC = 0.96; 95%CI [0.91, 0.98]). For CPS, the best possible agreement was found for CPS defined as the print size sustaining 80% of MRS (ICC = 0.77; 95%CI [0.68, 0.84]). Absolute agreement between raters and automated methods was 'excellent' for MRS (ICC = 0.96; 95% CI [0.88, 0.98] for SDev; ICC = 0.97; 95% CI [0.95, 0.98] for NLME). For CPS, absolute agreement between raters and SDev ranged from 'poor' to 'good' (ICC = 0.66; 95% CI [0.3, 0.80]), while agreement between raters and NLME was 'good' (ICC = 0.83; 95% CI [0.76, 0.88]).<h4>Conclusion</h4>For MRS, inter-rater reliability is excellent, even considering the possibility of noisy and/or incomplete data collected in low-vision individuals. For CPS, inter-rater reliability is lower. This may be problematic, for instance in the context of multisite investigations or follow-up examinations. The NLME method showed better agreement with the raters than the SDev method for both reading parameters. Setting up consensual guidelines to deal with ambiguous curves may help improve reliability. While the exact definition of CPS should be chosen on a case-by-case basis depending on the clinician or researcher's motivations, evidence suggests that estimating CPS as the smallest print size sustaining about 80% of MRS would increase inter-rater reliability.