Robust Neural Automated Essay Scoring Using Item Response Theory
ABSTRACT: Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to human grading. Conventional AES methods typically rely on manually tuned features, which are laborious to effectively develop. To obviate the need for feature engineering, many deep neural network (DNN)-based AES models have been proposed and have achieved state-of-the-art accuracy. DNN-AES models require training on a large dataset of graded essays. However, assigned grades in such datasets are known to be strongly biased due to effects of rater bias when grading is conducted by assigning a few raters in a rater set to each essay. Performance of DNN models rapidly drops when such biased data are used for model training. In the fields of educational and psychological measurement, item response theory (IRT) models that can estimate essay scores while considering effects of rater characteristics have recently been proposed. This study therefore proposes a new DNN-AES framework that integrates IRT models to deal with rater bias within training data. To our knowledge, this is a first attempt at addressing rating bias effects in training data, which is a crucial but overlooked problem.
Project description:We propose a novel approach to modelling rater effects in scoring-based assessment. The approach is based on a Bayesian hierarchical model and simulations from the posterior distribution. We apply it to large-scale essay assessment data over a period of 5 years. Empirical results suggest that the model provides a good fit for both the total scores and when applied to individual rubrics. We estimate the median impact of rater effects on the final grade to be ± 2 points on a 50 point scale, while 10% of essays would receive a score at least ± 5 different from their actual quality. Most of the impact is due to rater unreliability, not rater bias.
Project description:Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.
Project description:How to effectively evaluate students' essays based on a series of relatively objective writing criteria has always been a topic of discussion. With the development of automatic essay scoring, a key question is whether the writing quality can be evaluated systematically based on the scoring rubric. To address this issue, we used an innovative set of graph-based features to predict the quality of Chinese middle school students' essays. These features are divided into four sub-dimensions: basic characteristics, main idea, essay content, and essay development. The results show that graph-based features were significantly better at predicting human essay scores than the baseline features. This indicates that graph-based features can be used to reliably and systematically evaluate the quality of an essay based on the scoring rubric, and it can be used as an alternative tool to replace or supplement human evaluation.
Project description:In various assessment contexts including entrance examinations, educational assessments, and personnel appraisal, performance assessment by raters has attracted much attention to measure higher order abilities of examinees. However, a persistent difficulty is that the ability measurement accuracy depends strongly on rater and task characteristics. To resolve this shortcoming, various item response theory (IRT) models that incorporate rater and task characteristic parameters have been proposed. However, because various models with different rater and task parameters exist, it is difficult to understand each model's features. Therefore, this study presents empirical comparisons of IRT models. Specifically, after reviewing and summarizing features of existing models, we compare their performance through simulation and actual data experiments.
Project description:Some large-scale testing requires examinees to select and answer a fixed number of items from given items (e.g., select one out of the three items). Usually, they are constructed-response items that are marked by human raters. In this examinee-selected item (ESI) design, some examinees may benefit more than others from choosing easier items to answer, and so the missing data induced by the design become missing not at random (MNAR). Although item response theory (IRT) models have recently been developed to account for MNAR data in the ESI design, they do not consider the rater effect; thus, their utility is seriously restricted. In this study, two methods are developed: the first one is a new IRT model to account for both MNAR data and rater severity simultaneously, and the second one adapts conditional maximum likelihood estimation and pairwise estimation methods to the ESI design with the rater effect. A series of simulations was then conducted to compare their performance with those of conventional IRT models that ignored MNAR data or rater severity. The results indicated a good parameter recovery for the new model. The conditional maximum likelihood estimation and pairwise estimation methods were applicable when the Rasch models fit the data, but the conventional IRT models yielded biased parameter estimates. An empirical example was given to illustrate these new initiatives.
Project description:INTRODUCTION:Social media is a novel medium to host reflective writing (RW) essays, yet its impact on depth of students' reflection is unknown. Shifting reflection on to social platforms offers opportunities for students to engage with their community, yet may leave them feeling vulnerable and less willing to reflect deeply. Using sociomateriality as a conceptual framework, we aimed to compare the depth of reflection in RW samples submitted by medical students in a traditional private essay format to those posted on a secure social media platform. METHODS:Fourth-year medical students submitted a RW essay as part of their emergency medicine clerkship, either in a private essay format (academic year [AY] 2015) or onto a closed, password-protected social media website (AY 2016). Five raters used the Reflection Evaluation for Learners' Enhanced Competencies Tool (REFLECT) to score 122 de-identified RW samples (55 private, 67 social media). Average scores on two platforms were compared. Students were also surveyed regarding their comfort with the social media experience. RESULTS:There were no differences in average composite REFLECT scores between the private essay (14.1, 95% confidence interval [CI], 12.0-16.2) and social media (13.7 95% CI, 11.4-16.0) submission formats (t [1,120] = 0.94, p = 0.35). Of the 73% of students who responded to the survey, 72% reported feeling comfortable sharing their personal reflections with peers, and 84% felt comfortable commenting on peers' writing. CONCLUSION:Students generally felt comfortable using social media for shared reflection. The depth of reflection in RW essays was similar between the private and social media submission formats.
Project description:Background The main objective of this study is the development of a short reliable easy-to-use assessment tool in the aim of providing feedback to the reflective writings of medical students and residents. Methods This study took place in a major tertiary academic medical center in Beirut, Lebanon. Seventy-seven reflective essays written by 18 residents in the department of Family Medicine at the American University of Beirut Medical Center (AUBMC) were graded by 3 raters using the newly developed scale to assess the scale reliability. Following a comprehensive search and analysis of the literature, and based on their experience in reflective grading, the authors developed a concise 9-item scale to grade reflective essays through repeated cycles of development and analysis as well as the determination of the inter-rater reliability (IRR) using intra-class correlation coefficients (ICC) and Krippendorff’s Alpha. Results The inter-rater reliability of the new scale ranges from moderate to substantial with ICC of 0.78, 95% CI 0.64–0.86, p?<?0.01 and Krippendorff’s Alpha was 0.49. Conclusions The newly developed scale, GRE-9, is a short, concise, easy-to-use reliable grading tool for reflective essays that has demonstrated moderate to substantial inter-rater reliability. This will enable raters to objectively grade reflective essays and provide informed feedback to residents and students.
Project description:We investigated some of the key features of effective active learning by comparing the outcomes of three different methods of implementing active-learning exercises in a majors introductory biology course. Students completed activities in one of three treatments: discussion, writing, and discussion + writing. Treatments were rotated weekly between three sections taught by three different instructors in a full factorial design. The data set was analyzed by generalized linear mixed-effect models with three independent variables: student aptitude, treatment, and instructor, and three dependent (assessment) variables: change in score on pre- and postactivity clicker questions, and coding scores on in-class writing and exam essays. All independent variables had significant effects on student performance for at least one of the dependent variables. Students with higher aptitude scored higher on all assessments. Student scores were higher on exam essay questions when the activity was implemented with a writing component compared with peer discussion only. There was a significant effect of instructor, with instructors showing different degrees of effectiveness with active-learning techniques. We suggest that individual writing should be implemented as part of active learning whenever possible and that instructors may need training and practice to become effective with active learning.
Project description:BACKGROUND:Skin fibrosis is the clinical hallmark of systemic sclerosis (SSc), where collagen deposition and remodeling of the dermis occur over time. The most widely used outcome measure in SSc clinical trials is the modified Rodnan skin score (mRSS), which is a semi-quantitative assessment of skin stiffness at seventeen body sites. However, the mRSS is confounded by obesity, edema, and high inter-rater variability. In order to develop a new histopathological outcome measure for SSc, we applied a computer vision technology called a deep neural network (DNN) to stained sections of SSc skin. We tested the hypotheses that DNN analysis could reliably assess mRSS and discriminate SSc from normal skin. METHODS:We analyzed biopsies from two independent (primary and secondary) cohorts. One investigator performed mRSS assessments and forearm biopsies, and trichrome-stained biopsy sections were photomicrographed. We used the AlexNet DNN to generate a numerical signature of 4096 quantitative image features (QIFs) for 100 randomly selected dermal image patches/biopsy. In the primary cohort, we used principal components analysis (PCA) to summarize the QIFs into a Biopsy Score for comparison with mRSS. In the secondary cohort, using QIF signatures as the input, we fit a logistic regression model to discriminate between SSc vs. control biopsy, and a linear regression model to estimate mRSS, yielding Diagnostic Scores and Fibrosis Scores, respectively. We determined the correlation between Fibrosis Scores and the published Scleroderma Skin Severity Score (4S) and between Fibrosis Scores and longitudinal changes in mRSS on a per patient basis. RESULTS:In the primary cohort (n =?6, 26 SSc biopsies), Biopsy Scores significantly correlated with mRSS (R?=?0.55, p =?0.01). In the secondary cohort (n =?60 SSc and 16 controls, 164 biopsies; divided into 70% training and 30% test sets), the Diagnostic Score was significantly associated with SSc-status (misclassification rate?=?1.9% [training], 6.6% [test]), and the Fibrosis Score significantly correlated with mRSS (R?=?0.70 [training], 0.55 [test]). The DNN-derived Fibrosis Score significantly correlated with 4S (R?=?0.69, p =?3?×?10-?17). CONCLUSIONS:DNN analysis of SSc biopsies is an unbiased, quantitative, and reproducible outcome that is associated with validated SSc outcomes.
Project description:Long non-coding RNA (lncRNA) plays a key role in various disorders. However, its role in keloid is still unclear.We explored differentially expressed (DE) lncRNAs and mRNAs between keloid tissue (KT)s and normal tissue (NT)s, as well as keloid fibroblast (KFB)s and normal fibroblast (NFB)s, respectively.We use KTs and NTs from the chest of 5 patients, and 3 pairs of KFBs and NFBs, to perform microarray respectively. Gene ontology and pathway analyses were conducted by online software DAVID (Database for Annotation, Visualization and Integrated Discovery). The validation of targeted lncRNAs were conducted by qRT-PCR in enlarged samples (79 KTs and 21 NTs).We identified 3680 DE-lncRNAs in tissue essay, and 1231 DE-lncRNAs in cell essay. Furthermore, we found that many lncRNAs and their relative mRNAs were regulated simultaneously in keloid. We identified that ENST00000439703 and uc003jox.1 were up-regulated in both of the above essays through comparing the results of lncRNA screening between tissue essay and cell essay; the results were confirmed through qRT-PCR in enlarged samples.Our study demonstrates that numerous lncRNAs are involved in the pathogenesis and development of the keloid.