Dataset Information

Reproducibility of radiomics quality score: an intra- and inter-rater reliability study.

ABSTRACT:

Objectives

To investigate the intra- and inter-rater reliability of the total radiomics quality score (RQS) and the reproducibility of individual RQS items' score in a large multireader study.

Methods

Nine raters with different backgrounds were randomly assigned to three groups based on their proficiency with RQS utilization: Groups 1 and 2 represented the inter-rater reliability groups with or without prior training in RQS, respectively; group 3 represented the intra-rater reliability group. Thirty-three original research papers on radiomics were evaluated by raters of groups 1 and 2. Of the 33 papers, 17 were evaluated twice with an interval of 1 month by raters of group 3. Intraclass coefficient (ICC) for continuous variables, and Fleiss' and Cohen's kappa (k) statistics for categorical variables were used.

Results

The inter-rater reliability was poor to moderate for total RQS (ICC 0.30-055, p < 0.001) and very low to good for item's reproducibility (k - 0.12 to 0.75) within groups 1 and 2 for both inexperienced and experienced raters. The intra-rater reliability for total RQS was moderate for the less experienced rater (ICC 0.522, p = 0.009), whereas experienced raters showed excellent intra-rater reliability (ICC 0.91-0.99, p < 0.001) between the first and second read. Intra-rater reliability on RQS items' score reproducibility was higher and most of the items had moderate to good intra-rater reliability (k - 0.40 to 1).

Conclusions

Reproducibility of the total RQS and the score of individual RQS items is low. There is a need for a robust and reproducible assessment method to assess the quality of radiomics research.

Clinical relevance statement

There is a need for reproducible scoring systems to improve quality of radiomics research and consecutively close the translational gap between research and clinical implementation.

Key points

• Radiomics quality score has been widely used for the evaluation of radiomics studies. • Although the intra-rater reliability was moderate to excellent, intra- and inter-rater reliability of total score and point-by-point scores were low with radiomics quality score. • A robust, easy-to-use scoring system is needed for the evaluation of radiomics research.

SUBMITTER: Akinci D'Antonoli T

PROVIDER: S-EPMC10957586 | biostudies-literature | 2024 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Reproducibility of radiomics quality score: an intra- and inter-rater reliability study.

Akinci D'Antonoli Tugba T Cavallo Armando Ugo AU Vernuccio Federica F Stanzione Arnaldo A Klontzas Michail E ME Cannella Roberto R Ugga Lorenzo L Baran Agah A Fanni Salvatore Claudio SC Petrash Ekaterina E Ambrosini Ilaria I Cappellini Luca Alessandro LA van Ooijen Peter P Kotter Elmar E Pinto Dos Santos Daniel D Cuocolo Renato R

European radiology 20230921 4

<h4>Objectives</h4>To investigate the intra- and inter-rater reliability of the total radiomics quality score (RQS) and the reproducibility of individual RQS items' score in a large multireader study.<h4>Methods</h4>Nine raters with different backgrounds were randomly assigned to three groups based on their proficiency with RQS utilization: Groups 1 and 2 represented the inter-rater reliability groups with or without prior training in RQS, respectively; group 3 represented the intra-rater reliab ...[more]

PMID: 37733025

Similar Datasets

Project description:ObjectivesThe objectives of this study were to asses (1) inter-rater and intrarater reliability of ultrasound imaging in patients with hip osteoarthritis, and (2) agreement between ultrasound and X-ray findings of hip osteoarthritis using validated Outcome Measures in Rheumatology ultrasound definitions for pathology.DesignAn inter-rater and intrarater reliability study.SettingA single-centre study conducted at a regional hospital.Participants50 patients >39 years of age referred for radiography due to hip pain and suspected hip osteoarthritis were included. Exclusion criteria were previous hip surgery in the painful hip, suspected fracture or malignant changes in the hip.InterventionBilateral ultrasound examinations (n=92) were performed continuously by two experienced operators blinded to clinical information and other imaging findings. After 4-6 weeks, one operator reassessed the images. X-rays were assessed by a third imaging specialist.Primary and secondary outcome measuresInter-rater and intrarater reliability and agreement between ultrasound imaging and X-ray were assessed using Cohen's ordinal kappa statistics for binary categorical variables and weighted kappa for ordered categorical variables.ResultsKappa values (κ) for inter-rater reliability were 0.9 and 0.8 for hip effusion/synovitis and osteoarthritis grading, respectively. For acetabular and femoral osteophytes, femoral cartilage changes and labrum changes κ ranged from 0.4 to 0.7. Intrarater reliability had κ equal or higher compared with inter-rater reliability. Agreement between ultrasound and X-ray findings ranged from κ=0.2 to κ=0.5.ConclusionThis study demonstrated substantial to almost perfect reliability on the most common ultrasound findings related to hip osteoarthritis and osteoarthritis grading. Agreement on the grade of osteoarthritis between ultrasound and X-ray was moderate. Overall, these results support ultrasound imaging as a reliable tool in the assessment of hip osteoarthritis.

Project description:BACKGROUND:There is a growing trend in the use of mobile health (mHealth) technologies in traditional Chinese medicine (TCM) and telemedicine, especially during the coronavirus disease (COVID-19) outbreak. Tongue diagnosis is an important component of TCM, but also plays a role in Western medicine, for example in dermatology. However, the procedure of obtaining tongue images has not been standardized and the reliability of tongue diagnosis by smartphone tongue images has yet to be evaluated. OBJECTIVE:The first objective of this study was to develop an operating classification scheme for tongue coating diagnosis. The second and main objective of this study was to determine the intra-rater and inter-rater reliability of tongue coating diagnosis using the operating classification scheme. METHODS:An operating classification scheme for tongue coating was developed using a stepwise approach and a quasi-Delphi method. First, tongue images (n=2023) were analyzed by 2 groups of assessors to develop the operating classification scheme for tongue coating diagnosis. Based on clinicians' (n=17) own interpretations as well as their use of the operating classification scheme, the results of tongue diagnosis on a representative tongue image set (n=24) were compared. After gathering consensus for the operating classification scheme, the clinicians were instructed to use the scheme to assess tongue features of their patients under direct visual inspection. At the same time, the clinicians took tongue images of the patients with smartphones and assessed tongue features observed in the smartphone image using the same classification scheme. The intra-rater agreements of these two assessments were calculated to determine which features of tongue coating were better retained by the image. Using the finalized operating classification scheme, clinicians in the study group assessed representative tongue images (n=24) that they had taken, and the intra-rater and inter-rater reliability of their assessments was evaluated. RESULTS:Intra-rater agreement between direct subject inspection and tongue image inspection was good to very good (Cohen ? range 0.69-1.0). Additionally, when comparing the assessment of tongue images on different days, intra-rater reliability was good to very good (? range 0.7-1.0), except for the color of the tongue body (?=0.22) and slippery tongue fur (?=0.1). Inter-rater reliability was moderate for tongue coating (Gwet AC2 range 0.49-0.55), and fair for color and other features of the tongue body (Gwet AC2=0.34). CONCLUSIONS:Taken together, our study has shown that tongue images collected via smartphone contain some reliable features, including tongue coating, that can be used in mHealth analysis. Our findings thus support the use of smartphones in telemedicine for detecting changes in tongue coating.

Dataset Information

Reproducibility of radiomics quality score: an intra- and inter-rater reliability study.

Objectives

Methods

Results

Conclusions

Clinical relevance statement

Key points

Publications

Reproducibility of radiomics quality score: an intra- and inter-rater reliability study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets