Abstract
Background: Accurate and unbiased assessment of tumor response to treatment is essential in cancer clinical trials. The Response Evaluation Criteria in Solid Tumors (RECIST) is the standard for evaluating tumor response, and assessments are commonly performed both by local site investigators and by blinded independent central reviewers. While RECIST has been widely adopted, questions remain about its reliability and the consistency of tumor response classification across different raters. Site investigators in particular may exhibit bias in tumor measurement or identification compared to blinded central reviewers. Previous studies have either individually examined inter-rater reliability or compared site investigator and central reviewer assessments, but gaps remain in terms of meta-analyzing reliability, assessing site investigator and central reviewer discrepancies, and in understanding the impact of RECIST’s threshold definitions on classification stability.
Purpose: This study aims to address these gaps through three main analyses. First, we conduct a general meta-analysis of inter-rater reliability for RECIST to establish an overall estimate of agreement. Second, we analyze discrepancies between site investigator and central reviewer assessments of tumor response, building on existing literature and providing new data from three cancer clinical trials. Third, we perform a sensitivity analysis to evaluate how changes in the disease response and progression thresholds affect the classification of tumor response. Together, these analyses provide a comprehensive assessment of RECIST reliability and its implications for clinical trial outcomes.
Methods: First, systematic review of the literature was conducted to assess the inter-rater reliability (IRR) of RECIST in terms of Cohen’s and Fleiss’ kappa statistics. Second, a retrospective analysis of three cancer clinical trials was performed, comparing site investigator and central reviewer assessments of tumor response. Specifically, trial endpoints of time to progression, time to response, and duration of response were analyzed for differences in hazard ratios between site investigators and central reviewers. Differences in hazard ratios between raters within studies were scrutinized, and these differences were synthesized into meta-analyses. Traditional null-hypothesis significance testing as well as equivalence testing on a hazard ratio range of \([0.80, 1.25]\) were performed to assess the significance of differences (or equivalence) in hazard ratios. Differences between raters in objective response rate were also analyzed using Cochran’s Q and McNemar’s tests. Third, a sensitivity analysis was conducted to evaluate how changes in RECIST’s disease response and progression thresholds affect the classification of tumor response. This involved simulating different threshold definitions and assessing their impact on the aforementioned IRR and trial endpoint analyses.
Results: The meta-analysis of inter-rater reliability for RECIST revealed a substantial level of agreement across studies, with a pooled kappa coefficient of 0.66. Of the included studies, 4 of them were directly from clinical trials, and exhibited clustering around a lower kappa value of ~0.45. The analysis of discrepancies between site investigator and central reviewer assessments showed that while there were some significant differences in hazard ratios within individual studies, no systematic differences were observed. Moreover, we were able to demonstrate equivalence between site investigators and central reviewers in the time to progression and time to response outcomes. The sensitivity analyses demonstrated that changes in RECIST’s threshold definitions do not significantly affect the differences between site investigators and central reviewers with regards to tumor response classification.
Conclusion: IRR analyses show substantial overall agreement between raters, but this might be lower in the real-world setting of clinical trials. Follow-up analyses of trial endpoints, however, do not point to systematic differences between site investigators and central reviewers, but individual studies still may show differences. These findings are consistent with previous literature, which has shown that RECIST is overall a reliable tool for assessing tumor response, but also highlight that individual studies likely benefit from the inclusion of blinded central reviewers. The sensitivity analyses further confirm that RECIST’s arbitrary threshold definitions do not significantly impact classification stability, suggesting that RECIST remains a robust tool for evaluating tumor response in clinical trials.