5 Conclusion
The validity and reliability of clinical trial results in oncology fundamentally depend on accurate tumor measurement and consistent assessment of disease response and progression. Without standardized evaluation criteria, trial outcomes would be subject to substantial variability, potentially leading to erroneous conclusions about treatment efficacy. RECIST was developed precisely to address this need, providing a standardized framework for tumor assessment that has been widely adopted in clinical trials globally. Since its initial introduction in 2000 and subsequent refinement to RECIST 1.1 in 2009, this framework has become the cornerstone for evaluating therapeutic responses in solid tumor oncology trials (1).
Despite RECIST 1.1’s widespread adoption and critical importance in drug development, there has been surprisingly limited systematic research examining its fundamental reliability. Particularly notable is the absence of comprehensive meta-analyses synthesizing inter-rater reliability (IRR) data across multiple studies and contexts. This knowledge gap is significant because consistent measurement and interpretation between different raters, whether at the same institution or across different trial sites, is essential for ensuring that reported treatment effects reflect genuine biological responses rather than measurement inconsistencies or subjective interpretations.
That is not to say that no work has been done in this domain; indeed, previous research has made valuable contributions to our understanding of RECIST reliability, particularly within the context of clinical trials. For example, Zhang et al. (2) and, more recently, Jacobs et al. (3) conducted meta-analyses investigating differences between site investigators and central reviewers in several key clinical trial outcome measures. These studies provide important insights into potential discrepancies in tumor assessments, with an absolutely critical conclusion by Zhang et al. that “statistically inconsistent inferences could be made in many trials” depending on whether the site investigator or central reviewer assessments are used (2). However, these studies are limited because the range of trial outcomes analyzed is constrained and neither study offers an analysis of equivalence between rater groups.
An additional critical gap in the literature concerns the empirical impact of RECIST’s defined threshold values. The 30% decrease threshold for partial response and 20% increase threshold for progressive disease were established based on expert consensus rather than extensive empirical validation. To date, no comprehensive research has investigated whether these specific thresholds might systematically influence agreement between raters or whether alternative thresholds might yield more consistent assessments. This question is not merely academic as even small changes in these threshold values could potentially affect trial outcomes and subsequent treatment approvals.
This thesis addresses these important gaps through three complementary analytical approaches. First, it presents a comprehensive meta-analysis of IRR studies to establish an overall estimate of RECIST 1.1’s reliability across diverse contexts. Second, it performs a detailed comparison of site investigator versus central reviewer assessments across multiple clinical trial endpoints, including several not previously examined in the literature. Finally, it conducts novel sensitivity analyses exploring how variations in the disease response and progression threshold values might affect the classification consistency of tumor measurements between raters. Together, these analyses provide a more comprehensive evaluation of RECIST 1.1’s reliability than has previously been available, with important implications for the interpretation of clinical trial results and potential refinements to the RECIST framework.
Our meta-analysis reveals that the IRR of the RECIST 1.1 criteria demonstrates substantial agreement between raters, with a pooled Cohen’s and Fleiss’ \(\kappa\) estimate of 0.66. According to the widely accepted Landis and Koch interpretive scale (4), this value indicates substantial agreement, suggesting that RECIST 1.1 provides reasonably consistent tumor assessments across different evaluators. However, this level of agreement, while encouraging, still leaves room for improvement in measurement consistency especially when considering how clearly defined the RECIST criteria are. Of particular concern is a notable pattern in our data: the four clinical trial studies included in our meta-analysis consistently show lower IRR values compared to studies conducted in non-clinical trial settings. This finding raises important questions about whether RECIST 1.1 reliability may be context-dependent, with potentially lower consistency in the high-stakes environment of clinical trials where diverse specialists (oncologists and radiologists) may be evaluating the same images with different training backgrounds. With more data on RECIST reliability in clinical trial contexts, this observation could be more deeply investigated to determine whether specific factors contribute to the observed variability such as compounding discrepancies due to serial assessments of the same patients.
From a methodological perspective, we also contribute to the statistical foundations of \(\kappa\) analysis by developing a more mathematically rigorous approach for scaling \(\kappa\) values to logit values using the delta method. This methodological advancement extends previous work that assumed \(\kappa\) values were bounded to the interval [0, 1], clarifying that the full theoretical range of \(\kappa\) is [-1, 1]. While in most practical RECIST applications we expect positive agreement values, our method provides a more statistically sound framework for meta-analytic comparisons that may encounter unusual cases of systematic disagreement.
To further examine RECIST reliability exclusively within clinical trial contexts, we conducted detailed comparisons between site investigator and central reviewer response assessments. Although there are detectable differences in ORR determinations in some cases, the direction and magnitude of these differences vary inconsistently across studies and may be an artefact of a small sample of response to treatment within the trials. More tellingly, an analysis of pairwise IRR between site investigators and central reviewers reveals a broad range of \(\kappa\) values, indicating variability in agreement. Notably, in two of the three clinical trials we examined, the pairwise \(\kappa\) values between site investigators and central reviewers are lower than those between different central reviewers. This pattern suggests a systematic tendency for greater consistency among specialized central reviewers than between site investigators and central review teams although the sample size is small and the results should be interpreted with caution. This finding highlights the potential value of central review processes in enhancing measurement reliability, particularly in complex clinical trial settings where multiple raters may interpret tumor images differently.
To determine whether observed differences in tumor classification have meaningful clinical implications, we modeled time-to-event outcomes using Cox regression models. Our analysis of TTP, TTR, and DoR yields particularly informative results. No significant differences in hazard ratios are detected between site investigators and central reviewers across these critical endpoints. More definitively, using two one-sided tests (TOST) for equivalence, we demonstrate statistical equivalence between site investigators and central reviewers for both TTP and TTR outcomes. Although equivalence could not be established for DoR, the point estimates are similar, suggesting that the inability to confirm equivalence likely stems from limited sample sizes and wider confidence intervals rather than potential clinical differences.
These time-to-event findings align with previous research by Zhang et al. (2) showing general concordance between rater groups for primary trial endpoints. However, our results also confirm an important nuance: while average agreement across multiple trials appears to be consistent, individual trials can exhibit meaningful discrepancies between site and central assessments. Such trial-specific variations, though statistically expected in some proportion of studies, can have profound implications for the interpretation of results and the determination of treatment efficacy (2). This observation lends support to the continued practice of BICR in clinical trials to ensure data quality.
Our sensitivity analysis of the RECIST 1.1 threshold values confirms the stability of current standards as the established disease response (30% decrease) and progression (20% increase) thresholds appeared to be robust across testing. That is, when we vary these thresholds within clinically reasonable ranges, we find no substantial effects on classification consistency or inter-rater agreement, and we only observed large deviations in agreement between raters at the extremes of classification thresholds. This indicates that the consensus-derived RECIST thresholds are appropriate and stable across different clinical contexts and rater groups. However, this analysis is limited by available data, particularly for extreme threshold values, because this clinical trial data only includes patients up until the point of disease progression, and therefore we cannot assess the full range of RECIST classifications. Future research could explore the effects of more extreme threshold values on classification consistency, particularly in trials involving treatments that may not produce the expected tumor shrinkage or growth patterns, and this could potentially be accomplished through simulation studies or additional empirical data collection.
The collective evidence from our three analytical approaches supports several important conclusions about RECIST 1.1 reliability. First, RECIST 1.1 demonstrates substantial overall reliability as reflected in good \(\kappa\) values, though with room for improvement, particularly in clinical trial settings where we observed consistently lower agreement. Second, differences between site investigators and central reviewers, while present, generally do not translate into clinically meaningful differences in trial endpoints, with the important caveat that individual trials may still exhibit consequential discrepancies. Third, the current RECIST threshold values appear empirically justified, showing stability across various analytical conditions.
It is important to recognize that this study hardly acknowledges a broader contextual issue regarding the type of treatment received and the relevance of RECIST criteria to different treatment methodologies; RECIST was originally developed when chemotherapy (cytotoxic treatments) and surgical interventions represented the primary available options, and the expected response pattern (i.e. apoptosis and clearing of dead cells) as a result of chemotherapy meant that measuring tumor growth or decline could be assumed as an adequate proxy for disease response (5–7). However, the contemporary therapeutic landscape has evolved dramatically, introducing treatment modalities that may not necessarily produce the tumor shrinkage that RECIST was designed to detect (8). This study alone examines trials involving cytotoxic, cytostatic, and immunotherapy treatment modalities (Table 2.1), representing only three of the apparent seven “pillars of cancer therapy,” which also include surgery, radiotherapy, hormonal therapy, and cell therapy (7). Cytostatic agents, for example, work by halting tumor growth rather than causing regression, while immunotherapies may initially cause tumor swelling due to immune cell infiltration before any reduction occurs (7). Therefore, one might reasonably expect differential utility of RECIST depending on the treatment type, with potentially decreasing utility as cytostatic and immunotherapies become more prevalent in clinical practice, given that these treatments may achieve therapeutic benefit without the measurable tumor shrinkage that RECIST criteria prioritize. Future studies should explore how RECIST 1.1 performs across these diverse treatment modalities, particularly in the context of immunotherapies and other novel approaches that may not conform to traditional tumor response patterns.
Furthermore, RECIST may soon face competition from emerging technological approaches, including software that can automatically segment tumors (9,10) and three-dimensional methods that might better assess overall tumor burden (11). In effect, the employment of these advanced techniques could potentially replace the need for human raters in some contexts, thereby enhancing measurement consistency and reducing variability. However, these innovations also raise important questions about how to integrate new technologies with existing frameworks like RECIST 1.1, particularly regarding the interpretation of results and the establishment of new standards for tumor assessment. While we intentionally focused our analysis on the reliability of current RECIST 1.1 criteria to avoid diluting the primary aims of this thesis, these broader considerations will require careful attention in future research as the tumor assessment landscape continues to evolve.
Looking forward, our findings suggest several promising directions for improving RECIST implementation. A particularly valuable approach would be to explore collaborative methodologies for tumor identification at baseline, as demonstrated by Oubel et al. (12), involving enhanced coordination between site investigators and central reviewers. This strategy could directly address the observed trend of lower inter-rater reliability in clinical trial settings by establishing shared understanding of target lesions from the outset of treatment. By developing standardized protocols for collaborative baseline assessments, the oncology research community could potentially enhance measurement consistency throughout the treatment course, further strengthening the reliability of this critical evaluation framework.
Such methodological innovations would complement RECIST’s solid foundation while addressing the specific contexts where our research identified opportunities for improvement. As cancer therapeutics continue to advance and diversify, maintaining and enhancing the reliability of response assessment tools like RECIST 1.1 remains essential for generating valid and generalizable clinical trial results. And despite the introduction of competing assessment frameworks, RECIST 1.1 remains the gold standard for tumor response evaluation in oncology trials, with its rigorous methodology and established reliability continuing to underpin the development of new cancer treatments.