4  Discussion

Authors
Affiliation

Maintaining parallel structure with the previous chapters, this section discusses the results of the analyses in three parts that are mostly self-contained: the IRR meta-analysis, the site investigator and central reviewer analyses, and the sensitivity analyses. Each part discusses the results of the analyses and their interpretation. Overall limitations of the study are addressed in a separate subsection.

4.1 Results interpretation

4.1.1 IRR Meta-analysis

The results of our meta-analysis demonstrate that the inter-rater reliability (IRR) of the RECIST 1.1 tumor measurement scale can be considered substantial based on Landis’ interpretation (1), with a pooled \(\kappa\) of 0.66, Cohen’s \(\kappa\) of 0.67, and a pooled Fleiss’ \(\kappa\) of 0.65. These values indicate raters generally agree on the classification of tumor measurements, supporting the reliability of RECIST 1.1 as a standardized assessment tool across different evaluators.

However, our analysis also revealed potential limitations in the available evidence. Egger’s test for funnel plot asymmetry yielded a p-value less than 0.05, suggesting the presence of publication bias in the studies included in our meta-analysis. This finding indicates that some relevant studies might have been excluded from our analysis, potentially affecting the comprehensiveness of our results and warranting caution in their interpretation. Another, potentially more likely explanation for the observed funnel plot asymmetry is that there is simply a dearth of studies analyzing IRR in the context of RECIST 1.1, particularly those that report IRR values for both site investigators and central reviewers as sponsors of clinical trials generally would not have a clear incentive to calculate or report these estimates. Such a lack of studies may lead to an overrepresentation of studies with higher IRR values, as these are more likely to be published, while studies with lower IRR values may be underrepresented or unpublished.

Of particular note, we observed a distinct clustering pattern where data from clinical trials centered around a lower mean \(\kappa\) value compared to data from non-clinical trials. This pattern suggests that the IRR of the RECIST 1.1 tumor measurement scale may be lower in clinical trial settings than in other contexts which is an important finding that raises questions about the contextual reliability of RECIST 1.1. This difference could just be an artifact of serial disagreements between raters within the same patients, but it is also likely we are simply looking at a different data context considering the composition of raters: non-clinical trial studies generally involved exclusively radiologists, while clinical trials generally have two radiologists as the central reviewers and a mix of specialists as the site investigators. This diversity in professional backgrounds and training could explain the observed variability in IRR values and highlights the need for further research to confirm these preliminary findings.

4.1.2 Site Investigator and Central Reviewer Analyses

Our pairwise IRR analyses reveal considerable variability in agreement between raters within and across studies, irrespective of whether they are site investigators or central reviewers. The wide range of pairwise Cohen’s \(\kappa\) values (0.286 to 0.803) across studies indicates some degree of inconsistency in how different raters interpret and apply RECIST criteria. However, Cohen’s \(\kappa\) penalizes all disagreements equally even if differences have only small clinical relevance, which can be the case for RECIST, depending on the trial outcome being used (e.g. TTP uses only progression as an indicator whereas ORR uses partial and complete response information). To account for the quasi-ordinal nature of the data, we conduct follow-up analyses using linear mixed effects models, which assign numerical values to outcomes based on their clinical favorability. These analyses identified significant differences between site investigator and one central reviewer, as well as between two central reviewers themselves, but only in one of the three studies examined. The other two studies show no significant differences between site investigators and central reviewers, suggesting quantifiable rater disagreements are the exception rather than the norm.

Overall, these preliminary analyses indicate an absence differences between site investigators and central reviewers regarding objective response rates. The variations that do exist appear attributable to random variation rather than systematic bias with the RECIST criteria, and a high degree of variability could be expected after examining the contingency tables for response rates; the observed quantity of responses to treatment were generally quite low. However, it is important to note that such within-study differences were also observed by Zhang et al. (2) in their examination of other clinical trial endpoints. Their findings alongside our analyses underscore the importance of considering potential rater discrepancies within individual trials, as they can lead to meaningful differences in outcome assessments even when systematic biases are not present across multiple studies (2).

Extending our analysis to time-to-event outcomes and hazard ratios further strengthens these conclusions. Across individual studies, we identify only a single instance of significant difference in hazard ratios between site investigators and central reviewers, with no consistent directional pattern of differences across studies. Pooling the data from all three studies, we find no statistically significant differences in hazard ratios for time-to-progression (TTP), time-to-response (TTR), or duration of response (DoR) between site investigators and central reviewers. This finding suggests that, despite the observed variability in pairwise IRR analyses, the overall agreement between rater groups remains robust when considering time-to-event outcomes. However, the absence of differences based on NHST does not necessarily imply equivalence between the two rater groups, as it is possible that differences exist but are not statistically significant due to limited sample sizes or other factors. To address this limitation and provide a more rigorous assessment, we employed formal equivalence testing using the TOST procedure, which yields particularly informative results.

Our equivalence analyses demonstrated statistical equivalence between site investigators and central reviewers for both TTP and TTR on a hazard ratio range of \([0.80, 1.25]\). However, equivalence could not be definitively established for DoR, though the data suggested potential similarity. The inability to confirm equivalence for DoR likely stems from limited sample sizes and wider confidence intervals rather than true clinical differences, as the point estimates were similar but the data lacked sufficient power to establish formal equivalence within our pre-defined margins. It is also worthwhile noting that the equivalence bounds we established could be interpreted as relatively liberal, which may have contributed to being about to establish equivalence for TTP and TTR.

Regardless, these findings again align with the work of Zhang et al. on rater agreement in tumor measurement classification although their approach lacked formal equivalence testing. Likewise, recent work by Jacobs et al. (3) focusing specifically on breast cancer trials similarly concluded that site investigators and central reviewers demonstrated good agreement in tumor measurement classification. While their methodological approach also paralleled ours in using meta-analysis to compare outcomes between rater groups, their study was more limited in scope, focusing exclusively on progression-free survival without conducting equivalence testing. Our use of formal equivalence testing and evaluation of multiple additional endpoints provides a broader understanding of RECIST reliability across different clinical contexts and allows for a more thorough assessment of IRR within the RECIST 1.1 framework.

4.1.3 Sensitivity Analyses

Our sensitivity analyses of Target Outcomes revealed important insights into the robustness of RECIST 1.1 criteria. While we observed some decreases in IRR when varying the assessment thresholds, these reductions only manifested at extreme threshold values that would rarely be encountered in clinical practice due to their clinical extremeness. Across all three clinical trials, we found no consistent patterns in the sensitivity analyses of Target Outcomes, which supports the conclusion that the IRR of the RECIST 1.1 tumor measurement scale remains stable across reasonable variations in the thresholds used to define target outcomes. This stability reinforces confidence in the reliability of RECIST 1.1 as a standardized assessment framework for oncology trials.

Perhaps the most significant finding from our IRR sensitivity analyses was the critical contribution of Non-Target Lesion and New Lesion measurements to the overall assessment reliability. The Overall Response classifications demonstrated markedly greater stability throughout our sensitivity analyses compared to Target Lesion measurements alone. This enhanced stability can be attributed to the additional contextual information provided by Non-Target Lesion and New Lesion measurements, which effectively compensate for potential variability in Target Lesion assessments. This finding underscores the importance of comprehensive tumor assessment in clinical trials and validates the RECIST 1.1 approach of integrating multiple types of lesion measurements into the final response determination.

With respect to trial endpoints, specifically ORR, TTP, TTR, and DoR, our sensitivity analyses further confirmed the robustness of the RECIST 1.1 framework. Across a range of alternative threshold values for disease progression and response, we observed that the RECIST 1.1 measurement scale maintained consistent performance characteristics. Notably, modifications to these thresholds produced no discernible impact on the degree of agreement (or disagreement) between site investigators and central reviewers across any of the three clinical trials examined. This consistency strongly suggests that RECIST 1.1 functions as a stable tool for tumor classification and that the currently established threshold values are appropriate for clinical trial applications. The framework’s resilience to threshold adjustments indicates that variations in measurement technique or interpretation within reasonable bounds are unlikely to substantially affect trial outcomes, lending further credibility to RECIST 1.1 as a reliable standard for oncology research.

4.2 Study Limitations

4.2.1 Data Availability and Quality

Our meta-analysis faces several important limitations related to the quantity and characteristics of available data. The analysis includes only 14 studies, just above the recommended minimum of 10 studies for meta-analytic approaches, which constrains the generalizability of our findings. This limitation is compounded by the considerable heterogeneity among the included studies in terms of study design, rater populations, and cancer contexts, potentially affecting the validity of our pooled estimates. The relatively small sample size also prevents us from conducting meaningful subgroup analyses to explore the influence of potentially important confounding variables, such as cancer type which is a factor that could be particularly relevant given that imaging techniques and tumor growth patterns vary substantially across different malignancies.

For our more in-depth analyses of site investigator and central reviewer agreement, we are further limited by access to only three clinical trials. This restricted sample size inevitably limits the generalizability of our findings regarding RECIST reliability in clinical trial settings. A more fundamental limitation of the trial data is the standard imaging schedule, typically performed at 4-6 week intervals. This relatively sparse temporal sampling limits our ability to characterize tumor growth and decay patterns with precision, potentially obscuring subtle differences in assessment timing between site investigators and central reviewers. Consequently, our time-to-event analyses may lack sufficient sensitivity to detect all meaningful differences between rater groups. The use of a tumor growth model that accounts for different growth and decay patterns could have provided a more nuanced understanding of tumor dynamics, but such models are inherently complex and difficult to develop due to the non-linear nature of tumor growth.

Arguably one of the largest limitations of the clinical trial data is that only control group data were available for our analyses. This limitation restricts our ability to draw conclusions about the IRR of RECIST 1.1 in the context of active treatment, where the dynamics of tumor response may differ significantly from those observed in control groups. Future research should aim to include both control and treatment arms to provide a more comprehensive understanding of RECIST 1.1’s reliability across different clinical scenarios.

An additional constraint is that our analyses focused exclusively on the RECIST 1.1 criteria, which may not be the optimal measurement scale for all tumor types. This focus limits the applicability of our findings to alternative response criteria such as iRECIST (for immunotherapy) or mRECIST (for hepatocellular carcinoma), which are increasingly used in specific therapeutic contexts. Our analytical approach could also have been expanded to include endpoints not addressed in this study. For instance, a meta-analysis similar to that conducted by Zhang et al. (2) could have been performed specifically for PFS and DCR, potentially providing additional insights into the IRR of RECIST 1.1 assessments. This represents a valuable direction for future research that could complement and extend our current findings.

4.2.2 Analytical Approaches

Our methodological approach to measuring IRR had several inherent limitations. While Cohen’s \(\kappa\) and Fleiss’ \(\kappa\) are widely accepted metrics for assessing agreement between raters, they do not account for potentially information such as the similarity of different levels of the outcome measure or the experience level of the raters. Additionally, for some studies, a continuous measure of IRR such as the intraclass correlation coefficient might have provided more nuanced insights into agreement patterns. However, we prioritized methodological consistency across studies, which necessitated using categorical measures of agreement that could be applied uniformly across the heterogeneous literature.

A further significant methodological challenge in our survival analyses was the use of Cox regression modeling, which required us to address violations of the independence of observations assumption. Although we implemented statistical corrections by specifying clustering of individuals, this approach, which did enable us to compare hazard ratios between raters, remains methodologically debatable. Similarly, our analyses of objective response rates might have benefited from logistic regression modeling, which would have permitted the estimation of odds ratios and corresponding confidence intervals, potentially offering a more clinically interpretable metric of agreement. However, our primary focus was on detecting the presence of differences rather than precisely quantifying their magnitude, which our chosen approach adequately accomplished.

For our analyses of ordinal RECIST data, we acknowledge that more sophisticated approaches such as ordinal logistic mixed effects models or proportional odds models might have better accounted for the inherent quasi-ordinality of response classifications. However, we deliberately employed basic linear mixed effects models as a pragmatic means to detect differences between rater groups without overcomplicating the analytical framework. This decision was justified by our subsequent time-to-event analyses, which provided more clinically relevant outcome measures. As noted earlier, tumor growth modeling that accounts for non-linear growth and decay patterns could have provided deeper insights, but the development of such models remains challenging due to the biological complexity and inter-individual variability of tumor dynamics.

A further limitation specific to our sensitivity analyses stems from the constraints inherent in clinical trial data, where information collection is bounded by patients’ actual clinical i.e. data is only available up until their participation in the study is discontinued. This fundamental constraint prevented us from observing the full spectrum of possible tumor growth and decay patterns as measured by SLD, limiting our ability to comprehensively characterize the IRR of the RECIST 1.1 tumor measurement scale across all theoretical threshold values. Particularly relevant to our threshold-modifying approach was the inability to observe how patients who progressed under the standard 20% threshold might have behaved at higher progression thresholds. Such patients may have required several additional weeks or months to reach the more extreme thresholds we examined in our sensitivity analyses. While a tumor growth modeling approach could theoretically address this limitation by simulating disease trajectories beyond observed timepoints, the considerable heterogeneity in tumor behavior—characterized by non-linear growth patterns and highly variable individual responses—renders such models exceptionally difficult to develop and validate. This limitation underscores the inherent challenge in fully exploring the theoretical boundaries of measurement criteria within the constraints of real-world clinical data.

With this breakdown of the overall results and limitations of our study, we can now turn to the implications of our findings for the future of tumor response assessment in clinical trials. The next section revisits the broader context of tumor response assessment and discusses how our results can inform future research and practice in this area, as well as potential avenues for further investigation.