Target Lesions | Non-Target Lesions | New Lesions | Overall Response |
---|---|---|---|
CR | CR | No | CR |
CR | Non-CR/non-PD | No | PR |
CR | Not evaluated | No | PR |
PR | Non-PD or not all evaluated | No | PR |
SD | Non-PD or not all evaluated | No | SD |
Not all evaluated | Non-PD | No | NE |
PD | Any | Yes or No | PD |
Any | PD | Yes or No | PD |
Any | Any | Yes | PD |
1 Introduction
This thesis examines the reliability of the Response Evaluation Criteria in Solid Tumors 1.1 (RECIST) (1), a widely used tool for assessing tumor response in oncology clinical trials. Reliable assessment of tumor response is crucial for evaluating treatment efficacy, yet concerns exist regarding the consistency of RECIST interpretations across different raters. Beginning with a general overview of cancer biology, global cancer epidemiology, and the challenges of measuring treatment outcomes in oncological studies, this Introduction establishes the context for understanding the importance of standardized response criteria. It further explores the role of clinical trials in drug development, regulatory requirements for tumor response assessment, and the operational challenges of implementing RECIST in multi-center trials. Following this, the express purpose of this thesis is articulated.
The subsequent Methods chapter details the research approach used to address the reliability questions, including a systematic literature review on inter-rater reliability of RECIST, retrospective analysis of discrepancies between site investigator and central reviewer tumor assessments using RECIST in several clinical trials, and sensitivity analyses examining how threshold definitions affect classification differences in the context of these trials. Following the presentation of the statistical methodology, the Results section reports findings from the meta-analysis of inter-rater reliability, quantifies site vs. central reviewer assessment discrepancies, and evaluates how varying threshold definitions impact response categorization.
The Discussion chapter contextualizes these findings within existing literature and explores their implications for clinical trials and research methodology while also highlighting some of the limitations of this thesis. Finally, the Conclusion synthesizes the key insights, readdresses the primary research questions, and offers recommendations for enhancing the reliability of tumor response assessment in future oncology trials.
1.1 A Brief Primer on Cancer
Before diving into the specifics of tumor response assessment and the reliability of RECIST, it is necessary to review several aspects of cancer as a disease and the value in developing treatments against cancers. This section thus provides an elementary overview of cancer biology, its global burden, and the challenges associated with treatment and response assessment.
1.1.1 Cancer 101
Cancer is fundamentally a disease of dysregulated cellular behavior, rooted in alterations at the molecular and cellular levels (2). Under normal physiological conditions, cells adhere to regulatory pathways that govern their growth, division, differentiation, and death. In contrast, cancer arises when these regulatory systems are disrupted, leading to unrestrained cell proliferation and, ultimately, tumor formation (2). The transformation from a normal to a malignant cell is typically driven by the accumulation of genetic and epigenetic mutations that interfere with key cellular processes (2,3). These mutations often activate oncogenes which promote cell division, and they can inactivate tumor suppressor genes which ordinarily function to restrain growth or induce apoptosis1 in damaged cells (3,4). These cells thus possess the ability to proliferate indefinitely, resist apoptosis, and ignore signals that would normally inhibit growth.
Metastasis, the process by which cancer cells spread beyond the primary tumor site, represents one of the most clinically significant and challenging aspects of cancer (5,6). Once metastasized, tumors are often less responsive to localized treatments such as surgery or radiation, necessitating systemic therapies that are typically less targeted and more toxic (6). Metastasis also correlates strongly with poorer prognosis, and its presence at diagnosis is a key determinant of clinical outcomes (5). Complicating the diagnosis and treatment of cancer is its remarkable heterogeneity (6). Cancer is not a singular disease but rather a collection of disorders characterized by diverse genetic profiles and clinical behaviors (5,6). Even within a single type of cancer, significant variation can exist between patients and even within the different tumors of a single patient (5,7).
It is also important to distinguish between two broad categories of cancer: solid tumors (i.e. carcinomas and sarcomas) and hematological malignancies (2). Solid tumors arise in tissues and organs such as the breast, colon, lungs, or in connective tissue such as muscle and bone although this latter grouping, so-called, sarcomas, are far less common (2). In either case though, the cancer from these solid tumors tends to form localized masses (2). In contrast, hematological malignancies, such as leukemia and lymphoma, originate in the blood or bone marrow and typically disseminate early in the disease process (2). Given their distinct biological behaviors, diagnostic criteria, and therapeutic strategies, this thesis will focus exclusively on cancers that present as solid tumors. This focus will allow for a more coherent investigation into the cellular and molecular dynamics specific to solid tumor biology and their implications for diagnosis, treatment, and patient outcomes.
1.1.2 Cancer in Context: Global Burden
While the biological bases of cancer are generally well understood, the disease continues to present a substantial global public health challenge. Cancer represents one of the leading causes of morbidity and mortality worldwide (8), with millions of new cases diagnosed annually. Current estimates indicate approximately 20 million incident cases of cancer per year globally alongside an estimated 9 million annual deaths (9). Of these, the 5 most common types include lung, breast, colorectal, prostate, and stomach cancers, which together account for nearly half of all new cases, and generally all present as solid tumors (9).
These figures, though significant, represent only point estimates in an evolving epidemiological landscape. Global trends in cancer incidence and mortality have shown concerning increases over recent decades, with projections suggesting this burden may rise by up to 75% to 35 million incident cases by 2050 (9). However, these statistics exhibit considerable variation across cancer types, geographic regions, and healthcare access levels (9). Improved diagnostic capabilities and expanded access to healthcare systems might contribute to higher reported incidence rates in some regions (10), though the overall trend toward increasing cancer burden appears consistent (9).
Given its current and future profound impact on global health and an economic burden in the trillions (11), cancer remains a critical priority for healthcare systems, researchers, and policymakers worldwide with nearly 4 billion USD in funding in 2024 from the US government, UK government, and European Commission alone (12). The United States’ National Institutes of Health dominates this figure with over 3 billion USD allocated to cancer research per year (12), accounting for approximately 10% of the total NIH budget (12,13). The scale of these investments underscores the urgency of developing more effective approaches to cancer diagnosis, treatment, and monitoring including improved methods for assessing treatment response in clinical settings.
1.1.3 Treatment Challenges
Despite remarkable advances in cancer research and substantial ongoing investment (14), cancer remains one of medicine’s most formidable challenges due to its inherent complexity and heterogeneity (14,15). Each tumor possesses a unique genomic profile with distinct mutations and cellular characteristics that can vary both between patients with the same cancer type and even within different regions of a single tumor (3,15). This molecular diversity means treatments effective for one patient may fail in another, even if their clinical presentations are similar (15).
The available treatment options for cancer are diverse to meet the diversity of tumor characteristics, and can generally be divided into the so-called “pillars” of cancer treatment. Different authors appear to define different categories and quantities of pillars (16,17), but surgery, radiation therapy, chemotherapy, and immunotherapy, appear to be agreed-upon pillars with targeted therapy, hormone therapy, and cell therapy also being included at times (17). Regardless of the classification system used, each of these treatment modalities has its own strengths and limitations; the choice of treatment often depends on the specific characteristics of the tumor, its stage, and the patient’s overall health (17). Surgical interventions, for example, can potentially be curative for localized disease, but cannot address metastases (18). Radiation therapy risks damaging adjacent healthy tissues (18), while chemotherapy often lacks specificity, causing substantial global toxicity (16). Without going into detail of the other modalities, it suffices to say that each different treatment comes with its own set of positives and negatives that must be weighed in the context of the cancer being treated and the needs of the individual patient receiving treatment.
Given the careful balancing act required to provide treatments that provide therapeutic benefit without excessive downsides, a central concern in cancer treatment is thus determining the effectiveness of a given therapy. This is particularly challenging in oncology, where treatment responses can be complex and multifaceted (19,20). Unlike many other medical conditions, cancer treatments often do not yield immediate or straightforward outcomes such as psuedoprogression, wherein tumor sizes on imaging might swell as a direct result of the treatment rather than due to true progression (21). Moreover, tumors may shrink, stabilize, or even grow despite treatment, and these changes can occur at different rates depending on the individual patient and the specific therapy used (4,21). Current evaluation methods, particularly imaging-based assessments of tumor burden, introduce additional variables related to technique and interpretation, emphasizing the need for standardized and validated criteria like RECIST while acknowledging their inherent limitations.
1.2 Measuring Tumor Response in Solid Tumors: RECIST
As the previous section highlighted, determining the effectiveness of cancer therapies presents unique challenges due to the complex and variable nature of tumor responses. Given these challenges, standardized assessment tools become essential for consistent evaluation across different clinical settings. While the discussions of RECIST criteria and the broader challenges of measuring tumor response in clinical trials are inherently intertwined, we begin by examining RECIST itself as a framework, assuming the reader has at least a passing familiarity with clinical trials, and we will address specific challenges of clinical trials later in this Introduction.
This section explores the historical context that necessitated the development of standardized response criteria, followed by a detailed overview of RECIST’s technical specifications and algorithmic structure. The discussion then extends to the practical implementation of RECIST in clinical settings, focusing particularly on inter-rater reliability challenges, discrepancies between site investigator and central reviewer assessments, and the complexities of applying these criteria consistently in clinical trials. Understanding both the technical aspects of the RECIST algorithm and its real-world reliability challenges is essential for interpreting the findings presented in this thesis and for appreciating their broader implications for clinical trial methodology, regulatory decision-making, and ultimately, patient care in oncology.
1.2.1 Historical Context: The Need for Standardization
The development of standardized tumor response assessment criteria emerged largely over the course of the early and mid 20th century in large part due to inconsistent and often dangers evaluation methods (22). One of the earliest examples of attempts to standardize the general process of a clinical trial in the modern sense (i.e., with a proper control group, clearly defined evaluation procedures, and objective endpoints) was published in 1960 by Zubrod et al. (22,23). However, it wasn’t until 1979 that the WHO introduced the first internationally recognized tumor response assessment standard, coinciding roughly with the wider availability of computed tomography and magnetic resonance imaging (24,25). Prior to this standardization, assessment of tumor response in oncology trials was highly variable and often relied on subjective evaluations by local investigators (22), leading to inconsistencies in how treatment effects were measured and reported across different institutions and studies.
The establishment of the World Health Organization (WHO) tumor assessment criteria in 1979 represented a pivotal advancement in oncology research. The WHO guidelines introduced a bidimensional approach to measuring and assessing tumor burden, wherein “the sum of the products of the two longest diameters in the perpendicular dimensions of all tumors” was calculated (26) and four key response categories were defined: Complete Response (CR), Partial Response (PR: ≥50% reduction), Stable Disease (SD), and Progressive Disease (PD: ≥25% increase or new lesions). Widely adopted and validated across numerous tumor types, the WHO criteria provided a consistent framework that enabled meaningful comparisons of treatment efficacy across trials for the first time (24).
Defining these cut-offs and categories is thus the measurement basis for calculating key trial endpoints such as Objective Response Rate (ORR), Progression-Free Survival (PFS), and Overall Survival (OS). These endpoints are critical for evaluating the effectiveness of new therapies and for regulatory approval processes, as they provide standardized measures of treatment benefit that can be compared across different studies and patient populations. Of course, there is an implicit assumption in these WHO criteria that tumor burden accurately reflects disease status and is predictive of outcomes; we discuss this topic in detail in Section 1.3.2. Despite this significant step forward, the WHO system was not without limitations. Its approach to measuring tumor response allowed for tracking of an indeterminate number of solid tumors, was vulnerable to manual error, and imposed a substantial time burden on clinicians (26,27). The practical demands of measuring multiple diameters per lesion made assessments time-consuming, while variability in how different observers selected and measured lesions further undermined reproducibility (27).
1.2.2 Improving on WHO Criteria: RECIST
In response to the limitations of the WHO criteria, RECIST was developed as a simplified and standardized framework for tumor response assessment with the first version, RECIST 1.0, introduced in 2000 (27). The most fundamental innovation of RECIST over the WHO guidelines was the adoption of unidimensional measurements, where only the longest diameter of target lesions would be measured, rather than the more complex bidimensional approach required by WHO criteria (27). This change alone substantially reduced computational complexity and measurement error, making the assessment process more efficient and reproducible across different investigators and trial sites.
RECIST 1.0 also introduced a more structured approach to lesion selection and categorization. The criteria allowed for up to 10 target lesions (with a maximum of 5 per organ), which would be measured and tracked throughout treatment (27). Additionally, the framework formalized the concept of non-target lesions (lesions not directly measured but still assessed qualitatively for response or progression) and established the identification of new lesions as an absolute marker of disease progression (27). The calculation process was dramatically simplified by introducing the sum of longest diameters (SLD) as the primary metric, eliminating the need for the complex product calculations required by WHO criteria (27). RECIST 1.0 also established new threshold values for calculating progression (>20% increase) and response (>30% decrease) in target tumor SLDs (27). This new framework was rapidly adopted across the oncology community due to its practicality, efficiency, and capacity to enhance consistency in multi-center trials.
Building on nearly a decade of implementation experience, RECIST 1.1 was introduced in 2009 to address certain limitations identified in the original criteria (1). This update further streamlined the assessment process by reducing the maximum number of target lesions from 10 to 5 (with no more than 2 per organ), which research had shown was sufficient for accurate response assessment while further reducing measurement burden (1). RECIST 1.1 also provided more detailed guidance on lesion selection, emphasizing the importance of choosing measurable lesions and excluding non-measurable abnormalities from target designation (1). RECIST 1.1 also established specific criteria for lymph node assessment, specifying minimum size requirements and measurement approaches (1). Additionally, it introduced an absolute minimum size threshold of 5mm for target lesions that could not be accurately measured, and implemented a similar absolute increase requirement to determine progression of disease (1). These refinements, alongside enhanced guidance on imaging techniques and assessment timing, further improved the standardization and reliability of tumor response evaluation in clinical trials.
Having traced the evolution of RECIST from its origins to the current 1.1 version, we can now examine its technical framework in detail. The historical development of these criteria contextualizes why certain specifications were adopted and how they address previously identified limitations in tumor assessment methodology. The following section provides a comprehensive overview of RECIST 1.1’s operational structure. This detailed understanding is essential for interpreting the reliability analyses presented later in this thesis, as the technical nuances of RECIST-based lesion classification and response categorization are of direct relevance to the consistency of assessments across different raters.
1.2.3 Technical Specifications of RECIST 1.1
As alluded to, lesions are classified into three distinct categories in RECIST 1.1, each with specific roles in determining treatment response. These categories are target lesions, non-target lesions, and new lesions. Target lesions and non-target lesions (if any) are evaluated at baseline and subsequently measured at each imaging assessment, while new lesions are identified based on their appearance during the course of treatment. The classification of lesions into these categories is critical for the overall response assessment, as it determines how changes in tumor burden are interpreted and categorized. The RECIST 1.1 framework provides clear definitions and measurement guidelines for each lesion type, ensuring that assessments are both standardized and reproducible across different clinical settings. The paragraphs below provide a look at each category and the algorithmic approach used to assess treatment response based on these classifications is presented in Table 1.1.
Target lesions form the quantitative foundation of the RECIST assessment. These lesions must be measurable in at least one dimension and meet specific size criteria: a minimum of 10 mm in the longest diameter for non-lymph nodes and 15 mm for lymph nodes (where the longest perpendicular diameter is used). RECIST 1.1 limits the selection to a maximum of five target lesions with no more than two per organ. These constraints ensure focus on the most clearly measurable disease manifestations while maintaining a manageable assessment burden. For these target lesions, the SLD is calculated at each imaging assessment, and a Target Response is determined based on specific threshold changes: CR requires disappearance of all target lesions; PR is defined as at least a 30% decrease in SLD compared to baseline; PD occurs with at least a 20% increase in SLD compared to the smallest SLD recorded since treatment initiation (i.e. since the point of nadir2); and SD applies when neither PR nor PD criteria are met.
Non-target lesions complement the quantitative assessment of target lesions. These are abnormalities that, while identified at baseline, are not selected for measurement due to their small size, irregular shape, or poor delineation. Although not measured directly, non-target lesions are still evaluated qualitatively for response or progression. Their assessment contributes significantly to the overall response determination, particularly when “unequivocal progression” is observed, which can indicate PD regardless of target lesion measurements.
New lesions constitute the third category and are defined as any lesions that were not present at baseline but appear during treatment. The identification of new lesions serves as an absolute marker of disease progression, overriding any positive changes observed in target or non-target lesions. This categorical classification reflects the biological significance of disease spread to new sites, which fundamentally indicates treatment failure regardless of responses elsewhere.
The overall response assessment in RECIST 1.1 integrates findings from all three lesion categories according to a predetermined algorithm. Table 1.1 summarizes the possible combinations of target, non-target, and new lesion assessments and their corresponding overall response classifications.
It is worth emphasizing certain critical aspects of the RECIST Overall Response calculation that have particular relevance for the reliability analyses presented in this thesis. The presence of new lesions or unequivocal progression in non-target lesions serves as an absolute indicator of disease progression, regardless of favorable changes observed in target lesions. This means that even if target lesions demonstrate CR, PR, or SD, the detection of new lesions or significant worsening of non-target lesions will override these positive findings, resulting in an overall classification of PD. This hierarchical decision structure reflects the biological understanding that cancer spread to new sites fundamentally represents treatment failure, regardless of responses in previously identified disease locations.
An important aspect of RECIST evaluations to keep in mind is that the Overall Response can take any of the values CR, PR, or SD up until the point at which a patient is classified as PD at which point they would generally be removed from study participation. This is of practical importance for endpoints that include positive responses such as ORR, time to response (TTR), and duration of response (DoR) as these endpoints are calculated using only patients who achieved CR or PR at any point during treatment. Moreover, because nadir can be a shifting point, the SLD used to determine progression can change over the course of treatment, meaning that a patient who was previously classified as PR or SD could later be reclassified as PD if their SLD increases by 20% from the lowest point recorded since treatment initiation. This dynamic nature of RECIST assessments means that RECIST ratings are only quasi-ordinal in nature, as the numeric basis for the ratings can change over time. Figure 1.1 below illustrates four possible patterns of RECIST 1.1 Target Lesion assessments over time, demonstrating how classification is a moving target for individuals that change based on the SLD of target lesions.
Each row of the four rows in this figure can be thought of as the trajectory of a patient’s SLD over the course of a baseline visit and 5 follow-up visits, with the y-axis normalized to 100% of the baseline SLD. Greyed out regions indicate timepoints not yet observed i.e. the rows should be read left to right. The red horizontal lines indicate the thresholds for progression (20% increase from nadir), while the green horizontal lines indicate the thresholds for response (30% decrease from baseline). In the first row, the patient neither achieves a response nor experiences progression, remaining classified as SD throughout the treatment course. In the second row, the patient’s SLD decreases slightly, lowering the nadir value and thus the threshold for progression in the process. However, their disease course begins to worsen, and they are ultimately classified as PD at the final visit as they have crossed the red line indicating a 20% increase in SLD from the nadir. The third row illustrates a patient who achieves a PR at their third follow-up visit as they cross under the green line, and who simply continues to show response over the following visits. The fourth row shows a patient who achieves a PR at their second follow-up visit, but whose disease worsens over time, ultimately being classified as PD at the final visit. This illustrates how RECIST classifications can change over time based on the SLD of target lesions and the appearance of new lesions.
While the overview provided here covers the core elements of RECIST 1.1, it necessarily omits some nuances addressed in the complete guidelines such as specific criteria for lymph nodes, handling of non-measurable lesions, and special considerations for certain imaging modalities. Additional technical specifications in the original publication (1) include detailed criteria for lymph node assessment, guidance for handling non-measurable or non-evaluable lesions, and special considerations for particular imaging modalities. Nevertheless, the framework described above establishes the essential foundation required for understanding RECIST’s application in clinical trials and, more specifically, for interpreting the reliability analyses that form the central focus of this thesis.
1.3 Towards Treatment: Clinical Trials
With some background on cancer and knowledge of the RECIST criteria in mind, we can now turn to the specific context of clinical trials and the role of tumor response assessment in evaluating cancer treatments. This section provides an overview of the importance of clinical trials, how success is measured in those trials, and the regulatory requirements for and difficulties in tumor response assessment.
1.3.1 Clinical Trials as the Gateway to Therapies
Clinical trials serve as the critical gateway through which new cancer treatments must pass before reaching patients. These systematic studies provide the controlled environment necessary to evaluate safety, efficacy, and optimal dosing of novel therapies (22). Regulatory bodies including the U.S. Food and Drug Administration (FDA) and the European Medicines Agency (EMA) have established rigorous frameworks governing the conduct of these trials, emphasizing not only scientific validity but also ethical considerations and patient safety (22,28). These regulatory requirements underscore the fundamental role of clinical trials as the sole legitimate pathway for new cancer therapies to gain approval and enter standard clinical practice.
The development of cancer treatments represents a substantial investment of time, expertise, and financial resources. A new therapeutic agent typically undergoes many years of laboratory research, preclinical testing, and multiple phases of clinical trials before receiving regulatory approval (29). This extensive development timeline makes the accuracy and reliability of trial endpoints particularly crucial; inefficiencies or errors in measuring treatment effects can significantly delay or even derail the development of potentially beneficial therapies. Phase II and Phase III trials are especially critical in this process, as they provide the primary evidence for efficacy and safety that supports regulatory decision-making (30). These later-phase trials rely heavily on standardized assessment criteria like RECIST to reliably measure treatment responses consistently across diverse clinical settings.
Given that regulatory approval hinges on demonstrating meaningful clinical benefit, measurement tools must be both precise and reliable to distinguish genuine therapeutic effects from artifacts or natural disease variation (21,31). Standardized frameworks like RECIST provide this essential foundation by enabling consistent assessment across different investigators, trial sites, and patient populations while facilitating regulatory review through a common evaluation language.
1.3.2 Measuring Treatment Success & Surrogate Endpoints
The primary objective of any cancer treatment is to improve patient outcomes, with overall survival (i.e. the length of time a patient lives after starting treatment) representing the most definitive measure of therapeutic benefit (32,33). Despite its unequivocal clinical relevance, overall survival presents significant methodological challenges as a primary endpoint in clinical trials. Studies using overall survival as their primary endpoint often require extended follow-up periods as they necessarily aim to capture the long-term effects of treatment, which substantially increases trial duration, cost, and complexity (32). Such extended timelines are frequently incompatible with the urgent need to bring effective therapies to patients and the economic realities of drug development (32). Furthermore, the impact of subsequent treatments after disease progression can confound the interpretation of overall survival data, potentially obscuring the true effect of the investigational therapy.
Given these limitations, surrogate endpoints have become essential tools in oncology research, providing earlier signals of efficacy that can accelerate therapeutic development and regulatory decision-making (34). These surrogate measures serve as proxies for clinical benefit that can be assessed more rapidly than overall survival. Common surrogate endpoints include outcomes reliant on measurements like tumor response and patient-reported quality-of-life (34,35). Importantly, several RECIST-based metrics have emerged as particularly valuable surrogate endpoints in oncology trials. Progression-Free Survival (PFS) measures the time patients live without disease worsening; Objective Response Rate (ORR) quantifies the proportion of patients achieving complete or partial tumor shrinkage; Disease Control Rate (DCR) captures those with complete response, partial response, or stable disease; Time to Progression (TTP) documents the interval from treatment initiation to disease advancement; and Duration of Response (DoR) records how long responses are maintained before progression or death (34). These endpoints are often employed as primary or secondary endpoints in solid tumor clinical trials depending on what patient outcomes the trial hopes to optimize (33).
While surrogate endpoints facilitate more efficient drug development, their use requires careful interpretation (33). The correlation between surrogate endpoints and meaningful clinical outcomes like overall survival varies across cancer types, treatment modalities, and patient populations (33). A treatment may demonstrate impressive tumor shrinkage yet fail to extend survival or improve quality of life. Such a disconnect underscores the need for cautious assessment of surrogate measures. Nevertheless, regulatory agencies including the FDA and EMA generally accept well-validated surrogate endpoints as components of the evidence package supporting approval, particularly in Phase II and Phase III trials. These later-stage trials, involving hundreds to thousands of participants, rely heavily on standardized assessment frameworks to ensure consistent measurement and interpretation of key endpoints.
Tumor response as measured by RECIST has been validated as a key indicator of treatment efficacy and predictor of survival in clinical practice. This validation provides the foundation for RECIST’s central role in defining and measuring the surrogate endpoints described above. However, the reliability of these RECIST-derived endpoints depends critically on accurate and consistent tumor assessment. Measurement challenges such as technical issues with imaging, observer-related variability, and patient-related factors can compromise endpoint validity. Inaccurate response assessment can lead to misclassification of treatment efficacy, inappropriate treatment decisions, and ultimately impact patient outcomes. Given that regulatory approval and clinical adoption of new therapies often hinge on these surrogate measures, ensuring their reliability through standardized assessment criteria becomes a matter of paramount importance.
1.3.3 RECIST as the Regulatory Standard
Regulatory authorities worldwide have embraced RECIST 1.1 as a preferred methodology for tumor response assessment in clinical trials. Both the FDA and EMA explicitly recognize RECIST in their guidance documents, with the FDA publishing non-binding recommendations that specifically reference RECIST as an established approach for evaluating tumor response in solid tumors (36,37). While these agencies do not mandate RECIST’s use, they strongly encourage standardized and validated methodologies that ensure consistency across trial sites and facilitate regulatory review (36). RECIST has become the de facto standard precisely because it provides a framework that satisfies regulatory expectations for objective, reproducible, and clinically meaningful endpoints (38).
The widespread regulatory acceptance of RECIST stems from multiple factors. First, standardized criteria like RECIST enable efficient regulatory review by providing a common framework for comparing results across different studies and therapeutic agents (38,39). Second, RECIST facilitates international harmonization, ensuring consistency across multinational trials and regulatory submissions to different authorities (1). Finally, regulatory agencies expect robust quality assurance in clinical trials, and the FDA helped mold RECIST as it was consulted during development (1).
While RECIST enjoys broad regulatory support and validation, important questions persist regarding its systematic validity in certain contexts. The framework’s reliance on unidimensional measurements may not fully capture complex response patterns (e.g. volumetric changes), particularly with novel therapies that may induce changes in tumor density or vascularity without necessarily affecting the tumor size. Furthermore, the arbitrary nature of the threshold values used to define response categories, namely the 30% reduction for partial response and 20% increase for progressive disease, lacks explicit biological rationale (31). These inherent limitations underscore the need for ongoing critical evaluation of RECIST’s performance across different treatment modalities and cancer types, even as it remains the regulatory standard for response assessment in oncology trials.
1.3.4 Operational Challenges of RECIST in Clinical Trials
Despite RECIST’s standardized framework, significant challenges persist in its practical implementation across clinical trials (39). Even when adhering to RECIST guidelines, rater variability remains a substantial concern, with multiple factors contributing to inconsistent assessments (40). These challenges can be broadly categorized into technical, observer-related, and lesion-specific factors. Technical factors include variations in imaging protocols, slice thickness, and contrast timing, which can significantly affect how tumors appear and are measured (1). Observer-related factors encompass differences in experience level, training background, and specialty expertise among evaluators (40,41). Lesion characteristics such as size, location, morphology, and enhancement patterns further complicate consistent assessment. Additionally, measurement methodology differences including the use of manual versus automated tools and variations in software systems can introduce another layer of potential inconsistency in tumor response evaluation (42). A particularly significant source of variability lies in the selection of target lesions at baseline (43,44). Since RECIST allows for the designation of up to five target lesions from potentially numerous eligible lesions, different evaluators may select different subsets of tumors for measurement. This initial divergence cascades throughout the assessment process, as subsequent measurements and response classifications are directly tied to these baseline selections. These challenges highlight the need for standardized imaging protocols, consistent training programs for investigators, and ongoing quality control processes to ensure assessment reliability.
In the context of clinical trials, a critical distinction exists between assessments performed by site investigators (also called “local” or “enrolling” investigators) and those conducted through blinded independent central review (BICR) (45). Site investigators are typically the clinicians directly involved in patient care and treatment decisions, while BICR involves independent radiologists or oncologists who review imaging data without knowledge of treatment allocation. (Of note, a single study may have many site investigators, each of whom may assess the same patient independently; we will generally refer to them in the singular form “site investigator” in our Methods and Results section for simplicity.) This separation is intended to minimize bias and ensure objective evaluation of treatment response (39).
Beaumont et al. (46) examined this distinction in a Phase II study, finding notable differences in the determination of progressive disease between site investigators and central reviewers. These differences were not merely random variations but reflected systematic disparities in how the two groups applied RECIST criteria. Ford et al. had previously observed that BICR workflow processes are “specifically intended to produce greater consistency in image interpretation” compared to site investigator assessments, typically employing two independent BICR raters with an adjudicator to resolve discrepancies which may produce better results than site investigators (45). Ford’s comprehensive analysis identified multiple sources of variability between site and central assessments, including differences in training protocols, potential treatment bias among site investigators who may have clinical knowledge of patients, varying experience with different tumor types, and disparate interpretations of RECIST guidelines (see Table 1 from Ford for a detailed overview of these variability sources) (45).
While BICR is generally considered the gold standard for response assessment due to its structured approach and reduced potential for bias, it is not always feasible due to substantial cost and time constraints (47). Consequently, some trials rely exclusively or partly on site investigator assessments, which might introduce variability and bias into response evaluations (48). This disparity is particularly pronounced in target lesion selection, where systematic differences may emerge between specialized radiologists (i.e. central reviewers) and clinicians with less imaging experience (i.e. site investigators) (45). Furthermore, site investigators may be influenced by treatment knowledge in open-label studies, potentially affecting their assessments in ways that blinded central reviewers would not experience (48).
The extent of such differences between site and central reviewers has been empirically investigated in several key studies. Zhang et al. (48) conducted a meta-analysis of 76 phase III randomized clinical trials (RCTs) of anticancer agents for solid tumors that included assessments from both site investigators and central reviewers. While their analyses found no systematic differences between site and central reviewers across the four trial endpoints (ORR, DRR, PFS, and TTP) they examined, they observed that statistically inconsistent inferences could be made in nearly a quarter of the trials depending on which assessment source was used. This suggests that although differences may not be systematic across the entire landscape of oncology trials, they can nevertheless have significant implications for specific studies and endpoints. Corroborating these findings, Jacobs et al. (49) examined 24 phase II and III RCTs of anticancer agents for solid tumors specifically in breast cancer. Their investigation reached essentially the same conclusion as Zhang et al., though with the notable limitation that Jacobs’ work focused exclusively on PFS as the endpoint of interest, leaving questions about other important RECIST-derived endpoints unaddressed.
1.4 Research Gaps Restated and Aims of the Thesis
The preceding sections have established that while RECIST 1.1 provides a standardized framework vital for tumor response assessment in clinical trials, significant reliability challenges persist in its implementation. Despite this widespread regulatory acceptance and empirical reliability analyses done by the likes of Zhang et al. and Jacobs et al., some questions persist about the consistency of RECIST interpretations across different raters, particularly between site investigators and central reviewers (48,49).
Several specific knowledge gaps regarding RECIST reliability can thus be seen in the literature covered in this introduction. First, despite numerous individual studies examining inter-rater reliability in RECIST assessments, no comprehensive meta-analysis has synthesized these findings to establish baseline expectations for agreement levels across raters in any context. Second, while some studies have noted differences between site investigator and central reviewer assessments, systematic analyses of how these differences affect various trial endpoints remain limited in terms of the endpoints they have analyzed including the absence of TTR and DoR. Third, the impact of RECIST’s arbitrary threshold values, namely the 30% reduction defining partial response and 20% increase indicating progression, on inter-rater reliability has not been comprehensively evaluated. These thresholds, though widely accepted, lack explicit biological rationale and may contribute to assessment variability when measurements fall near these cutoff points.
Addressing these knowledge gaps has significant clinical relevance for oncology trials. Improved understanding of RECIST reliability can enhance the accuracy of tumor response assessment, potentially reducing measurement error and increasing confidence in trial outcomes. Additionally, quantifying the extent and patterns of disagreement between site investigators and central reviewers could inform more efficient trial designs and monitoring practices. Finally, evaluating the impact of threshold values on response classification may guide future refinements to RECIST criteria, particularly as novel treatment modalities with atypical response patterns become more prevalent in oncology.
This thesis therefore aims to systematically address these gaps through three primary research objectives. First, we will conduct a comprehensive meta-analysis of existing inter-rater reliability studies using RECIST criteria to establish baseline agreement expectations. Second, we will analyze several trial endpoints to compare site investigator assessments with central review outcomes, examining whether systematic differences exist in response classification. As a corollary to these analyses, one would expect that site investigators and central reviewers will not differ at all in an ideal scenario, and we thus also conduct equivalence testing to formally assess this hypothesis on the same endpoints. Finally, we will evaluate how varying the threshold values in RECIST criteria affects inter-rater reliability and classification stability, potentially identifying cutpoints that maximize (dis-)agreement between raters.
With this background established, we proceed to the Methods chapter, which outlines the research design and methodology employed to address these questions.
Apoptosis is a form of programmed cell death that occurs in multicellular organisms, allowing for the removal of damaged or unwanted cells.↩︎
Nadir is the lowest point reached by a variable, in this case the SLD of target lesions, during treatment. It is used as a reference point for determining progression.↩︎