Legacy Study Harmonization and Patient-Level Integration

This case study demonstrates how six heterogeneous legacy breast cancer studies can be harmonized into a common patient-level database.

Jul 10, 2025

Introduction

Many pharmaceutical and clinical research organizations have accumulated substantial quantities of historical study data over time. These datasets often originate from studies conducted across different periods, geographies, and operational environments.

Although studies may investigate the same disease area, important differences frequently exist in:

Endpoint definitions
Data structures
Coding conventions
Variable naming
Follow-up duration
Time units
Treatment coding

As a result, direct integration of legacy studies is rarely possible without a dedicated harmonization effort.

The conventional solution is study-level meta-analysis, where each study is analyzed separately and summary treatment effects are subsequently combined.

While this approach is statistically rigorous and widely accepted, it limits the scope of possible analyses because patient-level information is not preserved within the combined evidence structure.

This project explores an alternative approach based on patient-level integration and harmonization.

Simulated Legacy Breast Cancer Portfolio

To illustrate the concept, six synthetic legacy breast cancer studies were generated from a common source population.

Each study was intentionally modified to resemble common real-world legacy data challenges.

The studies differed in several dimensions:

Recurrence endpoint definitions
Follow-up horizons
Treatment coding conventions
Time measurement units
Covariate representations.

Classical Study-Level Meta-Analysis

Each study was first analyzed independently using a Cox proportional hazards model.

Treatment effect estimates were expressed as hazard ratios (HRs) with corresponding 95% confidence intervals.

Substantial variability was observed across studies. Some studies suggested a strong treatment benefit, whereas others produced more uncertain estimates due to limited sample size and event counts.

These study-specific estimates were then combined using both fixed-effect and random-effects meta-analysis models.

The pooled study-level estimate suggested a beneficial treatment effect with a hazard ratio of approximately 0.80.

The fixed-effect model produced:

HR = 0.80 (95% CI 0.66–0.97)

while the random-effects model produced:

HR = 0.78 (95% CI 0.57–1.08)

These findings demonstrate the ability of meta-analysis to synthesize fragmented evidence across multiple studies.

Patient-Level Harmonization

Rather than stopping at study-level synthesis, all six datasets were subsequently harmonized into a common patient-level structure.

The harmonization process included:

Standardization of treatment variables
Alignment of time units
Reconciliation of endpoint definitions
Mapping of covariates to a common data model
Quality control and consistency verification

The resulting integrated dataset preserved individual patient records while maintaining study provenance information.

This patient-level architecture creates opportunities that extend beyond traditional meta-analysis.

The key advantage is that common endpoint definitions can be reconstructed consistently across all studies after integration.

Endpoint Re-Derivation

One of the most common challenges in legacy clinical trials is that clinical endpoints were not defined using fully consistent criteria.

Some studies used:

3-year recurrence,
others 5-year recurrence,
and different event definitions were often applied.

In study-level meta-analyses, these differences typically remain unresolved.

Patient-level harmonization, however, makes it possible to apply a common set of endpoint definitions and analysis rules across all studies.

Predictive Modelling on the Integrated Dataset

A study-level meta-analysis is limited to estimating overall treatment effects and cannot support patient-level prediction.

In contrast, an integrated patient-level database enables the development of prognostic models tailored to individual patient characteristics.

Using the harmonized dataset, separate predictive models were developed for:

3-year outcomes,
5-year outcomes.

This approach transforms historical trial data from a source of average treatment estimates into a platform for individualized risk prediction.

he models demonstrated acceptable discriminative performance:

3-year endpoint
- AUC ≈ 0.84
5-year endpoint
- AUC ≈ 0.78

Calibration analyses further showed that the predicted risks closely matched the observed event rates across risk groups.

These findings demonstrate that an integrated patient-level database is valuable not only for retrospective evidence synthesis but also as a foundation for predictive analytics and individualized risk assessment.

What Does Patient-Level Harmonization Add Beyond Meta-Analysis?

Meta-analysis and patient-level harmonization serve different scientific objectives.

The primary purpose of a meta-analysis is:

to summarize treatment effects across studies.

The primary purpose of a harmonized patient-level database is:

to maximize the long-term value and reusability of clinical evidence.

A study-level meta-analysis provides:

Pooled treatment effect estimates
Forest plot–based evidence synthesis

An integrated patient-level database provides all of the above and additionally enables:

Endpoint re-derivation using unified definitions
Application of common clinical criteria across studies
Patient-level regression modeling
Predictive analytics and risk stratification
Reuse of the database for future scientific questions

In this sense, patient-level harmonization transforms legacy clinical trial data from a static evidence source into a reusable analytical asset that continues to generate value long after the original studies have been completed.