Fairness Assessment in Clinical Prediction Models

Fairness diagnostics provide a structured framework for examining whether model performance varies across protected or clinically important patient groups.

Jul 10, 2025

Introduction

Modern clinical prediction models are increasingly used to support medical decision-making, risk stratification, patient screening and resource allocation. Model development traditionally focuses on predictive performance measures such as discrimination and calibration. Metrics such as

AUC, sensitivity,
specificity or
predictive values

are routinely reported and often serve as the primary criteria for model evaluation.

However, acceptable overall performance does not necessarily imply that a model behaves similarly across different patient groups. Two models with nearly identical overall accuracy may produce substantially different results when evaluated separately in clinically relevant subpopulations.

Fairness assessment aims to identify and quantify these differences. Rather than assuming that a model is unbiased, fairness diagnostics provide a structured framework for examining whether model performance varies across protected or clinically important patient groups.

Objectives

The objectives of this analysis are: Develop predictive models for cardiovascular risk. Compare overall and subgroup-specific performance. Evaluate differences between male and female patient populations. Apply fairness diagnostics using the R ecosystem. Demonstrate how fairness metrics complement traditional model evaluation. The analysis follows the methodology presented in the fairmodels package documentation and extends it using a real clinical teaching dataset.

The analysis uses the Framingham cardiovascular dataset distributed with the R package riskCommunicator.

The objective is not to develop a production-ready clinical model but to illustrate fairness assessment concepts using realistic clinical data.

Predictive Models Two independent prediction models were developed: Logistic Regression A conventional generalized linear model representing a transparent and interpretable baseline approach. Random Forest A machine learning model capable of capturing non-linear relationships and complex interactions among predictors. Both models were trained on identical training data and evaluated on a separate test dataset.

The objective is not to develop a production-ready clinical model but to illustrate fairness assessment concepts using realistic clinical data.

Traditional Performance Evaluation

The results demonstrate that model behaviour is not identical across patient groups.

Several metrics show meaningful differences between male and female patients despite comparable overall model performance.

This observation highlights an important limitation of relying solely on aggregate performance measures.

Why Fairness Assessment?

Traditional performance metrics answer questions such as: How accurate is the model? How well does it discriminate? How many events are correctly identified?

Clinical prediction models may be deployed across diverse populations where unequal performance could have practical consequences.

Fairness assessment therefore focuses on relative differences between groups rather than absolute model performance alone.

Fairness diagnostics ask a different question: Does the model perform similarly across different patient groups?

Fairness Diagnostics

The analysis uses the R package fairmodels, which builds upon the DALEX explainability framework.

Rather than introducing entirely new performance measures, fairness diagnostics reorganize familiar quantities such as: true positive rate, false positive rate, predictive values, overall accuracy into parity metrics that compare protected and privileged groups.

Values close to 1 indicate similar behaviour across groups, whereas larger deviations suggest subgroup differences.

Comparing Alternative Models

Fairness assessment is particularly useful when comparing multiple candidate models.

In this study, logistic regression and random forest models produced similar overall conclusions. The fairness heatmap provides a compact overview of subgroup behaviour across multiple fairness metrics simultaneously.

The heatmap demonstrates that fairness evaluation can be incorporated into routine model comparison workflows rather than being treated as a separate exercise.

Conclusions

This example demonstrates how fairness assessment can complement traditional clinical prediction model evaluation.

Key observations include:

Overall model performance may conceal subgroup-specific behaviour.
Standard metrics such as sensitivity, specificity and predictive values can vary substantially between patient groups.
Fairness diagnostics provide a structured framework for identifying and quantifying these differences.
Fairness assessment can be integrated into existing model development workflows using open-source R tools.

As predictive models become increasingly common in healthcare and life sciences, subgroup-level performance evaluation will likely become an important component of model governance, validation and regulatory review.

References

Framingham Dataset Framingham Heart Study teaching dataset distributed in the riskCommunicator R package.
fairmodels Package Basic Tutorial: https://cran.r-project.org/web/packages/fairmodels/vignettes/Basic_tutorial.html
Advanced Tutorial: https://cran.r-project.org/web/packages/fairmodels/vignettes/Advanced_tutorial.html

DALEX Biecek P. et al. Explainable Machine Learning and Model Auditing in R.