Handling missing data in cardiovascular prediction models in real live

Improving prevention of cardiovascular disease: from proof of principle to implementation readiness of a live cardiovascular risk management dashboard in electronic health record in routine clinical practice.

In this project, we investigated traditional statistical and modern machine learning (ML) methods for

handling of missing predictor data when applying prediction models for cardiovascular diseases in real-time medical settings and evaluated how well ML-based prediction model studies follow recommendations from existing reporting guidelines on missing data. We show how a majority of the clinical prediction model studies that use ML techniques does not report sufficient information on the presence and handling of missing data, despite missing values being highly common in routine healthcare data that often form the basis in ML prediction models studies. Also,  we show that ML-based prediction model studies adhered poorly to the current guideline Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD).

   We further developed and present the development of novel real time imputation methods for missing predictor values using either conditional modelling imputation (CMI, where a multivariable imputation model is derived for each predictor from a population) or joint modelling imputation (JMI, where we use a multivariate normal approximation to generate patient-specific imputations). The use of JMI, especially with auxiliary variables (i.e., variables not part of the prediction model), for real-time imputation of missing predictor values is found to be preferred. We also compared various ML modelling techniques that deal with missing predictor values. The use of surrogate splits were found to perform poorly, whilst pattern submodels showed good performance only when paired with a specific modelling technique. Overall, JMI is still to be preferred, provided multiple imputations are used. We also describe how the adoption of internal-external cross-validation (IECV) is preferred to assess the generalizability of prediction models during their development, and to identify whether complex modelling strategies may offer any advantages. Briefly, IECV allows to evaluate model performance in non-random hold-out samples with individuals from different settings or populations.

  Lastly, we present a pilot study which shows that the use of real-time missing predictor value imputation was found to be acceptable by potential users of CVD prediction models. The findings are reported in a PhD thesis of Steven Nijman, to be defended in 2022.

Summary
We set out to explore innovative big data analytics for development of flexible and improved risk prediction algorithms to estimate cardiovascular risk to apply ‘live’ in routine clinical practice to improve cardiovascular prevention. We show solutions to realize precise prediction with (flexible) prediction algorithms and work out potential ways of remediating methodological issues such as data quality and missing data. Furthermore, we provide a framework for the clinical integration of prediction models in clinical practice including methods that solve the problem of missing relevant patient data.
Technology Readiness Level (TRL)
3 - 7
Time period
38.5 months
Partners