A BMJ study has concluded that further scrutiny is needed before clinical decisions for patients are made by machine learning.

Last year Prime Minister Boris Johnson pledged £250m to further utilise artificial intelligence in the NHS, which was announced as an innovative bid to spot early signs of cancer and dementia. However, researchers have found that different machine learning models have similar performance to traditional statistical models and are similarly uncertainly in risk predictions for individual patients.

How useful is machine learning in predicting risk?

Researchers in the UK, China, and the Netherlands have assessed the reliability of twelve different machine learning models and seven statistical techniques in predicting individual and population-level risks of cardiovascular disease and their effects of censoring on risk predictions.

Previous research found that some AI models are adequate at predicting population-level risk but are considerably less successful at predicting individual risk. Additionally, other studies have submitted conflictual evidence and have either supported machine learning models in clinical settings or have argued that they potentially lead to inappropriate actions.

Across all the different models analysed, it was demonstrated that compared to data collected from general practices, hospital admissions, and mortality records – there was a similar population-level performance. However, a variety of models predicted risks for some patients very differently despite similar model performances.

For example, cardiovascular disease risk was predicted at 9.5-10.5% by the algorithm QRISK3 model and 2.9-9.2% and 2.4-7.2% by other models. A researcher wrote that: ‘Consequently, different treatment decisions could be made by arbitrarily selecting another modelling technique.'

These deviations were reasoned to be due to that some models ignored censoring (when patients are lost during a study, and the model assumes that they are disease-free) – which substantially underestimated the risk of cardiovascular disease in a population.

Therefore, researchers determined that consistency between models should be assessed before they are used to make clinical decisions for patients. In particular, they should not be applied in determining long term risk without considering censoring in the AI model.