Tools and Training

Open‑source machine‑learning tools and flexible data‑science training opportunities.

Training in methodological skills for Prediction Modelling

Clinical prediction models can help healthcare professionals and patients make clinical decisions with the aim of improving patient outcomes and quality of care. Developing and implementing an accurate, generalisable, and robust prediction model needs – besides clinical expertise, clinicians, patients and other stakeholders’ involvement – modern statistical and data science skills.

Virtual Data training Centre

Funded by UK Research and Innovation, the NIHR Maudsley BRC prediction modelling supports the development of an innovative, interactive online learning program to provide a comprehensive and integrated understanding of the requirements of modern data science in health research in the 21st century. This virtual online training centre will offer flexible online courses in R and Phyton programming, Prediction modelling, machine learning/AI and Natural Language processing. The online modules will be hosted by the NIHR Maudsley BRC CoqStack servers. Modules will introduce and train to use a variety of large data analysis techniques, allowing researchers to process health record data for both research and real-world applications. The first modules will be launched in the summer of 2022. More information about the centre which is part of Kings Innovation Scholars programme: Big Data skills Training” can be found at Innovation Scholars: Big Data Training (Pillar 1) and here: Innovation Scholars Training

Applied Statistical Modelling and Health Informatics

Members of the BRC Prediction modelling group are also supporting the MSc, PG Cert, PG Dip “Applied Statistical Modelling and Health Informatics”, which provides training in core applied statistical methodology, machine learning and computational methodology necessary for the successful development of clinical prediction models. More information about the MSc can be found here.

Analytical tools for machine learning

Pipelines and tools to analyse big data sets using machine learning methods

Big datasets in healthcare have very complex structure and particular characteristics. We develop open tools and pipelines based on modern machine learning and prediction modelling methods to facilitate their analysis.

A pipeline based on topological machine learning to identify homogeneous patients and relevant features

Dr Raquel Iniesta and Dr Ewan Carr developed a novel pipeline built on recent advances in topological data analysis (TDA) to identify homogeneous clusters of patients with respect to a characteristic of interest. The pipeline focuses on Mapper, a clustering algorithm to identify topological features in complex data that has shown big potential in uncovering homogeneous subgroups sharing common characteristics. TDA is a growing field providing tools to infer, analyse, and exploit the shape of data. TDA has seen increasing adoption in recent years. It holds particular promise as a set of tools to further precision medicine where we often want to identify groups of patients with similar treatment or prognostic outcome. The analytical tool combines and extends existing software implementations of the Mapper algorithm to provide several unique strengths, as the integration of prior knowledge to inform the clustering process, the restriction of clusters search to significant topological features, the use of multivariable machine learning XGBoost to describe clusters composition, and the ability to incorporate mixed data types. Details about the methodological aspects and implementation, and an application for clustering patients with major depression in terms of their chances to remit are published in this paper (2021).

Two videos introducing TDA and explaining the tool are on our BRC Prediction Modelling Presentation page and on YouTube at Introduction to TDA and Mapper pipeline presentation.

The pipeline can be downloaded at: https://github.com/kcl-bhi/mapper-pipeline

“dCVnet”: a user-friendly tool to develop regularized regression prediction models

Dr Andrew Lawrence developed a software tool “dCVnet” (R wrapper for the glmnet package) to implement regularized logistic regression with double (nested) cross-validation for internal validation and made this easy-to-use tool available for use by the scientific and clinical community as an R package.

In contrast to traditional statistical methods, regularized regression allows the analyses of a large number of predictors relative to sample size. Regularization provides a means to reduce overfitting by constraining the magnitude of the regression coefficients through the introduction of a penalty. DCVnet provides a documented and standardized implementation of this particular machine learning pipeline, making it accessible to researchers lacking the programming experience required for more general machine learning software environments. Details about the methodology and an application to predict of recurrence of depression are published in Lawrence, A. Stahl, D. et al (2022).

A video explaining the tool is on our BRC Prediction Modelling Presentation page and on YouTube.

The toolbox can be downloaded at: github.com/AndrewLawrence/dCVnet

“survcompare”: do I need a simple Cox Proportional Hazards model or a more flexible (but less transparent) machine learning method? An R package to investigate complexity of survival data.

The primary goal of the package is to assist researchers in making informed decisions regarding whether they should choose a flexible yet less transparent machine learning approach or employ a traditional linear method.

The package performs a repeated nested cross-validation to validate predictive performance of the Cox Proportionate Hazards model (or its LASSO regularised extension), and the performance of the Survival Random Forest and tests whether the ensemble method has outperformed the baseline Cox model. If there is no outperformance, the result can justify the employment of CoxPH model and indicate a negligible advantage of using a more flexible model such as Survival Random Forest. In the case of the outperformance, a researcher can 1) decide to go for a more complex model, 2) look for the interaction and non-linear terms that could be added to the baseline Cox model and re-run the test again, or 3) consider still using the Cox model if the difference is not large in the context of the performed task, or not enough to sacrifice model interpretability.

The package was developed by Dr Diana Shamsutdinova and Professor Daniel Stahl and is based on the collaboration with Dr Daniel Stamate and Dr Angus Roberts (see the Conference paper). It can be downloaded in R studio as install.packages(“survcompare”) or from github.com/dianashams/survcompare.