Eugene Tan
Machine Learning: Diagnosis of COVID-19 based on Lab Tests
Introduction
COVID-19 needs no introduction. It goes without saying that it has impacted every aspect of our daily lives at a micro and macro level.
Kaggle users have published a myriad of COVID-19 datasets as a response to the pandemic to help the medical community develop answers to high priority scientific questions. The dataset used for this experiment contains anonymized data from patients admitted at the Hospital Israelita Albert Einstein, in São Paulo, Brazil. The samples related to the dataset were collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests.
Purpose
The objective is to use our AugustAi machine learning platform to build a predictive model to detect COVID-19 positive cases amongst patients based on their laboratory tests, as well as generate insights and patient profiles using model explainability
Summary of Findings
A patient’s age, Leukocytes, Monocytes, Eosinophils, and patient ward admissions, are some of the important features that contribute to a patient testing positive for COVID-19.
The ML model built has a train/cross validation/test AUC of 0.72/0.66/0.66
Model Predictions top percentile — In a test dataset of 2274 patients (236 COVID-19 positive), the model is ~5 times more likely to detect if a patient is COVID-19 positive in the top percentile. That constitutes to ~12 positive patients in the top 1% of the model’s predictions (23 patients), or a capture rate of 5% (12/236 confirmed cases)
By way of model explainability, a typical COVID-19 positive patient (compared to a COVID-19 negative patient) is one that has been admitted to a regular ward, is older in age, has a higher Monocytes value, and lower Eosinophils value.
Data
The data was segmented into a 60/40 training/test data splits.

Correlation Analysis
Using Correlation Analysis and a threshold of 0.9, below are the highly correlated features in the dataset. The results can be found here.

Highly correlated features
Only one of the set of highly correlated features was retained in the dataset. The below features were removed.
po2_venous_blood_gas_analysis
hco3_arterial_blood_gas_analysis
indirect_bilirubin ph_arterial_blood_gas_analysis
mean_corpuscular_volume_mcv
base_excess_venous_blood_gas_analysis hemoglobin
hco3_venous_blood_gas_analysis
Feature Selection
Feature selection yielded the following results:


Model Grid Search
Details of the results can be found here.
The model grid search parameters:
GBM was the algorithm of choice for the model.
The COVID-19 negative class was under-sampled by 50% (3048 -> 1524), as the data classes was highly imbalanced with a COVID-19 positive class of approximately 10.5%.
27 of the top features were used for the model build.
A model grid search was carried out, with a cross validation fold of 3.
AUC was the metric of choice
AUC

Test Data Confusion Matrix

Test Data Lift and Capture Rate


Model Explainability
COVID-19 Negative and Positive patients with the highest model probability (for each negative and positive class) were selected to build patient profiles. Details of the analysis can be found here.
COVID-19 Negative Patient Profile
Normally not admitted to a hospital ward
In an age group of between 1 and 2 (group range of 0–19), estimated at 5–10 years old. This patients age is approximately ~10 years old
Rhinovirus/Enterovirus is detected
Monocytes value of less than or equal to 0 (patient had a value of 0)
Eosinophils value of between -0.02 and 0 (patient had a value of 0)


COVID-19 Positive Patient Profile
Normally admitted to a hospital ward
In an age group of larger than 9.25 (group range of 0–19), estimated at 100 years/20 groups X 9.25 ~ 46 years. (This patients age is ~60 years old)
Rhinovirus/Enterovirus is not detected
Monocytes value of more than 0 (patient had a value of 0.49)
Eosinophils value of less than or equal to -0.02 (patient had a value of -0.46)


How
AugustAi is a fully automated ML platform that embodies the principles of DevOps and CI/CD (MLOps) for Machine Learning and AI in the structured data space. It simplifies and automates the end to end ML model build process (data preparation -> model training -> model deployment) by way of standardisation, consistency, versioning, speed and scale. All users need to do is to provide the data and define the problem.