Machine Learning: Diagnosis of COVID-19 based on Lab Tests

Introduction


COVID-19 needs no introduction. It goes without saying that it has impacted every aspect of our daily lives at a micro and macro level.


Kaggle users have published a myriad of COVID-19 datasets as a response to the pandemic to help the medical community develop answers to high priority scientific questions. The dataset used for this experiment contains anonymized data from patients admitted at the Hospital Israelita Albert Einstein, in São Paulo, Brazil. The samples related to the dataset were collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests.


Purpose


The objective is to use our AugustAi machine learning platform to build a predictive model to detect COVID-19 positive cases amongst patients based on their laboratory tests, as well as generate insights and patient profiles using model explainability


Summary of Findings

  • A patient’s age, Leukocytes, Monocytes, Eosinophils, and patient ward admissions, are some of the important features that contribute to a patient testing positive for COVID-19.

  • The ML model built has a train/cross validation/test AUC of 0.72/0.66/0.66

  • Model Predictions top percentile — In a test dataset of 2274 patients (236 COVID-19 positive), the model is ~5 times more likely to detect if a patient is COVID-19 positive in the top percentile. That constitutes to ~12 positive patients in the top 1% of the model’s predictions (23 patients), or a capture rate of 5% (12/236 confirmed cases)

  • By way of model explainability, a typical COVID-19 positive patient (compared to a COVID-19 negative patient) is one that has been admitted to a regular ward, is older in age, has a higher Monocytes value, and lower Eosinophils value.

Data


The data was segmented into a 60/40 training/test data splits.




Correlation Analysis


Using Correlation Analysis and a threshold of 0.9, below are the highly correlated features in the dataset. The results can be found here.




Highly correlated features


Only one of the set of highly correlated features was retained in the dataset. The below features were removed.

  • po2_venous_blood_gas_analysis

  • hco3_arterial_blood_gas_analysis

  • indirect_bilirubin ph_arterial_blood_gas_analysis

  • mean_corpuscular_volume_mcv

  • base_excess_venous_blood_gas_analysis hemoglobin

  • hco3_venous_blood_gas_analysis

Feature Selection


Feature selection yielded the following results:




Model Grid Search


Details of the results can be found here.

  • The model grid search parameters:

  • GBM was the algorithm of choice for the model.

  • The COVID-19 negative class was under-sampled by 50% (3048 -> 1524), as the data classes was highly imbalanced with a COVID-19 positive class of approximately 10.5%.

  • 27 of the top features were used for the model build.

  • A model grid search was carried out, with a cross validation fold of 3.

  • AUC was the metric of choice


AUC




Test Data Confusion Matrix


Test Data Lift and Capture Rate




Model Explainability


COVID-19 Negative and Positive patients with the highest model probability (for each negative and positive class) were selected to build patient profiles. Details of the analysis can be found here.


COVID-19 Negative Patient Profile

  • Normally not admitted to a hospital ward

  • In an age group of between 1 and 2 (group range of 0–19), estimated at 5–10 years old. This patients age is approximately ~10 years old

  • Rhinovirus/Enterovirus is detected

  • Monocytes value of less than or equal to 0 (patient had a value of 0)

  • Eosinophils value of between -0.02 and 0 (patient had a value of 0)




COVID-19 Positive Patient Profile

  • Normally admitted to a hospital ward

  • In an age group of larger than 9.25 (group range of 0–19), estimated at 100 years/20 groups X 9.25 ~ 46 years. (This patients age is ~60 years old)

  • Rhinovirus/Enterovirus is not detected

  • Monocytes value of more than 0 (patient had a value of 0.49)

  • Eosinophils value of less than or equal to -0.02 (patient had a value of -0.46)




How


AugustAi is a fully automated ML platform that embodies the principles of DevOps and CI/CD (MLOps) for Machine Learning and AI in the structured data space. It simplifies and automates the end to end ML model build process (data preparation -> model training -> model deployment) by way of standardisation, consistency, versioning, speed and scale. All users need to do is to provide the data and define the problem. 

2 views0 comments