Machine Learning: Diagnosis of COVID-19 based on Lab Tests


COVID-19 needs no introduction. It goes without saying that it has impacted every aspect of our daily lives at a micro and macro level.

Kaggle users have published a myriad of COVID-19 datasets as a response to the pandemic to help the medical community develop answers to high priority scientific questions. The dataset used for this experiment contains anonymized data from patients admitted at the Hospital Israelita Albert Einstein, in São Paulo, Brazil. The samples related to the dataset were collected to perform the SARS-CoV-2 RT-PCR and additional laboratory tests.


The objective is to use our AugustAi machine learning platform to build a predictive model to detect COVID-19 positive cases amongst patients based on their laboratory tests, as well as generate insights and patient profiles using model explainability

Summary of Findings

  • A patient’s age, Leukocytes, Monocytes, Eosinophils, and patient ward admissions, are some of the important features that contribute to a patient testing positive for COVID-19.

  • The ML model built has a train/cross validation/test AUC of 0.72/0.66/0.66

  • Model Predictions top percentile — In a test dataset of 2274 patients (236 COVID-19 positive), the model is ~5 times more likely to detect if a patient is COVID-19 positive in the top percentile. That constitutes to ~12 positive patients in the top 1% of the model’s predictions (23 patients), or a capture rate of 5% (12/236 confirmed cases)

  • By way of model explainability, a typical COVID-19 positive patient (compared to a COVID-19 negative patient) is one that has been admitted to a regular ward, is older in age, has a higher Monocytes value, and lower Eosinophils value.


The data was segmented into a 60/40 training/test data splits.

Correlation Analysis

Using Correlation Analysis and a threshold of 0.9, below are the highly correlated features in the dataset. The results can be found here.

Highly correlated features

Only one of the set of highly correlated features was retained in the dataset. The below features were removed.

  • po2_venous_blood_gas_analysis

  • hco3_arterial_blood_gas_analysis

  • indirect_bilirubin ph_arterial_blood_gas_analysis

  • mean_corpuscular_volume_mcv

  • base_excess_venous_blood_gas_analysis hemoglobin

  • hco3_venous_blood_gas_analysis

Feature Selection

Feature selection yielded the following results:

Model Grid Search

Details of the results can be found here.

  • The model grid search parameters:

  • GBM was the algorithm of choice for the model.

  • The COVID-19 negative class was under-sampled by 50% (3048 -> 1524), as the data classes was highly imbalanced with a COVID-19 positive class of approximately 10.5%.

  • 27 of the top features were used for the model build.

  • A model grid search was carried out, with a cross validation fold of 3.

  • AUC was the metric of choice


Test Data Confusion Matrix

Test Data Lift and Capture Rate

Model Explainability

COVID-19 Negative and Positive patients with the highest model probability (for each negative and positive class) were selected to build patient profiles. Details of the analysis can be found here.

COVID-19 Negative Patient Profile

  • Normally not admitted to a hospital ward

  • In an age group of between 1 and 2 (group range of 0–19), estimated at 5–10 years old. This patients age is approximately ~10 years old

  • Rhinovirus/Enterovirus is detected

  • Monocytes value of less than or equal to 0 (patient had a value of 0)

  • Eosinophils value of between -0.02 and 0 (patient had a value of 0)

COVID-19 Positive Patient Profile

  • Normally admitted to a hospital ward

  • In an age group of larger than 9.25 (group range of 0–19), estimated at 100 years/20 groups X 9.25 ~ 46 years. (This patients age is ~60 years old)

  • Rhinovirus/Enterovirus is not detected

  • Monocytes value of more than 0 (patient had a value of 0.49)

  • Eosinophils value of less than or equal to -0.02 (patient had a value of -0.46)


AugustAi is a fully automated ML platform that embodies the principles of DevOps and CI/CD (MLOps) for Machine Learning and AI in the structured data space. It simplifies and automates the end to end ML model build process (data preparation -> model training -> model deployment) by way of standardisation, consistency, versioning, speed and scale. All users need to do is to provide the data and define the problem. 

2 views0 comments