Disease Prediction Based On Medical Side Symptoms

In this article, we will discuss one of DOCTOR-Y's Machine Learning Models. This model predicts the current patients' medical conditions based on the associated symptoms with the previous diagnoses from the patient's medical history.

We used a dataset containing the diseases and their symptoms in a checker format and classified it using 5 different machine learning classifiers.

If you don't know what is DOCTOR-Y check this post.

Idea

Physicians will spend a lot of time reviewing the patient's previous e-prescriptions provided on DOCTOR-Y to know their past medical conditions and previous diseases.

That's why DOCTOR-Y provides a summarized chart that represents the percentages for suffering from a group of diseases based on the associated symptoms with the previous diagnoses. The model is provided with a dataset to train and classify these symptoms.

The model takes the symptoms as input from previous prescriptions, and the output will be the predicted disease based on these symptoms.

The snippet below shows how the model works.

python symptoms_disease.py continous_sneezing, shivering, chills
['Allergy']

Dataset

In this model, we used Disease Symptom Prediction Dataset. This dataset is balanced. However, feature vectors (samples) in the data had a redundancy problem.

We chose the features of the unique vector (unique samples) and fed it to machine learning algorithms, then we reconstructed the data in a Boolean form to facilitate the process of training the model and get better results, to obtain a refactored dataset.

The data is in a checker format where we have 133 columns, the last column is the diseases, and the others are all the symptoms. We have a total of 309 entries and 41 unique disease averaging 8 entries per disease.

The table below is a sample of the symptoms, and you can find the full list here.

Symptoms	Symptoms	Symptoms	Symptoms	Symptoms
itching	skin rash	nodal skin eruptions	continuous sneezing	shivering
visual disturbances	receiving blood transfusion	receiving unsterile injections	coma	stomach bleeding
irregular sugar level	cough	high fever	sunken eyes	breathlessness
swelling of stomach	swelled lymph nodes	malaise	blurred and distorted vision	phlegm

The table below shows the diseases in the full dataset.

Prognosis	Prognosis	Prognosis	Prognosis
Fungal infection	Migraine	hepatitis A	Heart attack
Allergy	Cervical spondylosis	Hepatitis B	Varicose veins
GERD	Paralysis(brain hemorrhage)	Hepatitis C	Hypothyroidism
Chronic cholestasis	Jaundice	Hepatitis D	Hyperthyroidism
Drug Reaction	Malaria	Hepatitis E	Hypoglycemia
Peptic ulcer diseae	Chicken pox	Alcoholic hepatitis	Osteoarthristis
AIDS	Dengue	Tuberculosis	Arthritis
Diabetes	Typhoid	Common Cold	(vertigo) Paroymsal Positional Vertigo
Gastroenteritis	Psoriasis	Pneumonia	Acne
Bronchial Asthma	Impetigo	Dimorphic hemmorhoids(piles)	Urinary tract infection
Hypertension

Implementation

Data Preparation

For the decision tree algorithm, we used PCA to normalize our data and reduce our features from 132 to 70, and we transformed our training and testing data on the vector produced from the PCA.

Model Definition

The Model is trained on the discussed dataset.

The Model Input: the symptoms.

The Model Output: the possible diseases the patient may suffer from.

Model Training

We used five classification algorithms to process the data.

Decision Tree.

Random Forest.

Naïve Bayes.

K-Nearest Neighbor (KNN).

Artificial Neural Networks (ANN) is illustrated in the figure below, which shows that the model has one input layer with 132 neurons since we have 132 symptoms, one hidden layer, and one output layer with 41 neurons since we have 41 labels as outputs, batch size of 16, and 20 epochs. ANN format

Evaluation & Results

The Dataset was spilt by 66/33 for all the classifiers.

The accuracy of each classification technique used for predicting diseases based on symptoms.

Algorithm	Accuracy
Decision Tree	90%
Random Forest	97%
KNN	98%
Naïve Bayes	100%
ANN	100%

Discussion

The models showed a decent performance and very high accuracy. The best results were provided by ANN while Naïve Bayes & KNN & Random Forest provided comparable results.

However, while working with real and unseen data, the Random Forest showed the best results out of all the classifiers.

The table below shows the details of each model.

Algorithm	Review
Decision Tree	Working on the original data resulted in low accuracy. Features Reduction to normalize data and reduce dimensions lead to better results.
Random Forest	This model showed promising results, and we observed that when increasing the number of estimators, the results improved significantly.
KNN	We observed that when K was reduced, the results improved, and on trial-and-error experimentation we chose 7 to be the value of K, more experimentation may lead to a better accuracy.
Naïve Bayes	This model showed good performance on the original data achieving very high accuracy.
ANN	Unfortunately, papers did not provide guidelines on configuring the network of this model. So we had to use trial and error and determined the following hyperparameters. Number of layers: 3 Number of neurons unit: *305* Epoch Number: 20 Number of batches: 16 The neural network did not find it challenging to train, and after *20 epochs*, the training accuracy was good, and the test accuracy was also good.

Integration With DOCTOR-Y

We used the Diseases Symptoms Prediction Model's results and combined them with the Diseases Diagnoses Prediction Model's results to calculate the percentage of suffering from a group of diseases based on previous diagnoses + the associated symptoms.

The final diseases and their percentages are sent to the system server, which sends them to the client-side to be represented on a chart as shown in the figure below.