Disease Prediction Based On Medical Side Symptoms

In this article, we will discuss one of DOCTOR-Y's Machine Learning Models. This model predicts the current patients' medical conditions based on the associated symptoms with the previous diagnoses from the patient's medical history.

We used a dataset containing the diseases and their symptoms in a checker format and classified it using 5 different machine learning classifiers.

If you don't know what is DOCTOR-Y check this post.


Physicians will spend a lot of time reviewing the patient's previous e-prescriptions provided on DOCTOR-Y to know their past medical conditions and previous diseases.

That's why DOCTOR-Y provides a summarized chart that represents the percentages for suffering from a group of diseases based on the associated symptoms with the previous diagnoses. The model is provided with a dataset to train and classify these symptoms.

The model takes the symptoms as input from previous prescriptions, and the output will be the predicted disease based on these symptoms.

The snippet below shows how the model works.

python symptoms_disease.py continous_sneezing, shivering, chills


In this model, we used Disease Symptom Prediction Dataset. This dataset is balanced. However, feature vectors (samples) in the data had a redundancy problem.

We chose the features of the unique vector (unique samples) and fed it to machine learning algorithms, then we reconstructed the data in a Boolean form to facilitate the process of training the model and get better results, to obtain a refactored dataset.

The data is in a checker format where we have 133 columns, the last column is the diseases, and the others are all the symptoms. We have a total of 309 entries and 41 unique disease averaging 8 entries per disease.

The table below is a sample of the symptoms, and you can find the full list here.

Symptoms Symptoms Symptoms Symptoms Symptoms
itching skin rash nodal skin eruptions continuous sneezing shivering
visual disturbances receiving blood transfusion receiving unsterile injections coma stomach bleeding
irregular sugar level cough high fever sunken eyes breathlessness
swelling of stomach swelled lymph nodes malaise blurred and distorted vision phlegm

The table below shows the diseases in the full dataset.

Prognosis Prognosis Prognosis Prognosis
Fungal infection Migraine hepatitis A Heart attack
Allergy Cervical spondylosis Hepatitis B Varicose veins
GERD Paralysis(brain hemorrhage) Hepatitis C Hypothyroidism
Chronic cholestasis Jaundice Hepatitis D Hyperthyroidism
Drug Reaction Malaria Hepatitis E Hypoglycemia
Peptic ulcer diseae Chicken pox Alcoholic hepatitis Osteoarthristis
AIDS Dengue Tuberculosis Arthritis
Diabetes Typhoid Common Cold (vertigo) Paroymsal Positional Vertigo
Gastroenteritis Psoriasis Pneumonia Acne
Bronchial Asthma Impetigo Dimorphic hemmorhoids(piles) Urinary tract infection


Data Preparation

For the decision tree algorithm, we used PCA to normalize our data and reduce our features from 132 to 70, and we transformed our training and testing data on the vector produced from the PCA.

Model Definition

  • The Model is trained on the discussed dataset.
  • The Model Input: the symptoms.
  • The Model Output: the possible diseases the patient may suffer from.

Model Training

We used five classification algorithms to process the data.

  1. Decision Tree.
  2. Random Forest.
  3. Naïve Bayes.
  4. K-Nearest Neighbor (KNN).
  5. Artificial Neural Networks (ANN) is illustrated in the figure below, which shows that the model has one input layer with 132 neurons since we have 132 symptoms, one hidden layer, and one output layer with 41 neurons since we have 41 labels as outputs, batch size of 16, and 20 epochs. ANN format

Evaluation & Results

  • The Dataset was spilt by 66/33 for all the classifiers.
  • The accuracy of each classification technique used for predicting diseases based on symptoms.
Algorithm Accuracy
Decision Tree 90%
Random Forest 97%
KNN 98%
Naïve Bayes 100%
ANN 100%


The models showed a decent performance and very high accuracy. The best results were provided by ANN while Naïve Bayes & KNN & Random Forest provided comparable results.

However, while working with real and unseen data, the Random Forest showed the best results out of all the classifiers.

The table below shows the details of each model.

Algorithm Review
Decision Tree
  • Working on the original data resulted in low accuracy.
  • Features Reduction to normalize data and reduce dimensions lead to better results.
Random Forest
  • This model showed promising results, and we observed that when increasing the number of estimators, the results improved significantly.
  • We observed that when K was reduced, the results improved, and on trial-and-error experimentation we chose 7 to be the value of K, more experimentation may lead to a better accuracy.
Naïve Bayes
  • This model showed good performance on the original data achieving very high accuracy.
  • Unfortunately, papers did not provide guidelines on configuring the network of this model. So we had to use trial and error and determined the following hyperparameters.
    • Number of layers: 3
    • Number of neurons unit: 305
    • Epoch Number: 20
    • Number of batches: 16
  • The neural network did not find it challenging to train, and after 20 epochs, the training accuracy was good, and the test accuracy was also good.

Integration With DOCTOR-Y

We used the Diseases Symptoms Prediction Model's results and combined them with the Diseases Diagnoses Prediction Model's results to calculate the percentage of suffering from a group of diseases based on previous diagnoses + the associated symptoms.

The final diseases and their percentages are sent to the system server, which sends them to the client-side to be represented on a chart as shown in the figure below.