23
Disease Prediction Based On Medical Side Symptoms
In this article, we will discuss one of DOCTOR-Y's Machine Learning Models. This model predicts the current patients' medical conditions based on the associated symptoms with the previous diagnoses from the patient's medical history.
We used a dataset containing the diseases and their symptoms in a checker format and classified it using 5 different machine learning classifiers.
If you don't know what is DOCTOR-Y check this post.
Physicians will spend a lot of time reviewing the patient's previous e-prescriptions provided on DOCTOR-Y to know their past medical conditions and previous diseases.
That's why DOCTOR-Y provides a summarized chart that represents the percentages for suffering from a group of diseases based on the associated symptoms with the previous diagnoses. The model is provided with a dataset to train and classify these symptoms.
The model takes the symptoms as input from previous prescriptions, and the output will be the predicted disease based on these symptoms.
The snippet below shows how the model works.
python symptoms_disease.py continous_sneezing, shivering, chills
['Allergy']
In this model, we used Disease Symptom Prediction Dataset. This dataset is balanced. However, feature vectors (samples) in the data had a redundancy problem.
We chose the features of the unique vector (unique samples) and fed it to machine learning algorithms, then we reconstructed the data in a Boolean form to facilitate the process of training the model and get better results, to obtain a refactored dataset.
The data is in a checker format where we have 133 columns, the last column is the diseases, and the others are all the symptoms. We have a total of 309 entries and 41 unique disease averaging 8 entries per disease.
The table below is a sample of the symptoms, and you can find the full list here.
Symptoms | Symptoms | Symptoms | Symptoms | Symptoms |
---|---|---|---|---|
itching | skin rash | nodal skin eruptions | continuous sneezing | shivering |
visual disturbances | receiving blood transfusion | receiving unsterile injections | coma | stomach bleeding |
irregular sugar level | cough | high fever | sunken eyes | breathlessness |
swelling of stomach | swelled lymph nodes | malaise | blurred and distorted vision | phlegm |
The table below shows the diseases in the full dataset.
Prognosis | Prognosis | Prognosis | Prognosis |
---|---|---|---|
Fungal infection | Migraine | hepatitis A | Heart attack |
Allergy | Cervical spondylosis | Hepatitis B | Varicose veins |
GERD | Paralysis(brain hemorrhage) | Hepatitis C | Hypothyroidism |
Chronic cholestasis | Jaundice | Hepatitis D | Hyperthyroidism |
Drug Reaction | Malaria | Hepatitis E | Hypoglycemia |
Peptic ulcer diseae | Chicken pox | Alcoholic hepatitis | Osteoarthristis |
AIDS | Dengue | Tuberculosis | Arthritis |
Diabetes | Typhoid | Common Cold | (vertigo) Paroymsal Positional Vertigo |
Gastroenteritis | Psoriasis | Pneumonia | Acne |
Bronchial Asthma | Impetigo | Dimorphic hemmorhoids(piles) | Urinary tract infection |
Hypertension |
For the decision tree algorithm, we used PCA to normalize our data and reduce our features from 132 to 70, and we transformed our training and testing data on the vector produced from the PCA.
- The Model is trained on the discussed dataset.
- The Model Input: the symptoms.
- The Model Output: the possible diseases the patient may suffer from.
We used five classification algorithms to process the data.
- Decision Tree.
- Random Forest.
- Naïve Bayes.
- K-Nearest Neighbor (KNN).
- Artificial Neural Networks (ANN) is illustrated in the figure below, which shows that the model has one input layer with 132 neurons since we have 132 symptoms, one hidden layer, and one output layer with 41 neurons since we have 41 labels as outputs, batch size of 16, and 20 epochs.
- The Dataset was spilt by 66/33 for all the classifiers.
- The accuracy of each classification technique used for predicting diseases based on symptoms.
Algorithm | Accuracy |
---|---|
Decision Tree | 90% |
Random Forest | 97% |
KNN | 98% |
Naïve Bayes | 100% |
ANN | 100% |
The models showed a decent performance and very high accuracy. The best results were provided by ANN while Naïve Bayes & KNN & Random Forest provided comparable results.
However, while working with real and unseen data, the Random Forest showed the best results out of all the classifiers.
The table below shows the details of each model.
Algorithm | Review |
---|---|
Decision Tree |
|
Random Forest |
|
KNN |
|
Naïve Bayes |
|
ANN |
|
We used the Diseases Symptoms Prediction Model's results and combined them with the Diseases Diagnoses Prediction Model's results to calculate the percentage of suffering from a group of diseases based on previous diagnoses + the associated symptoms.
The final diseases and their percentages are sent to the system server, which sends them to the client-side to be represented on a chart as shown in the figure below.
23