Analysis of Diabetes data set of Pima Indians using Neural Network and NN Ensemble
Data Set Description:
Data set can be downloaded from UCI Machine Learning Repository. This data set contains of female patients (PIMA Indians) with at least 21 years of age. It has 768 instances and the following 8 attributes (All numeric-valued):
1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
This data set contains the diagnostic data to investigate whether the patient shows signs of diabetes according to World Health Organization criteria such as the 2-hour post-load plasma glucose.
Exploratory Data Analysis:
The graph below (obtained from Weka) shows the histograms of all the attributes.
The above histograms provide the following insights:
- Class 0 with 500 instances represents patients who tested negative and class 1 with 268 instances represents the patients tested positive. Data set is small and seems to be biased with almost 65 percent patients testing negative. This could act as a limitation in the study.
- Attributes 2-Hour serum insulin, Diabetes Pedigree function, Age and Number of times pregnant are highly skewed to the right. While Plasma glucose concentration, Diastolic Blood pressure and Body Mass Index appear to be normally distributed.
- Removal of the outliers: As seen in histogram below there are 49 outliers (red bar) which have been removed as part of data pre-processing.
- Reviewing scatter plots below of all attributes did not show with relationships amongst the attributes, however, there is a sizable correlation in Body Mass Index and Triceps Skin Fold Thickness. The latter was later removed from the analysis to improve model accuracy
- Larger values of plasma glucose clubbed with larger values for age, and other factors tends to show greater likelihood of testing positive for diabetes (see the red dots below).
Algorithms Used:
- ZeroR Classifier used to determine the base accuracy for the the model. Base accuracy of 66 percent was found using this algorithm applied only training data. Below is the screenshot of the result obtained:
- Classifier 1 : Multilayer Perceptron (MLP) - MLP was first applied on 75% split dataset remaining being test. Validation split has not been considered owing to the small size of data set.
MLP With All fields on 75% split of data was set with a bias of 0.3 and momentum of 0.2 gave an accuracy of 69% as shown in the below figure.
MLP With 7 fields (excluding the Triceps thickness field since it was collinear to Body Mass Index) on 75% split of data was set with a bias of 0.3 and momentum of 0.45 gave an accuracy of 71% as shown in the below figure.
- Classifier 2: Radial Basis Function - RBF with default configuration and removing Triceps thickness field to eliminate collinearity. Here an accuracy of 73 percent was achieved which is higher than MLP.
- Using Ensemble Voting Technique: The results of classifiers were combined by the average of probabilities of outputs of MLP and RBF classifiers.
Run Information: Scheme: weka.classifiers.meta.Vote -S 1 -B "weka.classifiers.functions.RBFClassifier -N 2 -R 0.01 -L 1.0E-6 -C 2 -P 1 -E 1 -S 1" -B "weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a" -R AVG Relation: Diabetes_Header-weka.filters.unsupervised.attribute.NumericToNominal-R9-weka.filters.unsupervised.attribute.InterquartileRange-Rfirst-last-O3.0-E6.0-weka.filters.unsupervised.instance.RemoveWithValues-S0.0-C10-Llast-weka.filters.unsupervised.attribute.Remove-R10-11-weka.filters.unsupervised.attribute.Remove-R4 Instances: 719 Attributes: 8
Results:
- Eliminating Triceps Skin Fold Thickness field gives better result.
- RBF Classifier performs better than MLP in the provided data set.
- Result of 73% though not very high but compared to base accuracy of 66% is significant.
- Ensemble technique takes the average of MLP and RBF classifiers output and provides the accuracy of 72% but it correctly classifies 44 true positives for class 1 much higher the predicted by RBF Classifier.
Disclaimer: The analysis has been done using Weka 3.8 as part of my masters study at NUS. The content is my understanding of the subject gathered during study and following journals on the subject. The cover image has been downloaded from the internet.