登录查看更多内容

Analysis of Diabetes data set of Pima Indians using Neural Network and NN Ensemble

Kriti Srivastava

Compliance Innovation & Analytics @ Bank of Singapore

发布日期: 2017年5月17日

Data Set Description:

Data set can be downloaded from UCI Machine Learning Repository. This data set contains of female patients (PIMA Indians) with at least 21 years of age. It has 768 instances and the following 8 attributes (All numeric-valued):

1. Number of times pregnant

2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3. Diastolic blood pressure (mm Hg)

4. Triceps skin fold thickness (mm)

5. 2-Hour serum insulin (mu U/ml)

6. Body mass index (weight in kg/(height in m)^2)

7. Diabetes pedigree function

8. Age (years)

9. Class variable (0 or 1)

This data set contains the diagnostic data to investigate whether the patient shows signs of diabetes according to World Health Organization criteria such as the 2-hour post-load plasma glucose.

Exploratory Data Analysis:

The graph below (obtained from Weka) shows the histograms of all the attributes.

The above histograms provide the following insights:

Class 0 with 500 instances represents patients who tested negative and class 1 with 268 instances represents the patients tested positive. Data set is small and seems to be biased with almost 65 percent patients testing negative. This could act as a limitation in the study.
Attributes 2-Hour serum insulin, Diabetes Pedigree function, Age and Number of times pregnant are highly skewed to the right. While Plasma glucose concentration, Diastolic Blood pressure and Body Mass Index appear to be normally distributed.
Removal of the outliers: As seen in histogram below there are 49 outliers (red bar) which have been removed as part of data pre-processing.

Reviewing scatter plots below of all attributes did not show with relationships amongst the attributes, however, there is a sizable correlation in Body Mass Index and Triceps Skin Fold Thickness. The latter was later removed from the analysis to improve model accuracy
Larger values of plasma glucose clubbed with larger values for age, and other factors tends to show greater likelihood of testing positive for diabetes (see the red dots below).

Algorithms Used:

ZeroR Classifier used to determine the base accuracy for the the model. Base accuracy of 66 percent was found using this algorithm applied only training data. Below is the screenshot of the result obtained:

Classifier 1 : Multilayer Perceptron (MLP) - MLP was first applied on 75% split dataset remaining being test. Validation split has not been considered owing to the small size of data set.

MLP With All fields on 75% split of data was set with a bias of 0.3 and momentum of 0.2 gave an accuracy of 69% as shown in the below figure.

MLP With 7 fields (excluding the Triceps thickness field since it was collinear to Body Mass Index) on 75% split of data was set with a bias of 0.3 and momentum of 0.45 gave an accuracy of 71% as shown in the below figure.

Classifier 2: Radial Basis Function - RBF with default configuration and removing Triceps thickness field to eliminate collinearity. Here an accuracy of 73 percent was achieved which is higher than MLP.

Using Ensemble Voting Technique: The results of classifiers were combined by the average of probabilities of outputs of MLP and RBF classifiers.

Run Information:
Scheme:       weka.classifiers.meta.Vote -S 1 -B "weka.classifiers.functions.RBFClassifier -N 2 -R 0.01 -L 1.0E-6 -C 2 -P 1 -E 1 -S 1" -B "weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a" -R AVG

Relation:     Diabetes_Header-weka.filters.unsupervised.attribute.NumericToNominal-R9-weka.filters.unsupervised.attribute.InterquartileRange-Rfirst-last-O3.0-E6.0-weka.filters.unsupervised.instance.RemoveWithValues-S0.0-C10-Llast-weka.filters.unsupervised.attribute.Remove-R10-11-weka.filters.unsupervised.attribute.Remove-R4

Instances:    719

Attributes:   8

Results:

Eliminating Triceps Skin Fold Thickness field gives better result.
RBF Classifier performs better than MLP in the provided data set.
Result of 73% though not very high but compared to base accuracy of 66% is significant.
Ensemble technique takes the average of MLP and RBF classifiers output and provides the accuracy of 72% but it correctly classifies 44 true positives for class 1 much higher the predicted by RBF Classifier.

Disclaimer: The analysis has been done using Weka 3.8 as part of my masters study at NUS. The content is my understanding of the subject gathered during study and following journals on the subject. The cover image has been downloaded from the internet.

要查看或添加评论，请登录

Kriti Srivastava的更多文章

Neonatal Mortality and its Correlates: Practical implications of Data Analytics across densely populated states in India

2017年6月6日

Neonatal Mortality and its Correlates: Practical implications of Data Analytics across densely populated states in India

Abstract: The Neonatal Mortality Rate in India is among the highest in the world. Non-availability of trained manpower…

Analysis of Diabetes data set of Pima Indians using Neural Network and NN Ensemble

Kriti Srivastava

Compliance Innovation & Analytics @ Bank of Singapore

Kriti Srivastava的更多文章

社区洞察

其他会员也浏览了

Artificial intelligence in precision medicine market: A novel way to reveal unexplored patterns of diseases

Top Pharma Stories in April 2023

BrainQ Technologies: Revolutionizing Neurorecovery with AI-Powered Innovation.

Can Artificial Intelligence Predict Diseases with HBYS?

BHSoft Leverages AI and Computer Vision Algorithms for Diagnosing Eye Diseases Chatbot

Can Artificial Intelligence Predict Diseases with HBYS

AI Image Analysis–A New Tool in CNS Tumour Diagnostics

From AI to Genomics: How Emerging Technologies Are Reshaping Healthcare Jobs How will the healthcare industry look like in the year 2040?

Why AI Won't Replace Doctors (Yet?) – This And More News In Digital Health This Week

Dr AI?

Kriti Srivastava的更多文章

Neonatal Mortality and its Correlates: Practical implications of Data Analytics across densely populated states in India

社区洞察

其他会员也浏览了

Artificial intelligence in precision medicine market: A novel way to reveal unexplored patterns of diseases

Top Pharma Stories in April 2023

BrainQ Technologies: Revolutionizing Neurorecovery with AI-Powered Innovation.

Can Artificial Intelligence Predict Diseases with HBYS?

BHSoft Leverages AI and Computer Vision Algorithms for Diagnosing Eye Diseases Chatbot

Can Artificial Intelligence Predict Diseases with HBYS

AI Image Analysis–A New Tool in CNS Tumour Diagnostics

From AI to Genomics: How Emerging Technologies Are Reshaping Healthcare Jobs How will the healthcare industry look like in the year 2040?

Why AI Won't Replace Doctors (Yet?) – This And More News In Digital Health This Week

Dr AI?