Classification Model using ML
https://www.thedataschool.com.au/wp-content/uploads/2018/10/text-classification-1080x545.png

Classification Model using ML

PhD Coursework: minor project on Classification ML: Python 3.5

Certainly academic world is moving slowly with respect to demand in industry, but the key for leaders and product managers is to learn the foundational building blocks right to lead the development and implementation of production deployable and scalable capabilities. At this point ML models are just a commodity but implementing the same in multiple use cases and monetizing on them is still not mature in many organisations except tech giants like FB, Google, Apple, Amazon, MS. Below is an attempt to share most of the critical steps involved in the process starting from why should one build a model ?

1.    Executive Summary

This study primarily focuses on building machine learning models to predict at early stages if a person has Parkinson’s disease based on multiple symptoms in speech production. Researches shows that there are some relations between Parkinson disease and speech production. The dataset was sourced from UCI ML Repository. Report is organized is broadly three sections- Data Cleaning/Preprocessing, Development of ML Models, and Testing Models Accuracy on test data. Different supervised ML techniques are used such as Decision Tree, Na?ve Bayes and Random Forest. Further other methodologies are used to improve results such as Feature Engineering, Cross Validation and Impurity Identification. After application of different ML algorithms and hyperparameters tuning, Random Forest with Cross Validation produced the best results with accuracy of 79.1%

2.    Introduction & Motivation

Parkinson's disease is a progressive nervous system disorder that affects movement. Symptoms start gradually, sometimes starting with a barely noticeable tremor in just one hand. Tremors are common, but the disorder also commonly causes stiffness or slowing of movement. In the early stages of Parkinson's disease, your face may show little or no expression. Your arms may not swing when you walk. Your speech may become soft or slurred. Parkinson's disease symptoms worsen as your condition progresses over time. Although Parkinson's disease can't be cured, medications might significantly improve your symptoms. Occasionally, your doctor may suggest surgery to regulate certain regions of your brain and improve your symptoms.” – Mayo clinic definition.

Researches shows that there are some relations between Parkinson disease and speech production [1]. The goal of this project is to design a machine learning model that can predict Parkinson using machine learning techniques. The dataset is from [2].

3.    Related Work in the Industry

There are multiple parallel efforts going on in the healthcare industry with application of AI to detect symptoms of life-threatening diseases at early stage. One of the key symptoms that patients with Parkinson Disease may experience is a change in speech. Multiple studies suggested voice may get softer, speech may be slurred, the tone may become monotonous and patients may not be able to speak/converse at a fast pace. Not everyone with PD experiences the same symptoms, however, for those who are affected, it can be a significant problem, causing difficulties in communication and possibly leading to reduced social interactions[3].

 IBM Research Healthcare and Life Sciences announced a partnership with the Michael J. Fox Foundation, which includes a grant for an undisclosed amount from the New York City-based foundation, as well as access to data the Foundation has collected for years. The key to progress lies in analyzing this data, known as the Parkinson’s Progression Marker Initiative (PPMI). That capability dovetails with work IBM has been doing in the area of neurodegenerative diseases [4].

study[5] published by Movement Disorders in 2013, indicates the national economic burden of Parkinson's Disease exceeded $14.4 billion or approximately $22,800 per patient in 2010[4].

In another data challenge, The Michael J. Fox Foundation (MJFF) offers large datasets to problem-solve big scientific questions in PD. Launched in spring 2017, the DREAM challenge asked experts to analyze data from two studies, including MJFF's Levodopa Response Trial, which evaluated the use of a smartwatch to monitor dyskinesia (uncontrolled, involuntary movements) and "off" periods (times when symptoms return because medication is not working optimally). Funding for the challenge was provided by MJFF and the Robert Wood Johnson Foundation [6].

4.    Data

The data used for this study contains 80 patients with patient id. However, overall dataset contains 240 records because each patient has three set of experiments performed on them. Following are the attributes I used:

1. Recording: Number of the recording.

2. Status: 0=Healthy; 1=PD

3. Gender: 0=Man; 1=Woman

4. Pitch local perturbation measures: relative jitter (Jitter_rel), absolute jitter (Jitter_abs), relative average perturbation (Jitter_RAP), and pitch perturbation quotient (Jitter_PPQ).

5. Amplitude perturbation measures: local shimmer (Shim_loc), shimmer in dB (Shim_dB), 6-point amplitude perturbation quotient (Shim_APQ3), 5-point amplitude perturbation quotient (Shim_APQ5),and 11-point amplitude perturbation quotient (Shim_APQ11).

7. Harmonic-to-noise ratio measures: harmonic-to-noise ratio in the frequency band 0-500 Hz (HNR05),in 0-1500 Hz (HNR15), in 0-2500 Hz (HNR25), in 0-3500 Hz (HNR35), and in 0-3800 Hz (HNR38).

8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (MFCC0, MFCC1,...,MFCC12) and their derivatives (Delta0, Delta1,..., Delta12).

9. Recurrence period density entropy (RPDE).

10. Detrended fluctuation analysis (DFA).

11. Pitch period entropy (PPE).

12. Glottal-to-noise excitation ratio (GNE).

5.    EDA

I started with basic visualization checking relationships between target and independent variables. Some of them were:

1.     Jitter_rel Vs Status: This shows that patient having high Jitter_rel have PD mostly.

No alt text provided for this image

2.     Shim_loc Vs Status: This shows that patient having higher shimmer is positively correlated with having PD.

No alt text provided for this image

3.     HNR15 Vs Status: This shows that patient having high Harmonic Noise to Ratio measure with a frequency of 15 Hz have been healthy.

No alt text provided for this image

6.    Pre-processing & Data Cleaning

Checking Missing Values:

I ran missing value check to see if there are any missing values in the dataset and plotted heatmap to check visually. There are only two missing values, which is not a big concern. We can clearly replace them with zeroes.

No alt text provided for this image

Checking correlated variables: Next, we checked the collinearity to see if there are any redundant features which can be removed based on the Pearson coefficient strength.

No alt text provided for this image

Outlier Detection:

I have used z-score methodology and used a threshold of 4. Ideally I should use 3, but I have used 4 because I am only looking for extreme observations.

No alt text provided for this image

Output:

There are certain records like 222, 169 which has multiple columns missing. There are a lot of co-related features between MFCC and HNR having high Pearson coefficient showing positive correlation. I have used 0.8 has the cutoff for the strength and removed anything more than that. So the new dataset has 20 columns.

7.    Approach

No alt text provided for this image

8.    Models Development, Feature Importance & Evaluation

Decision Tree Model:

Decision Tree Classifier uses Tree based models bagging approach (CART) to classify different class instances. Without dropping ID column, we were getting 100% accuracy with Decision Tree model. Obviously ID has no significance while building model.

Creating Training and Test Data:

I have used 30% for the Test Data and 70% for the Training Data. Random state of 52 is constant for all the models. Changing the state could impact the results but I have kept it consistent for now.

Fitting the Decision Tree model: Accuracy and Confusion Matrix

After the first run, the accuracy is 76.8% with 29 True Positives and 26 True Negatives. I will check for Precision and Recall after 10 fold cross validation.

No alt text provided for this image

Running Grid Search on Decision Tree Model:

I have used multiple combinations for different parameters:

Max_depth: Used 5,8,15

Min_sample_leaf : Used 1,2,5,7

Min_samples_split: Used 2

Min_impurity_decrease: Used 0,0.5 and 1

After running Grid search following were the best parameters:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0, min_impurity_split=None, min_samples_leaf=7, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

I got same accuracy even after running 10 fold cross validation. This could be because the default parameters are the best parameters for the Decision Tree Classifier.

Random Forest Model:

Random Forest Model uses boosting approach thereby shuffling trees and using the best possible trees to classify class instances. I got same accuracy with the first iteration of the model:

No alt text provided for this image

Accuracy is 76.8% Precision is 76.4%. Recall is 75.8%.

After running Grid Search 10 fold cross validation, best estimators were:

No alt text provided for this image

Fitting the new Random Forest model, post cross validation:

No alt text provided for this image

Clearly that accuracy has improved

Precision is 79% - Better than Decision Tree Model

Recall is 79% - Better than Decision Tree model

Feature Importance:

Based on the feature importance, top 5 features are:

MFCC4 , MFCC5, MFCC3, Delta2, MFCC1

In sound processing, the mel-frequency cepstrum coeffcient (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. So, we can say that sound frequency of a patient could be one of the most important features while determining Parkinson’s disease in a patient.

No alt text provided for this image

9.    Conclusion & Future Improvements

Random Forest was the best model out of all three ML models. Cross Validation using Grid Search improved the accuracy to 79.1% giving the best parameters. Meaning with the given dataset and attributes, model can predict 79% of the explained variance given any test dataset. 

I could further optimize Decision Tree using Pre-Pruning or Post-Pruning techniques to minimize over-fitting. I could use other ML techniques like XG Boost, Light GBM which could give more accurate results because of the learning rate they use to train every tree instance. In further iterations of the model, I could check skewness, check for outliers to further filter the dataset and get more accurate results.

10. Lessons Learned

 Labeled Data:

Availability of enough number of samples is the key to any ML algorithms, but especially Healthcare industry face challenges on patient population or patient data. Even sometimes when the data is available it is protected by laws & regulations. In this project models were built with around 240 records, which is quite less for making predictions and deploying it for practical purposes. To overcome patient data challenge, National Institute of Health is running multiple programs via grant CTSA to build patient population database and supplement researches via informatics and AI.

Domain/Functional Knowledge:

Any Data Science project is dependent on the domain knowledge of the Data Scientist. Despite availability of millions of records and thousands of attributes, it is critical to have domain knowledge which helps in establishing certain hypothesis, example, high blood pressure drives risk of heart attack. Lack of domain knowledge was another challenge in this project which I realized after 30% of work.

11. References

[1] Anne Robbins, J., Logemann, J., and Kirshner, H.S. ::Swallowing and speech production in Parkinson's disease. Annals of neurology19 (1986).

[2] Naranjo, L., P??rez, C.J., Campos-Roca, Y., Mart?-n, J.: Addressing voice recording replications for Parkinson’s disease detection. Expert Systems With Applications 46, 286-292 (2016).

[3]https://parkinsonsdisease.net/symptoms/speech-difficulties-changes/

[4]https://www.healthleadersmedia.com/innovation/when-parkinsons-meets-ai-models-disease-progression-expected

[5] https://onlinelibrary.wiley.com/doi/abs/10.1002/mds.25292

[6] https://www.michaeljfox.org/news/data-challenges?navid=data-science-challenge

要查看或添加评论,请登录

Nikhil Shrivastava的更多文章

社区洞察

其他会员也浏览了