Erythemato-squamous Diseases Prediction

Erythemato-squamous Diseases Prediction


Background of the Data

The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope. This database contains 34 attributes, 33 of which are linear valued and one of them is nominal.

In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The age feature simply represents the age of the patient. Every other feature (clinical and histopathological) was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values.

Data Dictionary

No alt text provided for this image
No alt text provided for this image

A brief explanation of all the machine learning models that I have applied

  • K-neighbors classifier: In the K-neighbors classifier classification (k-nearest neighbors), a clustering is created according to the distance values of the classes depending on the k parameter value in the existing data set, and the method of classifying the new data according to its similarity to these clusters is applied.?
  • Support Vector Classification (SVC): Support vector classification is the process of predicting what the outputs of new data will be based on existing data. Support vector classification performs classification by finding the separator plane with the widest range between classes.
  • Logistic regression: With logistic regression, a discrimination model is created according to the number of groups in the structure of the data. With this model, the new data taken into the dataset is classified. The purpose of using logistic regression is to create a model that will establish the relationship between the least variable and the most suitable dependent and independent variables.
  • Gaussian Na?ve Bayes: Gaussian Na?ve Bayes classification applies a classification algorithm based on the probability of the Gaussian distribution
  • Decision tree classifier: The decision tree classifier method consists of three components consisting of node, branch, and leaf. Questions are asked to create a tree structure using the attributes in the training data, and these processes continue until nodes or leaves without branches are reached.
  • Random forest classifier: The purpose of the random forest classifier classification method is to bring together the decisions made by many trees trained in different training sets instead of a single decision tree.

The output I have generated via the model.

  • SVC: The confusion matrix shows that the classifier correctly predicted 31 instances of class 1, 9 instances of class 2, 13 instances of class 3, 7 instances of class 4, 10 instances of class 5, and 3 instances of class 6. The accuracy of the support vector classifier (SVC) is 98.6486%, which means that the model correctly classified 73 out of 74 instances. A high accuracy, precision, recall, and F1-score indicate that the model is performing well in classifying the data.

No alt text provided for this image


  • Random Forest Classifier: The model correctly classified all instances in the dataset, as shown by the matrix having all non-zero elements on the diagonal and zero off-diagonal elements. The accuracy of the model is 100%, meaning that it correctly classified all instances in the dataset. The precision, recall, and f1-score measures are also high, with values of 1.0 for all classes. This indicates that the model achieved perfect precision, recall, and f1-score for all classes, with no false positives or false negatives.

No alt text provided for this image

  • Decision Tree Classifier: The classifier performed well for most classes, with perfect precision and recall for classes 1, 2, 5, and 6. However, it had a lower recall for class 3, indicating that some samples were incorrectly classified as other classes. The accuracy of the decision tree classifier is 98.65%, which indicates that the classifier is performing well.

No alt text provided for this image

  • K-Neighbors Classifier: The accuracy of the classifier is 98.64%, which is the percentage of correctly classified samples out of the total number of samples. The rows of the matrix represent the true classes, while the columns represent the predicted classes. Each entry of the matrix shows the number of samples that belong to a certain true class and are predicted to belong to a certain predicted class.

No alt text provided for this image

  • Naive Bayes: Looking at the off-diagonal elements, we can see where the model is making mistakes. For example, there were 2 instances with true label 2 that were incorrectly classified as 4, and 1 instance with true label 4 that was incorrectly classified as 3.

No alt text provided for this image

  • Correlation Matrix

No alt text provided for this image


  • Multiclass ROC Curve: To visualize the performance of a multiclass classification model, we can plot a ROC curve for each class separately. For each class, we can calculate the true positive rate (TPR) and false positive rate (FPR) by considering that class as the positive class and all the other classes as the negative class. We can then plot the TPR against the FPR for different classification thresholds to create a ROC curve for that class.

No alt text provided for this image


Showcasing how I can predict given specific data (Similar to my dataset)

To predict the class of a new patient using a machine learning model trained on the dermatology dataset, we need to extract the same features from the patient as the training dataset, i.e., the 34 attributes. Then, we can feed these features into the trained model, which will predict the class of the new patient.

For example, suppose we have a new patient with the following features:

  • erythema: 2
  • scaling: 1
  • definite borders: 2
  • itching: 3
  • koebner phenomenon: 1
  • polygonal papules: 2
  • follicular papules: 1
  • oral mucosal involvement: 0
  • knee and elbow involvement: 2
  • scalp involvement: 2
  • family history: 1
  • melanin incontinence: 1
  • eosinophils in the infiltrate: 0
  • PNL infiltrate: 3
  • fibrosis of the papillary dermis: 1
  • exocytosis: 1
  • acanthosis: 2
  • hyperkeratosis: 2
  • parakeratosis: 1
  • clubbing of the rete ridges: 2
  • elongation of the rete ridges: 2
  • thinning of the suprapapillary epidermis: 1
  • spongiform pustule: 0
  • munro microabcess: 0
  • focal hypergranulosis: 2
  • disappearance of the granular layer: 0
  • vacuolisation and damage of basal layer: 3
  • spongiosis: 2
  • saw-tooth appearance of retes: 1
  • follicular horn plug: 0
  • perifollicular parakeratosis: 0
  • inflammatory mononuclear infiltrate: 3
  • band-like infiltrate: 0
  • Age: 42

We can use the trained machine learning model to predict the class of the new patient based on these features. For instance, if we have used a Random Forest classifier to train the model, we can use the `predict()` method of the classifier to get the predicted class for the new patient. Suppose the classifier predicts that the new patient belongs to class 1, which corresponds to psoriasis. Then we can conclude that the patient most likely has psoriasis based on the clinical and histopathological features.

Conclusion:

The attributes are a mix of clinical and histopathological attributes that describe different aspects of skin diseases. The class distribution of the dataset is imbalanced with 6 classes, namely psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris.

Overall, this dataset can be a valuable resource for developing diagnostic tools and improving our understanding of skin diseases. The development of such models has significant implications for the early detection and prevention of ESD. With early detection, individuals can receive timely treatment, reducing the severity of the disease and the associated costs of treatment. Additionally, preventative measures such as sun protection and regular skin screenings can be implemented for high-risk individuals.

Factors such as changes in the environment, genetics, and individual behavior can influence the development of ESD and are not captured by the models.


  • Dataset link


  • Github Link:

要查看或添加评论,请登录

Anirban Chowdhury的更多文章

社区洞察

其他会员也浏览了