Titanic with ML and Evaluation Classification Models
Titanic and evaluation systems

Titanic with ML and Evaluation Classification Models

Firstly , it's one?of the projects that I finished when I started the journey with ML and I'll try to upload one by one of my projects and explain with it a Trick or algorithm or special technique

Therefore it hello world project in ML for Titanic Datasets

The process of working on data is divided into some parts

1-Import the dataset

2-Cleaning data and handling the missing data

2-Visualization the distributions

3- check the bias of the target

4-encoding data?can use in the ML process

5-feature extraction

6- splitting Data

7-selecting the classification model

8-evaluate the?meaning of the accuracy of the accurac

  • Using pandas start to read the data and show the head of it

No alt text provided for this image
No alt text provided for this image

check the empty values its very important step cause the machine learning models cant accept it and if turn it to values by wrong action the model unfortunately will ignore all data in predictions (underfite)

  • we have 177 null in age and its note small number with respect to the data
  • in Cebin we has 687 missing values ,we ca drop all this column cause it will not effect relational and also its 687 missing value of 860 so it's hard to get all of this missing data by any sens technique

No alt text provided for this image

  • Drop the non-useful columns and we deciding it logically and based on the relation between each feature and the the target..for examle the ones called 'ZAZA' dosn't mean that he will die with probability higher than 'ZIZI'

No alt text provided for this image

  • New data after dropping the non-useful columns

No alt text provided for this image

  • Show the distribution of each feature and the relation if it

No alt text provided for this image

  • get the quantity of target values you should care about it cause the increasing and decreasing quantity of any class will make the model bias towards this model which is has the higher quantity data in training
  • but it here i think it has balanced quantity

No alt text provided for this image

  • you could be convert the columns with type object or has strings values cause the model as you know its an statistical formulas so must converted by some of techniques as one-hot encoder and label-encoder
  • OneHotEncoder Encodes categorical integer features as a one-hot numeric array. It's Transform method returns a sparse matrix if sparse=True else a 2-d array. You can't cast a 2-d array (or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.
  • get dummies func can give this action easily just path your dataset

No alt text provided for this image

  • and now we dropped the nulls

No alt text provided for this image

  • Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output. There are various ways to build a model in Machine Learning, which are:

  1. All-in
  2. Backward Elimination
  3. Forward Selection
  4. Bidirectional Elimination
  5. Score Comparison

Steps of Backward Elimination

Step-1: Firstly, We need to select a significance level to stay in the model. (SL=0.05)

Step-2: Fit the complete model with all possible predictors/independent variables.

Step-3: Choose the predictor which has the highest P-value, such that.

  1. If P-value >SL, go to step 4.
  2. Else Finish, and Our model is ready.

Step-4: Remove that predictor.

Step-5: Rebuild and fit the model with the remaining variables.

No alt text provided for this image

The features selected by Backward Elimination is

No alt text provided for this image

  • The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced. In this tutorial, you will discover how to evaluate machine learning models using the train-test split.

No alt text provided for this image

  • Logistic regression is another powerful supervised ML algorithm used for binary classification problems (when target is categorical). The best way to think about logistic regression is that it is a linear regression but for classification problems. Logistic regression essentially uses a logistic function defined below to model a binary output variable (Tolles & Meurer, 2016). The primary difference between linear regression and logistic regression is that logistic regression's range is bounded between 0 and 1. In addition, as opposed to linear regression, logistic regression does not require a linear relationship between inputs and output variables. This is due to applying a nonlinear log transformation to the odds ratio (will be defined shortly).
  • Logisticfunction=1/1+e?x^x ->sigmoid function

No alt text provided for this image

  • in classification we cant depend on only accuracy it loss some times

But why?

  • You must be wondering?‘Can’t we just use?accuracy?of the model as the holy grail metric?’
  • Accuracy is very important, but it might not be the best metric all the time. Let’s look at why with an example -:

Let’s say we are building a model which predicts if a bank loan will default or not

  • (The S&P/Experian Consumer Credit Default Composite Index reported a default rate of 0.91%)
  • Let’s have a dummy model that always predicts that a loan will not default. Guess what would be the accuracy of this model? ===> 99.10%

Impressive, right? Well, the probability of a bank buying this model is absolute zero. ??

While our model has a stunning accuracy, this is an apt example where accuracy is definitely not the right metric.

If not accuracy, what else?

Along with accuracy, there are a bunch of other methods to evaluate the performance of a classification model

  • Confusion?matrix,
  • Precision, Recall
  • ROC and AUC

Confusion Matrix

  • As now we are familiar with TP, TN, FP, FN — It will be very easy to understand what confusion matrix is.
  • It is a summary table showing how good our model is at predicting examples of various classes. Axes here are predicted-lables vs actual-labels.

Precision and Recall

Precision —?Also called Positive predictive value The ratio of correct positive predictions to the total predicted positives.

Recall —?Also called Sensitivity, Probability of Detection, True Positive Rate The ratio of correct positive predictions to the total positives examples.

Accuracy

Accuracy is defined as the ratio of correctly predicted examples by the total examples.

No alt text provided for this image

The END

Great work

Ahmed Mohamed Abdelaziz

Technology development engineer

3 年

Great work ??

Nourhan Elsherbiny

ITI Trainee - Java Enterprise & Web Apps development || ISTQB? CTFL

3 年

Great work and great explanation thank you for sharing this ????

要查看或添加评论,请登录

Belal Aboelkher的更多文章

社区洞察

其他会员也浏览了