Machine Learning: Introduction and Practical Example

Machine Learning: Introduction and Practical Example

I have always loved studying and believe in the transformative power of knowledge. Therefore, it is equally important to share knowledge whenever possible.

In this article, I'd like to talk a little about a topic that I have been reading a lot about lately: Machine Learning.

The intention of this article is not to exhaust the subject or delve deeper into its concepts. Instead, it aims to provide a concise overview so that you, as the reader, have a starting point to explore further.

Introduction to Machine Learning

Machine Learning, in simplified terms, can be understood as the use of recipes (mathematical algorithms) that, when exposed to certain situations, can predict, replicate, or identify patterns in these situations.

Analogous to human learning, where we are not born knowing the names of the seasons or the fact that we will get wet if we are caught in the rain, exposure to such situations gives us the ability to identify, replicate (through images or drawings, for example) and even predict them.

Just as different individuals have varied learning abilities – some excel in calculations, others in spatial perception and still others in writing – there are different algorithms in Machine Learning, each with its performance varying in different scenarios. One of the tasks of Machine Learning is to select the most suitable algorithm for a given scenario.

Practical example

To illustrate how Machine Learning works, let's delve into a practical example. We will use the Python language (with the PyCaret module) and a dataset containing information about breast cancer to build a model capable of determining whether a patient has cancer based on her characteristics.

If you want to try the code below, I recommend using Google Colab (New Notebook).

# Installation of the PyCaret library, which simplifies the experimentation process with various ML models and pipelines.
pip install pycaret

# Importing functions and classes from the PyCaret library for classification tasks.
from pycaret.classification import *

# Importing the pandas library and giving it the alias pd.
import pandas as pd

# Importing the train_test_split function for later data division between train and test
from sklearn.model_selection import train_test_split

# Importing the breast cancer dataset from scikit-learn library
from sklearn.datasets import load_breast_cancer

# Loading the breast cancer dataset into the “data” variable
data = load_breast_cancer()

# Printing the feature names of the dataset, describing the different attributes of the patients.
print(data.feature_names)

# Printing the target labels of the dataset, indicating whether a patient has malignant or benign cancer.
print(data.target)

# Creating DataFrame called df with the dataset's data, using the feature names as column names.
df = pd.DataFrame(data.data, columns=data.feature_names)

# Printing the DataFrame df, which contains the dataset's data.
print(df)

# Adding a column named 'target' to the DataFrame df, containing the target labels.
df['target'] = data.target

# Splitting the data into training and testing sets.
x_train, x_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=45)

# Setting up the modeling environment for the classification task, including the training data, target column name, and session ID.
model_setup = setup(data=pd.concat([x_train, y_train], axis=1), target='target', session_id=123)

# Comparing the performance of various classification models using the training dataset.
model_performance = compare_models()        

The code above obtains a set of data on breast cancer, separates part of the data for training the models and another part to test the accuracy of the models created, finally, it tests different algorithms to check which ones have the best accuracy, the result is shown in the image below:

The lines are ordered from best to worst algorithm, see that Extra Trees Classifier obtained an accuracy close to 97%, which implies that it has the potential to be used as a complementary screening method within the scenario presented.

If you want to explore the code above a little more, I recommend that you evaluate the outputs of the code below:

# Printing the performance of the compared models.
print(model performance)

# Making predictions on the test data using the best-performing model identified during performance comparison.
predictions = predict_model(model_performance, data=x_test)

# Converting the predictions into a pandas DataFrame for analysis.
prediction_df = pd.DataFrame(predictions)

# Printing the DataFrame containing the model's predictions.
print(prediction_df)        

Conclusion

Although it is a topic that has only recently gained the attention of the public, Machine Learning presents a fascinating ability to imitate human learning to make predictions, identify patterns or generate information.

In the example of this article, a practical example of how to use Machine Learning for real situations, such as cancer diagnosis, was demonstrated.

The volume of content available on the internet on this topic is currently immense, and I hope that this article has managed to spark your interest, so that you can seek out more information on the subject and continue learning about it.

Questions and answers

1. The article talked about Accuracy what is that?

Note that the image displayed some columns in addition to Accuracy, such as AUC, Recall, F1, Kappa, MCC, and TT. We evaluate the best algorithm based on accuracy, but these other factors may also be helpful in your evaluation. Let me try to explain them in an uncomplicated way:

Accuracy: Remember that we divided a part of the data for training and another for testing? The training part was used to build the model. Now, feeding this model with the part we reserved for testing, the accuracy shows the percentage of correct predictions made by the model out of all predictions.

AUC (Area under the curve): What is the model's ability to distinguish between different classes? The AUC plots a graph with a curve where the correct information lies below that curve.

Recall: Recall measures the model's ability to correctly identify all positive instances.

F1 score: In some situations, we have unbalanced classes (imagine for example that you have a large amount of data on patients with cancer and a low amount of data on patients without cancer). The F1 score is useful in these cases, and it is a calculation based on Recall and Precision (Precision is the proportion of patients who had cancer in relation to the number of patients that the model said had cancer).

Kappa (Cohen's Kappa): If the assessment were being made by two humans, what would be the agreement between them beyond what would be expected by chance alone? This is the Kappa measurement.

MCC (Matthews Correlation Coefficient): MCC performs a calculation based on true positives, true negatives, false positives, false negatives. Values range from 0 to 1. A low value (close to 0) suggests that the predictions are occurring by chance.

TT (training time in seconds): TT measures the time it takes the model to train on the given data. If time is a critical factor for you, you will want the lowest amount of training time.

Victory Agbonighale Odianosen

Driving AI-Powered Analytics & Cloud Solutions | Solutions Engineer | Business Intelligence | GenAI & Renewable Energy Tech

11 个月

I like how you simplified each concept. Well done River De Morais e Silva

要查看或添加评论,请登录

RIVER DE MORAIS E SILVA的更多文章

  • Seguran?a: Custo ou investimento?

    Seguran?a: Custo ou investimento?

    Na próxima quarta-feira fará 30 anos da morte do Ayrton Senna (01 de maio de 1994), eu era muito pequeno, mas me lembro…

    7 条评论
  • Se você n?o gosta de pessoas, seguran?a n?o é para você.

    Se você n?o gosta de pessoas, seguran?a n?o é para você.

    A vida me deu oportunidades muito boas de estar junto a várias pessoas e reaprender com cada uma delas coisas que eu…

  • A melhor senha de todas!

    A melhor senha de todas!

    Ninguém gosta de ter que usar senhas, n?o é verdade? Existem ótimos projetos e iniciativas surgindo para tentar nos…

    1 条评论
  • Seguran?a da Informa??o: 7 cuidados básicos que podem salvar sua empresa.

    Seguran?a da Informa??o: 7 cuidados básicos que podem salvar sua empresa.

    A prote??o dos negócios é um dos objetivos chave da seguran?a da informa??o, mas muitas vezes cuidados básicos s?o…

    15 条评论

社区洞察

其他会员也浏览了