登录查看更多内容

TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2016年1月29日

Document Classification: We’ll do today a very common task in Machine Learning, training a classifier and use it to predict the category of new given input text using the well known Python package named Scikit. Basically,there are four simple steps:

1. LOADING THE DATA AND CONSTRUCTING THE FEATURE VECTORS

2. TRAIN THE CLASSIFIER

3. PREDICT THE TESTING SET

4. EVALUATE THE CLASSIFIER

Finally, we can save the trained model and the vectorizer for later use, and try some arbitrary examples.

The "20 Newsgroups" English corpus is used as annotated corpus for training and evaluating the classifier.

Training Set number of Samples: 11,314 documents
Testing Set number of Samples: 7,532 documents

For detailed evaluation and analysis, Accuracy, Confusion Matrix, Precision, Recall and F1 score are used. (see details below)

1) Loading the data and constructing the feature vector
We will use the Scikit package, a very popular python Machine Learning toolkit, that provides the machine learning algorithm and training data. we load the training data provided from the well-known 20 news group English corpus (after removing headers, footers, ...). We then select the desired categories to load. Custom dataset can be created manually and then loaded almost the same way described here. Data is divided into training and testing sets.

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

newsgroups_test = fetch_20newsgroups(subset='test')

Now we Construct the Feature Vector for each sample (document). Features are based on bag of words (BOW) where each feature represents the presence of a word, in other words, features are binary with value equal to 1 if the word exists, and value equal to 0 otherwise. To make the features more powerful we may use term frequency * inverse document frequency. This will help in eliminating the non-prominent terms such as stop words that has very high frequency, such as: “the, is , a, …” and the noisy words that have very low frequency. Moreover, in addition to single words, we can use patterns of two words bi-gram and three words tri-gram.

vectorizer = TfidfVectorizer(max_df=0.8, min_df=3, ngram_range=(1, 2))

vectors = vectorizer.fit_transform(newsgroups_train.data)

max_df:When building the vocabulary, ignore terms that have a document frequency higher than the given threshold (percentage of documents).
min_df: When building the vocabulary, ignore terms that have a document frequency lower than the given threshold (percentage of documents). Here we used the uni-gram and bi-gram

Here we used the uni-gram and bi-gram: Number of features (terms: uni-gram and bi-gram): 123,072

2) Train the classifier

The second step is training the classifier based on the training data only.

clf = MultinomialNB(alpha=.01)

clf.fit(vectors, newsgroups_train.target)

3) Predict the testing set

That’s pretty much it. Now we are able to predict using the trained mode. The trained classifier will try to predict the correct categories of input documents.

vectors_test = vectorizer.transform(newsgroups_test.data)

pred = clf.predict(vectors_test)

4) Testing the accuracy of the model.

We can now test how well the classifier will perform. We compare the predicted categories with the actual ones.

print(‘Accuracy: ’, accuracy_score(newsgroups_test.target, pred)) >> Accuracy 0.77

For detailed analysis of the classifier performance, we usually use these important measures that may show us the strengths and weaknesses of the classification and the chances for improvements.

1. Accuracy is a measure of the percentage of correctly classified test samples. This measure is used over all the classes.

2. Confusion Matrix is a powerful tool that enables deep analysis of how miss-classification happens for each class, where each raw of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice-versa). The more values in the diagonal, the better the classification (less confusion).

3. Precision and Recall and F1 score. Generally, precision is a measure of result relevancy, while recall is a measure of result coverage. A system with high recall but low precision for a certain class returns many results, but most of its predicted labels are incorrect. A system with high precision but low recall for a certain class is just the opposite, returning very few mostly correct results, but suffers from low coverage of the correct labels. F1 score, is defined as the harmonic mean of precision and recall: F1 =2(Precision * Recall)/(Precision + Recall) A good system is the the one which balances between both precision and recall and having a high F1 score. However, in some cases, applications prefer high precision, others prefer high recall. These measures will help in comparing the system performance generally and even for each class, for example, Some classes perform better than others, some classes are very similar and miss classified with each others, while some other classes are unique enough and easily distinguished from others. Careful study of these measures tells a lot about the detailed performance across classes, what goes wrong and even how to enhance the over all and per-class accuracy.

Save and Load

One final note, we can save the trained model and the vectorizer for later use.

joblib.dump(vectorizer, "my_vectorizer.pkl")

joblib.dump(clf, "my_model.pkl")

Then, at any time, just load the model and vectorizer and classify new text normally

clf = joblib.load("my_model.pkl")

vectorizer = joblib.load("my_vectorizer.pkl")

Let’s try some arbitrary examples

Example 1:

new_text = ['space is cold']
vectors_test = vectorizer.transform(new_text)
pred = clf.predict(vectors_test)
print(target_names[pred])
Result: sci.space

Example 2:

new_text = ['God loves all people']
vectors_test = vectorizer.transform(new_text)
pred = clf.predict(vectors_test)
print(target_names[pred])
Result: soc.religion.christian

Example 3:

new_text = ['Image processing is very interesting']
vectors_test = vectorizer.transform(new_text)
pred = clf.predict(vectors_test)
print(target_names[pred])
Result: comp.graphics

And that’s how we build a simple document classifier!

Best Regards

Ibrahim Sobh

Luis Fernando Castellanos Guarin

Profesional Universitario grado 33 especializado

5 年

Oh Great Article .. Excellent!....excuse me....i don't understand how used the function target_names[pred]...please help me

Moamen Abdelrazek

AI Engineering Manager

6 年

Oh Great Article .. Excellent!.

1 次回应

Saleh Ahmad Khan

Electrical Engineer at ASML

7 年

wonderful!

1 次回应

Dina Salem

Lead Software Engineer

8 年

Great article ..

1 次回应

Randa Elanwar, PhD

Research Scientist | Lecturer | Consultant

8 年

Great article .. jazak Allah khairan

1 次回应

查看更多评论

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论

See all articles

TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

AIML 09- Data Augmentation in Python: Everything You Need to Know

Why Is Python Used for Machine Learning

SIMPLE LINEAR REGRESSION IN PYTHON :

AI Text Detection in Python: How to Identify AI-Generated Content

6 Reasons Why Python Can Ace AI and Machine Learning Applications?

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Comprehensive Guide to Feature Engineering for Machine Learning in Python

Python MACHINE LEARNING

A Gentle Introduction to XGBoost for Applied Machine Learning

Top 5 Python Frameworks For Machine Learning

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

社区洞察

其他会员也浏览了

AIML 09- Data Augmentation in Python: Everything You Need to Know

Why Is Python Used for Machine Learning

SIMPLE LINEAR REGRESSION IN PYTHON :

AI Text Detection in Python: How to Identify AI-Generated Content

6 Reasons Why Python Can Ace AI and Machine Learning Applications?

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Comprehensive Guide to Feature Engineering for Machine Learning in Python

Python MACHINE LEARNING

A Gentle Introduction to XGBoost for Applied Machine Learning

Top 5 Python Frameworks For Machine Learning