TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
Document Classification: We’ll do today a very common task in Machine Learning, training a classifier and use it to predict the category of new given input text using the well known Python package named Scikit. Basically,there are four simple steps:
1. LOADING THE DATA AND CONSTRUCTING THE FEATURE VECTORS
2. TRAIN THE CLASSIFIER
3. PREDICT THE TESTING SET
4. EVALUATE THE CLASSIFIER
Finally, we can save the trained model and the vectorizer for later use, and try some arbitrary examples.
The "20 Newsgroups" English corpus is used as annotated corpus for training and evaluating the classifier.
- Training Set number of Samples: 11,314 documents
- Testing Set number of Samples: 7,532 documents
For detailed evaluation and analysis, Accuracy, Confusion Matrix, Precision, Recall and F1 score are used. (see details below)
- 1) Loading the data and constructing the feature vector
- We will use the Scikit package, a very popular python Machine Learning toolkit, that provides the machine learning algorithm and training data. we load the training data provided from the well-known 20 news group English corpus (after removing headers, footers, ...). We then select the desired categories to load. Custom dataset can be created manually and then loaded almost the same way described here. Data is divided into training and testing sets.
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test')
- Now we Construct the Feature Vector for each sample (document). Features are based on bag of words (BOW) where each feature represents the presence of a word, in other words, features are binary with value equal to 1 if the word exists, and value equal to 0 otherwise. To make the features more powerful we may use term frequency * inverse document frequency. This will help in eliminating the non-prominent terms such as stop words that has very high frequency, such as: “the, is , a, …” and the noisy words that have very low frequency. Moreover, in addition to single words, we can use patterns of two words bi-gram and three words tri-gram.
vectorizer = TfidfVectorizer(max_df=0.8, min_df=3, ngram_range=(1, 2))
vectors = vectorizer.fit_transform(newsgroups_train.data)
- max_df:When building the vocabulary, ignore terms that have a document frequency higher than the given threshold (percentage of documents).
- min_df: When building the vocabulary, ignore terms that have a document frequency lower than the given threshold (percentage of documents). Here we used the uni-gram and bi-gram
Here we used the uni-gram and bi-gram: Number of features (terms: uni-gram and bi-gram): 123,072
- 2) Train the classifier
The second step is training the classifier based on the training data only.
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
- 3) Predict the testing set
That’s pretty much it. Now we are able to predict using the trained mode. The trained classifier will try to predict the correct categories of input documents.
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
- 4) Testing the accuracy of the model.
We can now test how well the classifier will perform. We compare the predicted categories with the actual ones.
print(‘Accuracy: ’, accuracy_score(newsgroups_test.target, pred)) >> Accuracy 0.77
For detailed analysis of the classifier performance, we usually use these important measures that may show us the strengths and weaknesses of the classification and the chances for improvements.
1. Accuracy is a measure of the percentage of correctly classified test samples. This measure is used over all the classes.
2. Confusion Matrix is a powerful tool that enables deep analysis of how miss-classification happens for each class, where each raw of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice-versa). The more values in the diagonal, the better the classification (less confusion).
3. Precision and Recall and F1 score. Generally, precision is a measure of result relevancy, while recall is a measure of result coverage. A system with high recall but low precision for a certain class returns many results, but most of its predicted labels are incorrect. A system with high precision but low recall for a certain class is just the opposite, returning very few mostly correct results, but suffers from low coverage of the correct labels. F1 score, is defined as the harmonic mean of precision and recall: F1 =2(Precision * Recall)/(Precision + Recall) A good system is the the one which balances between both precision and recall and having a high F1 score. However, in some cases, applications prefer high precision, others prefer high recall. These measures will help in comparing the system performance generally and even for each class, for example, Some classes perform better than others, some classes are very similar and miss classified with each others, while some other classes are unique enough and easily distinguished from others. Careful study of these measures tells a lot about the detailed performance across classes, what goes wrong and even how to enhance the over all and per-class accuracy.
Save and Load
One final note, we can save the trained model and the vectorizer for later use.
joblib.dump(vectorizer, "my_vectorizer.pkl")
joblib.dump(clf, "my_model.pkl")
Then, at any time, just load the model and vectorizer and classify new text normally
clf = joblib.load("my_model.pkl")
vectorizer = joblib.load("my_vectorizer.pkl")
Let’s try some arbitrary examples
Example 1:
- new_text = ['space is cold']
- vectors_test = vectorizer.transform(new_text)
- pred = clf.predict(vectors_test)
- print(target_names[pred])
- Result: sci.space
Example 2:
- new_text = ['God loves all people']
- vectors_test = vectorizer.transform(new_text)
- pred = clf.predict(vectors_test)
- print(target_names[pred])
- Result: soc.religion.christian
Example 3:
- new_text = ['Image processing is very interesting']
- vectors_test = vectorizer.transform(new_text)
- pred = clf.predict(vectors_test)
- print(target_names[pred])
- Result: comp.graphics
And that’s how we build a simple document classifier!
Best Regards
Ibrahim Sobh
Profesional Universitario grado 33 especializado
5 年Oh Great Article .. Excellent!....excuse me....i don't understand how used the function target_names[pred]...please help me
AI Engineering Manager
6 年Oh Great Article .. Excellent!.
Electrical Engineer at ASML
7 年wonderful!
Lead Software Engineer
8 年Great article ..
Research Scientist | Lecturer | Consultant
8 年Great article .. jazak Allah khairan