TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON

TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON

Document Classification: We’ll do today a very common task in Machine Learning, training a classifier and use it to predict the category of new given input text using the well known Python package named Scikit. Basically,there are four simple steps:

1. LOADING THE DATA AND CONSTRUCTING THE FEATURE VECTORS

2. TRAIN THE CLASSIFIER

3. PREDICT THE TESTING SET

4. EVALUATE THE CLASSIFIER

Finally, we can save the trained model and the vectorizer for later use, and try some arbitrary examples.

The "20 Newsgroups" English corpus is used as annotated corpus for training and evaluating the classifier.

  • Training Set number of Samples: 11,314 documents
  • Testing Set number of Samples: 7,532 documents

For detailed evaluation and analysis, Accuracy, Confusion Matrix, Precision, Recall and F1 score are used. (see details below)

  • 1) Loading the data and constructing the feature vector
  • We will use the Scikit package, a very popular python Machine Learning toolkit, that provides the machine learning algorithm and training data. we load the training data provided from the well-known 20 news group English corpus (after removing headers, footers, ...). We then select the desired categories to load. Custom dataset can be created manually and then loaded almost the same way described here. Data is divided into training and testing sets.
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test')
  • Now we Construct the Feature Vector for each sample (document). Features are based on bag of words (BOW) where each feature represents the presence of a word, in other words, features are binary with value equal to 1 if the word exists, and value equal to 0 otherwise. To make the features more powerful we may use term frequency * inverse document frequency. This will help in eliminating the non-prominent terms such as stop words that has very high frequency, such as: “the, is , a, …” and the noisy words that have very low frequency. Moreover, in addition to single words, we can use patterns of two words bi-gram and three words tri-gram.
vectorizer = TfidfVectorizer(max_df=0.8, min_df=3, ngram_range=(1, 2))
vectors = vectorizer.fit_transform(newsgroups_train.data)
  •    max_df:When building the vocabulary, ignore terms that have a document frequency higher than the given threshold (percentage of documents). 
  • min_df: When building the vocabulary, ignore terms that have a document frequency lower than the given threshold (percentage of documents). Here we used the uni-gram and bi-gram

Here we used the uni-gram and bi-gram: Number of features (terms: uni-gram and bi-gram): 123,072

  • 2) Train the classifier

The second step is training the classifier based on the training data only.

clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)  
  • 3) Predict the testing set

That’s pretty much it. Now we are able to predict using the trained mode. The trained classifier will try to predict the correct categories of input documents.

vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)  


  • 4) Testing the accuracy of the model.

We can now test how well the classifier will perform. We compare the predicted categories with the actual ones.

print(‘Accuracy: ’, accuracy_score(newsgroups_test.target, pred))  >> Accuracy 0.77    

For detailed analysis of the classifier performance, we usually use these important measures that may show us the strengths and weaknesses of the classification and the chances for improvements.

1. Accuracy is a measure of the percentage of correctly classified test samples. This measure is used over all the classes.

2. Confusion Matrix is a powerful tool that enables deep analysis of how miss-classification happens for each class, where each raw of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice-versa). The more values in the diagonal, the better the classification (less confusion).

3. Precision and Recall and F1 score. Generally, precision is a measure of result relevancy, while recall is a measure of result coverage. A system with high recall but low precision for a certain class returns many results, but most of its predicted labels are incorrect. A system with high precision but low recall for a certain class is just the opposite, returning very few mostly correct results, but suffers from low coverage of the correct labels. F1 score, is defined as the harmonic mean of precision and recall: F1 =2(Precision * Recall)/(Precision + Recall) A good system is the the one which balances between both precision and recall and having a high F1 score. However, in some cases, applications prefer high precision, others prefer high recall. These measures will help in comparing the system performance generally and even for each class, for example, Some classes perform better than others, some classes are very similar and miss classified with each others, while some other classes are unique enough and easily distinguished from others. Careful study of these measures tells a lot about the detailed performance across classes, what goes wrong and even how to enhance the over all and per-class accuracy.

Save and Load

One final note, we can save the trained model and the vectorizer for later use.

joblib.dump(vectorizer, "my_vectorizer.pkl")
joblib.dump(clf, "my_model.pkl")  

Then, at any time, just load the model and vectorizer and classify new text normally

clf = joblib.load("my_model.pkl")
vectorizer = joblib.load("my_vectorizer.pkl")


Let’s try some arbitrary examples

Example 1:

  • new_text = ['space is cold']
  • vectors_test = vectorizer.transform(new_text)
  • pred = clf.predict(vectors_test)
  • print(target_names[pred])
  • Result: sci.space     


Example 2:

  • new_text = ['God loves all people']
  • vectors_test = vectorizer.transform(new_text)
  • pred = clf.predict(vectors_test)
  • print(target_names[pred])
  • Result: soc.religion.christian


Example 3:

  • new_text = ['Image processing is very interesting']
  • vectors_test = vectorizer.transform(new_text)
  • pred = clf.predict(vectors_test)
  • print(target_names[pred])
  • Result: comp.graphics   


 And that’s how we build a simple document classifier!


Best Regards

Ibrahim Sobh

Luis Fernando Castellanos Guarin

Profesional Universitario grado 33 especializado

5 年

Oh Great Article .. Excellent!....excuse me....i don't understand how used the function target_names[pred]...please help me

回复
Moamen Abdelrazek

AI Engineering Manager

6 年

Oh Great Article .. Excellent!.

Saleh Ahmad Khan

Electrical Engineer at ASML

7 年

wonderful!

Dina Salem

Lead Software Engineer

8 年

Great article ..

Randa Elanwar, PhD

Research Scientist | Lecturer | Consultant

8 年

Great article .. jazak Allah khairan

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了