Cross-Lingual Text Classification

Cross-Lingual Text Classification

Introduction

Sport news has entered into our life. Besides local games, there are many world-level competitions and championships. They are widely covered in the press. And, of course, such news appear in different languages, meanwhile, the content (meaning) is the same. From the product intellectualization viewpoint, we probably would like to build AI solutions that “understand” the meaning of a text regardless of its language. Thus, we created a special method of making the cross-lingual models and prepared several ML models to confirm practically that the method works well.

In this post we attempt to show you the details of the method as well as the details of the models that were created. We also touch on the issues of the used data sets, optimization details, and nuances of the metrics we used. Finally, we share our experimental results and quality reports.

Problem Statement

InHabit is going to be multilingual. It will support articles about different sports (as well as non-sport) in the 5 most popular languages: English, French, Italian, German, and Spanish (EFIGS). Our recent ML models worked with texts in English only.

But there are several constraints. We have a pretty small data set to train (about 3-5K articles for each sport) and in English only. It is obvious that, having articles only in English (and of such a small amount), we can not build own word embeddings - otherwise a final model would be very biased to the distribution we have. And the main restriction: a vocabulary would be in English only (that, actually, contradicts with our initial task). One more limitation of training on data in English: all texts/words in FIGS must have close (to English) vector values. Only in this case we can try to train a model only on English data and, hopefully, such a model, receiving text in German, will classify it properly.

Related works

Cross-lingual Text Classification (CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. The most powerful approach in the modern Natural Language Processing (NLP), found by Mikolov et al., opened a new epoch. It was word embeddings usage [1, 2, 3]. Its rethinking led to such techniques as doc2vec or paragraph2vec [4]. But all this stuff was still related to the field of mono-lingual embeddings. Cross-lingual machine learning, in its turn, appeared in the area of machine translation [5]. And it also can be divided into pure word embeddings (this time, in terms of multilingual vector representations) and sentence/paragraph/document embeddings [6, 7, 8, 9].

The most interesting results in the field of the cross-lingual word embeddings were found during investigation how vectors of parallel words from different languages relate to each other. Mikolov et al. showed that this relation is linear – so, it is possible to find a transformation (translation) matrix from one (let’s say English) word embedding representation to another one (French) [10, 11, 12, 13]. It leads us to such a term as word vectors alignment which is a transformation of several word embedding representations (e.g. belonged to different languages) to a single common space. As a result, it is possible to create a kind of linguistically invariant word embeddings: vector space supposes the real meanings of words (hence, this is definitely the semantic space), regardless of the text language.

Figure 1: Cross-lingual word embeddings across two languages (from [13]).

Method

Single multilingual vocabulary 

Thus, we have 5 aligned word2vec vocabularies (EFIGS). How to create a model which will define a sport, that the text is about, in any of these 5 languages? The quick answer is language detection. The possible flow is to detect the text language firstly, get a corresponding vocabulary and then use it as word embeddings in a classifier. But is it a cheap task to detect a language? It turned out that a simple approach is heuristic a bit (like matching of stop-words, found in the text, with each vocabularies) and a more precise language detection model (yes, it is a separate model!) is more complex and it stretches the work-time of the overall model.

But another question is: do we need to detect the language at all? All we need is to figure out a topic of the text in whatever language it is! What if simply to concatenate all 5 vocabularies into a single one? So, each possible EFIGS word will be in this vocabulary. And, since our 5 initial word2vec vocabularies are already aligned, it will be a comprehensive true word2vec representation: all words with close meaning correspond to close vectors.

Figure 2: Multilingual word-set in a single vector space.

But what if two vocabularies/languages contain the words with same spelling? Since it is word embeddings, and, moreover, all vocabularies are aligned, we can make follows trick. We can put our languages in the order depending on the impact of a classification error on the business (starting from the biggest drawback). Merging vocabularies step by step, following this prioritization, we can ignore a word if the final vocabulary already contains it. One more question is: what is the exact set of stop-words to use? Our answer is to concatenate stop-words sets (for all languages) too.

Thus, using such a multilingual vocab, we have the following opportunities:

  • We can apply it to a text without knowledge about exact language (from EFIGS) it uses.
  • The text can even be multilingual itself.
  • Word embeddings can be considered in 2 senses: there will be a vector similarity for the similar words in a one language as well as in different languages (including parallel words).
  • As a result, we can use this vocabulary to train a multilingual model, using a monolingual corpus only.

Doc2vec representation

Since we use word2vec representation, each text can be presented as a matrix (2D-array) of the float values where rows present words and columns are hidden (latent) factors. But what can we do with that matrix? We need to classify a whole document. It means we firstly need to transform the matrix to a vector (1D-array) - this is a conversion of a text to a sort of doc2vec [4] representation - and apply to this 1D-array such models as Logistic Regression, Decision Tree (or one of the tree's ensembles), MLP, etc. To get such a 1D-array we can create:

  • Convolutional (2D) Neural Network. We described the approach in the previous post.
  • Bag-of-Features. It is primitive, but sometimes workable, approach. We average initial 2D-array by the factors axis. As a result, we have a vector with the same dimension as a word2vec-word has and with the same hidden factors (in the same order). So, we average factors given by every word from the input text. Unfortunately, such an approach is usually considered as sub-optimal: since the word ordering is not taken into account, and the phrases "I am happy" and "I am not happy" can have very similar final vectors.
  • Since a text is a sequence of words, we can use Recurrent Neural Network (RNN). In this case the final hidden state of an RNN (the value of an output vector of the RNN layer after passing the last word) plays the role of doc2vec representation.

We have chosen the last approach.

Direct translation into FIGS

For the purpose of data augmentation we translated a portion of randomly selected texts into FIGS languages. Despite the initial goal of using only English corpus, such a procedure extended our data, increased its variation, and, probably, helped the model "to be prepared" to texts in FIGS.

Adding special noise layers

Thus, the train corpus was mainly in English. And the only thing that allowed to expect that texts in FIGS would be classified well by the model, which was trained on texts in English, was the alignment of the word2vec vocabularies. But there were several problems:

  • Order of words in parallel sentences vary
  • Parallel sentences can have different numbers of words
  • Two parallel words have some difference in the word2vec representation.

Basing on these observations, we have built 3 auxiliary layers:

  • Drop Words layer. With the probability of 10%, every word in a training text is being dropped during current iteration.
  • Permute Words layer. Words in a window of size 6 is being shuffled a bit.
  • Gaussian Noise layer. Every word is being summed up with a White Gaussian Noise (SNR = 20dB).

The main intuition of creating such noise layers was got from [5].

One important detail about these layers is that they modify a mini-batch on-the-fly. So, with the predefined probabilities, a mini-batch to train the model on is always differed a bit. On one hand is a kind of dynamic data augmentation, on the other hand it is a well-known regularization approach (adding noise to the input). But we tried to align this approach with our intuition about the possible difference between two parallel texts.

And, of course, all these layers are used only during the train phase.

Network architecture

The noise layers are followed by the LSTM layer with the Masking layer before (to ignore "empty" words). Complicated models (see below) have bidirectional LSTM (Bi-LSTM) instead of unidirectional. The final state provided by LSTM is passed into the fully-connected block which contains one or more dense layers (with ReLU as activation for hidden layers). For regularization, dropout layers are used. As well as Batch Normalization layers are used to accelerate learning. The output has the appropriate activation: sigmoid for the binary models or softmax for the multiclass models.

Figure 3: Architecture of Neural Network for cross-lingual text classification.

Models details

Sport vs. Non-sport

The first model in the series, which was intended to prove the method described above, was one that defines whether the text is about sport. The model structure assumes an input matrix of 500x300, noise layers (mentioned above), one (unidirectional) LSTM layer (with a dropout of 0.1) with only 11 units in a hidden state, followed by only one dense layer with sigmoid activation (output layer). The data was presented by a small (~5K) set from InHabit historical URLs, labeled manually.

The final state of creating a binary model is a binarization threshold tuning. And, since it is a real business, the pure F1 score metric is inappropriate, because "sport" and "non-sport" classes do not have a similar importance; in comparison to the Basketball model, which predicts 2 classes (Pro vs. College) and both of them are similarly important. Hence, in case of "sport" and "non-sport" classes, it is better to try to show a ?active? on a non-sport text (there is a big chance that it will be not displayed due to the lack of sport data in the text) than to lose an article about sport [fig. 4]. So, we use Recall metric. The same approach was discussed in the one of the previous posts.

Figure 4: Distribution of sport and non-sport texts.

All sports

The second model is "All sports". It separates texts by high-level classes like baseball, basketball, football, and soccer. The model structure is similar to the prior one: input of size 500x300, noise layers, one (unidirectional) LSTM layer (with the dropout of 0.3) with 24 units in the hidden state, followed by the only one dense layer with softmax (4 classes) activation (output layer). The data was presented by the set (~10K) from InHabit historical URLs as well as crawled from Internet. All data was labeled manually.

This time it is a multi-class problem. Hence, we use softmax - that is why the way from prediction (e.g. [0.059, 0.4984, 0.1951, 0.2475]) to classification is a piece of cake: we just use argmax (an output label is defined by a maximum score). Given that, it can be the first impression that F1 score metric is appropriate. Yes, it is almost the best choice. But, since all (4) classes are similarly important, we would like to have a similar level of errors for all of them. Besides minimizing averaged (among all classes) error rate, we also minimize a maximum difference between every class recalls. The problem that we solve in this way is a confusion matrix with a bias in recalls. E.g. the matrix [[9, 0, 1], [1, 9, 0], [0, 1, 9]] is more appropriate than [[10, 0, 0], [1, 7, 2], [0, 0, 10]].

Brand safety model

For the purpose of building the brand safety solution (to show InHabit on the brand safe content only) we continue to improve our Blocking model. In spite of the fact that this model shows very good results in production and passes legitimate content only, it is mono-lingual. So, it is time to create a cross-lingual version of the model.

In general, the architecture of the new Brand safety model is pretty the same as ones that we describe for two models above. But, since brand safety modeling is much harder than a pure classification of texts, we need a bit more complicated model.

The first additional idea is to use Bi-LSTM. It can help, because the status of each token (in terms of the brand safety modeling) can depend not only on previous tokens (unidirectional case), but also on subsequent ones.

Another approach, that was also based on intuition of brand safety modeling complexity, is a much longer hidden state (100 units in our case). 5-10 units context, probably, is not enough to see the difference between "blocked" and "unblocked" texts. Another interpretation of such a big hidden state is a lack of the training data. Usually this issue is solved exactly by using a big capacity with a big regularization (to avoid over-fitting). So, we used pretty aggressive dropouts [0.4, 0.2, 0.1] for inference layers respectively. Such a regularization is a very elegant way of making a sort of neural nets ensemble. It is known that ensembles (e.g. decision trees ensembles) help to search very complicated borders. Such an approach (that became so useful) emphasizes once again the complexity of the task.

And, finally, there is one more addition: the fully-connected block is extended by an additional dense layer with 30 units.

Similarly to the "Sport vs. Non-sport" model the "Brand safety" model uses two classes with different importance. It is obvious that false positive cases and false negative cases lead to absolutely different impact on business: it is better to ignore a brand-safe article than to show a ?active? on an illegitimate content. That is why the main metrics for this model are recalls of both classes. And tuning a binarization threshold supposes finding a balance between blocking some brand-safe content and accepting some brand-unsafe one.

Figure 5: Distribution of brand-unsafe vs. brand-safe texts.

Dataset Details

Data set was mainly presented by the sports articles crawled from Internet. But there was a portion of historical data (articles processed by InHabit). All data was automatically labeled, basing on a mapping between URL-patterns and labels. Such a labeling was quickly validated by a human expert. Random portion of all this data was translated to FIGS languages.

This whole corpus was split into the train and test sets. The train set was used for learning on Cross-validation (CV) with 5 folds. And the final model was picked basing on the results of validation on the test set.

The final pre-production testing supposed usage directly translated (FIGS) texts as well as a small set of absolutely fresh (EFIGS) data. The idea of the first approach was to check that initial goal of training cross-lingual model on mono-lingual data was achieved. And the second one - to see error rate of the classifier.

The first deployment into production, as usual, had the goal of collecting telemetry. And the new model results did not participate in the decision engine. Thus, very final check (as well as binarization thresholds tuning) was done on the actual production data.

Loss optimization

It is very important to design experiments properly. Let's say that some algorithm is converged, but in a bad-prepared experiment it can require too much (impossibly many) epochs. So, we tried to do our best to decrease the training time.

RNNs suppose such a thing as "time-steps size" - length of a sequence. In our case it is a number of words in a text. Obviously, it varies from text to text. So, someone can use such an approach as fit_generator() - function which prepares a mini-batch on-the-fly - and use just a one data object (a one article) for a one mini-batch. But this way is extremely slow. So, the better idea is to pad our texts. We chose 500 as the most appropriate number of words. If a text had less than 500 words - all rests vectors were 0 ones. If a text had more than 500 words - redundant words were ignored. This approach allowed to use mini-batches of big enough size - that, in its turn, sped up the train. And we do not worry about 0-vectors, because we can use masking, and train function will simply ignore such empty words.

We used mini-batches with size of 256 and 64. Particular value depended on a size of memory in GPU we used to train. And, that even more important, all mini-batches (including the last one) were fully-sized. The one way to get on with that is to ignore a training set tail which does not fit in fully-sized mini-batch. But, for the sake of keeping all data, we randomly over-sampled the data to achieve the same goal.

We used RMSProp as an optimization algorithm. But a decay was very useful too. RHO parameter, which is similar to momentum in SGD, helps to smooth oscillation of gradients. Such an oscillation leads to "waving" in an error rate plot too [fig. 6, 7]. That is why for GPU, which was used with mini-batch of size 64, we used rho of 0.8. Meanwhile, for mini-batch of 256, rho of 0.5 was more suited.

Figure 6: Error plot without oscillation (proper rho).

Figure 7: Error plot with oscillation (inappropriate rho).

We used cross-entropy (categorical or binary) as a loss function. But simple mean squared error showed the same result, and, unfortunately, it was not faster.

Number of epochs was varied from 30 to 200, depending on a model and an experiment. Initial learning rate was usually equal to 0.1, rho - 0.5 (or 0.8), and decay - 0.01. Dropout of RNN layer was equal to 0.4, meanwhile followed dense layers had dropouts 0.2 and 0.1 respectively.

Experimental Results and Quality Reports

Quality of Brand safety model turned out to be even better than the quality of the previous version.

AUC-ROC: 0.9865
Threshold: 0.2
F1: 0.815
Confusion matrix: [[116, 4], [118, 2360]]
Recalls: [0.97, 0.95]
Precisions: [0.50, 0.9983]

Figure 8: Brand safety model threshold dynamics.

Sport vs. Non-sport:

AUC-ROC: 0.977
Threshold: 0.02
F1: 0.9124
Confusion matrix: [[532, 55], [93, 1320]]
Recalls: [0.90, 0.934]
Precisions: [0.85, 0.96]

Figure 9: Sport vs. Non-sport model threshold dynamics.

All Sports:

Recall Diff score: 0.9935
F1: 0.995

Figure 10: All Sports model confusion matrix.

Operationalize Models

The new version of Azure Machine Learning Service was released recently. There are several improvements in AMLS which we started to use. New SDK was provided (we had to use only CLI before). We now can see results of all experiments directly on the portal [fig. 11]. And AKS is now the default option for deployment ML models into production.

Figure 11: Experiments in Azure Portal.

Conclusions

Sport news can be presented in different languages. So, we have been working on Cross-lingual Text Classification. Currently, we support articles in English, French, Italian, German, and Spanish.

3 new ML models were prepared:

  • Cross-lingual ML model for brand-unsafe content blocking
  • Cross-lingual ML model for separation sport texts from non-sport ones
  • Cross-lingual ML model for classifying sport texts as baseball, basketball, football, and soccer.

But the most important that we have created the methodology of training multi-lingual ML models, providing the English corpus only. We hope that this information will help you to solve similar problems.

References

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781v3.
  2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546v1.
  3. Joulin, A., Bojanowski, P., Mikolov, T., J′egou, H., & Grave, E. (2018). Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion. arXiv: 1804.07745v3.
  4. Le, Q., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. arXiv:1405.4053v2.
  5. Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. A. (2018). Unsupervised Machine Translation Using Monolingual Corpora Only. arXiv:1711.00043v2.
  6. Andrade, D., Tamura A., Tsuchida, M., & Sadamasa K. (2015). Cross-lingual Text Classification Using Topic-Dependent Word Probabilities. In Proceedings of NAACL-HLT, pp. 1466–1471.
  7. Pham, H., Luong, M.-T., & Manning, C. D. (2015). Learning distributed representations for multilingual text sequences. In Proceedings of the Workshop on Vector Modeling for NLP, pp. 88–94.
  8. Gerard, M. (2017). Multilingual Vector Representations of Words, Sentences, and Documents. In Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 3–5.
  9. Hermann, K. M., & Blunsom, P. (2014). Multilingual models for compositional distributed semantics. In Proceedings of ACL, pp. 58–68.
  10. Mikolov, T., Le, Q. V., & Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.
  11. Conneau, A., Lample, G., Ranzato M. A., Denoyer, L., & J′egou, H. (2018). Word Translation without Parallel Data. arXiv:1710.04087v3.
  12. Joulin, A., Bojanowski, P., Mikolov, T., J′egou H., & Grave E. (2018). Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion. arXiv:1804.07745v3.
  13. Gouws, S., Bengio, Y., & Corrado, G. (2015). BilBOWA: Fast bilingual distributed representations without word alignments. In Proceedings of ICML, pp. 748–756.

要查看或添加评论,请登录

Aleksei Dolgikh的更多文章

  • Don't Miss the Last (ML) Train

    Don't Miss the Last (ML) Train

    In the fast-paced world of ML and DS, staying current with technological advancements isn't just a benefit - it's a…

    1 条评论
  • Leading Data Science

    Leading Data Science

    In the dynamic and evolving field of Data Science, effective leadership is crucial for driving team success and…

    1 条评论
  • How to effectively hide an object in the background

    How to effectively hide an object in the background

    Introduction Machine learning can help to automate creating video games. Procedural content generation (PCG) is one of…

  • Extracting Crossword Clue/Answer Pairs from any Text

    Extracting Crossword Clue/Answer Pairs from any Text

    In this post, we study the possibility of extracting clue/answer pairs from any text to automate the generation of…

  • Artificial Intelligence in a Casual Game Company

    Artificial Intelligence in a Casual Game Company

    Introduction Applying AI in a company supposes a transformation of the whole business. The reason for this is that not…

  • Reinforcement Learning to Increase CTR

    Reinforcement Learning to Increase CTR

    Introduction InHabit is presented as a website-widget with interactive content inside. But how to interact with it?…

  • Language identification as a side effect of Cross-lingual text classification

    Language identification as a side effect of Cross-lingual text classification

    Introduction Since InHabit is displayed on multilingual articles, it is natural to identify the text language…

    1 条评论
  • Game Recommender System

    Game Recommender System

    Introduction Nowadays, the lack of a recommender system (RS) - as a core component of an online service - has become a…

  • Conv-BiLSTM for Multiclass prediction

    Conv-BiLSTM for Multiclass prediction

    Introduction In the previous post, we described how to organize cross-lingual text classification (CLTC), using a…

  • Categorization of Basketball articles

    Categorization of Basketball articles

    Introduction We are finishing the phase of working on Machine Learning (ML) models which related to relevance of…

社区洞察

其他会员也浏览了