The world of classification in Machine Learning
Sachin S Panicker
Chief AI Scientist | Keynote Speaker at International Conferences | Tech Soothsayer | Researcher in Generative AI, Singularity, Web 3.0, Metaverse, Blockchain, IoT, Quantum Computing, Robotics & Design Thinking | Artist
Classification in Machine Learning has various applications. For instance, making product recommendations in the field of e-commerce, question answering in open domain, tagging of documents, and dynamism in search advertisements. But consider a scenario, where there are millions and millions of labels applicable, and one must predict a subset of the most relevant ones. Most of the traditional approaches to a multi label text classification falls short in such cases. Let’s do a comparison of the most prevalent and newer approaches out there.
First off, there is the AttentionXML based on Bi-LSTM method base model. This was one of the earliest methods that tried to combine an attention-based deep encoder with a label tree based shortlist. It was successful in adapting Attention maps to each resolution thereby enabling a full resolution while using a label representation to create a multi-level Hierarchical Label Tree [HLT]. Earlier tree-based methods used to employ the entire HLT for an extreme classification, while the more recent methods use label clusters only at a certain level of the HLT as meta-labels which are in turn used to shortlist candidate labels for the extreme task.
Now comes the modern approaches that have replaced the model architecture with a more powerful transformer model and thence fine-tuning a pre-trained instance such as Bert. But one has to be careful with such models as they are computationally very expensive and so far many have not been able to effectively leverage transformers for both computation and performance on such extreme multi-label classification tasks.
Couple of such approaches are the XR-Transformer and the LightXML one.
领英推荐
The latter approach employs dynamic negative sampling, which replaces pre-computed label shortlists with a dynamically calculated shortlist that changes as the model’s weights get updated. This enables end-to-end training with a single model by using the final feature representation of the transformer encoder for both the meta- and the extreme classification task. However the downside is that these two tasks interfere with one another. And this could be because the meta task needs the attention maps to focus on different tokens than the extreme task. Also it only uses a single-level tree, which prevents scaling to the largest datasets.
While the former approach is derived from multi-resolution approaches in computer vision like super resolution and progressive growing of Generative Adversarial Networks [GAN], and enables multiple resolutions through iterative training. However, unlike progressively grown GANs, which predict only at the highest resolution, XR-Transformer needs predictions across all resolutions for its progressive shortlisting pipeline, but uses representations trained at a single resolution. In practice, this leads to XR-Transformer having a complex multi-stage pipeline where the transformer model is iteratively trained up to a certain resolution and then frozen. This is followed by a re-clustering and re-training of multiple classifiers, working at different resolutions, with the same fixed transformer features. Unlike AttentionXML, using multiple instances of transformer models becomes undesirable due to their computational overhead. This enforces LightXML and XR-Transformer to make different trade-offs when leveraging a single transformer model for Extreme Multi-Label Text classification tasks compared to AttentionXML.
And finally there’s this newly proposed approach called CascadeXML, that combines the strengths of all these approaches, thereby creating an end to-end trainable multi-resolution learning pipeline which trains a single transformer model across multiple resolutions in a way that allows the creation of label resolution specific attention maps and feature embeddings.
CascadeXML optimises the training objective using Binary Cross Entropy loss as the loss function and AdamW as the optimiser.
It's an exciting world of research and development happening out there in the field of classification, and we are just getting started, so to speak, on unraveling the full potential of Transformer models!