Google Research's CodecLM - Aligning Language Models with Tailored Synthetic Data & Overview of Multilingual Large Language Models
Aditi Khare
AWS & AI Research [LLMs & Vision]-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | Inference Optimization | Hyperspectral Imaging | Open-Source Dev | Build Production-Grade AI Products from Scratch
Google Research's CodecLM - Aligning Language Models with Tailored Synthetic Data -
CodecLM - General Framework for adaptively generating high quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs.
Following Encode-Decode principles, with using LLMs as codecs to guide the data generation process. Its first encodes seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions.
Thereby introducing Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.
CodecLM - General Framework for generating high-quality instruction-response pairs tailored to different downstream tasks and LLMs, eliminating the need for human annotation.
LLM as Codec for Instructions -Concept of using a strong LLM as a codec that is both encoder and decoder, for instruction generation.
CodecLM effectively captures the underlying instruction distribution via instruction metadata, and further tailor the most effective instruction response pairs through Self-Rubrics and Contrastive Filtering.
CodecLM provides a potent solution towards adapting LLMs for customized uses, without the necessity of human annotation.
CodecLM serves as a general framework for targeted LLM alignment, which opens the door to multiple promising research directions within the framework, such as richer metadata definition, better prompt design, and more reliable LLM-based scorer.
CodecLM can also benefit from orthogonal research fields.
References -
Paper Link - https://arxiv.org/abs/2404.05875
Multilingual Large Language Model - An Overview
Multilingual Large Language Models are capable of using powerful LLMs to handle and respond to queries supporting multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive surveys to summarize existing approaches.
Monolingual Large Language Models - Can only process one language at a time for example - English & Chinese.
Multilingual Large Language Models as shown in the above diagram unlike monolingual LLM - Multilingual LLM is capable of handling and producing content in various languages simultaneously for example as English & Chinese.
Parameter-Tuning Alignment - Indicates that MLLMs should tune their parameters for better cross-lingual alignment. The below diagram proposes four approaches -
1.a From scratch Pretraining Alignment presents a series of approaches have achieved to alignment across languages by tuning the initially random parameters of MLLMs during pretraining and observed that adding a few multilingual data during the from-scratch pretraining alignment, even unintentionally, can significantly boost the multilingual performance.
1.b Continual Pretraining Alignment - Addresses the high computational cost of from-scratch pretraining, continual pretraining alignment builds the pretraining process upon pretrained MLLMs on adding more target language data during continual pretraining for general performance. Further emphasized extending the MLLMs’ vocabularies to adapt to new languages.
2. PTA in SFT Stage - PTA in SFT stage means leveraging multiple multilingual task data with instruction format for tuning parameters In particular, models like Flan-PaLM, BLOOMz, PolyLM, Tk-Instruct etc there by directly incorporated multilingual data in the SFT stage to achieve implicit multilingual alignment across languages.
3. PTA in RLHF Stage - For achieving alignment in reinforcement learning from human feedback (RLHF) stage.
Salmon framework- enhances multilingual RLHF by self-generating rewards for better alignment.
领英推荐
4. PTA in Downstream Finetuning Stage -
Full-Parameter Finetuning Alignment - Full-Parameter finetuning in MLLMs means tuning all parameters in downstream tasks.
Parameter-Efficient Finetuning Alignment - Defines approaches for reducing full-parameter fine-tuning costs. Proposes minimal soft prompt prefix fine-tuning for better alignment. Proposes methods based on Low-Rank Adaptation (LoRA) to achieve PEFT alignment.
Introducing a LangBridge model to bridge multilingual encoder to single-lingual LLM to effectively achieve promising performance.
Key Pointers -
PTA in pretraining stage describes essential multilingual capabilities of the MLLMs & Effectiveness of alignment in MLLMs is greatly influenced by previous alignment stage, e.g. Pretraining will significantly influence SFT.
Parameter-Frozen Alignment - In contrast to the traditional parameter-tuning approaches parameter-frozen alignment methods aim to perform alignment without any parameter tuning.
The most popular approaches employ prompting strategies to elicit the alignment potential of MLLMs.
Described below are the Prompting Strategies for alignment without Parameter Tuning -
a. Direct Prompting - Means directly outputting the request without any additional instruction for implicit alignment through MLLM itself.
b. Code-Switching Prompting - It Integrates multilingual words into a single-language utterance, which is a typical language phenomenon for effective language alignment Specifically showed the effectiveness of MLLMs in cross-lingual alignment through model-generated code switching texts.
c. Translation Alignment Prompting - This approach helps in translating the query into other languages for better alignment which can be divided into the following classes such as Key Information Translation, Direct Translation, 3 Step-by-step Translation and Restatement.
d. Retrieval Augmented Alignment - Incorporates external retrieval during prompting to inject more knowledge in MLLMs. Specifically focuses on retrieving cultural or professional knowledge to enrich prompts.
Translation alignment prompting is more effective for cross-lingual alignment.
Retrieval augmented alignment mitigates knowledge gaps in LLMs.
Hallucination in MLLMs -
1. Multilingual Hallucination Detection - Effectively detecting hallucination phenomenon of MLLM across different languages is the primary problem to be solved in this field.
2. Multilingual Hallucination Alleviation - Current strategies for hallucination alleviation still focus on incorporating extensive factual data or utilizing external systems, which pose significant challenges for multiple languages, especially low-resource languages.
References -
Paper Link - https://arxiv.org/pdf/2404.04925.pdf
For more information on AI Research Papers you can visit my Github Profile -
For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -
Thank you & Happy Reading !