Top AI/ML Papers of the Week [19/08 - 25/08]
Bruno Lopes e Silva
Artificial Intelligence | National Award-Winning Engineer ???? | Professor | Speaker | PhD Candidate in AI | Podcast Host ???
Last week, I picked out eight scientific articles that I found noteworthy to share with you. Each will be showcased with a short synopsis and a link to investigate the subject further. At the end, a reflection on how these advances may impact your projects or companies in the future will be presented!
[1] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
This report presents xGen-MM (BLIP-3), a framework for developing Large Multimodal Models (LMMs) that includes curated datasets, a training recipe, model architectures, and a suite of LMMs. Expanding the Salesforce xGen initiative, xGen-MM models are rigorously evaluated on various tasks, showcasing strong in-context learning and competitive performance among open-source models. Additionally, a safety-tuned model using DPO is introduced to reduce harmful behaviors like hallucinations. All models, datasets, and fine-tuning codebase are open-sourced to support further LMM research. [Link]
[2] JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Recent advancements in image and video generation have leveraged autoregressive LLM architectures for their adaptability and ease of integration into multi-modal systems. This approach hinges on discretization, transforming continuous data into discrete tokens. Traditional methods, such as raw pixel modeling or vector quantization, are either too lengthy or require complex pre-training. This work introduces a novel method using canonical codecs (e.g., JPEG, AVC/H.264) to model images and videos directly as compressed files. JPEG-LM and AVC-LM, based on the Llama architecture, generate images and videos by outputting compressed file bytes. Evaluations show that JPEG-LM outperforms pixel-based models and vector quantization, reducing FID by 31% and excelling in generating long-tail visual elements. This method lowers the barriers between language and visual generation, promoting research on multi-modal LLMs. [Link]
[3] LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Long-context capability is essential for multi-modal foundation models, particularly for understanding long videos. The study introduces LongVILA, a comprehensive solution for visual-language models, designed through a co-development of algorithms and systems. LongVILA enhances existing models to handle long video contexts by adding two stages: long context extension and supervised fine-tuning. Given the high computational demands of training on long videos, a new Multi-Modal Sequence Parallelism (MM-SP) system is proposed to efficiently parallelize this process, allowing extensive training on multiple GPUs without gradient checkpointing. LongVILA significantly increases video frame processing capability, improves captioning accuracy, and integrates seamlessly with Hugging Face Transformers. [Link]
[4] TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Recent advances in LLMs have greatly improved the ability to interpret and process tabular data, enabling new capabilities. However, LLMs still face challenges in industrial applications due to the complexity of reasoning required for real-world data, highlighting a gap between academic benchmarks and practical use. To bridge this gap, a comprehensive benchmark called TableBench is proposed, covering 18 fields across four major categories of table question answering. Additionally, TableLLM is introduced, trained on a carefully curated dataset, TableInstruct, showing performance comparable to GPT-3.5. Extensive experiments reveal that both open-source and proprietary LLMs, including GPT-4, still have significant room for improvement to match human performance in real-world scenarios. [Link]
[5] TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models
This work addresses the need for fair and robust evaluation of video foundation models, which often differ in parameters such as sampling rate and number of frames, complicating comparisons. A new evaluation framework is proposed to measure two key video comprehension capabilities: appearance and motion understanding. The study reveals that current models, whether text-supervised or self-supervised, have limitations in at least one of these areas. To overcome these issues, the new model TWLV-I is introduced, offering robust visual representations for both types of videos. TWLV-I outperforms existing models across five action recognition benchmarks, showing significant accuracy improvements over V-JEPA, UMT, and even larger models like DFN. Embedding vectors and evaluation code for TWLV-I are provided for further research. [Link]
[6] LLM Pruning and Distillation in Practice: The Minitron Approach
This report details the compression of the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation techniques. Two pruning strategies are explored: depth pruning and joint hidden/attention/MLP (width) pruning, with results evaluated on standard benchmarks. The compressed models are aligned with NeMo Aligner and tested in instruct-tuned versions, resulting in a competitive 4B model from Llama 3.1 8B and the advanced MN-Minitron-8B model from Mistral NeMo 12B. Fine-tuning teacher models on the distillation dataset, even without access to the original data, proves beneficial. The base model weights are open-sourced on Hugging Face with a permissive license. [Link]
领英推荐
[7] Sapiens: Foundation for Human Vision Models
Sapiens is a family of models designed for four key human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. These models support high-resolution 1K inference and can be easily fine-tuned for specific tasks using over 300 million in-the-wild human images. Self-supervised pretraining on a curated dataset of human images significantly enhances performance across various tasks, even with limited labeled data. The models demonstrate strong generalization capabilities and improve with increased parameters, from 0.3 to 2 billion. Sapiens outperforms existing baselines, achieving notable improvements in several benchmarks, including Humans-5K, Humans-2K, Hi4D, and THuman2. [Link]
[8] Controllable Text Generation for Large Language Models: A Survey
Large Language Models in NLP have shown high-quality text generation capabilities but face complex demands in real-world applications. Beyond avoiding misleading or inappropriate content, LLMs must meet specific user needs, such as mimicking writing styles or generating poetic text. Controllable Text Generation (CTG) techniques have emerged to ensure outputs meet predefined conditions like safety, sentiment, and style while maintaining quality. This paper reviews recent CTG advancements, categorizing tasks into content and attribute control, and discusses methods such as fine-tuning, reinforcement learning, and prompt engineering. It also evaluates CTG methods, explores applications, identifies challenges, and suggests future research directions to enhance practicality and fluency. [Link]
How might these advances impact the future?
xGen-MM (BLIP-3) presents a comprehensive framework for developing Large Multimodal Models (LMMs) that are capable of strong in-context learning and competitive performance across various tasks, setting a new standard for open-source LMM development and expanding research in multimodal AI.
JPEG-LM introduces a novel method for image and video generation using canonical codecs like JPEG, allowing for efficient integration of visual data generation within LLM architectures, potentially transforming how visual content is created and enhancing multi-modal applications.
LongVILA significantly enhances the capability of visual-language models to handle long video contexts by introducing advanced training techniques and parallelism systems, paving the way for more efficient and scalable processing of video data in AI applications.
TableBench provides a comprehensive benchmark for table question answering, highlighting the challenges and progress in applying LLMs to real-world tabular data, which could drive further innovations in data interpretation and decision-making in industries reliant on complex datasets.
TWLV-I introduces a new evaluation framework and model for video foundation models, improving robustness in visual representation and accuracy in video comprehension tasks, which could lead to advancements in video analysis and surveillance technologies.
LLM Pruning and Distillation in Practice details techniques for reducing model size while maintaining performance, showcasing a path to more efficient and accessible AI models, which could democratize access to powerful LLMs and reduce computational costs.
Sapiens advances human-centric vision models for tasks like pose estimation and depth prediction, significantly improving performance and scalability, which could impact fields such as robotics, healthcare, and human-computer interaction.
Controllable Text Generation for Large Language Models explores methods for fine-tuning LLM outputs to meet specific conditions like sentiment and style, enhancing the customization of AI-generated text for applications ranging from customer service to creative writing.
In conclusion, these advancements set the stage for:
By leveraging these innovations, the scientific community and various industries can unlock new levels of creativity, efficiency, and engagement in AI-driven solutions, significantly impacting how we interact with technology and each other in the digital age.
If you found value in these insights and reflections, please don't forget to share and interact. Your participation helps spread the information and contributes to a more engaged and informed community.??