Top AI/ML Papers of the Week [19/08 - 25/08]

Top AI/ML Papers of the Week [19/08 - 25/08]

Last week, I picked out eight scientific articles that I found noteworthy to share with you. Each will be showcased with a short synopsis and a link to investigate the subject further. At the end, a reflection on how these advances may impact your projects or companies in the future will be presented!


[1] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

This report presents xGen-MM (BLIP-3), a framework for developing Large Multimodal Models (LMMs) that includes curated datasets, a training recipe, model architectures, and a suite of LMMs. Expanding the Salesforce xGen initiative, xGen-MM models are rigorously evaluated on various tasks, showcasing strong in-context learning and competitive performance among open-source models. Additionally, a safety-tuned model using DPO is introduced to reduce harmful behaviors like hallucinations. All models, datasets, and fine-tuning codebase are open-sourced to support further LMM research. [Link]

[2] JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Recent advancements in image and video generation have leveraged autoregressive LLM architectures for their adaptability and ease of integration into multi-modal systems. This approach hinges on discretization, transforming continuous data into discrete tokens. Traditional methods, such as raw pixel modeling or vector quantization, are either too lengthy or require complex pre-training. This work introduces a novel method using canonical codecs (e.g., JPEG, AVC/H.264) to model images and videos directly as compressed files. JPEG-LM and AVC-LM, based on the Llama architecture, generate images and videos by outputting compressed file bytes. Evaluations show that JPEG-LM outperforms pixel-based models and vector quantization, reducing FID by 31% and excelling in generating long-tail visual elements. This method lowers the barriers between language and visual generation, promoting research on multi-modal LLMs. [Link]


[3] LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Long-context capability is essential for multi-modal foundation models, particularly for understanding long videos. The study introduces LongVILA, a comprehensive solution for visual-language models, designed through a co-development of algorithms and systems. LongVILA enhances existing models to handle long video contexts by adding two stages: long context extension and supervised fine-tuning. Given the high computational demands of training on long videos, a new Multi-Modal Sequence Parallelism (MM-SP) system is proposed to efficiently parallelize this process, allowing extensive training on multiple GPUs without gradient checkpointing. LongVILA significantly increases video frame processing capability, improves captioning accuracy, and integrates seamlessly with Hugging Face Transformers. [Link]

[4] TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Recent advances in LLMs have greatly improved the ability to interpret and process tabular data, enabling new capabilities. However, LLMs still face challenges in industrial applications due to the complexity of reasoning required for real-world data, highlighting a gap between academic benchmarks and practical use. To bridge this gap, a comprehensive benchmark called TableBench is proposed, covering 18 fields across four major categories of table question answering. Additionally, TableLLM is introduced, trained on a carefully curated dataset, TableInstruct, showing performance comparable to GPT-3.5. Extensive experiments reveal that both open-source and proprietary LLMs, including GPT-4, still have significant room for improvement to match human performance in real-world scenarios. [Link]

[5] TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

This work addresses the need for fair and robust evaluation of video foundation models, which often differ in parameters such as sampling rate and number of frames, complicating comparisons. A new evaluation framework is proposed to measure two key video comprehension capabilities: appearance and motion understanding. The study reveals that current models, whether text-supervised or self-supervised, have limitations in at least one of these areas. To overcome these issues, the new model TWLV-I is introduced, offering robust visual representations for both types of videos. TWLV-I outperforms existing models across five action recognition benchmarks, showing significant accuracy improvements over V-JEPA, UMT, and even larger models like DFN. Embedding vectors and evaluation code for TWLV-I are provided for further research. [Link]

[6] LLM Pruning and Distillation in Practice: The Minitron Approach

This report details the compression of the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation techniques. Two pruning strategies are explored: depth pruning and joint hidden/attention/MLP (width) pruning, with results evaluated on standard benchmarks. The compressed models are aligned with NeMo Aligner and tested in instruct-tuned versions, resulting in a competitive 4B model from Llama 3.1 8B and the advanced MN-Minitron-8B model from Mistral NeMo 12B. Fine-tuning teacher models on the distillation dataset, even without access to the original data, proves beneficial. The base model weights are open-sourced on Hugging Face with a permissive license. [Link]

[7] Sapiens: Foundation for Human Vision Models

Sapiens is a family of models designed for four key human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. These models support high-resolution 1K inference and can be easily fine-tuned for specific tasks using over 300 million in-the-wild human images. Self-supervised pretraining on a curated dataset of human images significantly enhances performance across various tasks, even with limited labeled data. The models demonstrate strong generalization capabilities and improve with increased parameters, from 0.3 to 2 billion. Sapiens outperforms existing baselines, achieving notable improvements in several benchmarks, including Humans-5K, Humans-2K, Hi4D, and THuman2. [Link]

[8] Controllable Text Generation for Large Language Models: A Survey

Large Language Models in NLP have shown high-quality text generation capabilities but face complex demands in real-world applications. Beyond avoiding misleading or inappropriate content, LLMs must meet specific user needs, such as mimicking writing styles or generating poetic text. Controllable Text Generation (CTG) techniques have emerged to ensure outputs meet predefined conditions like safety, sentiment, and style while maintaining quality. This paper reviews recent CTG advancements, categorizing tasks into content and attribute control, and discusses methods such as fine-tuning, reinforcement learning, and prompt engineering. It also evaluates CTG methods, explores applications, identifies challenges, and suggests future research directions to enhance practicality and fluency. [Link]

How might these advances impact the future?

xGen-MM (BLIP-3) presents a comprehensive framework for developing Large Multimodal Models (LMMs) that are capable of strong in-context learning and competitive performance across various tasks, setting a new standard for open-source LMM development and expanding research in multimodal AI.

JPEG-LM introduces a novel method for image and video generation using canonical codecs like JPEG, allowing for efficient integration of visual data generation within LLM architectures, potentially transforming how visual content is created and enhancing multi-modal applications.

LongVILA significantly enhances the capability of visual-language models to handle long video contexts by introducing advanced training techniques and parallelism systems, paving the way for more efficient and scalable processing of video data in AI applications.

TableBench provides a comprehensive benchmark for table question answering, highlighting the challenges and progress in applying LLMs to real-world tabular data, which could drive further innovations in data interpretation and decision-making in industries reliant on complex datasets.

TWLV-I introduces a new evaluation framework and model for video foundation models, improving robustness in visual representation and accuracy in video comprehension tasks, which could lead to advancements in video analysis and surveillance technologies.

LLM Pruning and Distillation in Practice details techniques for reducing model size while maintaining performance, showcasing a path to more efficient and accessible AI models, which could democratize access to powerful LLMs and reduce computational costs.

Sapiens advances human-centric vision models for tasks like pose estimation and depth prediction, significantly improving performance and scalability, which could impact fields such as robotics, healthcare, and human-computer interaction.

Controllable Text Generation for Large Language Models explores methods for fine-tuning LLM outputs to meet specific conditions like sentiment and style, enhancing the customization of AI-generated text for applications ranging from customer service to creative writing.


In conclusion, these advancements set the stage for:

  • Enhanced development of open-source, multimodal AI models;
  • Novel approaches to image and video generation within LLM frameworks;
  • Improved capabilities in processing and understanding long-context video data;
  • Progress in applying LLMs to complex real-world datasets, particularly in table question answering;
  • More robust and accurate video foundation models for various applications;
  • Increased efficiency and accessibility of LLMs through pruning and distillation;
  • Advanced human-centric vision models for a wide range of applications;
  • Greater control and customization in AI-generated text, enhancing user-specific applications.


By leveraging these innovations, the scientific community and various industries can unlock new levels of creativity, efficiency, and engagement in AI-driven solutions, significantly impacting how we interact with technology and each other in the digital age.

If you found value in these insights and reflections, please don't forget to share and interact. Your participation helps spread the information and contributes to a more engaged and informed community.??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了