AI Newsletter
Another week - another cool updates in the world of AI!
?? Llama 3.1 Release
?? Mistral Large 2 Release
?? Gemini 1.5 Flash Upgrade
?? AI Breakthrough in Math: AlphaProof and AlphaGeometry 2 by Google
?? SearchGPT Release
?? ChatGPT Fine-tune for Free
?? Grok 2 and Grok 3 Announcements
?? Bing AI Search Update
?? Stable Video 4D Release
?? Luma Video Loop Feature
?? Kling AI Video Widely Available
?? Runway Scraping YouTube Videos
?? Most Powerful AI Training Cluster
New Noteworthy Papers??
Authors: Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi
Institutions: Visual Geometry Group, University of Oxford
Abstract: Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularized by DensePose for the analysis of humans, attempts to apply the concept to other categories have been limited due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision, achieving better results than supervised methods for most categories. Our approach leverages foundation computer vision models such as DINO and Stable Diffusion, which provide excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from these models. The method matches images of the object to non-photorealistic renders of the template, emulating manual annotation processes. These correspondences supervise high-quality canonical maps for any object of interest. Additionally, image generators can further improve the realism of the template views, providing another source of supervision for the model.
Authors: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin
Institutions: NVIDIA (Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Song Han, Hongxu Yin), UT Austin (Jang Hyun Cho), MIT (Marco Pavone)
领英推荐
Abstract: Visual language models (VLMs) have advanced rapidly due to the success of large language models (LLMs). While model architectures and training infrastructures progress quickly, data curation remains under-explored. When data quantity and quality become bottlenecks, existing methods either crawl more raw data from the Internet, which lacks quality guarantees, or distill from commercial models, limiting performance to those models. We propose a novel approach with a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its pretraining data to enhance quality and retrains from scratch using this refined dataset. This process iterates until saturation. Subsequently, specialist VLMs fine-tuned from the self-augmented VLM add domain-specific expertise through task-oriented recaptioning and retraining. This combined approach introduces VILA2 (VILA-augmented-VILA), a VLM family that consistently improves accuracy across various tasks and achieves new state-of-the-art results on the MMMU leaderboard among open-sourced models.
Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R?dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer
Institutions: Meta FAIR (Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R?dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer)
Core Contributors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R?dle, Christoph Feichtenhofer
Project Lead: Piotr Dollár
Abstract: We present Segment Anything Model 2 (SAM 2), a foundation model aimed at solving promptable visual segmentation in images and videos. SAM 2 introduces a data engine that enhances model and data through user interaction, creating the largest video segmentation dataset to date. Our model utilizes a simple transformer architecture with streaming memory for real-time video processing. Trained on our dataset, SAM 2 shows strong performance across various tasks, achieving better accuracy in video segmentation with 3× fewer interactions compared to previous methods. In image segmentation, SAM 2 is more accurate and 6× faster than its predecessor, the Segment Anything Model (SAM). This release includes the model, dataset, and an interactive demo, all under the Apache 2.0 license.
Authors: Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, Janet B. Pierrehumbert
Institutions: 1. University of Leeds, 2. University of Oxford, 3. University of Edinburgh, 4. University of Cambridge, 5. University of Illinois Urbana-Champaign
Abstract: Planning is a fundamental aspect of human intelligence, and reasoning about asynchronous plans requires both sequential and parallel planning to optimize time costs. This study presents a large-scale investigation into whether large language models (LLMs) can handle such tasks. We find that prominent LLMs, including GPT-4 and LLaMA-2, perform poorly without detailed illustrations of the task-solving process. We introduce a novel technique, Plan Like a Graph (PLaG), which integrates graphs with natural language prompts to achieve state-of-the-art results. Despite improvements in performance with PLaG, LLMs still experience significant degradation as task complexity increases. This research represents a significant step towards using LLMs as effective autonomous agents.
Audio:
Generation Authors: Yun-Han Lan, Wen-Yi Hsiao, Hao-Chung Cheng, Yi-Hsuan Yang Institutions: 1. Taiwan AI Labs, 2. National Taiwan University
Abstract: Existing text-to-music models produce high-quality audio with considerable diversity. However, they struggle to precisely control temporal musical features such as chords and rhythm. We present MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds on the pretrained MusicGen framework. Our approach includes an efficient finetuning mechanism for consumer-grade GPUs, incorporating automatically-extracted rhythm and chords as condition signals. During inference, conditions can be derived from musical features of a reference audio signal or user-defined symbolic chord sequences, BPM, and textual prompts. Evaluations on two datasets—one based on extracted features and the other on user-created inputs—show that MusiConGen effectively generates realistic backing track music that adheres to the specified conditions.
Authors: Fred Bruford, Frederik Blang, Shahan Nercessian Institutions: Native Instruments (London, United Kingdom; Berlin, Germany; Boston, MA, USA)
Abstract: Systems for synthesizer sound matching, which automatically set synthesizer parameters to emulate an input sound, aim to make synthesizer programming faster and easier for both novice and experienced musicians. Given the vast variety of synthesizers and their complexities, general-purpose sound matching systems that require minimal prior knowledge about the synthesis architecture are highly desirable. This paper introduces a synthesizer sound matching model based on the Audio Spectrogram Transformer. The model is trained on a large synthetic dataset of samples from the popular Massive synthesizer. It effectively reconstructs parameters from a set of 16 parameters, showing improved fidelity compared to multi-layer perceptron and convolutional neural network baselines. The paper also includes audio examples of the model’s performance in emulating vocal imitations and sounds from other synthesizers and musical instruments.
Authors: Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter
Institutions: RWTH Aachen University, Germany; AppTek GmbH, Germany
Abstract: This work evaluates the utility of synthetic data for training automatic speech recognition (ASR) systems. A text-to-speech (TTS) system similar to FastSpeech-2 is used to reproduce original training data, enabling training ASR systems solely on synthetic data. The study compares three ASR architectures: attention-based encoder-decoder, hybrid deep neural network hidden Markov model, and Gaussian mixture hidden Markov model, examining their sensitivity to synthetic data generation. It extends previous work with ablation studies on the effectiveness of synthetic versus real training data, focusing on variations in speaker embedding and model size. Results indicate that TTS models generalize well even when training scores suggest overfitting.
Authors: Jarod Duret, Yannick Esteve, Titouan Parcollet
Institutions: LIA - Avignon Universite, France; University of Cambridge, United Kingdom
Abstract: Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. While most state-of-the-art systems use similar architectures to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remain unclear. This work investigates the selection process by studying downstream tasks such as automatic speech recognition, speech synthesis, speaker recognition, and emotion recognition. Findings reveal a discrepancy in the optimization of discrete speech units: units that perform well in resynthesis do not always correlate with enhanced translation efficacy. This discrepancy highlights the complex nature of target feature selection and its impact on speech-to-speech translation performance.
Authors: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Institution: Stability AI
Abstract: Audio-based generative models for music have made significant progress recently but have struggled to produce full-length music tracks with coherent musical structure from text prompts. This work demonstrates that training a generative model on long temporal contexts enables the production of long-form music up to 4 minutes and 45 seconds. The proposed model is a diffusion-transformer that operates on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It achieves state-of-the-art performance in terms of audio quality and prompt alignment, and subjective evaluations confirm that it generates full-length music with a coherent structure.
Thank you for your attention. Subscribe now to stay informed and join the conversation!
About us:
We also have an amazing team of AI engineers with:
We are here to help you maximize efficiency with your available resources.
Reach out when:
Have doubts or many questions about AI in your business? Get in touch! ??
Expanding Businesses into New Markets | Strategic Growth & Innovation | Sharing Insights and Experiences from Dubai ????
2 个月Exciting times. Love to hear diverse perspectives on recent AI breakthroughs. Ievgen Gorovyi
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
2 个月The rapid evolution of AI models. With the release of Meta's Llama 3.1 and Google's Gemini 1.5 Flash upgrade, what do you think are the key improvements in handling complex data sets? How do these advancements enhance practical applications?