AI Newsletter

AI Newsletter

Another week - another cool updates in the world of AI!

?? Llama 3.1 Release

  • Meta has launched Llama 3.1, an enhanced version of their language model, available in three sizes: 8 billion, 70 billion, and 405 billion parameters. This upgrade brings significant improvements in complex reasoning, coding, and multilingual tasks. According to Meta, Llama 3.1 outperforms other state-of-the-art models like GPT-4 Omni and Claude 3.5 in various benchmarks. Impressively, Llama 3.1 is open-source, allowing developers to fully customize and fine-tune the models for their needs. The models can be tested and used across various platforms, including Meta's messaging apps and AI tools like Groq and Perplexity.

Credit: Llama

?? Mistral Large 2 Release

  • Mistral has just launched its newest open-source model, Mistral Large 2, boasting 123 billion parameters. This model is already making waves by outperforming Meta’s Llama 3.1 (70B) in math and various code generation tasks, such as Python, C++, and Java. Notably, it competes closely with state-of-the-art models like GPT-4.0, offering a robust alternative for developers seeking high performance and flexibility.

Credit: Mistral

?? Gemini 1.5 Flash Upgrade

  • Google is upgrading the free tier of its Gemini AI to Gemini 1.5 Flash, promising noticeable improvements in quality, latency, reasoning, and image understanding. They've also expanded the token limit to 32,000 tokens. Soon, users will be able to upload files via Google Drive or directly from their devices, a feature currently available only in Gemini Advanced. Additionally, Google is enhancing fact-checking by displaying links to related content for fact-seeking prompts in Gemini.

Credit: Google

?? AI Breakthrough in Math: AlphaProof and AlphaGeometry 2 by Google

  • AlphaProof and AlphaGeometry 2 have made a significant splash by solving four out of six problems from this year’s International Mathematical Olympiad (IMO). This impressive feat means these AI models are now performing at the level of a silver medalist. AlphaProof tackled tough algebra and number theory problems, while AlphaGeometry 2 nailed the geometry challenge.

Credit: DeepMind

?? SearchGPT Release

  • OpenAI has introduced a new prototype called SearchGPT, aiming to revolutionize the search experience. This AI-powered search tool promises enhanced search capabilities by providing detailed answers with sources, images, and even weather data, similar to what we see with Google and Bing's AI searches. Users can ask questions and receive comprehensive responses, including links to original sources. Currently, SearchGPT is being tested by a select group, but you can join the waitlist for a chance to try it out early.

Credit: OpenAI

?? ChatGPT Fine-tune for Free

  • OpenAI is offering a limited-time opportunity to fine-tune GPT-4 for free, allowing up to 2 million training tokens per day until September 23rd. This initiative enables businesses and researchers to customize the model with their own data, whether for specialized applications in health, biology, or internal documents.

Credit: OpenAI

?? Grok 2 and Grok 3 Announcements

  • Elon Musk recently revealed exciting updates for xAI's Grok models. Grok 2.0 is set to launch soon and is anticipated to rival GPT-4 and Claude 3.5. Musk also mentioned that Grok 3.0, expected by December, aims to be the most powerful AI in the world. With the Memphis supercluster, the most potent AI training cluster globally, at their disposal, xAI is positioned to achieve remarkable advancements. However, given Musk's history with timelines, the actual release dates remain to be seen.

Credit: X.AI

?? Bing AI Search Update

  • Bing is rolling out a new AI-driven search experience that prioritizes answering your questions directly on the left side of the screen, with traditional search results moved to the right sidebar. This update aims to streamline how users access information, presenting a more detailed response upfront. While some users are already seeing these changes, others, like myself, are still experiencing the traditional layout.

Credit: Bing

?? Stable Video 4D Release

  • Stable Video 4D introduces an exciting capability that transforms a video of a single object into multiple novel perspectives, simulating different camera angles. For instance, starting with a simple video of a camel, the model generates views from eight different angles, enhancing the depth of video content.

Credit: Stability AI

?? Luma Video Loop Feature

  • Luma AI has introduced an exciting new feature called "Loop" in their Dream Machine tool. This addition allows users to create endlessly looping animations from images, such as a spinning top or a spaceship flying through space. While Dream Machine's text-to-video capabilities are still developing, its ability to animate static images has been notably enhanced with this feature.

Credit: Luma

?? Kling AI Video Widely Available

  • Kling AI, a standout in text-to-video technology, is now more accessible than ever. Known for its advanced capabilities and flexibility, Kling AI allows users to generate high-quality videos from text prompts, including some impressive scenes like samurais with flaming swords and futuristic robots. The platform is now available without the need for phone number verification, making it easier to use with just an email registration and daily free credits.

Credit: AI Perceiver

?? Runway Scraping YouTube Videos

  • Recent reports have surfaced suggesting that Runway may have used YouTube videos for training its AI models without explicit permission. While this information comes from anonymous sources and remains unconfirmed, the leaked spreadsheet lists prominent YouTube channels like Marquez Brownlee and MrBeast as potential sources.

Credit: RunWay

?? Most Powerful AI Training Cluster

  • Elon Musk has announced via X that xAI has launched the most powerful AI training cluster globally, featuring 100,000 Nvidia H100 GPUs with advanced liquid cooling. This impressive array is designed to accelerate the training of sophisticated AI models, surpassing existing systems such as AMD’s Frontier with 27,888 GPUs, Intel’s Aurora with 60,000 GPUs, and Microsoft’s Eagle with 14,400 H100 GPUs. The cluster is pivotal for training Grok 3, the next-generation AI model anticipated for release by the end of the year.

Credit: xAI

New Noteworthy Papers??

SHIC: Shape-Image Correspondences with No Keypoint Supervision

Authors: Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi

Institutions: Visual Geometry Group, University of Oxford

Abstract: Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularized by DensePose for the analysis of humans, attempts to apply the concept to other categories have been limited due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision, achieving better results than supervised methods for most categories. Our approach leverages foundation computer vision models such as DINO and Stable Diffusion, which provide excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from these models. The method matches images of the object to non-photorealistic renders of the template, emulating manual annotation processes. These correspondences supervise high-quality canonical maps for any object of interest. Additionally, image generators can further improve the realism of the template views, providing another source of supervision for the model.

VILA2: Augmented VILA

Authors: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin

Institutions: NVIDIA (Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Song Han, Hongxu Yin), UT Austin (Jang Hyun Cho), MIT (Marco Pavone)

Abstract: Visual language models (VLMs) have advanced rapidly due to the success of large language models (LLMs). While model architectures and training infrastructures progress quickly, data curation remains under-explored. When data quantity and quality become bottlenecks, existing methods either crawl more raw data from the Internet, which lacks quality guarantees, or distill from commercial models, limiting performance to those models. We propose a novel approach with a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its pretraining data to enhance quality and retrains from scratch using this refined dataset. This process iterates until saturation. Subsequently, specialist VLMs fine-tuned from the self-augmented VLM add domain-specific expertise through task-oriented recaptioning and retraining. This combined approach introduces VILA2 (VILA-augmented-VILA), a VLM family that consistently improves accuracy across various tasks and achieves new state-of-the-art results on the MMMU leaderboard among open-sourced models.

SAM 2: Segment Anything in Images and Videos

Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R?dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer

Institutions: Meta FAIR (Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R?dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer)

Core Contributors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R?dle, Christoph Feichtenhofer

Project Lead: Piotr Dollár

Abstract: We present Segment Anything Model 2 (SAM 2), a foundation model aimed at solving promptable visual segmentation in images and videos. SAM 2 introduces a data engine that enhances model and data through user interaction, creating the largest video segmentation dataset to date. Our model utilizes a simple transformer architecture with streaming memory for real-time video processing. Trained on our dataset, SAM 2 shows strong performance across various tasks, achieving better accuracy in video segmentation with 3× fewer interactions compared to previous methods. In image segmentation, SAM 2 is more accurate and 6× faster than its predecessor, the Segment Anything Model (SAM). This release includes the model, dataset, and an interactive demo, all under the Apache 2.0 license.

Demo & Code

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Authors: Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, Janet B. Pierrehumbert

Institutions: 1. University of Leeds, 2. University of Oxford, 3. University of Edinburgh, 4. University of Cambridge, 5. University of Illinois Urbana-Champaign

Abstract: Planning is a fundamental aspect of human intelligence, and reasoning about asynchronous plans requires both sequential and parallel planning to optimize time costs. This study presents a large-scale investigation into whether large language models (LLMs) can handle such tasks. We find that prominent LLMs, including GPT-4 and LLaMA-2, perform poorly without detailed illustrations of the task-solving process. We introduce a novel technique, Plan Like a Graph (PLaG), which integrates graphs with natural language prompts to achieve state-of-the-art results. Despite improvements in performance with PLaG, LLMs still experience significant degradation as task complexity increases. This research represents a significant step towards using LLMs as effective autonomous agents.

Video about it:

Audio:

MUSICONGEN: Rhythm and Chord Control for Transformer-Based Text-to-Music

Generation Authors: Yun-Han Lan, Wen-Yi Hsiao, Hao-Chung Cheng, Yi-Hsuan Yang Institutions: 1. Taiwan AI Labs, 2. National Taiwan University

Abstract: Existing text-to-music models produce high-quality audio with considerable diversity. However, they struggle to precisely control temporal musical features such as chords and rhythm. We present MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds on the pretrained MusicGen framework. Our approach includes an efficient finetuning mechanism for consumer-grade GPUs, incorporating automatically-extracted rhythm and chords as condition signals. During inference, conditions can be derived from musical features of a reference audio signal or user-defined symbolic chord sequences, BPM, and textual prompts. Evaluations on two datasets—one based on extracted features and the other on user-created inputs—show that MusiConGen effectively generates realistic backing track music that adheres to the specified conditions.

SYNTHESIZER SOUND MATCHING USING AUDIO SPECTROGRAM TRANSFORMERS

Authors: Fred Bruford, Frederik Blang, Shahan Nercessian Institutions: Native Instruments (London, United Kingdom; Berlin, Germany; Boston, MA, USA)

Abstract: Systems for synthesizer sound matching, which automatically set synthesizer parameters to emulate an input sound, aim to make synthesizer programming faster and easier for both novice and experienced musicians. Given the vast variety of synthesizers and their complexities, general-purpose sound matching systems that require minimal prior knowledge about the synthesis architecture are highly desirable. This paper introduces a synthesizer sound matching model based on the Audio Spectrogram Transformer. The model is trained on a large synthetic dataset of samples from the popular Massive synthesizer. It effectively reconstructs parameters from a set of 16 parameters, showing improved fidelity compared to multi-layer perceptron and convolutional neural network baselines. The paper also includes audio examples of the model’s performance in emulating vocal imitations and sounds from other synthesizers and musical instruments.


On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures

Authors: Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

Institutions: RWTH Aachen University, Germany; AppTek GmbH, Germany

Abstract: This work evaluates the utility of synthetic data for training automatic speech recognition (ASR) systems. A text-to-speech (TTS) system similar to FastSpeech-2 is used to reproduce original training data, enabling training ASR systems solely on synthetic data. The study compares three ASR architectures: attention-based encoder-decoder, hybrid deep neural network hidden Markov model, and Gaussian mixture hidden Markov model, examining their sensitivity to synthetic data generation. It extends previous work with ablation studies on the effectiveness of synthetic versus real training data, focusing on variations in speaker embedding and model size. Results indicate that TTS models generalize well even when training scores suggest overfitting.

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

Authors: Jarod Duret, Yannick Esteve, Titouan Parcollet

Institutions: LIA - Avignon Universite, France; University of Cambridge, United Kingdom

Abstract: Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. While most state-of-the-art systems use similar architectures to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remain unclear. This work investigates the selection process by studying downstream tasks such as automatic speech recognition, speech synthesis, speaker recognition, and emotion recognition. Findings reveal a discrepancy in the optimization of discrete speech units: units that perform well in resynthesis do not always correlate with enhanced translation efficacy. This discrepancy highlights the complex nature of target feature selection and its impact on speech-to-speech translation performance.

LONG-FORM MUSIC GENERATION WITH LATENT DIFFUSION

Authors: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

Institution: Stability AI

Abstract: Audio-based generative models for music have made significant progress recently but have struggled to produce full-length music tracks with coherent musical structure from text prompts. This work demonstrates that training a generative model on long temporal contexts enables the production of long-form music up to 4 minutes and 45 seconds. The proposed model is a diffusion-transformer that operates on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It achieves state-of-the-art performance in terms of audio quality and prompt alignment, and subjective evaluations confirm that it generates full-length music with a coherent structure.


Thank you for your attention. Subscribe now to stay informed and join the conversation!

About us:

We also have an amazing team of AI engineers with:

  • A blend of industrial experience and a strong academic track record ??
  • 300+ research publications and 150+ commercial projects ??
  • Millions of dollars saved through our ML/DL solutions ??
  • An exceptional work culture, ensuring satisfaction with both the process and results ??

We are here to help you maximize efficiency with your available resources.

Reach out when:

  • You want to identify what daily tasks can be automated ??
  • You need to understand the benefits of AI and how to avoid excessive cloud costs while maintaining data privacy ??
  • You’d like to optimize current pipelines and computational resource distribution ??
  • You’re unsure how to choose the best DL model for your use case ??
  • You know how but struggle with achieving specific performance and cost efficiency

Have doubts or many questions about AI in your business? Get in touch! ??

Manuel Kistner

Expanding Businesses into New Markets | Strategic Growth & Innovation | Sharing Insights and Experiences from Dubai ????

2 个月

Exciting times. Love to hear diverse perspectives on recent AI breakthroughs. Ievgen Gorovyi

回复
Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

2 个月

The rapid evolution of AI models. With the release of Meta's Llama 3.1 and Google's Gemini 1.5 Flash upgrade, what do you think are the key improvements in handling complex data sets? How do these advancements enhance practical applications?

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了