AI Newsletter

AI Newsletter

Another week - another cool updates in the world of AI!

?? AI Images Are So Real Now!

AI-generated images have reached a new level of realism, as demonstrated by a recent example from the Flux model. The images are so lifelike that even minor details like facial features and clothing look convincingly real. The only telltale signs are small errors, such as gibberish text or slightly off details on objects like microphones.

?? Zico Kolter Joins OpenAI’s Board

OpenAI has appointed Zico Kolter, a leading expert in AI safety and alignment, to its Board of Directors. Kolter, who heads the Machine Learning Department at Carnegie Mellon University, brings extensive expertise in developing robust machine learning models and innovative AI safety methods. He will also serve on the Safety and Security Committee, contributing to critical decisions on OpenAI's projects.

?? GPT-4's New Safety Measures

OpenAI released a detailed system card for GPT-4, outlining the safety measures implemented before its release. The report highlights extensive safety work, including external red teaming, risk evaluations, and various mitigations. The evaluation scorecard indicates that cybersecurity, biological threats, and model autonomy are low risk, while persuasion threats are medium risk. OpenAI also addressed concerns about emotional attachment to the GPT-4 voice mode, noting that some users may form strong connections with the AI.

?? OpenAI's Failure to Deliver

OpenAI has faced criticism for frequently announcing exciting developments but failing to deliver them to the public. Despite promises, features like Sora access and GPT-4's advanced voice capabilities remain largely unavailable. Similarly, their new ChatGPT search tool, intended to compete with Google, has yet to be released. As Dev Day approaches on October 1st in San Francisco, OpenAI has tempered expectations, indicating that no major announcements or new features for ChatGPT are expected.


Credit: Justin Sullivan / Getty Images

?? Elon Sues OpenAI (Again)

Elon Musk has filed a new lawsuit against OpenAI, claiming that the organization misled him into co-founding a nonprofit by promising it would prioritize safety and transparency over profits. This follows an earlier lawsuit, which Musk later dropped, that accused OpenAI of straying from its original mission. The new lawsuit alleges that OpenAI manipulated Musk into investing in a venture that later shifted towards profit-making, contrary to its initial nonprofit ethos.

?? Nvidia Scraping YouTube

Nvidia has found itself in controversy this week after a leaked document revealed that the company has been scraping vast amounts of YouTube videos—equivalent to a human lifetime’s worth each day—to train its AI models. The internal communications and documents suggest that this data collection is part of Nvidia's efforts to develop a new video foundation model.

Credit: Nvidia

??Character AI Shakeup

Character AI is undergoing a significant transition as the company partners with Google, resulting in the departure of its co-founders. Noom Shazir, the co-founder and CEO of Character AI, is returning to Google to work with DeepMind after originally leaving the company in 2021 to start Character AI. While this move raises questions about the future direction of Character AI, the company isn’t closing shop. The General Counsel of Character AI will take over as interim CEO, and the majority of the staff will remain.

Сredit: Winni Wintermeyer/Getty Images

?? Introducing Qwen2-Math

The Qwen Team has showed a series of math-specific large language models under the Qwen2 series, designed to excel in solving arithmetic and mathematical problems. The Qwen2-Math and Qwen2-Math-Instruct models, available in sizes ranging from 1.5B to 72B parameters, have shown remarkable performance, surpassing both open-source and closed-source models like GPT-4o in various mathematical benchmarks. These models have been fine-tuned with a specialized mathematical corpus and are evaluated across both English and Chinese math benchmarks, achieving state-of-the-art results.

Credit: Qwen

?? OpusClip's New Feature: Clip Anything

OpusClip has just rolled out an exciting new feature called "Clip Anything," designed to enhance video editing with AI-driven precision. This update allows users to create short-form clips from any moment in a long-form video by leveraging advanced visual, audio, and sentiment analysis. The tool analyzes each frame to identify objects, scenes, actions, emotions, and text, enabling you to generate clips based on natural language prompts.

?? Reddit to Test AI-Powered Search Results

Reddit is preparing to roll out a new feature that will integrate AI into its search functionality. According to Reddit CEO Steve Huffman, the platform will soon test AI-generated summaries at the top of search result pages. This feature aims to provide users with concise content summaries and recommendations, helping them explore deeper into topics and discover new communities on Reddit. The AI-powered search results will be driven by both first-party and third-party technologies, with the experiment set to begin later this year.

Credit: Mashable

?? Google Announces Gemini AI-Powered Streaming Device

Google is set to launch a new streaming device powered by Gemini AI, designed to replace the Chromecast and compete with Roku, Amazon Fire Stick, and Apple TV. The device aims to enhance content discovery by leveraging Google AI to provide personalized content recommendations tailored to user preferences.. While current platforms like Hulu, Amazon Prime, Netflix, and YouTube already use recommendation algorithms to suggest content, Google’s use of Gemini AI may offer improvements through more advanced generative AI techniques. Generative AI could potentially provide deeper insights and more nuanced recommendations compared to traditional algorithms.

Credit: 9to5Google

??

New robot Figure O2

Figure Robotics revealed their Figure O2 humanoid robot, featuring a design similar to Tesla's Optimus. Equipped with six cameras for advanced AI-driven vision and speech-to-speech reasoning, the Figure O2 can interact with its environment and respond to commands. It is already being used on BMW production lines.

Credit: Figure


New Noteworthy Papers??

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Authors: Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, David Ha Equal Contribution: Chris Lu, Cong Lu, Robert Tjarko Lange; Equal Advising: Jeff Clune, David Ha

Affiliations: Sakana AI, FLAIR - University of Oxford, University of British Columbia, Vector Institute, Canada CIFAR AI Chair

Abstract: One of the major challenges in artificial general intelligence is creating agents capable of conducting scientific research independently. While current models assist with tasks like brainstorming, coding, and prediction, they only support part of the scientific process. This paper introduces "The AI Scientist," a framework for fully automated scientific discovery using frontier large language models (LLMs). This system autonomously generates research ideas, writes code, executes experiments, visualizes results, drafts scientific papers, and conducts a simulated review process for evaluation. The approach allows for iterative development of ideas and contributes to an expanding knowledge archive, mimicking the human scientific community. The framework is demonstrated across three machine learning subfields: diffusion modeling, transformer-based language modeling, and learning dynamics. The cost of generating each paper is under $15, showcasing the potential for democratizing research and accelerating scientific progress. The framework includes an automated reviewer that performs near-human quality evaluations, with generated papers meeting acceptance criteria at a top machine learning conference. This work represents a significant step toward fully automated scientific discovery and innovation. The code is available at: GitHub - Sakana AI/AI-Scientist.

Matryoshka Diffusion Models Authors: Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly

Affiliation: Apple

Abstract: Diffusion models have become a leading approach for generating high-quality images and videos, but training these high-dimensional models is challenging due to computational and optimization constraints. Existing techniques typically involve either cascaded models in pixel space or downsampled latent spaces using auto-encoders. This paper introduces Matryoshka Diffusion Models (MDM), an end-to-end framework designed for high-resolution image and video synthesis. MDM employs a diffusion process that denoises inputs across multiple resolutions simultaneously, utilizing a NestedUNet architecture where smaller-scale features are nested within larger-scale ones. The framework supports a progressive training schedule from lower to higher resolutions, improving optimization for high-resolution outputs. MDM's effectiveness is demonstrated across several benchmarks, including class-conditioned image generation, high-resolution text-to-image synthesis, and text-to-video generation. Notably, MDM can train a single pixel-space model at resolutions up to 1024 × 1024 pixels, showcasing strong zero-shot generalization with the CC12M dataset, which includes 12 million images.

What Matters When Building Vision-Language Models?

Authors: Hugo Lauren?on, Léo Tronchon, Matthieu Cord, Victor Sanh

Affiliation: Hugging Face, Sorbonne Université

Abstract: The field of vision-language models (VLMs) has gained significant momentum due to advancements in large language models and vision transformers. Despite extensive research, many critical design decisions in VLM development are made without thorough justification, which can hinder progress by obscuring which design choices actually enhance model performance. To address this, the authors conduct comprehensive experiments focusing on pre-trained models, architecture choices, data, and training methods. They introduce Idefics2, an efficient foundational VLM with 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category on various multimodal benchmarks, often rivaling models that are four times its size. The authors release the Idefics2 model (base, instructed, and chat versions) along with the datasets used for its training.

TAPTR: Tracking Any Point with Transformers as Detection

Authors: Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang

Affiliations: South China University of Technology, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, Dept. of CST., BNRist Center, Institute for AI, Tsinghua University

Project Page

Abstract: This paper introduces TAPTR, a framework for Tracking Any Point with Transformers. TAPTR is inspired by object detection and tracking frameworks, particularly DETR-like algorithms. In this approach, each tracking point is represented as a point query, which includes positional and content features. These queries are updated layer by layer in each video frame, with visibility predicted based on their content features. The model leverages self-attention across the temporal dimension to exchange information between queries of the same tracking point. TAPTR incorporates elements from optical flow models, such as cost volume, and uses straightforward techniques to maintain long-term temporal information while reducing feature drifting. The framework achieves state-of-the-art performance on various TAP datasets and demonstrates faster inference speeds.

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Authors: Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, Jianfei Cai

Affiliations: Monash University, ETH Zurich, University of Tübingen (Tübingen AI Center), VGG (University of Oxford), Microsoft, Nanyang Technological University Project Page

Abstract: This paper presents MVSplat, a model designed for efficient 3D Gaussian splatting from sparse multi-view images. MVSplat predicts clean, feed-forward 3D Gaussians by using a cost volume representation built via plane sweeping. This cost volume captures cross-view feature similarities, providing crucial geometric information for accurate Gaussian center localization. The model jointly learns parameters for Gaussian primitives along with their centers, relying solely on photometric supervision. MVSplat demonstrates superior performance on large-scale benchmarks like RealEstate10K and ACID, achieving state-of-the-art results with 10× fewer parameters and over 2× faster inference speed compared to the latest method, pixelSplat. Additionally, MVSplat offers enhanced appearance and geometry quality and improved cross-dataset generalization.

Achieving Human-Level Competitive Robot Table Tennis

Authors: David B. D’Ambrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen, Heni Ben Amor, Alex Bewley, Barney J. Reed, Krista Reymann, Leila Takayama, Yuval Tassa, Krzysztof Choromanski, Erwin Coumans, Deepali Jain, Navdeep Jaitly, Natasha Jaques, Satoshi Kataoka, Yuheng Kuang, Nevena Lazic, Reza Mahjourian, Sherry Moore, Kenneth Oslund, Anish Shankar, Vikas Sindhwani, Vincent Vanhoucke, Grace Vesom, Peng Xu, Pannag R. Sanketi

Affiliation: Google DeepMind

Abstract: This paper presents a significant advancement in robotics by developing a robot agent that achieves amateur human-level performance in competitive table tennis. The robot features a 6 DoF ABB 1100 arm mounted on two Festo linear gantries, providing extensive movement capabilities. Key contributions of the work include:

  1. Hierarchical and Modular Policy Architecture: This comprises low-level controllers with detailed skill descriptors to bridge the simulation-to-real gap and a high-level controller for skill selection.
  2. Zero-Shot Sim-to-Real Techniques: An iterative approach for defining task distributions grounded in real-world scenarios and automatic curriculum development.
  3. Real-Time Adaptation: Techniques for adapting to unseen opponents in real-time.

Performance was evaluated through 29 matches between the robot and human players of varying skill levels. The robot won 45% of the matches, demonstrating strong performance against beginner and intermediate players. The robot was less successful against the most advanced players but still showcased impressive amateur-level capabilities.

Videos of the matches: Available here

Searching for Best Practices in Retrieval-Augmented Generation

Authors: Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang

Affiliation: School of Computer Science, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing

Abstract: Retrieval-augmented generation (RAG) techniques are effective in integrating up-to-date information, reducing hallucinations, and improving response quality, especially in specialized fields. Despite their advantages, existing RAG approaches often face challenges related to implementation complexity and extended response times. This paper investigates various RAG methods and their combinations to determine optimal practices. Key findings include:

  • Optimal Strategies: Several strategies for deploying RAG are proposed to balance performance with efficiency.
  • Multimodal Retrieval: Techniques for integrating multimodal retrieval can significantly improve question-answering capabilities concerning visual inputs and accelerate the generation of multimodal content using a “retrieval as generation” approach.

Compositional Generative Modeling: A Single Model is Not All You Need

Authors: Yilun Du, Leslie Kaelbling

Abstract: Large, monolithic generative models, trained on vast amounts of data, have become prevalent in AI research. This paper proposes an alternative approach—constructing generative systems by composing smaller, specialized models. Key points include:

  • Data Efficiency: Compositional generative models can learn distributions more efficiently, improving generalization to parts of the data distribution not seen during training.
  • Flexibility: This approach allows for the creation and programming of new generative models for tasks that were not anticipated at training time.
  • Discovery of Components: Compositional models can reveal and utilize generative components discovered from data.

The study highlights the advantages of a compositional approach in terms of data efficiency, task adaptability, and the ability to uncover useful model components.

Audio papers:

MIDI-TO-TAB: Guitar Tablature Inference via Masked Language Modeling

Authors: Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon

Affiliation: Centre for Digital Music, Queen Mary University of London, UK

Abstract: Generating guitar tablatures involves assigning each musical note to a specific string and fret on a guitar, which can be complex due to multiple possible assignments per pitch. Traditional methods often use constraint-based dynamic programming to minimize hand movement costs. This paper introduces a novel deep learning approach for this task:

  • Model: An encoder-decoder Transformer model is employed within a masked language modeling framework to infer guitar string and fret assignments.
  • Training: The model is pre-trained on DadaGP, a dataset with over 25,000 tablatures, and fine-tuned on professionally transcribed guitar performances.
  • Evaluation: A user study with guitarists evaluates the playability of tablatures produced by the system. The study shows that this new approach significantly outperforms existing algorithms in terms of generating high-quality, playable tablature.

This work highlights the effectiveness of using advanced deep learning techniques for symbolic guitar tablature estimation.


TF-LOCOFORMER: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Authors: Kohei Saijo, Gordon Wichern, Fran?ois G. Germain, Zexu Pan, Jonathan Le Roux

Affiliation: Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA; Waseda University, Japan

Abstract: Time-frequency (TF) domain models have achieved significant success in speech separation. Traditional models often rely on RNNs, which can limit parallelization and scalability. This paper introduces TF-Locoformer, a novel Transformer-based model designed to overcome these limitations:

  • Model Architecture: TF-Locoformer employs feedforward networks (FFNs) with convolutional layers to capture local information, while self-attention mechanisms focus on global patterns. Two FFNs are positioned before and after the self-attention layers to enhance local modeling.
  • Normalization: A new normalization technique is proposed for TF-domain dual-path models.
  • Performance: Experiments demonstrate that TF-Locoformer meets or surpasses state-of-the-art performance in various benchmarks for speech separation and enhancement, all without relying on RNNs.

This approach combines the strengths of Transformer architectures with effective local modeling techniques to achieve high-fidelity speech processing.

GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch Authors: Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

Affiliations: Department of Intelligence and Information, Seoul National University, South Korea; Sony AI, Tokyo, Japan; Sony Europe B.V., Stuttgart, Germany; Sony Group Corporation, Tokyo, Japan

Abstract: GRAFX is an open-source library developed for handling audio processing graphs within the PyTorch framework. Key features and contributions of GRAFX include:

  • Efficient Parallel Computation: The library supports efficient parallel computation of input graphs, signals, and processor parameters on GPUs.
  • Example Use Case: Demonstrates its utility in a music mixing scenario where the parameters of every differentiable processor in a large audio graph are optimized via gradient descent.
  • Availability: The code for GRAFX is available at https://github.com/sh-lee97/grafx.

GRAFX aims to streamline the development and optimization of complex audio processing systems by leveraging PyTorch's capabilities.



Thank you for your attention. Subscribe now to stay informed and join the conversation!

About us:

We also have an amazing team of AI engineers with:

  • A blend of industrial experience and a strong academic track record ??
  • 300+ research publications and 150+ commercial projects ??
  • Millions of dollars saved through our ML/DL solutions ??
  • An exceptional work culture, ensuring satisfaction with both the process and results ??

We are here to help you maximize efficiency with your available resources.

Reach out when:

  • You want to identify what daily tasks can be automated ??
  • You need to understand the benefits of AI and how to avoid excessive cloud costs while maintaining data privacy ??
  • You’d like to optimize current pipelines and computational resource distribution ??
  • You’re unsure how to choose the best DL model for your use case ??
  • You know how but struggle with achieving specific performance and cost efficiency

Have doubts or many questions about AI in your business? Get in touch! ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了