AI Newsletter
Ievgen Gorovyi
Founder & CEO @ It-Jim | AI Expert | PhD, Computer Vision | GenAI | AI Consulting
Another week - another cool updates in the world of AI!
?? AI Images Are So Real Now!
AI-generated images have reached a new level of realism, as demonstrated by a recent example from the Flux model. The images are so lifelike that even minor details like facial features and clothing look convincingly real. The only telltale signs are small errors, such as gibberish text or slightly off details on objects like microphones.
?? Zico Kolter Joins OpenAI’s Board
OpenAI has appointed Zico Kolter, a leading expert in AI safety and alignment, to its Board of Directors. Kolter, who heads the Machine Learning Department at Carnegie Mellon University, brings extensive expertise in developing robust machine learning models and innovative AI safety methods. He will also serve on the Safety and Security Committee, contributing to critical decisions on OpenAI's projects.
?? GPT-4's New Safety Measures
OpenAI released a detailed system card for GPT-4, outlining the safety measures implemented before its release. The report highlights extensive safety work, including external red teaming, risk evaluations, and various mitigations. The evaluation scorecard indicates that cybersecurity, biological threats, and model autonomy are low risk, while persuasion threats are medium risk. OpenAI also addressed concerns about emotional attachment to the GPT-4 voice mode, noting that some users may form strong connections with the AI.
?? OpenAI's Failure to Deliver
OpenAI has faced criticism for frequently announcing exciting developments but failing to deliver them to the public. Despite promises, features like Sora access and GPT-4's advanced voice capabilities remain largely unavailable. Similarly, their new ChatGPT search tool, intended to compete with Google, has yet to be released. As Dev Day approaches on October 1st in San Francisco, OpenAI has tempered expectations, indicating that no major announcements or new features for ChatGPT are expected.
?? Elon Sues OpenAI (Again)
Elon Musk has filed a new lawsuit against OpenAI, claiming that the organization misled him into co-founding a nonprofit by promising it would prioritize safety and transparency over profits. This follows an earlier lawsuit, which Musk later dropped, that accused OpenAI of straying from its original mission. The new lawsuit alleges that OpenAI manipulated Musk into investing in a venture that later shifted towards profit-making, contrary to its initial nonprofit ethos.
?? Nvidia Scraping YouTube
Nvidia has found itself in controversy this week after a leaked document revealed that the company has been scraping vast amounts of YouTube videos—equivalent to a human lifetime’s worth each day—to train its AI models. The internal communications and documents suggest that this data collection is part of Nvidia's efforts to develop a new video foundation model.
??Character AI Shakeup
Character AI is undergoing a significant transition as the company partners with Google, resulting in the departure of its co-founders. Noom Shazir, the co-founder and CEO of Character AI, is returning to Google to work with DeepMind after originally leaving the company in 2021 to start Character AI. While this move raises questions about the future direction of Character AI, the company isn’t closing shop. The General Counsel of Character AI will take over as interim CEO, and the majority of the staff will remain.
?? Introducing Qwen2-Math
The Qwen Team has showed a series of math-specific large language models under the Qwen2 series, designed to excel in solving arithmetic and mathematical problems. The Qwen2-Math and Qwen2-Math-Instruct models, available in sizes ranging from 1.5B to 72B parameters, have shown remarkable performance, surpassing both open-source and closed-source models like GPT-4o in various mathematical benchmarks. These models have been fine-tuned with a specialized mathematical corpus and are evaluated across both English and Chinese math benchmarks, achieving state-of-the-art results.
?? OpusClip's New Feature: Clip Anything
OpusClip has just rolled out an exciting new feature called "Clip Anything," designed to enhance video editing with AI-driven precision. This update allows users to create short-form clips from any moment in a long-form video by leveraging advanced visual, audio, and sentiment analysis. The tool analyzes each frame to identify objects, scenes, actions, emotions, and text, enabling you to generate clips based on natural language prompts.
?? Reddit to Test AI-Powered Search Results
Reddit is preparing to roll out a new feature that will integrate AI into its search functionality. According to Reddit CEO Steve Huffman, the platform will soon test AI-generated summaries at the top of search result pages. This feature aims to provide users with concise content summaries and recommendations, helping them explore deeper into topics and discover new communities on Reddit. The AI-powered search results will be driven by both first-party and third-party technologies, with the experiment set to begin later this year.
?? Google Announces Gemini AI-Powered Streaming Device
Google is set to launch a new streaming device powered by Gemini AI, designed to replace the Chromecast and compete with Roku, Amazon Fire Stick, and Apple TV. The device aims to enhance content discovery by leveraging Google AI to provide personalized content recommendations tailored to user preferences.. While current platforms like Hulu, Amazon Prime, Netflix, and YouTube already use recommendation algorithms to suggest content, Google’s use of Gemini AI may offer improvements through more advanced generative AI techniques. Generative AI could potentially provide deeper insights and more nuanced recommendations compared to traditional algorithms.
??
New robot Figure O2
Figure Robotics revealed their Figure O2 humanoid robot, featuring a design similar to Tesla's Optimus. Equipped with six cameras for advanced AI-driven vision and speech-to-speech reasoning, the Figure O2 can interact with its environment and respond to commands. It is already being used on BMW production lines.
New Noteworthy Papers??
Authors: Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, David Ha Equal Contribution: Chris Lu, Cong Lu, Robert Tjarko Lange; Equal Advising: Jeff Clune, David Ha
Affiliations: Sakana AI, FLAIR - University of Oxford, University of British Columbia, Vector Institute, Canada CIFAR AI Chair
Abstract: One of the major challenges in artificial general intelligence is creating agents capable of conducting scientific research independently. While current models assist with tasks like brainstorming, coding, and prediction, they only support part of the scientific process. This paper introduces "The AI Scientist," a framework for fully automated scientific discovery using frontier large language models (LLMs). This system autonomously generates research ideas, writes code, executes experiments, visualizes results, drafts scientific papers, and conducts a simulated review process for evaluation. The approach allows for iterative development of ideas and contributes to an expanding knowledge archive, mimicking the human scientific community. The framework is demonstrated across three machine learning subfields: diffusion modeling, transformer-based language modeling, and learning dynamics. The cost of generating each paper is under $15, showcasing the potential for democratizing research and accelerating scientific progress. The framework includes an automated reviewer that performs near-human quality evaluations, with generated papers meeting acceptance criteria at a top machine learning conference. This work represents a significant step toward fully automated scientific discovery and innovation. The code is available at: GitHub - Sakana AI/AI-Scientist.
Matryoshka Diffusion Models Authors: Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly
Affiliation: Apple
Abstract: Diffusion models have become a leading approach for generating high-quality images and videos, but training these high-dimensional models is challenging due to computational and optimization constraints. Existing techniques typically involve either cascaded models in pixel space or downsampled latent spaces using auto-encoders. This paper introduces Matryoshka Diffusion Models (MDM), an end-to-end framework designed for high-resolution image and video synthesis. MDM employs a diffusion process that denoises inputs across multiple resolutions simultaneously, utilizing a NestedUNet architecture where smaller-scale features are nested within larger-scale ones. The framework supports a progressive training schedule from lower to higher resolutions, improving optimization for high-resolution outputs. MDM's effectiveness is demonstrated across several benchmarks, including class-conditioned image generation, high-resolution text-to-image synthesis, and text-to-video generation. Notably, MDM can train a single pixel-space model at resolutions up to 1024 × 1024 pixels, showcasing strong zero-shot generalization with the CC12M dataset, which includes 12 million images.
Authors: Hugo Lauren?on, Léo Tronchon, Matthieu Cord, Victor Sanh
Affiliation: Hugging Face, Sorbonne Université
Abstract: The field of vision-language models (VLMs) has gained significant momentum due to advancements in large language models and vision transformers. Despite extensive research, many critical design decisions in VLM development are made without thorough justification, which can hinder progress by obscuring which design choices actually enhance model performance. To address this, the authors conduct comprehensive experiments focusing on pre-trained models, architecture choices, data, and training methods. They introduce Idefics2, an efficient foundational VLM with 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category on various multimodal benchmarks, often rivaling models that are four times its size. The authors release the Idefics2 model (base, instructed, and chat versions) along with the datasets used for its training.
Authors: Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang
Affiliations: South China University of Technology, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, Dept. of CST., BNRist Center, Institute for AI, Tsinghua University
领英推荐
Abstract: This paper introduces TAPTR, a framework for Tracking Any Point with Transformers. TAPTR is inspired by object detection and tracking frameworks, particularly DETR-like algorithms. In this approach, each tracking point is represented as a point query, which includes positional and content features. These queries are updated layer by layer in each video frame, with visibility predicted based on their content features. The model leverages self-attention across the temporal dimension to exchange information between queries of the same tracking point. TAPTR incorporates elements from optical flow models, such as cost volume, and uses straightforward techniques to maintain long-term temporal information while reducing feature drifting. The framework achieves state-of-the-art performance on various TAP datasets and demonstrates faster inference speeds.
Authors: Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, Jianfei Cai
Affiliations: Monash University, ETH Zurich, University of Tübingen (Tübingen AI Center), VGG (University of Oxford), Microsoft, Nanyang Technological University Project Page
Abstract: This paper presents MVSplat, a model designed for efficient 3D Gaussian splatting from sparse multi-view images. MVSplat predicts clean, feed-forward 3D Gaussians by using a cost volume representation built via plane sweeping. This cost volume captures cross-view feature similarities, providing crucial geometric information for accurate Gaussian center localization. The model jointly learns parameters for Gaussian primitives along with their centers, relying solely on photometric supervision. MVSplat demonstrates superior performance on large-scale benchmarks like RealEstate10K and ACID, achieving state-of-the-art results with 10× fewer parameters and over 2× faster inference speed compared to the latest method, pixelSplat. Additionally, MVSplat offers enhanced appearance and geometry quality and improved cross-dataset generalization.
Authors: David B. D’Ambrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen, Heni Ben Amor, Alex Bewley, Barney J. Reed, Krista Reymann, Leila Takayama, Yuval Tassa, Krzysztof Choromanski, Erwin Coumans, Deepali Jain, Navdeep Jaitly, Natasha Jaques, Satoshi Kataoka, Yuheng Kuang, Nevena Lazic, Reza Mahjourian, Sherry Moore, Kenneth Oslund, Anish Shankar, Vikas Sindhwani, Vincent Vanhoucke, Grace Vesom, Peng Xu, Pannag R. Sanketi
Affiliation: Google DeepMind
Abstract: This paper presents a significant advancement in robotics by developing a robot agent that achieves amateur human-level performance in competitive table tennis. The robot features a 6 DoF ABB 1100 arm mounted on two Festo linear gantries, providing extensive movement capabilities. Key contributions of the work include:
Performance was evaluated through 29 matches between the robot and human players of varying skill levels. The robot won 45% of the matches, demonstrating strong performance against beginner and intermediate players. The robot was less successful against the most advanced players but still showcased impressive amateur-level capabilities.
Videos of the matches: Available here
Authors: Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Affiliation: School of Computer Science, Fudan University, Shanghai, China; Shanghai Key Laboratory of Intelligent Information Processing
Abstract: Retrieval-augmented generation (RAG) techniques are effective in integrating up-to-date information, reducing hallucinations, and improving response quality, especially in specialized fields. Despite their advantages, existing RAG approaches often face challenges related to implementation complexity and extended response times. This paper investigates various RAG methods and their combinations to determine optimal practices. Key findings include:
Authors: Yilun Du, Leslie Kaelbling
Abstract: Large, monolithic generative models, trained on vast amounts of data, have become prevalent in AI research. This paper proposes an alternative approach—constructing generative systems by composing smaller, specialized models. Key points include:
The study highlights the advantages of a compositional approach in terms of data efficiency, task adaptability, and the ability to uncover useful model components.
Audio papers:
Authors: Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon
Affiliation: Centre for Digital Music, Queen Mary University of London, UK
Abstract: Generating guitar tablatures involves assigning each musical note to a specific string and fret on a guitar, which can be complex due to multiple possible assignments per pitch. Traditional methods often use constraint-based dynamic programming to minimize hand movement costs. This paper introduces a novel deep learning approach for this task:
This work highlights the effectiveness of using advanced deep learning techniques for symbolic guitar tablature estimation.
Authors: Kohei Saijo, Gordon Wichern, Fran?ois G. Germain, Zexu Pan, Jonathan Le Roux
Affiliation: Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA; Waseda University, Japan
Abstract: Time-frequency (TF) domain models have achieved significant success in speech separation. Traditional models often rely on RNNs, which can limit parallelization and scalability. This paper introduces TF-Locoformer, a novel Transformer-based model designed to overcome these limitations:
This approach combines the strengths of Transformer architectures with effective local modeling techniques to achieve high-fidelity speech processing.
GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch Authors: Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji
Affiliations: Department of Intelligence and Information, Seoul National University, South Korea; Sony AI, Tokyo, Japan; Sony Europe B.V., Stuttgart, Germany; Sony Group Corporation, Tokyo, Japan
Abstract: GRAFX is an open-source library developed for handling audio processing graphs within the PyTorch framework. Key features and contributions of GRAFX include:
GRAFX aims to streamline the development and optimization of complex audio processing systems by leveraging PyTorch's capabilities.
Thank you for your attention. Subscribe now to stay informed and join the conversation!
About us:
We also have an amazing team of AI engineers with:
We are here to help you maximize efficiency with your available resources.
Reach out when:
Have doubts or many questions about AI in your business? Get in touch! ??