AI Newsletter
Another week - another cool updates in the world of AI!
?? ChatGPT Voice Tech
?? New ChatGPT-4o
?? Microsoft Adds OpenAI as Competitor
?? OpenAI Endorses Senate Bills
?? Google's New AI Model: Gemma 2B
?? New AI Chrome Features
?? Meta's New AI Chatbot Feature
?? Perplexity Publishers’ Program Launch
?? MidJourney 6.1 Update
?? Stable Fast 3D
?? New A-player in LLM world
?? New model from Runway
?? Suno Lawsuit Response
?? Paris Olympics AI Integration
New Noteworthy Papers??
Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
Institutions:
Abstract: Research on scaling large language models (LLMs) has primarily concentrated on increasing model parameters and training data size, often neglecting the impact of vocabulary size. This study examines how vocabulary size affects LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We introduce three methods for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. All methods indicate that the optimal vocabulary size correlates with the available compute budget, suggesting that larger models benefit from larger vocabularies. For instance, our analysis predicts that the optimal vocabulary size for Llama2-70B should be at least 216K—seven times larger than its current size of 32K. We empirically validate this by training 3B parameter models across different compute budgets. Using the predicted optimal vocabulary size consistently enhances downstream performance compared to conventional sizes. Increasing the vocabulary size from 32K to 43K improves performance on the ARC-Challenge from 29.1 to 32.0 with the same compute budget. This research highlights the importance of considering both model parameters and vocabulary size for effective scaling.
领英推荐
Authors: Anonymous
Abstract: As language models advance in complexity, ensuring their outputs faithfully reflect the input and maintaining consistency in reasoning are crucial challenges. To address scalability issues in monitoring these aspects, this paper introduces a novel approach using information-theoretic measures to detect manipulated or unfaithful reasoning. We propose a Difference of Entropies (DoE) estimator to quantify the mutual information difference between outputs, offering a principled method to identify low-quality or inconsistent content. Theoretically, we analyze the DoE estimator, proving its incentive-compatibility and deriving scaling laws for f-mutual information based on sample size. Implementing the estimator with an LLM on tasks such as machine translation and peer review datasets from ICLR 2023, we find that the DoE estimator consistently assigns higher scores to unmodified reviews compared to manipulated ones and correlates with BLEU scores. These findings demonstrate the effectiveness of information-theoretic approaches in ensuring the reliability of language model reasoning and highlight their potential for scalable oversight of advanced AI systems.
Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin
Abstract: The wide-ranging applications of large language models (LLMs) have led to the development of extensive benchmarks to thoroughly test their capabilities. However, these benchmarks often require tens of thousands of examples, making the evaluation process costly and time-consuming. This paper explores methods to reduce the number of evaluations needed to assess LLM performance on key benchmarks. We demonstrate that, for instance, evaluating an LLM on only 100 curated examples can provide an accurate estimate of its performance on MMLU, a widely used multiple-choice QA benchmark with 14K examples. We release tools and smaller versions of popular benchmarks, including Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis shows that these tools and smaller benchmarks are effective in reliably and efficiently reproducing the original evaluation results.
Authors: Dan Kondratyuk1, Lijun Yu12, Xiuye Gu1, Jose Lezama1, Jonathan Huang1, Grant Schindler1, Rachel Hornung1, Vighnesh Birodkar1, Jimmy Yan1, Ming-Chang Chiu1, Krishna Somandepalli1, Hassan Akbari1, Yair Alon1, Yong Cheng1, Josh Dillon1, Agrim Gupta1, Meera Hahn1, Anja Hauth1, David Hendon1, Alonso Martinez1, David Minnen1, Mikhail Sirotenko1, Kihyuk Sohn1, Xuan Yang1, Hartwig Adam1, Ming-Hsuan Yang1, Irfan Essa1, Huisheng Wang1, David A. Ross1, Bryan Seybold1, Lu Jiang12
Abstract: We introduce VideoPoet, a model designed for generating high-quality videos from diverse conditioning signals. VideoPoet uses a decoder-only transformer architecture to handle multimodal inputs, including images, videos, text, and audio. Its training involves two stages: pretraining with a mixture of multimodal generative objectives within an autoregressive Transformer framework, followed by task-specific adaptation. Results showcase VideoPoet’s state-of-the-art performance in zero-shot video generation, particularly in creating high-fidelity motions. For more information, visit the project page.
Authors: Gyeongsik Moon12, Takaaki Shiratori2, Shunsuke Saito2 1DGIST 2Codec Avatars Lab, Meta
Abstract: Facial expressions and hand motions are crucial for conveying emotions and interacting effectively. However, many 3D human avatars created from casually captured videos only support body motions, lacking detailed facial expressions and hand movements. We introduce ExAvatar, a system for creating expressive whole-body 3D avatars from short monocular videos. ExAvatar combines a parametric mesh model (SMPL-X) with 3D Gaussian Splatting (3DGS) to address challenges related to limited facial and pose diversity in videos and the absence of 3D observations. Our hybrid approach uses 3D Gaussians as vertices on a mesh with predefined connectivity, enabling animations with novel facial expressions and reducing artifacts in new motions.
Tora: Trajectory-oriented Diffusion Transformer for Video Generation Authors: Zhenghao Zhang1*, Junchao Liao1*, Menghao Li1, Long Qin1, Weizhi Wang1 1[Affiliation]
Abstract: Recent advancements in Diffusion Transformers (DiTs) have shown impressive results in high-quality video generation. However, the control of motion within generated videos has seen limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework designed to integrate textual, visual, and trajectory conditions for video generation. Tora comprises a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes trajectories into hierarchical spacetime motion patches using a 3D video compression network. The MGF incorporates these patches into DiT blocks to ensure video consistency along trajectories. Tora’s architecture aligns with DiT’s scalability, enabling precise control over video dynamics with varying durations, aspect ratios, and resolutions. Extensive experiments showcase Tora’s capability in achieving high motion fidelity and accurately simulating physical movements.
Authors: Prafull Sharma1,2, Varun Jampani, Yuanzhen Li, Dmitry Lagun2, Fredo Durand2, Bill Freeman1,2, Mark Matthews2 1MIT CSAIL,
Abstract: This paper introduces Alchemist, a method for controlling material properties—such as roughness, metallicity, albedo, and transparency—in real images using diffusion models. Leveraging the generative capabilities of text-to-image models known for their photorealism, Alchemist employs scalar values and instructions to adjust these low-level material attributes. To address the scarcity of datasets with controlled material properties, the authors created a synthetic dataset featuring physically-based materials. By fine-tuning a modified pre-trained text-to-image model on this dataset, Alchemist enables precise editing of material properties in real-world images while maintaining other attributes. The model’s potential applications include material editing in Neural Radiance Fields (NeRFs).
Audio:
Authors: Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen
Abstract: Dialogue is a natural mode of human-computer interaction (HCI), but traditional speech language models (SLMs) are constrained by turn-based conversation, lacking real-time interaction capabilities. This paper introduces a novel approach to address this limitation through full duplex modeling (FDM) in interactive speech language models (iSLM). The proposed listening-while-speaking language model (LSLM) integrates both listening and speaking channels in an end-to-end system. LSLM employs a token-based decoder-only text-to-speech (TTS) system for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. The model supports autoregressive generation and real-time turn-taking detection. Three fusion strategies—early fusion, middle fusion, and late fusion—are evaluated, with middle fusion providing the best balance between speech generation and real-time interaction. Experimental results in command-based and voice-based FDM settings demonstrate LSLM’s robustness to noise and responsiveness to diverse instructions. This advancement aims to improve interactive speech dialogue systems, enhancing real-time conversational capabilities.
Authors: Jiwoo Ryu, Hao-Wen Dong, Jongmin Jung, Dasaem Jeong
Abstract: Representing symbolic music with compound tokens, which bundle multiple sub-tokens each representing a specific musical feature, can effectively reduce sequence length. However, predicting all sub-tokens simultaneously often fails to capture their interdependencies. This paper introduces the Nested Music Transformer (NMT), an architecture designed to decode compound tokens autoregressively while managing memory usage efficiently. The NMT features two transformers: a main decoder for the sequence of compound tokens and a sub-decoder for the sub-tokens within each compound token. Experiments demonstrate that the NMT improves performance in terms of perplexity when processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset, showcasing its effectiveness in music sequence modeling.
Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
Abstract: Conversational Speech Synthesis (CSS) aims to produce speech that conveys the appropriate speaking style for user-agent interactions. While existing methods use multi-modal context modeling to understand and express empathy, they often require complex network architectures and struggle with limited, scripted datasets that do not fully capture natural conversational styles. To overcome these challenges, we propose GPT-Talker, a novel generative expressive CSS system. GPT-Talker converts multi-turn dialogue history into discrete token sequences and integrates them to create a comprehensive dialogue context. Using GPT, the system predicts a token sequence that encompasses both semantic and stylistic information for the agent's response. The synthesized speech is then produced by an enhanced VITS model. We also introduce the Natural CSS Dataset (NCSSD), a large-scale dataset with 236 hours of naturally recorded conversational speech and dialogues from TV shows in both Chinese and English. Extensive experiments show that GPT-Talker significantly outperforms existing CSS systems in naturalness and expressiveness, as validated by both subjective and objective evaluations.
Authors: Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon
Abstract: Multi-instrument music transcription converts polyphonic recordings into musical scores, assigning each note to the correct instrument. This task is challenging due to the need for simultaneous identification of multiple instruments and precise transcription of pitch and timing. Additionally, the scarcity of fully annotated data complicates training. This paper presents YourMT3+, an advanced model suite for multi-instrument music transcription, building on the MT3 language token decoding approach. Enhancements include a hierarchical attention transformer in the time-frequency domain and the integration of a mixture of experts. To tackle data limitations, we introduce a new multi-channel decoding method that works with incomplete annotations and propose intra- and cross-stem augmentation techniques for dataset expansion. Experiments show that YourMT3+ can perform direct vocal transcription without needing separate voice separation processes. Benchmark tests across ten public datasets demonstrate the model's competitive or superior performance compared to existing transcription models, with additional testing revealing limitations in current models for pop music recordings.
Thank you for your attention. Subscribe now to stay informed and join the conversation!
About us:
We also have an amazing team of AI engineers with:
We are here to help you maximize efficiency with your available resources.
Reach out when:
Have doubts or many questions about AI in your business? Get in touch! ??