AI Newsletter

AI Newsletter

Another week - another cool updates in the world of AI!

?? ChatGPT Voice Tech

  • OpenAI is starting to roll out its new advanced voice feature for ChatGPT, which has been compared to the lifelike voice of Scarlett Johansson in the movie "Her." This feature is currently available to a select group of users, allowing them to experience the new capabilities first-hand. Users with access can try the advanced voice mode, which offers lifelike interactions, including voice interruptions and nuanced speech patterns like taking breaths during fast counting.

Credit: Newschain

?? New ChatGPT-4o

  • OpenAI is introducing an experimental version of GPT-4o, allowing for outputs of up to 64K tokens per request. This initiative aims to unlock new use cases that benefit from longer completions, providing more comprehensive responses. Alpha participants can access this feature by using the model name "gpt-4o-64k-output-alpha." Due to the higher inference costs associated with long completions, the pricing is set at $6.00 per 1M input tokens and $18.00 per 1M output tokens.

Credit: ChatGPT

?? Microsoft Adds OpenAI as Competitor

  • Microsoft's relationship with OpenAI has become more complicated as it now lists OpenAI as a competitor in its latest annual report. Despite a long-term partnership and significant investment, Microsoft and OpenAI are moving into each other's domains, particularly in AI offerings and search and news advertising. This shift highlights the evolving dynamics between the tech giant and the AI startup, even as they continue to collaborate closely.

Credit: Getty Images

?? OpenAI Endorses Senate Bills

  • OpenAI has endorsed three Senate bills that could significantly shape America’s AI policy. The Future of AI Innovation Act aims to establish the United States AI Safety Institute to set standards for AI models. OpenAI also supports the NSF AI Education Act and the CREATE AI Act, which provide scholarships for AI research and establish educational resources in colleges and K-12 settings.

Credit: Getty Images

?? Google's New AI Model: Gemma 2B

  • Google has introduced the Gemma 2B, a 2 billion parameter model designed for improved speed and efficiency, especially suited for mobile devices. This new model comes alongside the Shield Gemma suite, an open-source tool for content safety that filters AI inputs and outputs. Remarkably, the Gemma 2B model outperforms larger models like Mixtral 8X 7B, GPT-3.5 Turbo, and LLaMA 2 70B, showcasing significant advancements in performance despite its smaller size.

Credit: Gemma 2

?? New AI Chrome Features

  • Google Chrome introduced exciting new AI features. The updated Google Lens now allows users to select specific areas of images for more precise searches. For instance, you can highlight a suitcase or plant, and Chrome will find similar items or identify the plant using AI. Another addition is the Compare feature, which enables side-by-side comparisons of products from different tabs. Lastly, the new natural language search history feature lets you query your past searches in plain language, making it easier to track down that ice cream shop you visited last week.

Credit: Google

?? Meta's New AI Chatbot Feature

  • Meta has retired its previous AI chatbot features, like those featuring digital renditions of celebrities, and replaced them with a more versatile AI Studio. This new tool allows users to create custom AI characters tailored to their specific interests and needs, such as a personal tutor, creative designer, or even a pet AI. The AI Studio offers extensive customization options, including character design, capabilities, and dialogues, enabling users to build unique AI experiences.

Credit: Meta

?? Perplexity Publishers’ Program Launch

  • Perplexity has released its new Publishers’ Program to bolster the role of media organizations and content creators. The initiative features revenue sharing, where publishers earn from advertising linked to their content, and access to Perplexity’s APIs for custom answer engines on their sites. Partners like TIME and Der Spiegel will benefit from these tools and Enterprise Pro access, supporting enhanced engagement and analytics. The program aims to foster collaboration and adapt to evolving internet ecosystems, potentially including future bundled subscription options.

Credit: Perplexity

?? MidJourney 6.1 Update

  • MidJourney has released version 6.1, bringing significant enhancements to its AI image generator. This update improves image quality, coherence, and text accuracy, introducing a new upscaling and personalization model. The advancements are evident in the generated images, with notably better realism and detail. Users have already shared impressive results, showcasing how far AI-generated visuals have come in just a few years.

Credit: Midjourney

?? Stable Fast 3D

  • Stable Fast 3D has introduced a new way to generate 3D assets quickly from single images. This tool, available via API and on Stable Assistant (which requires a subscription), can create 3D models from images in under a second. For a more accessible option, you can try it on Hugging Face, where it also offers rapid 3D generation. While it performs impressively fast, the quality of the 3D models may vary depending on the image’s angle and detail.

Credit: Stability

?? New A-player in LLM world

  • Black Forest Labs has introduced FLUX.1, their latest suite of text-to-image models. This new release offers impressive advancements in image detail, prompt adherence, and scene complexity. Available in three variants—FLUX.1 [pro], [dev], and [schnell]—the models cater to different needs, from top-tier performance to rapid, local development. Notably, FLUX.1 [pro] surpasses models like Midjourney v6.0 and DALL·E 3 in visual quality and output diversity

Credit: Black Forest Labs

?? New model from Runway

  • Runway has expanded its Gen 3 Alpha capabilities with a new image-to-video model. This update allows users to upload an image and generate corresponding video content. Early demonstrations showcase impressive results, such as transforming static images into dynamic scenes. Additionally, Runway introduced Gen 3 Alpha Turbo, which significantly speeds up video generation, completing outputs in as little as 11 seconds.

Credit: RunWay

?? Suno Lawsuit Response

  • Suno has responded to a lawsuit alleging they used copyrighted information for training their AI model. They assert that their data collection was based on publicly available sources and that any copyrighted content inadvertently included was not their target. Suno compares their learning process to that of a human, absorbing general musical styles without replicating specific artists.

Credit: Suno

?? Paris Olympics AI Integration

  • The 2024 Paris Olympics are set to showcase AI like never before. This year, the International Olympic Committee (IOC) is diving into AI to boost athletes' performance, ensure fair play, and enhance the spectator experience. From AI-driven talent scouting and personalized training plans to real-time injury prevention and more transparent judging systems.

Credit: Shutterstock

New Noteworthy Papers??

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

Institutions:

  1. The University of Hong Kong
  2. Sea AI Lab
  3. Contextual AI
  4. The Ohio State University

Abstract: Research on scaling large language models (LLMs) has primarily concentrated on increasing model parameters and training data size, often neglecting the impact of vocabulary size. This study examines how vocabulary size affects LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We introduce three methods for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. All methods indicate that the optimal vocabulary size correlates with the available compute budget, suggesting that larger models benefit from larger vocabularies. For instance, our analysis predicts that the optimal vocabulary size for Llama2-70B should be at least 216K—seven times larger than its current size of 32K. We empirically validate this by training 3B parameter models across different compute budgets. Using the predicted optimal vocabulary size consistently enhances downstream performance compared to conventional sizes. Increasing the vocabulary size from 32K to 43K improves performance on the ARC-Challenge from 29.1 to 32.0 with the same compute budget. This research highlights the importance of considering both model parameters and vocabulary size for effective scaling.

Implementability of Information Elicitation Mechanisms with Pre-Trained Language Models

Authors: Anonymous

Abstract: As language models advance in complexity, ensuring their outputs faithfully reflect the input and maintaining consistency in reasoning are crucial challenges. To address scalability issues in monitoring these aspects, this paper introduces a novel approach using information-theoretic measures to detect manipulated or unfaithful reasoning. We propose a Difference of Entropies (DoE) estimator to quantify the mutual information difference between outputs, offering a principled method to identify low-quality or inconsistent content. Theoretically, we analyze the DoE estimator, proving its incentive-compatibility and deriving scaling laws for f-mutual information based on sample size. Implementing the estimator with an LLM on tasks such as machine translation and peer review datasets from ICLR 2023, we find that the DoE estimator consistently assigns higher scores to unmodified reviews compared to manipulated ones and correlates with BLEU scores. These findings demonstrate the effectiveness of information-theoretic approaches in ensuring the reliability of language model reasoning and highlight their potential for scalable oversight of advanced AI systems.

tinyBenchmarks: Evaluating LLMs with Fewer Examples

Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin

Abstract: The wide-ranging applications of large language models (LLMs) have led to the development of extensive benchmarks to thoroughly test their capabilities. However, these benchmarks often require tens of thousands of examples, making the evaluation process costly and time-consuming. This paper explores methods to reduce the number of evaluations needed to assess LLM performance on key benchmarks. We demonstrate that, for instance, evaluating an LLM on only 100 curated examples can provide an accurate estimate of its performance on MMLU, a widely used multiple-choice QA benchmark with 14K examples. We release tools and smaller versions of popular benchmarks, including Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis shows that these tools and smaller benchmarks are effective in reliably and efficiently reproducing the original evaluation results.

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Authors: Dan Kondratyuk1, Lijun Yu12, Xiuye Gu1, Jose Lezama1, Jonathan Huang1, Grant Schindler1, Rachel Hornung1, Vighnesh Birodkar1, Jimmy Yan1, Ming-Chang Chiu1, Krishna Somandepalli1, Hassan Akbari1, Yair Alon1, Yong Cheng1, Josh Dillon1, Agrim Gupta1, Meera Hahn1, Anja Hauth1, David Hendon1, Alonso Martinez1, David Minnen1, Mikhail Sirotenko1, Kihyuk Sohn1, Xuan Yang1, Hartwig Adam1, Ming-Hsuan Yang1, Irfan Essa1, Huisheng Wang1, David A. Ross1, Bryan Seybold1, Lu Jiang12

Abstract: We introduce VideoPoet, a model designed for generating high-quality videos from diverse conditioning signals. VideoPoet uses a decoder-only transformer architecture to handle multimodal inputs, including images, videos, text, and audio. Its training involves two stages: pretraining with a mixture of multimodal generative objectives within an autoregressive Transformer framework, followed by task-specific adaptation. Results showcase VideoPoet’s state-of-the-art performance in zero-shot video generation, particularly in creating high-fidelity motions. For more information, visit the project page.

Expressive Whole-Body 3D Gaussian Avatar

Authors: Gyeongsik Moon12, Takaaki Shiratori2, Shunsuke Saito2 1DGIST 2Codec Avatars Lab, Meta

Project Page

Abstract: Facial expressions and hand motions are crucial for conveying emotions and interacting effectively. However, many 3D human avatars created from casually captured videos only support body motions, lacking detailed facial expressions and hand movements. We introduce ExAvatar, a system for creating expressive whole-body 3D avatars from short monocular videos. ExAvatar combines a parametric mesh model (SMPL-X) with 3D Gaussian Splatting (3DGS) to address challenges related to limited facial and pose diversity in videos and the absence of 3D observations. Our hybrid approach uses 3D Gaussians as vertices on a mesh with predefined connectivity, enabling animations with novel facial expressions and reducing artifacts in new motions.



Tora: Trajectory-oriented Diffusion Transformer for Video Generation Authors: Zhenghao Zhang1*, Junchao Liao1*, Menghao Li1, Long Qin1, Weizhi Wang1 1[Affiliation]

Abstract: Recent advancements in Diffusion Transformers (DiTs) have shown impressive results in high-quality video generation. However, the control of motion within generated videos has seen limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework designed to integrate textual, visual, and trajectory conditions for video generation. Tora comprises a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes trajectories into hierarchical spacetime motion patches using a 3D video compression network. The MGF incorporates these patches into DiT blocks to ensure video consistency along trajectories. Tora’s architecture aligns with DiT’s scalability, enabling precise control over video dynamics with varying durations, aspect ratios, and resolutions. Extensive experiments showcase Tora’s capability in achieving high motion fidelity and accurately simulating physical movements.

Alchemist: Parametric Control of Material Properties with Diffusion Models

Authors: Prafull Sharma1,2, Varun Jampani, Yuanzhen Li, Dmitry Lagun2, Fredo Durand2, Bill Freeman1,2, Mark Matthews2 1MIT CSAIL,

Abstract: This paper introduces Alchemist, a method for controlling material properties—such as roughness, metallicity, albedo, and transparency—in real images using diffusion models. Leveraging the generative capabilities of text-to-image models known for their photorealism, Alchemist employs scalar values and instructions to adjust these low-level material attributes. To address the scarcity of datasets with controlled material properties, the authors created a synthetic dataset featuring physically-based materials. By fine-tuning a modified pre-trained text-to-image model on this dataset, Alchemist enables precise editing of material properties in real-world images while maintaining other attributes. The model’s potential applications include material editing in Neural Radiance Fields (NeRFs).


Audio:

Language Model Can Listen While Speaking

Authors: Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Abstract: Dialogue is a natural mode of human-computer interaction (HCI), but traditional speech language models (SLMs) are constrained by turn-based conversation, lacking real-time interaction capabilities. This paper introduces a novel approach to address this limitation through full duplex modeling (FDM) in interactive speech language models (iSLM). The proposed listening-while-speaking language model (LSLM) integrates both listening and speaking channels in an end-to-end system. LSLM employs a token-based decoder-only text-to-speech (TTS) system for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. The model supports autoregressive generation and real-time turn-taking detection. Three fusion strategies—early fusion, middle fusion, and late fusion—are evaluated, with middle fusion providing the best balance between speech generation and real-time interaction. Experimental results in command-based and voice-based FDM settings demonstrate LSLM’s robustness to noise and responsiveness to diverse instructions. This advancement aims to improve interactive speech dialogue systems, enhancing real-time conversational capabilities.


Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

Authors: Jiwoo Ryu, Hao-Wen Dong, Jongmin Jung, Dasaem Jeong

Abstract: Representing symbolic music with compound tokens, which bundle multiple sub-tokens each representing a specific musical feature, can effectively reduce sequence length. However, predicting all sub-tokens simultaneously often fails to capture their interdependencies. This paper introduces the Nested Music Transformer (NMT), an architecture designed to decode compound tokens autoregressively while managing memory usage efficiently. The NMT features two transformers: a main decoder for the sequence of compound tokens and a sub-decoder for the sub-tokens within each compound token. Experiments demonstrate that the NMT improves performance in terms of perplexity when processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset, showcasing its effectiveness in music sequence modeling.


Generative Expressive Conversational Speech Synthesis

Authors: Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

Abstract: Conversational Speech Synthesis (CSS) aims to produce speech that conveys the appropriate speaking style for user-agent interactions. While existing methods use multi-modal context modeling to understand and express empathy, they often require complex network architectures and struggle with limited, scripted datasets that do not fully capture natural conversational styles. To overcome these challenges, we propose GPT-Talker, a novel generative expressive CSS system. GPT-Talker converts multi-turn dialogue history into discrete token sequences and integrates them to create a comprehensive dialogue context. Using GPT, the system predicts a token sequence that encompasses both semantic and stylistic information for the agent's response. The synthesized speech is then produced by an enhanced VITS model. We also introduce the Natural CSS Dataset (NCSSD), a large-scale dataset with 236 hours of naturally recorded conversational speech and dialogues from TV shows in both Chinese and English. Extensive experiments show that GPT-Talker significantly outperforms existing CSS systems in naturalness and expressiveness, as validated by both subjective and objective evaluations.

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Authors: Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

Abstract: Multi-instrument music transcription converts polyphonic recordings into musical scores, assigning each note to the correct instrument. This task is challenging due to the need for simultaneous identification of multiple instruments and precise transcription of pitch and timing. Additionally, the scarcity of fully annotated data complicates training. This paper presents YourMT3+, an advanced model suite for multi-instrument music transcription, building on the MT3 language token decoding approach. Enhancements include a hierarchical attention transformer in the time-frequency domain and the integration of a mixture of experts. To tackle data limitations, we introduce a new multi-channel decoding method that works with incomplete annotations and propose intra- and cross-stem augmentation techniques for dataset expansion. Experiments show that YourMT3+ can perform direct vocal transcription without needing separate voice separation processes. Benchmark tests across ten public datasets demonstrate the model's competitive or superior performance compared to existing transcription models, with additional testing revealing limitations in current models for pop music recordings.

Thank you for your attention. Subscribe now to stay informed and join the conversation!

About us:

We also have an amazing team of AI engineers with:

  • A blend of industrial experience and a strong academic track record ??
  • 300+ research publications and 150+ commercial projects ??
  • Millions of dollars saved through our ML/DL solutions ??
  • An exceptional work culture, ensuring satisfaction with both the process and results ??

We are here to help you maximize efficiency with your available resources.

Reach out when:

  • You want to identify what daily tasks can be automated ??
  • You need to understand the benefits of AI and how to avoid excessive cloud costs while maintaining data privacy ??
  • You’d like to optimize current pipelines and computational resource distribution ??
  • You’re unsure how to choose the best DL model for your use case ??
  • You know how but struggle with achieving specific performance and cost efficiency

Have doubts or many questions about AI in your business? Get in touch! ??

要查看或添加评论,请登录

Ievgen Gorovyi的更多文章

  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? OpenAI Structure Changes OpenAI is reportedly planning a…

  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? OpenAI's New feature OpenAI has introduced a new advanced…

  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? OpenAI's New 01 Model OpenAI has released the 01-Preview…

    2 条评论
  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? GPT-Next: 100x Performance Leap on the Horizon At a recent…

    1 条评论
  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? MidJourney free trial is coming OpenAI has been working on a…

  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? MidJourney free trial is coming MidJourney has reopened its…

    2 条评论
  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? Grok-2 Release xAI has released Grok-2, a large language…

    1 条评论
  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? AI Images Are So Real Now! AI-generated images have reached…

  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? Llama 3.1 Release Meta has launched Llama 3.

    2 条评论
  • AI Newsletter

    AI Newsletter

    Another week - another cool updates in the world of AI! ?? OpenAI's Strawberry Q: The Next Big Thing in AI OpenAI…

社区洞察

其他会员也浏览了