TAI #123; Strong Upgrade to Anthropic’s Sonnet and Haiku 3.5, but Where’s Opus?
Towards AI
Making AI accessible to all with our courses, blogs, tutorials, books & community.
Also; Computer Use, Perplexity upgrades, Stable Diffusion 3.5, OmniParser and more.
What happened this week in AI by Louie
This week Anthropic released significant upgrades to its Claude model family with an improved Sonnet 3.5, the first smaller Haiku model in the 3.5 series, and a new feature allowing “computer use”. The updated Sonnet model, imaginatively called “3.5 (new)”, comes with big jumps in Agentic Coding Benchmarks achieving 49.0% on SWE-bench verified. Even the new smaller Haiku 3.5 (40.6%) beats the prior version of Sonnet 3.5 (33.4%). SWE-bench is a dataset of 2,294 Issue-Pull Request pairs designed to evaluate systems' ability to automatically resolve GitHub issues. Sonnet 3.5 (new) beats GPT-4o on most benchmarks but falls short of OpenAI’s o1 slower and more expensive reasoning model family on several capabilities.?
The new 3.5 Sonnet model continues Anthropic’s rapid and inference cost efficient progress across many benchmarks this year. For example the new model scores 65.0% on the GPQA Diamond Graduate reasoning benchmark, up from 59.4% for the 3.5 model released in June and 40.4% for the Sonnet 3.0 released in March. The new Claude 3.5 Haiku matches the cost and speed of its predecessor, while also surpassing the performance of the previous largest model, Claude 3 Opus, in coding and other key benchmarks. The model will be available across multiple platforms later this month.
Additionally, Claude 3.5 Sonnet (new) introduces an experimental feature called "computer use," enabling the AI to interact with standard computer software by navigating screens, clicking buttons, and typing text, similar to how humans operate computers. This capability is currently in public beta and aims to automate tasks across various applications. While the feature is still developing and has limitations, Anthropic has implemented safety measures to address possible misuse, such as prompt injection attacks and unauthorized actions.?
Why should you care??
Many people have found Claude Sonnet 3.5 to be an incredibly valuable coding assistant this year, either used directly or when integrated with agents or tools such as Cursor. It is becoming increasingly important for developers to learn how to code “LLM natively” and to figure out how to boost productivity and reduce coding mistakes by using these tools. This does however require a willingness to adapt your workflows and there is still a learning curve to use these tools diligently and effectively and to not introduce new bugs or vulnerabilities. The continued pace of improvements in Claude Sonnet this year, within the fast and affordable mid size model category, suggests coding agents and assistants are soon going to become even more powerful.?
One question mark raised during the Sonnet and Haiku 3.5 release was the absence of an Opus 3.5 larger tier model. While Opus 3.0 has fallen behind on benchmarks it is still a popular model and many people still think there is some hard to pin down “magic” in interactions with the largest tier models. The absence of Opus 3.5 raises questions about whether Anthropic is discontinuing this largest tier of models and whether capability gains from scaling training compute are beginning to plateau. It is not just Opus 3.5 - we are also yet to see GPT-5, or Gemini Ultra 1.5 or 2.0 - or anything clearly in a larger model parameter tier than GPT-4. We think part of the reason for this is that the largest models now make great “teacher” models when using model distillation to improve the capabilities of smaller and more affordable “student” models. We wouldn’t be surprised if Sonnet 3.5 benefits from being taught by a larger internal teacher model. The decision at AI labs not to release the largest tier models could include:
1: Poor capability premium relative to the lower cost and latency of smaller student models that already integrate some of the benefits from the larger model training run. Fear of public disappointment also plays a role here.?
2: Risks of distillation by competitors - if a model is released publically competitors may try to use its responses to distill its intelligence and use it to teach their own models. This is against terms and conditions, and distilling via API and model responses is much less robust than if you can access the full model weights - but is still a competitive risk.?
3: Constrained inference capacity at the AI labs for serving models to customers - it is better for Anthropic to serve more customers using Sonnet 3.5 reliably and accelerate LLM adoption, than to have a smaller number of LLM early movers spending a larger amount on Opus 3.5.?
These are all tied together in terms of weighing up the trade offs and the release decision which creates optimum long term value for the company relative to risk. The new “inference time scaling laws” pursued by OpenAI’s o1 model family are also another path for scaling the intelligence of smaller models instead of releasing larger ones. Overall however, while releasing their best models is no longer a straightforward decision, we still expect to see larger tier models released by the large AI labs. For many tasks, enterprise customers will pay up for more capability - particularly when using a model router in the pipeline which will only use the most expensive model when necessary.?
Hottest News
Anthropic has launched a “computer use” feature in public beta, enabling Claude to interact with computers like humans, showing promise for automating complex tasks. They also released an updated version of Claude 3.5 Sonnet and a new model, Claude 3.5 Haiku.
NotebookLM, built on Gemini 1.5, has upgraded its audio overviews to let users customize the AI host’s focus and expertise level while ensuring data privacy. They also introduced a NotebookLM Business pilot program.
Perplexity has launched Internal Knowledge Search and Spaces in its Pro and Enterprise Pro versions, enabling users to efficiently integrate web and internal file searches for enhanced research capabilities.
Stability AI has launched Stable Diffusion 3.5, introducing customizable models such as the 8 billion parameter Large and the faster Large Turbo. These models, available under the Stability AI Community License, are suitable for both commercial and non-commercial applications.
Meta FAIR has introduced new AI research advancements, including updates to the Segment Anything Model (SAM 2.1), the Meta Spirit LM for speech-text integration, Layer Skip for efficient large language models, Salsa for post-quantum cryptography, and Meta Open Materials 2024 for faster materials discovery.
Ideogram.ai ’s new platform, Ideogram Canvas, offers advanced image organization, generation, editing, and merging capabilities. With tools like Magic Fill (inpainting) and Extend (outpainting) available on any paid plan, it supports AI image editing and expansion. The platform works great in high-resolution detail generation and precise text rendering.
OpenAI is reportedly preparing to launch a new AI model, Orion, by December, which could surpass GPT-4 in capabilities. Initially, Orion will be available to select partner companies for integration, with Microsoft potentially hosting it on Azure, despite CEO Sam Altman’s denial of the release.
Five 5-minute reads/videos to keep you learning
The article highlights a transformative shift in AI with OpenAI’s o1 series, marking a move from generative to reasoning models. This evolution signals the decline of chatbots and the rise of AI capable of real-time reasoning and complex problem-solving, potentially creating a divide between AI-rich and AI-poor users and challenging existing AI paradigms.
领英推荐
Cloudflare CEO Matthew Prince highlights the shift towards local data processing in Internet infrastructure to reduce latency, particularly for AI inference on edge devices.
Transformers.js v3 offers major updates with WebGPU support for up to 100x faster processing, new quantization formats, and compatibility with Node.js, Deno, and Bun. It supports 120 architectures and over 1200 pre-converted models on the Hugging Face Hub, enabling advanced machine learning computations directly in browsers.
The Anthropic Alignment Science team has created new evaluations to assess sabotage risks from advanced AI models, including human decision sabotage and code sabotage. Initial tests with models like Claude 3 Opus and Claude 3.5 Sonnet show low-level sabotage capabilities, indicating minimal current risks but emphasizing the need for ongoing vigilance and improved evaluations as AI technology progresses.
The article presents sCM, a novel approach for continuous-time consistency models that enhances training by simplifying, stabilizing, and scaling the process. sCM achieves high-quality samples in just two steps, offering a ~50x speed increase over leading diffusion models, enabling real-time image generation in various AI applications with the potential for further speed and quality improvements.
Repositories & Tools?
Top Papers of The Week?
Large Language Models frequently produce hallucinations or non-factual content. Knowledge editing aims to correct these errors without full retraining, but its effectiveness is uncertain due to inadequate evaluation datasets. The study introduces HalluEditBench, a benchmark with a dataset spanning 9 domains to assess knowledge editing methods on Efficacy, Generalization, Portability, Locality, and Robustness, providing insights into their capabilities and limitations.
OmniParser is a method that enhances vision-based GUI agents, such as GPT-4V, by improving their ability to parse screens. It focuses on identifying interactable icons and understanding UI element semantics. By using curated datasets to train detection and caption models, OmniParser significantly boosts GPT-4V’s performance on benchmarks like ScreenSpot, Mind2Web, and AITW, outperforming baselines that need additional information beyond screenshots.
Mini-Omni2 is an open-source model designed to emulate GPT-4o’s ability to process visual, auditory, and textual inputs. It distinguishes itself by integrating pre-trained encoders and employing a novel three-stage training process for efficient multi-modal data handling.
CompassJudger-1 is an open-source judge model for evaluating LLMs, offering scoring, comparisons, and critiques. Alongside, JudgerBench provides a benchmark for judging models on subjective tasks. Both tools, aimed at advancing LLM evaluation, are publicly available on GitHub.
SAM2Long enhances SAM 2 for long video segmentation by tackling "error accumulation" with a training-free memory tree that optimally selects segmentation pathways. This method improves accuracy and robustness in complex videos, showing notable benchmark gains. The code is publicly accessible.
Quick Links
Who’s Hiring in AI
Interested in sharing a job opportunity here? Contact [email protected] .
Think a friend would enjoy this too? Share the newsletter and let them join the conversation