AI/ML news summary: week 32
Marco van Hurne
Architect of AI solutions that improve business efficiency and client engagement.
Here are the articles, guides, and news about AI; Week 32. I read tons of RSS feeds and blogs, so you won't have to scour the internet yourself for the latest AI news of this week:
Before we start!
If you like this topic and you want to support me:
This week was a game-changer for LLM inference.
LLM inference is the process of using a pre-trained LLM to make predictions or generate text based on new input.
So, what is the big news all about?
"Cutting costs for reused input tokens with context caching".
I know, it sounds technical and utterly useless for us normal folk
But the takeaway is simple: costs have plummeted.
With Deepseek v2, reused input token inference is now 4,300 times cheaper than GPT-3 (da-vinci 002) from just two years ago. And performance? It's soaring. The MMLU benchmark jumped from 60% to 79%, and the maximum context window size grew by 60 times. To put this in perspective, at the height of Moore's Law, the cost per transistor dropped 4,000 times in 14 years. But transistors didn't get smarter.
This kind of progress? It means a big global impact is just around the corner.
Deepmind was busy too. They followed META's lead with a burst of activity. They launched the Gemini Pro 1.5 experimental model. This finally put Deepmind at the top of the LMSYS arena. They've caught up in the LLM race. But they're still behind on Livebench and Zeroeval benchmarks. They also announced the Flash model. It's set to be five times cheaper next week, at half the cost of GPT-4o-mini. This drop likely reflects advances in distillation and competitive pressure from Llama 3.1. They also introduced a small but impressive 2B Gemma model. It benefits from model distillation. This model will join the LLM builder toolkit after Llama 3.1.
Within 24 hours of Gemini Flash's price announcement, China-based DeepSeek made waves. They launched a new Context Caching on Disk API. This technology slashes the cost of handling reused input tokens by 90%. It's down to $0.014 per million tokens—ten times cheaper than GPT-4o-mini. The caching mechanism stores input content likely to be reused in a distributed disk array. When the same input comes up again, it retrieves it from the cache. This avoids recomputation. It not only cuts API costs but also reduces latency. From 13 seconds to just 500 milliseconds for large 128k prompts. This makes LLMs more viable for tasks like multi-step data analysis, repeated code base queries, and multi-turn conversations. Deepmind Gemini also offers context caching. But its price cut isn't as steep, and it's not automatic. However, the imminent 5x reduction in Gemini Flash price will help.
On another front, new research on inference-time scaling laws suggests we can boost LLM performance. How? By increasing the number of inference steps. This approach, called repeated sampling, lets weaker models outperform stronger ones in some tasks. For instance, DeepSeek-Coder-V2-Instruct, with 250 attempts, achieves a 56% success rate on SWE-bench Lite. It beats the 43% success rate of a single attempt using more capable models like GPT-4o. The effectiveness hinges on two factors: coverage (solving more problems across attempts) and precision (finding the correct solution among many).
The pace of progress in AI is friggin' accelerating, and exhilarating.
Each breakthrough not only advances the field but also sets the stage for the next leap. The developments we're seeing in context caching, cost reduction, and inference-time scaling laws are just the beginning. The future of AI looks more promising—and more economically viable—than ever.
Some more news
Let's break down the latest news in the AI world. My Lord, it has been a busy week!
Musical Chairs at the Leading AI Labs?
OpenAI is facing a talent drain. Co-founder John Schulman is leaving to join Anthropic. He's returning to hands-on technical work with a focus on AI alignment. Greg Brockman announced he's taking a sabbatical from OpenAI until the end of the year. Meanwhile, Google Deepmind has acquired a 30-person model training team from Character AI. This brings transformer co-inventor Noam Shazeer back to Google. The $2.5 billion deal looks like a strategic move. It rewards remaining Character employees and investors as Character shifts to building on external foundation models. It highlights the challenge of raising enough capital to train competitive foundation models, even for niche applications. And it underscores the value of key AI talent.
DeepSeek API Now Launches Context Caching on Disk
DeepSeek has launched Context Caching on Disk technology. This new approach automatically stores frequently referenced contexts on distributed storage. It slashes API costs for reused inputs by up to 90%. For a 128K prompt with high reference, the first token latency drops from 13 seconds to just 500 milliseconds.
Google Released Gemma 2 2B
Deepmind introduced Gemma 2 2B, an impressive small model. It fits on a smartphone yet offers GPT-3.5 levels of performance. It outperforms both GPT-3.5 and Mixtral in the chatbot arena. And it scored 51.3 on 5-shot MMLU. Gemma achieves this via model distillation from a larger model.
European Artificial Intelligence Act Comes Into Force
The European Artificial Intelligence Act (AI Act), the world's first comprehensive AI regulation, is now in force. The AI Act aims to ensure that AI developed and used in the EU is trustworthy. It has safeguards to protect fundamental rights. Most rules will apply from August 2, 2026. But rules for General-Purpose AI models start 12 months later.
Stable Diffusion Creators Launch Black Forest Labs, Secure $31M for FLUX.1 AI Image Generator
Black Forest Labs, founded by the creators of Stable Diffusion, launched the FLUX.1 text-to-image model suite. FLUX.1 has 12 billion parameters. It uses a hybrid architecture of multimodal and parallel diffusion transformer blocks. It comes in three versions: the closed-source FLUX.1 [pro] via API, the open-weight FLUX.1 [dev] for non-commercial use, and FLUX.1 [schnell], a faster version under the Apache 2.0 license for personal and local development.
Zyphra Released Zamba2–2.7B: An Efficient and Faster Small Language Model
Zyphra introduced Zamba2–2.7B, a hybrid model combining Mamba2 (State Space Model) and transformer technology. It shows significant efficiency and performance improvements. It was trained on a proprietary dataset of about 3 trillion tokens. It matches the performance of larger models like Zamba1–7B. This release follows several other models. They demonstrate that hybrid Mamba models can compete with transformers at large training compute budgets. With 2–4 billion parameters and 3 trillion token scales. We're eager to see if these can scale to hundreds of billions of parameters!
Cohere Released Cohere Prompt Tuner For Prompt Optimization
Cohere launched Prompt Tuner, a customizable tool for optimizing and evaluating prompts for generative language use cases. The evaluation criteria are customizable. This guides an instruction generation model to propose new prompts. It's available in beta on the Cohere Dashboard.
AI Investment Driving Capex Growth at US Cloud Leaders
Major US cloud platforms are ramping up capital expenditures, driven by GPU purchases and data center investments. Amazon's capex grew to $17.6 billion (+54% year-on-year), Microsoft to $13.9 billion (+55%), Google to $13.2 billion (+91%), and Meta to $8.5 billion (+36%). This surge is likely due to investments in inference data centers and training clusters for LLMs. As well as other AI uses for GPUs, like recommender systems.
领英推荐
Short stuff to keep you learning
This post addresses 11 critical questions about Llama 3.1 for managers and leaders, such as why the open-source nature of Llama 3.1 is beneficial compared to closed-source models, the available integrations with public cloud providers, deployment infrastructure, the advantages of Llama 3.1 in terms of performance, cost, and potential cost savings, and more.
The latest and highest-scoring frontier models are almost indistinguishable from the general user’s point of view. At the same time, the cost of training frontier AI models keeps rising while improvements are marginal. This blog observes various frontier AI models to solidify the trend that some customers may only need small models, some will need big models, and many will want to combine both in various ways.
In AI, we have newer and improves models, plenty of investments, improving infrastructure, lowering costs, but we haven’t seen proportionate revenue and productivity gains. This essay highlights the major AI developments between March and June to underline how AI Engineers can bridge the gap from capabilities to product.
The article highlights how present-day models can help you learn and automate boring tasks. This author provides a list of 50 conversations that he (a programmer and research scientist studying ML) has had with different large language models to improve their ability to perform research and work on coding side projects.
This article focuses on the current problem with prompting. It will cover the basics, some “advanced” techniques, and, most importantly, burst the myths around prompting. As the authors put it, “Despite all the hype around “advanced” prompting techniques, it’s really just about telling the model what you want in plain language.”
Tools
Research papers
This official paper introduces Gemma 2, a new addition to the Gemma family of lightweight open models, ranging in scale from 2 billion to 27 billion parameters. This version has several technical modifications to the Transformer architecture, such as interleaving local-global attentions and group-query attention.
This paper introduces MoMa, a modality-aware mixture-of-experts (MoE) architecture for pre-training mixed-modal, early-fusion language models. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing.
This paper introduces MindSearch, which mimics human minds in web information seeking and integration and can be instantiated by a simple LLM-based multi-agent framework. MindSearch demonstrates significant improvement in response quality in terms of depth and breadth on both close-set and open-set QA problems.
This research evaluated LLMs’ performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans. It tested Claude 3 and GPT-4 against six medical experts and found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
This paper proposes RialTo, a system for robustifying real-world imitation learning policies using reinforcement learning in “digital twin” simulation environments constructed from small amounts of real-world data. RialTo quickly scans and constructs digital twins of real-world environments and implements an “inverse distillation” procedure for bringing real-world demonstrations into simulated environments, with minimal human intervention and engineering.
Links
Well, that's a wrap for today. Tomorrow, I'll have a fresh episode of TechTonic Shifts for you. If you enjoy my writing and want to support my work, feel free to buy me a coffee ??
Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn appreciates your likes by making my articles available to more readers.
Signing off - Marco
Top-rated articles: