Play mighty cash online free without downloading.Claim Your Free 999 Pesos Bonus Today

Here are the articles, guides, and news about AI; Week 32. I read tons of RSS feeds and blogs, so you won't have to scour the internet yourself for the latest AI news of this week:

Before we start!

If you like this topic and you want to support me:

Comment on the article; LinkedIn appreciates that and it will really help spread the word ??
Connect with me on Linkedin ??
Subscribe to TechTonic Shifts to get your daily dose of tech ??
If you have a crazy project idea and want to share it? Book 30 minutes. ??https://calendly.com/marco_van_hurne/happy-hour-hacks ??

This week was a game-changer for LLM inference.

LLM inference is the process of using a pre-trained LLM to make predictions or generate text based on new input.

So, what is the big news all about?

"Cutting costs for reused input tokens with context caching".

I know, it sounds technical and utterly useless for us normal folk

But the takeaway is simple: costs have plummeted.

With Deepseek v2, reused input token inference is now 4,300 times cheaper than GPT-3 (da-vinci 002) from just two years ago. And performance? It's soaring. The MMLU benchmark jumped from 60% to 79%, and the maximum context window size grew by 60 times. To put this in perspective, at the height of Moore's Law, the cost per transistor dropped 4,000 times in 14 years. But transistors didn't get smarter.

This kind of progress? It means a big global impact is just around the corner.

Deepmind was busy too. They followed META's lead with a burst of activity. They launched the Gemini Pro 1.5 experimental model. This finally put Deepmind at the top of the LMSYS arena. They've caught up in the LLM race. But they're still behind on Livebench and Zeroeval benchmarks. They also announced the Flash model. It's set to be five times cheaper next week, at half the cost of GPT-4o-mini. This drop likely reflects advances in distillation and competitive pressure from Llama 3.1. They also introduced a small but impressive 2B Gemma model. It benefits from model distillation. This model will join the LLM builder toolkit after Llama 3.1.

Within 24 hours of Gemini Flash's price announcement, China-based DeepSeek made waves. They launched a new Context Caching on Disk API. This technology slashes the cost of handling reused input tokens by 90%. It's down to $0.014 per million tokens—ten times cheaper than GPT-4o-mini. The caching mechanism stores input content likely to be reused in a distributed disk array. When the same input comes up again, it retrieves it from the cache. This avoids recomputation. It not only cuts API costs but also reduces latency. From 13 seconds to just 500 milliseconds for large 128k prompts. This makes LLMs more viable for tasks like multi-step data analysis, repeated code base queries, and multi-turn conversations. Deepmind Gemini also offers context caching. But its price cut isn't as steep, and it's not automatic. However, the imminent 5x reduction in Gemini Flash price will help.

On another front, new research on inference-time scaling laws suggests we can boost LLM performance. How? By increasing the number of inference steps. This approach, called repeated sampling, lets weaker models outperform stronger ones in some tasks. For instance, DeepSeek-Coder-V2-Instruct, with 250 attempts, achieves a 56% success rate on SWE-bench Lite. It beats the 43% success rate of a single attempt using more capable models like GPT-4o. The effectiveness hinges on two factors: coverage (solving more problems across attempts) and precision (finding the correct solution among many).

The pace of progress in AI is friggin' accelerating, and exhilarating.

Each breakthrough not only advances the field but also sets the stage for the next leap. The developments we're seeing in context caching, cost reduction, and inference-time scaling laws are just the beginning. The future of AI looks more promising—and more economically viable—than ever.

Some more news

Let's break down the latest news in the AI world. My Lord, it has been a busy week!

Musical Chairs at the Leading AI Labs?

OpenAI is facing a talent drain. Co-founder John Schulman is leaving to join Anthropic. He's returning to hands-on technical work with a focus on AI alignment. Greg Brockman announced he's taking a sabbatical from OpenAI until the end of the year. Meanwhile, Google Deepmind has acquired a 30-person model training team from Character AI. This brings transformer co-inventor Noam Shazeer back to Google. The $2.5 billion deal looks like a strategic move. It rewards remaining Character employees and investors as Character shifts to building on external foundation models. It highlights the challenge of raising enough capital to train competitive foundation models, even for niche applications. And it underscores the value of key AI talent.

DeepSeek API Now Launches Context Caching on Disk

DeepSeek has launched Context Caching on Disk technology. This new approach automatically stores frequently referenced contexts on distributed storage. It slashes API costs for reused inputs by up to 90%. For a 128K prompt with high reference, the first token latency drops from 13 seconds to just 500 milliseconds.

Google Released Gemma 2 2B

Deepmind introduced Gemma 2 2B, an impressive small model. It fits on a smartphone yet offers GPT-3.5 levels of performance. It outperforms both GPT-3.5 and Mixtral in the chatbot arena. And it scored 51.3 on 5-shot MMLU. Gemma achieves this via model distillation from a larger model.

European Artificial Intelligence Act Comes Into Force

The European Artificial Intelligence Act (AI Act), the world's first comprehensive AI regulation, is now in force. The AI Act aims to ensure that AI developed and used in the EU is trustworthy. It has safeguards to protect fundamental rights. Most rules will apply from August 2, 2026. But rules for General-Purpose AI models start 12 months later.

Stable Diffusion Creators Launch Black Forest Labs, Secure $31M for FLUX.1 AI Image Generator

Black Forest Labs, founded by the creators of Stable Diffusion, launched the FLUX.1 text-to-image model suite. FLUX.1 has 12 billion parameters. It uses a hybrid architecture of multimodal and parallel diffusion transformer blocks. It comes in three versions: the closed-source FLUX.1 [pro] via API, the open-weight FLUX.1 [dev] for non-commercial use, and FLUX.1 [schnell], a faster version under the Apache 2.0 license for personal and local development.

Zyphra Released Zamba2–2.7B: An Efficient and Faster Small Language Model

Zyphra introduced Zamba2–2.7B, a hybrid model combining Mamba2 (State Space Model) and transformer technology. It shows significant efficiency and performance improvements. It was trained on a proprietary dataset of about 3 trillion tokens. It matches the performance of larger models like Zamba1–7B. This release follows several other models. They demonstrate that hybrid Mamba models can compete with transformers at large training compute budgets. With 2–4 billion parameters and 3 trillion token scales. We're eager to see if these can scale to hundreds of billions of parameters!

Cohere Released Cohere Prompt Tuner For Prompt Optimization

Cohere launched Prompt Tuner, a customizable tool for optimizing and evaluating prompts for generative language use cases. The evaluation criteria are customizable. This guides an instruction generation model to propose new prompts. It's available in beta on the Cohere Dashboard.

AI Investment Driving Capex Growth at US Cloud Leaders

Major US cloud platforms are ramping up capital expenditures, driven by GPU purchases and data center investments. Amazon's capex grew to $17.6 billion (+54% year-on-year), Microsoft to $13.9 billion (+55%), Google to $13.2 billion (+91%), and Meta to $8.5 billion (+36%). This surge is likely due to investments in inference data centers and training clusters for LLMs. As well as other AI uses for GPUs, like recommender systems.

Short stuff to keep you learning

1. Why Is Llama 3.1 Such a Big Deal?

This post addresses 11 critical questions about Llama 3.1 for managers and leaders, such as why the open-source nature of Llama 3.1 is beneficial compared to closed-source models, the available integrations with public cloud providers, deployment infrastructure, the advantages of Llama 3.1 in terms of performance, cost, and potential cost savings, and more.

2. Built-In AI Web APIs Will Enable a New Generation of AI Startups

The latest and highest-scoring frontier models are almost indistinguishable from the general user’s point of view. At the same time, the cost of training frontier AI models keeps rising while improvements are marginal. This blog observes various frontier AI models to solidify the trend that some customers may only need small models, some will need big models, and many will want to combine both in various ways.

3. The Winds of AI Winter

In AI, we have newer and improves models, plenty of investments, improving infrastructure, lowering costs, but we haven’t seen proportionate revenue and productivity gains. This essay highlights the major AI developments between March and June to underline how AI Engineers can bridge the gap from capabilities to product.

4. How I Use “AI”

The article highlights how present-day models can help you learn and automate boring tasks. This author provides a list of 50 conversations that he (a programmer and research scientist studying ML) has had with different large language models to improve their ability to perform research and work on coding side projects.

5. Mastering ‘Advanced’ Prompting: Because Overcomplicating Things Is an Art Form

This article focuses on the current problem with prompting. It will cover the basics, some “advanced” techniques, and, most importantly, burst the myths around prompting. As the authors put it, “Despite all the hype around “advanced” prompting techniques, it’s really just about telling the model what you want in plain language.”

Tools

ZeroEval is a simple unified framework for evaluating LLMs on various tasks.
ScrapetoAI helps you extract data from websites for your custom GPTs.
Trace is a new AutoDiff-like tool for training AI systems end-to-end with general feedback.
Every Programmer Should Know is a collection of (mostly) technical things every software developer should know about.
AgentScope helps you build LLM-powered multi-agent applications in an easier way.

Research papers

1. Gemma 2: Improving Open Language Models at a Practical Size

This official paper introduces Gemma 2, a new addition to the Gemma family of lightweight open models, ranging in scale from 2 billion to 27 billion parameters. This version has several technical modifications to the Transformer architecture, such as interleaving local-global attentions and group-query attention.

2. MoMA: Efficient Early-Fusion Pre-Training With Mixture of Modality-Aware Experts

This paper introduces MoMa, a modality-aware mixture-of-experts (MoE) architecture for pre-training mixed-modal, early-fusion language models. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing.

3. MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

This paper introduces MindSearch, which mimics human minds in web information seeking and integration and can be instantiated by a simple LLM-based multi-agent framework. MindSearch demonstrates significant improvement in response quality in terms of depth and breadth on both close-set and open-set QA problems.

4. Performance of Large Language Models in Numerical vs. Semantic Medical Knowledge: Benchmarking on Evidence-Based Q&As

This research evaluated LLMs’ performance on two question types: numeric (correlating findings) and semantic (differentiating entities), while examining differences within and between LLMs in medical aspects and comparing their performance to humans. It tested Claude 3 and GPT-4 against six medical experts and found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.

5. Reconciling Reality Through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

This paper proposes RialTo, a system for robustifying real-world imitation learning policies using reinforcement learning in “digital twin” simulation environments constructed from small amounts of real-world data. RialTo quickly scans and constructs digital twins of real-world environments and implements an “inverse distillation” procedure for bringing real-world demonstrations into simulated environments, with minimal human intervention and engineering.

Links

Unexpected design flaws have forced Nvidia to push deliveries of Blackwell GPUs back to early next year. The production issue was discovered by manufacturer TSMC and involves the processor die that connects two Blackwell GPUs on a GB200.
OpenAI introduced ChatGPT’s new advanced voice mode to a small group of ChatGPT Plus subscribers. Various clips of the feature in action have appeared online, demonstrating its ability to sing, imitate accents, correct language pronunciation, and perform narrative storytelling.
GitHub launches integrated Model Playground for AI developers . This new capability allows developers to explore various models within the GitHub web interface and they can also test and compare different models without leaving their current environment.

Well, that's a wrap for today. Tomorrow, I'll have a fresh episode of TechTonic Shifts for you. If you enjoy my writing and want to support my work, feel free to buy me a coffee ??

Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn appreciates your likes by making my articles available to more readers.

Signing off - Marco

AI/ML news summary: week 32

Marco van Hurne

Architect of AI solutions that improve business efficiency and client engagement.

Before we start!

This week was a game-changer for LLM inference.

Some more news

领英推荐

Short stuff to keep you learning

Tools

Research papers

Links

Top-rated articles:

TechTonic Shifts

2,202 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

This AI newsletter is all you need #88

Quantization, Linear Regression, and Hardware for AI: Our Best Recent Deep Dives

LLMs Are Becoming a Commodity—Now What?

Artificial Intelligence #209

Artificial Intelligence #209

AI Weekly Digest - September 16 2024

Watch#2: Small Models Matter and the Fight Against Hallucinations

AI/ML news summary: Week 39

AI/ML news summary: week 33

AI’s unintended consequences

Before we start!

This week was a game-changer for LLM inference.

Some more news

领英推荐

Short stuff to keep you learning

Tools

Research papers

Links

Top-rated articles:

TechTonic Shifts

2,202 位关注者

How AI Agents will disrupt small and mid-sized businesses in 2025

2024年11月22日

Take back control from the algorithm!

2024年11月21日

The Agentic era of UX

2024年11月20日

I've seen the dark side of AI, and you need to know about it

2024年11月19日

How Google tells you what you want to hear

2024年11月18日

AI/ML news summary: Week 46

2024年11月17日

Gemini had a case of “AI overload syndrome”, and snapped!

2024年11月16日

Strap on a headband, shock your brain, and call it therapy

2024年11月15日

Google Astra update - a tad too late

2024年11月14日

AI projects are failing (surprised?) Here’s why your data mess is the real culprit

2024年11月13日

社区洞察

其他会员也浏览了

This AI newsletter is all you need #88

Quantization, Linear Regression, and Hardware for AI: Our Best Recent Deep Dives

LLMs Are Becoming a Commodity—Now What?

Artificial Intelligence #209

Artificial Intelligence #209

AI Weekly Digest - September 16 2024

Watch#2: Small Models Matter and the Fight Against Hallucinations

AI/ML news summary: Week 39

AI/ML news summary: week 33

AI’s unintended consequences