I/O takeaways: Google closed some of the gap. But only some. (AI Current #21)

I/O takeaways: Google closed some of the gap. But only some. (AI Current #21)

The Current Wave

Google had an unimaginably packed I/O conference with over two dozen AI announcements. Some of them are helpful features that are available today or very soon, while others are a bit more… aspirational, let's say. Google is a big and slow company with the inertia of an incumbent; but once they get going, few competitors can stop them. And they got going…

To make sense of all the announcements, I’ve grouped them into three categories: foundation models, productivity AI, and aspirational products. There’s also search and Search Generative Experience, which is not new, but will see a US-wide rollout soon, including ads.

Foundation models

Gemini: better, faster, cheaper

Gemini 1.5 Pro received an update with an even longer, 2M token context window and better multimodal performance, bumping the MMLU benchmark score to 85.9%, on par with the very best models.

But the real news is the release of Gemini 1.5 Flash, a faster and more compact model that outperforms the Mixtral 8x22b and performs roughly on the level of Anthropic's Claude 3 Sonnet. So why is Flash a big deal? First, despite being positioned as a small-but-fast model, it still allows a 1M token context window, which greatly helps with many use cases. Second, it is incredibly cheap: it is exactly 10 percent of the price of Gemini 1.5 Pro and around 80 percent cheaper than Claude 3 Sonnet (again, a comparable model).

Google can offer inference cheaper than anyone else on the market thanks to its infrastructure advantage and the fact that it’s much less reliant on Nvidia. Indeed, this pricing is so aggressive that it might drive some companies out of the market completely.

Google I/O also saw a Gemini Nano announcement, although it got a bit lost in the noise. The smallest Gemini will receive an update later this year that will make it multimodal, a huge deal for visually impaired people. The model will be able to describe unlabeled images on the phone's screen, and since inference runs locally, it should work with minimal latency. Historically, the iPhone has been the preferred choice for accessibility, but this might change soon. Gemini Nano will also help with screening spam calls.

Ah, and there are the Gemini custom chatbots ("Gems"). Interestingly, these don't seem to be very successful. OpenAI tried this with GPTs, Poe with bots, and they are underwhelming and overwhelming at the same time. They don't seem to add too much to the experience, but there are just so many of them, it's impossible to even pick one.

Open source models

In a few weeks, we will get an update to Gemma, Google's open source model. There are only preliminary evaluation scores available, but they look very promising. We are talking about a 27B parameter model with a performance on par with Llama 3 70B!

Gemma is not multimodal, but it gets a companion, PaliGemma, the first open source vision-language model from Google. And this one is available today (here's the Hugging Face demo page).

It is fantastic to see Meta and Google seriously competing in the open weights LLM space. It is time for Microsoft to release a bigger version of Phi-3 — the current largest 14B model was just released three weeks ago (covered in AI Current #17) but already looks outdated.

Images and videos

Imagen 3 is a big jump compared to v2, with much better text generation and superb photorealism. It looks like a worthy competitor to Midjourney. There is a waitlist but it seems ready to be released to the general public in the not too distant future.

Veo will be a competitor to OpenAI’s SORA, generating longer, 60+ seconds videos in 1080p from text prompts. The problem is, both Veo and SORA are in private beta, only accessible to a handful of selected creators. The examples on the Veo site undoubtedly look great (maybe a bit less great than the SORA examples?), but we have no idea how cherry-picked they are.

We know, however, what Google's long-term strategy is with text-to-video. They say it themselves: "In the future, we'll also bring some of Veo's capabilities to YouTube Shorts and other products." Even if Veo turns out to be not as great as SORA, Google's reach in video will be hard to beat…

Productivity AI

If you live in a Google universe using Gmail, Google Docs, Drive, and so on, you are in for a treat. Yes, Google knows everything about you, so it's time to make use of it! The new productivity features announced at I/O are not mind-blowing or even unique to Google. But they work beautifully together to create something that's much greater than the sum of its parts.

Any half-decent LLM can summarize a document or draft an email. But it's not the same as telling Gemini: "Hey, please collect all the house refinancing offers from my emails; some are in attachments. Summarize each of them in a few bullet points, then put the whole thing in an email and send it to my wife." It's night and day when it comes to user experience.

The main way to interact with AI in Workspace is the new sidebar, where users can chat with Gemini to summarize, analyze, and generate content. This will be available directly within Gmail, Docs, Sheets, and other Workspace apps. The side panel offers contextually relevant prompts based on the content, helping users get started quickly. Gemini can also help with organizing and tracking information. For example, it can recognize attachments like receipts in emails and prompt users to organize and track these in Drive and Sheets automatically.

Again, none of this is magical from an AI perspective — even GPT-3.5 can do all the different bits, although the 1M token context window definitely helps. The differentiator is the user experience, and it's great to see Google managed to create something that flows this naturally. UX hasn't been one of the company's strengths recently, to put it mildly…

The other differentiator is scale. I can't emphasize enough how important infrastructure is. No company other than Google can run these long context inference tasks on a billion-user scale. Maybe Microsoft. Maybe. Bing and Copilot have been down for hours as I am writing this…

Project Astra: a disappointment

Oh well, now we know why OpenAI held their event (which I covered in AI Current #20) just the day before Google I/O.

Google’s Project Astra is essentially the same as the next generation of ChatGPT, powered by GPT-4o. It can interpret the real world using a camera and it can converse in natural, spoken language.

There is one huge difference, though.

OpenAI did a live demo of the new ChatGPT app. Sure, it came with some caveats like using a wired connection for the iPhone on stage (latency can ruin the experience), and it will take a couple more weeks until it goes live. Google showed a 2-minutes video. They did the same with Gemini back then, and let’s just say the model didn’t live up to the hype. At all.

Project Astra is a disappointment not because of what we see in the promo video. It looks great! The problem is that OpenAI is still months ahead of Google when it comes to bleeding-edge AI. Which means Microsoft is still months ahead…

Google has closed some of that gap with these announcements. Gemini is a great model, Gemma looks like a strong open source contender, Veo is exciting, and the Workspace AI features are very well thought-out.

This should be enough to keep their user base, but to expand? Microsoft just made Windows great again (more on that in the next issue!), and Apple is still a big unknown until WWDC in June…


Thanks for reading the AI Current!

I publish this newsletter twice a week on Tuesdays and Fridays at the same time on Substack and on LinkedIn.

Don’t be a stranger! Hit me with a LinkedIn DM or say hi in the Substack chat.

I work at Appen, one of the world’s best AI training data companies. However, I don’t speak for Appen, and nothing in this newsletter represents the views of the company.

要查看或添加评论,请登录

Torsten Szabolcs Sándor的更多文章

社区洞察

其他会员也浏览了