GenAI Weekly — Edition 13

GenAI Weekly — Edition 13

Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs

Stay at the forefront of the Gen AI revolution with Gen AI Weekly! Each week, we curate the most noteworthy news, insights, and breakthroughs in the field, equipping you with the knowledge you need to stay ahead of the curve.

? Click subscribe to be notified of future editions


OpenAI announces GPT-4o, a new flagship model that can reason across audio, vision, and text in real-time.

From OpenAI:

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

It’s “omni” because that 3-stage pipeline is now gone and the model can now directly ingest and output audio and video, apart from text. Closer to how humans process information. On availability:

GPT-4o is our latest step in pushing the boundaries of deep learning, this time in the direction of practical usability. We spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As a first fruit of this research, we’re able to make a GPT-4 level model available much more broadly. GPT-4o’s capabilities will be rolled out iteratively (with extended red team access starting today).?
GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.
Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

Watch the launch video here if you haven’t already.


Google announces Gemini 1.5 Pro improvements and a new 1.5 Flash model

From the Google blog:

  • Gemini 1.5 Pro: We made a series of quality improvements across key use cases, such as translation, coding, reasoning and more. You’ll see these updates in the model starting today, which should help you tackle even broader and more complex tasks.
  • Gemini 1.5 Flash: This smaller Gemini model is optimized for narrower or high-frequency tasks where the speed of the model’s response time matters the most.
  • Availability: Both models are available today in more than 200 countries and territories in preview and will be generally available in June.
  • Natively multimodal with long context: Both 1.5 Pro and 1.5 Flash come with our 1 million token context window and allow you to interleave text, images, audio and video as inputs. To get access to 1.5 Pro with a 2 million token context window, join the waitlist in Google AI Studio or in Vertex AI for Google Cloud customers.

See also: 100 things announced at I/O 2024


Google announces Veo, a video generation model

From the Deepmind website:

Veo is our most capable video generation model to date. It generates high-quality, 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles.
It accurately captures the nuance and tone of a prompt, and provides an unprecedented level of creative control — understanding prompts for all kinds of cinematic effects, like time lapses or aerial shots of a landscape.
Our video generation model will help create tools that make video production accessible to everyone. Whether you're a seasoned filmmaker, aspiring creator, or educator looking to share knowledge, Veo unlocks new possibilities for storytelling, education and more.
Over the coming weeks some of these features will be available to select creators through VideoFX, a new experimental tool at labs.google. You can join the waitlist now.
In the future, we’ll also bring some of Veo’s capabilities to YouTube Shorts and other products.

Sora is not the only game in town, but the fact that Veo was announced after Sora should be a PR pain for Google.


UAE’s Technology Innovation Institute announces Falcon 2

From TII:

The Technology Innovation Institute (TII), a leading global scientific research center and the applied research pillar of Abu Dhabi’s Advanced Technology Research Council (ATRC), today launched a second iteration of its renowned large language model (LLM) – Falcon 2. Within this series, it has unveiled two groundbreaking versions: Falcon 2 11B, a more efficient and accessible LLM trained on 5.5 trillion tokens with 11 billion parameters, and Falcon 2 11B VLM, distinguished by its vision-to-language model (VLM) capabilities, which enable seamless conversion of visual inputs into textual outputs. While both models are multilingual, notably, Falcon 2 11B VLM stands out as TII's first multimodal model – and the only one currently in the top tier market that has this image-to-text conversion capability, marking a significant advancement in AI innovation.
Tested against several prominent AI models in its class among pre-trained models, Falcon 2 11B surpasses the performance of Meta’s newly launched Llama 3 with 8 billion parameters(8B), and performs on par with Google’s Gemma 7B at first place (Falcon 2 11B: 64.28 vs Gemma 7B: 64.29), as independently verified by Hugging Face, a US-based platform hosting an objective evaluation tool and global leaderboard for open LLMs. More importantly, Falcon 2 11B and 11B VLM are both open-source, empowering developers worldwide with unrestricted access. In the near future, there are plans to broaden the Falcon 2 next-generation models, introducing a range of sizes. These models will be further enhanced with advanced machine learning capabilities like 'Mixture of Experts' (MoE), aimed at pushing their performance to even more sophisticated levels.

Mozilla compares text embedding models

From the Mozilla blog:

Best overall: Salesforce/SFR-Embedding-Mistral (llamafile link). Why does it work so well? They carried out additional, multi-task finetuning on top of intfloat/e5-mistral-7b-instruct using the training datasets of several tasks in the Massive Text Embedding Benchmark (MTEB). For more details, see their blog post.
Best overall, commercial-friendly license: intfloat/e5-mistral-7b-instruct (llamafile link). Why does it work so well? Synthetic data generation. The authors finetune mistral-7b-instruct on various synthetic text embedding datasets generated by another LLM. For more information, see their paper.
Best small: mixedbread-ai/mxbai-embed-large-v1 (llamafile link). Why does it work so well? Data building and curation. The authors scraped and curated 700 million text pairs and trained a BERT model using contrastive training. Then, they finetuned the model on an additional 30 million text triplets using AngIE loss. For more information, see their blog post.

Check out their interesting Llamafile project on Github that allows you to ship LLM apps as a single file.


For the extra curious


要查看或添加评论,请登录

社区洞察

其他会员也浏览了