GenAI Weekly — Edition 30
Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs
Stay at the forefront of the Gen AI revolution with Gen AI Weekly! Each week, we curate the most noteworthy news, insights, and breakthroughs in the field, equipping you with the knowledge you need to stay ahead of the curve.
Best OCR Software in 2024 — A Tool Comparison & Evaluation Guide
OCR technology is essential in today's digital world, transforming scanned papers, PDFs, and images into editable, searchable text. This boosts productivity, especially in industries like finance, healthcare, legal, and education, where document processing is vital. The effectiveness of OCR directly affects workflows, data accuracy, and operational efficiency. As businesses embrace digital transformation, choosing the right OCR tool is crucial. This article reviews the top OCR software available in 2024.
We will compare:
1. Tesseract,
2. Paddle OCR,
3. Azure Document Intelligence
4. Amazon Textract
5. LLMWhisperer from Unstract
OpenAI releases o1-preview and o1-mini
We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.?
In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. You can read more about this in our technical research post .
As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.
But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.
[…]
Whom it’s for
These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows.?
[…]
OpenAI o1-mini
The o1 series excels at accurately generating and debugging complex code. To offer a more efficient solution for developers, we’re also releasing OpenAI o1-mini , a faster, cheaper reasoning model that is particularly effective at coding. As a smaller model, o1-mini is 80% cheaper than o1-preview, making it a powerful, cost-effective model for applications that require reasoning but not broad world knowledge.
Also see Learning to Reason with LLMs .
Notes on OpenAI’s new o1 chain-of-thought models
OpenAI released two major new preview models today: o1-preview and o1-mini (that mini one is not a preview )—previously rumored as having the codename “strawberry”. There’s a lot to understand about these models—they’re not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved “reasoning” capabilities.
[…] the models can better handle significantly more complicated prompts where a good result requires backtracking and “thinking” beyond just next token prediction.
I don’t really like the term “reasoning” because I don’t think it has a robust definition in the context of LLMs, but OpenAI have committed to using it here and I think it does an adequate job of conveying the problem these new models are trying to solve.
[…]
Most interestingly is the introduction of “reasoning tokens”—tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens.
Thanks to the importance of reasoning tokens—OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models—the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini! These are an increase from the gpt-4o and gpt-4o-mini models which both currently have a 16,384 output token limit.
One last interesting tip from that API documentation:
Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.
This is a big change from how RAG is usually implemented, where the advice is often to cram as many potentially relevant documents as possible into the prompt.
My take on this: The best summary one can read anywhere.
领英推荐
Mistral releases Pixtral 12B, its first multimodal model
French AI startup Mistral has released its first model that can process images as well as text.
Called Pixtral 12B, the 12-billion-parameter model is about 24GB in size. Parameters?roughly correspond to a model’s problem-solving skills, and models with more?parameters?generally perform better than those with fewer parameters.
Built on one of Mistral’s text models, Nemo 12B, the new model can answer questions about an arbitrary number of images of an arbitrary size given either URLs or images encoded using base64, the binary-to-text encoding scheme. Similar to other multimodal models such as Anthropic’s Claude family and OpenAI’s GPT-4o, Pixtral 12B should — at least in theory — be able to perform tasks like captioning images and counting the number of objects in a photo.
Available via a torrent link on GitHub and AI and machine learning development platform Hugging Face , Pixtral 12B can be downloaded, fine-tuned and used under an Apache 2.0 license without restrictions. (A Mistral spokesperson confirmed the license being applied to Pixtral 12B via email.)
This writer wasn’t able to take Pixtral 12B for a spin, unfortunately — there weren’t any working web demos at the time of publication. In a post on X , Sophia Yang, head of Mistral developer relations, said Pixtral 12B will be available for testing on Mistral’s chatbot and API-serving platforms, Le Chat and Le Plateforme, soon. It’s unclear which image data Mistral might have used to develop Pixtral 12B.
My take on this: Good to see multi-modal open weight models!
Google Illuminate: Transform your content into engaging AI-generated audio discussions
My take on this: Google Illuminate is just insanely good. I’d rate the quality of the conversation to be of almost podcast quality.
How few-shot learning with Google’s Prompt Poet can supercharge your LLMs
Prompt engineering, the discipline of crafting just the right input to a large language model (LLM) to get the desired response, is a critical new skill for the age of AI. It’s helpful for even casual users of conversational AI, but essential for builders of the next generation of AI-powered applications.
Enter Prompt Poet , the brainchild of Character.ai , a conversational LLM startup recently acquired by Google . Prompt Poet simplifies advanced prompt engineering by offering a user-friendly, low-code template system that manages context effectively and seamlessly integrates external data. This allows you to ground LLM-generated responses to a real-world data context, opening up a new horizon of AI interactions.
Prompt Poet shines for its seamless integration of “few-shot learning,” a powerful technique for rapid customization of LLMs without requiring complex and expensive model fine-tuning. This article explores how few-shot learning with Prompt Poet can be leveraged to deliver bespoke AI-driven interactions with ease and efficiency.
My take on this: Prompt engineering isn’t really all that widespread. This will help.
Tell Replit's AI Agent Your App Idea, and It'll Code and Deploy It for You
Replit has launched an AI agent capable of building entire applications from scratch. This isn't just another copilot coding assistant – it's much closer to a intern software developer that can understand your vision and help bring it to life.
But what exactly is an AI agent, and why is this such a big deal?
An AI agent is a more autonomous and proactive system compared to current AI assistants like ChatGPT or Claude. While today's AI assistants respond to specific queries or tasks, AI agents operate with a higher degree of independence, making decisions and executing complex tasks without constant user input. They can learn and adapt over time, improving their actions based on feedback and new information.
Replit's AI agent takes this concept and applies it to the world of software development. It can reason through a task and create its own steps to complete it—such as writing code, setting up environments, and managing deployments.
"We've crossed a threshold," says Replit CEO Amjad Masad. "This isn't about AI replacing developers. It's about supercharging human creativity and making software creation accessible to everyone."
My take on this: I would wait for more in-depth reviews from industry experts.
Sebastian Raschka’s Build a Large Language Model (From Scratch) is now available
Learn how to create, train, and tweak large language models (LLMs) by building one from the ground up! In Build a Large Language Model (from Scratch) bestselling author Sebastian Raschka guides you step by step through creating your own LLM. Each stage is explained with clear text, diagrams, and examples. You’ll go from the initial design and creation, to pretraining on a general corpus, and on to fine-tuning for specific tasks.
Build a Large Language Model (from Scratch) teaches you how to:
Build a Large Language Model (from Scratch) takes you inside the AI black box to tinker with the internal systems that power generative AI. As you work through each key stage of LLM creation, you’ll develop an in-depth understanding of how LLMs work, their limitations, and their customization methods. Your LLM can be developed on an ordinary laptop, and used as your own personal assistant.
My take on this: Highly recommended. It’s a great resource to gain key fundamental insights.
If you've made it this far and follow my newsletter, please consider exploring the platform we're currently building: Unstract —a no-code LLM platform that automates unstructured data workflows.
I Help B2B Founders & CXOs Create and Monetize Their Brand On & Beyond LinkedIn | Personal Branding Expert | LinkedIn Growth Hacker | LinkedIn Lead Generation Specialist | Ghost Writer
2 个月Great post!