GenAI Weekly — Edition 23
Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs
Stay at the forefront of the Gen AI revolution with Gen AI Weekly! Each week, we curate the most noteworthy news, insights, and breakthroughs in the field, equipping you with the knowledge you need to stay ahead of the curve.
Python Libraries to Extract Table from PDF
Extracting tables from PDF files is a complex task due to the diverse structures and layouts of PDFs. Unlike simple text files, PDFs are designed for consistent visual presentation, making it challenging to automatically extract structured data like tables.
The varied appearance, construction, and formatting of tables within PDFs further complicate accurate extraction. Python, with its robust libraries, offers powerful solutions for this challenge. Leveraging Python for table extraction from PDFs provides several advantages, including flexibility, automation capabilities, and support for multiple PDF formats.
Here are four Python libraries specifically developed to facilitate table extraction, each with unique features and benefits:
In our latest article by Nuno Bispo, we dive deep into these four Python libraries. We’ll compare what they can do, how well they work, and how to extract a table from a PDF accurately.
Zuckerberg: Open Source AI Is the Path Forward
In the early days of high-performance computing, the major tech companies of the day each invested heavily in developing their own closed source versions of Unix. It was hard to imagine at the time that any other approach could develop such advanced software. Eventually though, open source Linux gained popularity – initially because it allowed developers to modify its code however they wanted and was more affordable, and over time because it became more advanced, more secure, and had a broader ecosystem supporting more capabilities than any closed Unix. Today, Linux is the industry standard foundation for both cloud computing and the operating systems that run most mobile devices – and we all benefit from superior products because of it.
I believe that AI will develop in a similar way. Today, several tech companies are developing leading closed models. But open source is quickly closing the gap. Last year, Llama 2 was only comparable to an older generation of models behind the frontier. This year, Llama 3 is competitive with the most advanced models and leading in some areas. Starting next year, we expect future Llama models to become the most advanced in the industry. But even before that, Llama is already leading on openness, modifiability, and cost efficiency.
Today we’re taking the next steps towards open source AI becoming the industry standard. We’re releasing Llama 3.1 405B, the first frontier-level open source AI model, as well as new and improved Llama 3.1 70B and 8B models. In addition to having significantly better cost/performance relative to closed models, the fact that the 405B model is open will make it the best choice for fine-tuning and distilling smaller models.
[…]
At some point in the future, individual bad actors may be able to use the intelligence of AI models to fabricate entirely new harms from the information available on the internet. At this point, the balance of power will be critical to AI safety. I think it will be better to live in a world where AI is widely deployed so that larger actors can check the power of smaller bad actors. This is how we’ve managed security on our social networks – our more robust AI systems identify and stop threats from less sophisticated actors who often use smaller scale AI systems. More broadly, larger institutions deploying AI at scale will promote security and stability across society. As long as everyone has access to similar generations of models – which open source promotes – then governments and institutions with more compute resources will be able to check bad actors with less compute.?
My take on this: I must say the Linux analogy he uses is good. In open source the saying goes “More eyeballs make bugs shallow”. This is true for “open source” AI as well. Also, it must be noted that as far as Meta’s models go, only the weights are open. The training data and other supporting scripts aren’t.
OpenAI announces SearchGPT Prototype
We’re testing SearchGPT, a prototype of new search features designed to combine the strength of our AI models with information from the web to give you fast and timely answers with clear and relevant sources. We’re launching to a small group of users and publishers to get feedback. While this prototype is temporary, we plan to integrate the best of these features directly into ChatGPT in the future. If you’re interested in trying the prototype, sign up(opens in a new window) for the waitlist.
Getting answers on the web can take a lot of effort, often requiring multiple attempts to get relevant results. We believe that by enhancing the conversational capabilities of our models with real-time information from the web, finding what you’re looking for can be faster and easier.
[…]
SearchGPT is designed to help users connect with publishers by prominently citing and linking to them in searches. Responses have clear, in-line, named attribution and links so users know where information is coming from and can quickly engage with even more results in a sidebar with source links.
My take on this: We’ve all been waiting for an answering engine which should just give us an answer rather than showing us a bunch of links. Everyone, including Google, should be headed in this direction.
领英推荐
Meta announces Llama 3.1 models
Llama 3.1 405B is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. With the release of the 405B model, we’re poised to supercharge innovation—with unprecedented opportunities for growth and exploration. We believe the latest generation of Llama will ignite new applications and modeling paradigms, including synthetic data generation to enable the improvement and training of smaller models, as well as model distillation—a capability that has never been achieved at this scale in open source.
As part of this latest release, we’re introducing upgraded versions of the 8B and 70B models. These are multilingual and have a significantly longer context length of 128K, state-of-the-art tool use, and overall stronger reasoning capabilities. This enables our latest models to support advanced use cases, such as long-form text summarization, multilingual conversational agents, and coding assistants. We’ve also made changes to our license, allowing developers to use the outputs from Llama models—including the 405B—to improve other models. True to our commitment to open source, starting today, we’re making these models available to the community for download on llama.meta.com and Hugging Face and available for immediate development on our broad ecosystem of partner platforms.
My take on this: Meta continues to impress with their open weight models, but their pricing seems expensive compared to what’s generally available from the competition—at least for now.
The Data That Powers A.I. Is Disappearing Fast
For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Now, that data is drying up. Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.
[…]
As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for A.I. training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging A.I. companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
My take on this: With a lot of content already being created by AI, it’s an open question what all this inbreeding will lead to.
Reddit is now blocking major search engines and AI bots — except the ones that pay
Reddit is ramping up its crackdown on web crawlers. Over the past few weeks, Reddit has started blocking search engines from surfacing recent posts and comments unless the search engine pays up, according to a report from 404 Media.
Right now, Google is the only mainstream search engine that shows recent results when you search for posts on Reddit using the “site:reddit.com” trick, 404 Media reports. This leaves out Bing, DuckDuckGo, and other alternatives — likely because Google has struck a $60 million deal that lets the company train its AI models on content from Reddit.
“This is not at all related to our recent partnership with Google,” Reddit spokesperson Tim Rathschmidt says in a statement to The Verge. “We have been in discussions with multiple search engines. We have been unable to reach agreements with all of them, since some are unable or unwilling to make enforceable promises regarding their use of Reddit content, including their use for AI.”
Last month, to enforce its policy against scraping, Reddit updated the site’s robots.txt file, which tells web crawlers whether they can access a site. “It’s a signal to those who don’t have an agreement with us that they shouldn’t be accessing Reddit data,” Ben Lee, Reddit’s chief legal officer, told my colleague Alex Heath in Command Line.
My take on this: More evidence that the internet as we’ve known it is done.
If you've made it this far and follow my newsletter, please consider exploring the platform we're currently building: Unstract—a no-code LLM platform that automates unstructured data workflows.
For the extra curious