Maxjili com login password.Enjoy Free 888+200 Daily Legal Bonus

Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs

Stay at the forefront of the Gen AI revolution with Gen AI Weekly! Each week, we curate the most noteworthy news, insights, and breakthroughs in the field, equipping you with the knowledge you need to stay ahead of the curve.

? Click subscribe to be notified of future editions

Python Libraries to Extract Table from PDF

From the Unstract blog:

Nuno Bispo

Extracting tables from PDF files is a complex task due to the diverse structures and layouts of PDFs. Unlike simple text files, PDFs are designed for consistent visual presentation, making it challenging to automatically extract structured data like tables.

The varied appearance, construction, and formatting of tables within PDFs further complicate accurate extraction. Python, with its robust libraries, offers powerful solutions for this challenge. Leveraging Python for table extraction from PDFs provides several advantages, including flexibility, automation capabilities, and support for multiple PDF formats.

Here are four Python libraries specifically developed to facilitate table extraction, each with unique features and benefits:

Camelot
Tabula
Pdfplumber
Pdftables

In our latest article by Nuno Bispo, we dive deep into these four Python libraries. We’ll compare what they can do, how well they work, and how to extract a table from a PDF accurately.

Zuckerberg: Open Source AI Is the Path Forward

Him writing on the Meta blog:

Meta

In the early days of high-performance computing, the major tech companies of the day each invested heavily in developing their own closed source versions of Unix. It was hard to imagine at the time that any other approach could develop such advanced software. Eventually though, open source Linux gained popularity – initially because it allowed developers to modify its code however they wanted and was more affordable, and over time because it became more advanced, more secure, and had a broader ecosystem supporting more capabilities than any closed Unix. Today, Linux is the industry standard foundation for both cloud computing and the operating systems that run most mobile devices – and we all benefit from superior products because of it.

I believe that AI will develop in a similar way. Today, several tech companies are developing leading closed models. But open source is quickly closing the gap. Last year, Llama 2 was only comparable to an older generation of models behind the frontier. This year, Llama 3 is competitive with the most advanced models and leading in some areas. Starting next year, we expect future Llama models to become the most advanced in the industry. But even before that, Llama is already leading on openness, modifiability, and cost efficiency.

Today we’re taking the next steps towards open source AI becoming the industry standard. We’re releasing Llama 3.1 405B, the first frontier-level open source AI model, as well as new and improved Llama 3.1 70B and 8B models. In addition to having significantly better cost/performance relative to closed models, the fact that the 405B model is open will make it the best choice for fine-tuning and distilling smaller models.

[…]

At some point in the future, individual bad actors may be able to use the intelligence of AI models to fabricate entirely new harms from the information available on the internet. At this point, the balance of power will be critical to AI safety. I think it will be better to live in a world where AI is widely deployed so that larger actors can check the power of smaller bad actors. This is how we’ve managed security on our social networks – our more robust AI systems identify and stop threats from less sophisticated actors who often use smaller scale AI systems. More broadly, larger institutions deploying AI at scale will promote security and stability across society. As long as everyone has access to similar generations of models – which open source promotes – then governments and institutions with more compute resources will be able to check bad actors with less compute.?

My take on this: I must say the Linux analogy he uses is good. In open source the saying goes “More eyeballs make bugs shallow”. This is true for “open source” AI as well. Also, it must be noted that as far as Meta’s models go, only the weights are open. The training data and other supporting scripts aren’t.

OpenAI announces SearchGPT Prototype

From the OpenAI blog:

OpenAI

We’re testing SearchGPT, a prototype of new search features designed to combine the strength of our AI models with information from the web to give you fast and timely answers with clear and relevant sources. We’re launching to a small group of users and publishers to get feedback. While this prototype is temporary, we plan to integrate the best of these features directly into ChatGPT in the future. If you’re interested in trying the prototype, sign up(opens in a new window) for the waitlist.

Getting answers on the web can take a lot of effort, often requiring multiple attempts to get relevant results. We believe that by enhancing the conversational capabilities of our models with real-time information from the web, finding what you’re looking for can be faster and easier.

[…]

SearchGPT is designed to help users connect with publishers by prominently citing and linking to them in searches. Responses have clear, in-line, named attribution and links so users know where information is coming from and can quickly engage with even more results in a sidebar with source links.

My take on this: We’ve all been waiting for an answering engine which should just give us an answer rather than showing us a bunch of links. Everyone, including Google, should be headed in this direction.

Meta announces Llama 3.1 models

From the Meta blog:

The Data That Powers A.I. Is Disappearing Fast

Kevin Roose writing for The New York Times:

Kevin Roose

For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up. Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

[…]

As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for A.I. training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.

Sites like Reddit and StackOverflow have begun charging A.I. companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.

My take on this: With a lot of content already being created by AI, it’s an open question what all this inbreeding will lead to.

Reddit is now blocking major search engines and AI bots — except the ones that pay

Emma Roth reporting for The Verge:

The Verge Emma Roth

Reddit is ramping up its crackdown on web crawlers. Over the past few weeks, Reddit has started blocking search engines from surfacing recent posts and comments unless the search engine pays up, according to a report from 404 Media.

Right now, Google is the only mainstream search engine that shows recent results when you search for posts on Reddit using the “site:reddit.com” trick, 404 Media reports. This leaves out Bing, DuckDuckGo, and other alternatives — likely because Google has struck a $60 million deal that lets the company train its AI models on content from Reddit.

“This is not at all related to our recent partnership with Google,” Reddit spokesperson Tim Rathschmidt says in a statement to The Verge. “We have been in discussions with multiple search engines. We have been unable to reach agreements with all of them, since some are unable or unwilling to make enforceable promises regarding their use of Reddit content, including their use for AI.”

Last month, to enforce its policy against scraping, Reddit updated the site’s robots.txt file, which tells web crawlers whether they can access a site. “It’s a signal to those who don’t have an agreement with us that they shouldn’t be accessing Reddit data,” Ben Lee, Reddit’s chief legal officer, told my colleague Alex Heath in Command Line.

My take on this: More evidence that the internet as we’ve known it is done.

If you've made it this far and follow my newsletter, please consider exploring the platform we're currently building: Unstract—a no-code LLM platform that automates unstructured data workflows.

GenAI Weekly — Edition 23

Shuveb Hussain

Co-founder at Unstract

Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs

Python Libraries to Extract Table from PDF

Zuckerberg: Open Source AI Is the Path Forward

OpenAI announces SearchGPT Prototype

领英推荐

Meta announces Llama 3.1 models

The Data That Powers A.I. Is Disappearing Fast

Reddit is now blocking major search engines and AI bots — except the ones that pay

For the extra curious

GenAI Weekly

2,222 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

How is Python revolutionizing data analysis in large-scale industries?

KX's developed innovation of AI (Artificial Intelligence)

Bootcamp ‘End to End Data Science Project using Python’

Shining Some Light on The New Milvus Lite

AIML 09- Data Augmentation in Python: Everything You Need to Know

Unlocking Enterprise AI: Why Python is Your Key

Python NumPy: Efficient Numerical Computing

5 Beginner Friendly Steps to Learn Machine Learning and Data Science with Python

6 Reasons Why Python Can Ace AI and Machine Learning Applications?

A Comprehensive Guide to Deploying Machine Learning Models Using Python and Flask

Your Weekly Dose of Gen AI: News, Trends, and Breakthroughs

Python Libraries to Extract Table from PDF

Zuckerberg: Open Source AI Is the Path Forward

OpenAI announces SearchGPT Prototype

领英推荐

Meta announces Llama 3.1 models

The Data That Powers A.I. Is Disappearing Fast

Reddit is now blocking major search engines and AI bots — except the ones that pay

For the extra curious

GenAI Weekly

2,222 位关注者

GenAI Weekly — Edition 31

2024年9月23日

GenAI Weekly — Edition 30

2024年9月16日

GenAI Weekly — Edition 29

2024年9月9日

GenAI Weekly — Edition 28

2024年9月2日

GenAI Weekly — Edition 27

2024年8月27日

GenAI Weekly — Edition 26

2024年8月19日

GenAI Weekly — Edition 25

2024年8月12日

GenAI Weekly — Edition 24

2024年8月5日

GenAI Weekly — Edition 22

2024年7月22日

GenAI Weekly — Edition 21

2024年7月15日

社区洞察

其他会员也浏览了

How is Python revolutionizing data analysis in large-scale industries?

KX's developed innovation of AI (Artificial Intelligence)

Bootcamp ‘End to End Data Science Project using Python’

Shining Some Light on The New Milvus Lite

AIML 09- Data Augmentation in Python: Everything You Need to Know

Unlocking Enterprise AI: Why Python is Your Key

Python NumPy: Efficient Numerical Computing

5 Beginner Friendly Steps to Learn Machine Learning and Data Science with Python

6 Reasons Why Python Can Ace AI and Machine Learning Applications?

A Comprehensive Guide to Deploying Machine Learning Models Using Python and Flask