登录查看更多内容

Building a Search Engine Using Elastic and Recursive Webscraping - Developer's Guide #1

Han Xiang Choong

Senior Customer Architect - APJ @ Elastic | Applied ML | Use-case Delivery | Search Experiences

发布日期: 2024年5月22日

Hello friends and colleagues, I'm shifting gears a little bit to discuss a topic that is rapidly becoming near and dear to me -> Search engines. For this, we'll be scraping data from wikipedia and using Elastic for indexing and search. For those unfamiliar, it would not be an exaggeration to say that Elastic powers the internet's search bars, which includes Wikipedia and Github.

The End-to-End flow is straightforward:

1) Start with a URL

2) Scrape that URL and collect article content

3) Collect all the other URLs on the page

4) Recursively repeat 2) and 3) for every URL

5) Turn the article content into a list of objects and save it as a .csv

6) Upload the csv to Elastic, where each row is a document and the columns are the fields

7) Make a search query

That's it. Let's get started.

The notebook with the full code can be found here:

I also want something to extract urls from the page for future scraping. extract_urls_structure takes a bs4 soup obj and looks for hrefs in the article html. It uses re to filter out most of the urls that don't correspond to an article on wikipedia (ie. Talk pages, help, templates, etc...), then returns a list of filtered urls.

Next, we need to get nicely formatted article contents. extract_article_content is going to look for headers and paragraphs corresponding to article sections, and return them in a semi-structured way. It's going to reject any section that is not strictly article content. clean_text is going to filter out special characters and citations.

Document Controller

Next, we have the DocumentController class, to control document formatting and aggregate the results of all the functions we previously defined. It is the output of create_document_object which will be uploaded to Elastic.

Scraper

Finally, we have the Scraper class. It's going to initialize everything we've defined and call the functions in the right order to scrape a single webpage, and return the urls and document objects we expect.

This is the more interesting part. The recursive scrape is going to visit as many urls as it can find, grab the contents, aggregate them in a list called documents, taking care not to duplicate or revisit anything it has already seen, and then return that list.

Note that recursive functions are notorious for running forever and never stopping. We have an end condition - If there number of documents grows beyond the limit defined in max_docs, stop and return. We also constrain the range of the search via max_depth - Essentially this says that you may visit any webpage at most two hops away from your starting point. This should help to control relevance and ensure that we get broadly similar things to what we are interested in, though of course that is very far from guaranteed.

Note, we also want to make sure that we only visit novel urls. If there is no more novelty to be explored, the scraper should stop and return everything it has collected.

Running Everything

Let's set the scraping a-going.

Okay, 1 minute 24 seconds, not too shabby. We could easily do this asynchronously at scale to save time but now we're getting into "Unintentionally DDoSing a website and attracting angry attention towards our un-vpned ip address" so maybe let's not.

Let's save what we found to a csv and take a look at the catch:

David Shergilashvili 5 个月前

Extract Summit Spotlight: Proxy Tech Future and Legal…

Zyte 3 个月前

Monitor Your Web Data Like a Pro with this Open-Source…

Zyte 1 年前

Sweet Gordon Ramsay, what we have now is a corpus of Italian cuisine - The Tibet article coming up is quite weird, but other than that, heck yeah we in biz.

2 - Processing Article Content into a Semi-Structured Form

Crap I didn't realize the content of 2 is basically covered in 1 and separating it feels weird

....

3 - Uploading to Elastic

Okay let's upload the catch to Elastic. We're going to need a few things.

We're going to modify the Elastic Search Connector from previous articles. Now it's going to use an ES Cloud ID as well as your username and password to establish a connection. Hope you saved all that earlier.

We're also going to need the ESBulkIndexer, which uploads a batch of documents and is far, far faster than uploading one by one. The document _id field will be the url, because I think it makes sense to define unique documents by their url.

bulk_upload_documents is going to take a list of document_objs, create a set of upsert actions, then execute them all with bulk. Simple, right?

We'll also introduce a new member of the family - ESQueryMaker. This guy is going to make a simple Elastic search query, which will match the query against the specified fields, retrieve the results, and print them out nicely.

Uploading

Let's establish a connection:

We'll now load the csv we previously generated into a pandas dataframe, and create a new index called "wikipedia_spaghetti". This guy is tailormade for our csv and will use all the columns. Make sure to drop the NaNs, because we specified that the fields are text and if ES sees a non-text value when it was expecting a non-non-text value, it's going to kick and scream and cause a scene.

Create the index:

Upload the documents:

Check Elastic Cloud:

Aw yeah we're in business now. Looks good, and we got 927 unique documents out of it too.

4 - Plain Text Search

Now for the big payoff - Let's search Elastic for "Spaghetti" and see what we get:

Oh yeah, talk to me about them spaghettis breh

What if we search for something a bit vaguer? Like "Strongest pizza in the world" or something?

Okay well Pizza al Taglio is big and sold by the kilogram so... I'm counting it as a win.

Okay, search engine done. To do Classic RAG, you would hook this up to an LLM and maybe make use of embeddings. You would also spend a lot more time enriching the documentation with some NLP techniques like keyword extraction and summarization. You could also do some more advanced stuff like linking documents using an additional graph layer, allowing you to traverse content by semantics rather than url-based linkages.

But all of that is out of scope. We'll come back and look at things you can do with search engines at a later date. See you!

Building a Search Engine Using Elastic and Recursive Webscraping - Developer's Guide #1

Han Xiang Choong

Senior Customer Architect - APJ @ Elastic | Applied ML | Use-case Delivery | Search Experiences

Table of Contents:

Pre-Requisites

1 - Recursively Scraping Wikipedia for Article Content

Webscraper

Wikipedia Data Processor

Document Controller

Scraper

Running Everything

领英推荐

2 - Processing Article Content into a Semi-Structured Form

3 - Uploading to Elastic

Uploading

4 - Plain Text Search

更多精彩文章

社区洞察

其他会员也浏览了

Real-World Web Scraping Success Stories

5,000 KDnuggets Posts – Examining Our Most Popular Analytics, Big Data, Data Science stories

The rise of open source LLMs

Automate Data Collection: Leveraging Web Scraping Tools for Efficient Data Gathering

Deep dive into JSON-LD IRI IDs in TerminusDB

What I learned while building a semantic search engine?

Multi Curl Web Scraper for Price Comparison

14 Best Web Scraping Tools and Software in 2024

Table of Contents:

Pre-Requisites

1 - Recursively Scraping Wikipedia for Article Content

Webscraper

Wikipedia Data Processor

Document Controller

Scraper

Running Everything

领英推荐

2 - Processing Article Content into a Semi-Structured Form

3 - Uploading to Elastic

Uploading

4 - Plain Text Search

Improving e-Commerce Search with Query Profiles in Elastic

2024年11月21日

Code Snippet: Parallel LLM Calls

2024年9月19日

Snippet: Speeding up Bulk Upload Speeds to Elastic with Parallelisation in Python

2024年9月13日

Automating Traditional Search

2024年9月9日

A Personal Chatbot Interface with Elasticsearch & Streamlit

2024年8月31日

Search Concepts Cheatsheet - Elastic Oriented

2024年8月13日

Search in Elastic 8.15 - Building RAG Extremely Quickly WITHOUT Code

2024年8月12日

Advanced RAG Techniques Part 2: Querying and Testing

2024年8月6日

Advanced RAG Techniques Part 1: Data Processing

2024年8月6日

Searching for Best Practices in RAG: The Sparknotes Version

2024年7月26日

社区洞察

其他会员也浏览了

Real-World Web Scraping Success Stories

5,000 KDnuggets Posts – Examining Our Most Popular Analytics, Big Data, Data Science stories

The rise of open source LLMs

Automate Data Collection: Leveraging Web Scraping Tools for Efficient Data Gathering

Deep dive into JSON-LD IRI IDs in TerminusDB

What I learned while building a semantic search engine?

Multi Curl Web Scraper for Price Comparison

14 Best Web Scraping Tools and Software in 2024