Building a Search Engine Using Elastic and Recursive Webscraping - Developer's Guide #1
Han Xiang Choong
Senior Customer Architect - APJ @ Elastic | Applied ML | Use-case Delivery | Search Experiences
Hello friends and colleagues, I'm shifting gears a little bit to discuss a topic that is rapidly becoming near and dear to me -> Search engines. For this, we'll be scraping data from wikipedia and using Elastic for indexing and search. For those unfamiliar, it would not be an exaggeration to say that Elastic powers the internet's search bars, which includes Wikipedia and Github.
The End-to-End flow is straightforward:
1) Start with a URL
2) Scrape that URL and collect article content
3) Collect all the other URLs on the page
4) Recursively repeat 2) and 3) for every URL
5) Turn the article content into a list of objects and save it as a .csv
6) Upload the csv to Elastic, where each row is a document and the columns are the fields
7) Make a search query
That's it. Let's get started.
The notebook with the full code can be found here:
Table of Contents:
0) Pre-Requisites
1) Recursively Scraping Wikipedia for Article Content
2) Processing Article Content into a Semi-Structured Form
3) Uploading to Elastic
4) Plain Text Search
Pre-Requisites
First step, head to Elastic Cloud, register for a free trial, and deploy a cluster. Make sure to save your cloud id, username, and password. You'll need these things to access the index later on.
1 - Recursively Scraping Wikipedia for Article Content
Webscraper
First thing to do is define a simple webscraper. This is the same class from the previous articles but stripped down:
Its only purpose in life is to return little objects like this that contain html:
{'content': '<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-...
'url': 'https://en.wikipedia.org/wiki/Spaghetti_all%27assassina'}
An enhancement would be generating many variations of headers, maybe preloading them somewhere in the script, then using RNG to randomly select a header for each request. This reduces your risk of getting blocked.
Wikipedia Data Processor
The next piece is a bit more interesting. We need a dataprocessor to handle Wikipedia pages specifically.
First, wikipedia articles often come with an infobox. I'd like to extract that infobox and turn it into neatly formatted text, which will enrich the search engine's contents. Let's define a class called WikipediaDataProcessor.
The infobox is a table that follows some standard formatting, so it's straightforward to extract it. The return datatype will be text, to keep things simple in our field mapping later on.
I also want something to extract urls from the page for future scraping. extract_urls_structure takes a bs4 soup obj and looks for hrefs in the article html. It uses re to filter out most of the urls that don't correspond to an article on wikipedia (ie. Talk pages, help, templates, etc...), then returns a list of filtered urls.
Next, we need to get nicely formatted article contents. extract_article_content is going to look for headers and paragraphs corresponding to article sections, and return them in a semi-structured way. It's going to reject any section that is not strictly article content. clean_text is going to filter out special characters and citations.
Document Controller
Next, we have the DocumentController class, to control document formatting and aggregate the results of all the functions we previously defined. It is the output of create_document_object which will be uploaded to Elastic.
Scraper
Finally, we have the Scraper class. It's going to initialize everything we've defined and call the functions in the right order to scrape a single webpage, and return the urls and document objects we expect.
This is the more interesting part. The recursive scrape is going to visit as many urls as it can find, grab the contents, aggregate them in a list called documents, taking care not to duplicate or revisit anything it has already seen, and then return that list.
Note that recursive functions are notorious for running forever and never stopping. We have an end condition - If there number of documents grows beyond the limit defined in max_docs, stop and return. We also constrain the range of the search via max_depth - Essentially this says that you may visit any webpage at most two hops away from your starting point. This should help to control relevance and ensure that we get broadly similar things to what we are interested in, though of course that is very far from guaranteed.
Note, we also want to make sure that we only visit novel urls. If there is no more novelty to be explored, the scraper should stop and return everything it has collected.
Running Everything
Let's set the scraping a-going.
Okay, 1 minute 24 seconds, not too shabby. We could easily do this asynchronously at scale to save time but now we're getting into "Unintentionally DDoSing a website and attracting angry attention towards our un-vpned ip address" so maybe let's not.
Let's save what we found to a csv and take a look at the catch:
领英推荐
Sweet Gordon Ramsay, what we have now is a corpus of Italian cuisine - The Tibet article coming up is quite weird, but other than that, heck yeah we in biz.
2 - Processing Article Content into a Semi-Structured Form
Crap I didn't realize the content of 2 is basically covered in 1 and separating it feels weird
uh
....
3 - Uploading to Elastic
Okay let's upload the catch to Elastic. We're going to need a few things.
We're going to modify the Elastic Search Connector from previous articles. Now it's going to use an ES Cloud ID as well as your username and password to establish a connection. Hope you saved all that earlier.
We're also going to need the ESBulkIndexer, which uploads a batch of documents and is far, far faster than uploading one by one. The document _id field will be the url, because I think it makes sense to define unique documents by their url.
bulk_upload_documents is going to take a list of document_objs, create a set of upsert actions, then execute them all with bulk. Simple, right?
We'll also introduce a new member of the family - ESQueryMaker. This guy is going to make a simple Elastic search query, which will match the query against the specified fields, retrieve the results, and print them out nicely.
Uploading
Let's establish a connection:
We'll now load the csv we previously generated into a pandas dataframe, and create a new index called "wikipedia_spaghetti". This guy is tailormade for our csv and will use all the columns. Make sure to drop the NaNs, because we specified that the fields are text and if ES sees a non-text value when it was expecting a non-non-text value, it's going to kick and scream and cause a scene.
Create the index:
Upload the documents:
Check Elastic Cloud:
Aw yeah we're in business now. Looks good, and we got 927 unique documents out of it too.
4 - Plain Text Search
Now for the big payoff - Let's search Elastic for "Spaghetti" and see what we get:
Oh yeah, talk to me about them spaghettis breh
What if we search for something a bit vaguer? Like "Strongest pizza in the world" or something?
Okay well Pizza al Taglio is big and sold by the kilogram so... I'm counting it as a win.
Okay, search engine done. To do Classic RAG, you would hook this up to an LLM and maybe make use of embeddings. You would also spend a lot more time enriching the documentation with some NLP techniques like keyword extraction and summarization. You could also do some more advanced stuff like linking documents using an additional graph layer, allowing you to traverse content by semantics rather than url-based linkages.
But all of that is out of scope. We'll come back and look at things you can do with search engines at a later date. See you!