Simple STEPS to Scrape a Website
Parth Shah
BIE @ Amazon Robotics | Data Scientist | Data Engineer | MLOps | Analytics Engineer | Product Analyst |
Disclaimer: Web Scrapping is illegal. Please read the "robots.txt" file of the website carefully.
Don't know what is robot.txt? Don't worry you will find out soon.
What is Web Scraping?
It is really simple. Lets read a story. Rahul wants to buy a laptop. But he is really confused about which one to buy as there are so many options given on the website and also don't have enough time to compare so many laptops one by one it will take ages to do that. Rahul thinks what if he can pull the data from the website into an excel and compare the laptops easily there. (Good Idea Rahul, Rahul is smart, be like this Rahul, not the Gandhi one!)
Rahul comes to know about web scraping. Web Scraping can easily pullout attributes from any website to an excel file. Rahul decides to implement this, let's look at how it works.
STEP 1 :
Select the website which you want to scrape. Rahul decides to scrape Flipkart.com as it has a wide range of laptops. So first he decides to check if Flipkart gives permission to pull out the data from their website. All the websites have a "robots.txt" file in which they store whether people can scrape it or not. to view this .txt file, type the website name/robots.txt
In the robot.txt file of Flipkart as we can see what all pages are allowed and what all are disallowed. (All the pages are disallowed, But Rahul is a curious guy he still scrapes Flipkart. Stunts like this are carried under lawyer's supervision, please don't try this at home)
STEP 2 :
After checking the permissions, let's open our jupyter notebook and import some important libraries and install some drivers.
We will need 3 libraries
1) Selenium(webdriver): This is used to connect our jupyter notebook to the chrome browser.
2) bs4(BeautifulSoup): To traverse through the API which we get from the website
3) Pandas(Everyone's Favourite): "Ye wala nai pata tho pahele basics padke aa"
We will also need 1 Driver. (Rahul prefers using chrome for searching websites, as Internet Explorer will give us answers after 2 years)
1) Go to https://sites.google.com/a/chromium.org/chromedriver/home and install the chrome driver. P.S: Look at the version of your chrome and download accordingly)
Bravo Second step is Done!! Let's steal some data now!
STEP 3 :
Connect the Jupyter notebook to chrome using Driver
Provide the Path where you have installed the driver and save it in "driver" Variable. As we are now successfully connected to chrome. Let's test that!
.get will get us to the website's Webpage that we want to scrape. (After running the above code it will open a new tab of chrome with the mentioned website, If it does not open maybe chrome is not happy with you!)
STEP 4 :
Cool, so we can access the webpage easily. So it's now time to take the Content from the webpage to our python notebook.
"Page_Source" will help us to get the content from the webpage which look like ↓↓↓↓↓
I know that's a really creepy looking data (Less creepy than your Ex though :P) But don't worry we do not need to look and pull the data manually, python will do that for us.
For that, we use "BeautifulSoap"(Second line in the above code) which helps python to traverse through the creepy data and get us some valuable content.
STEP 5 :
Once we are done with beautifying things. We will declare 3 List. From the website we will steal
1) Name of the Laptop 2) Price of the Laptop 3) Rating of the Laptop
STEP 6 :
The Bahubali Step!! This step is really important and annoying (Just like your Girlfriend/Boyfriend).
Here Rahul writes a "for" loop to iterate over the number of laptops on that page.
a) soup.findAll: This help's us in finding all the classes on that webpage that have class id: '_31qSD5'. The best part of a website is that all the products always starts with the same class id. To get the class id of a particular website we need to inspect that.
Open the webpage and right-click on that page, select the "Inspect" option and you will see a beautiful page opens on the right side of the screen.
When you hover on the tags in the inspect window you will see blue box appearing on the website. This tells us which tag is used for which part of the website. In the above image, I have hovered on <a class="_31qSD5"> this class starts with a <a> attribute tag. All the products on this website will start with the same class. This will help us iterate over all the products on the webpage.
b) name=a.find('div', attrs={'class':'_3wU53n'}) This piece of code will give us the name of the products. As I told you everything on a webpage starts with a class, so the name of the product will also have a class which is: '_3wU53n'
As we can see when I hover over <div class:"'_3wU53n"> the blue box appears on the name of the product.
c) price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'}) d) rating=a.find('div', attrs={'class':'hGSR34'})
By the same logic, Raj have also found out classes for price and ratings. You can extract different parameters by finding its respective class in the inspect window. (Sab me hi batao kya)
STEP 7 :
Bravo, we have got everything. Now let's use our favorite pandas and store the data in a data frame and output it as a csv.
Rahul has extracted a total of 24 product names, prices, and ratings in just 2 Mins.
Web scrapping is really easy to implement. After seeing the csv file in your downloads you will really feel like.
We are going through tough times, let's upskill our selves for a better tomorrow.Let me know if you get stuck anywhere. You can also find the entire code on my github.
STAY HOME || WASH YOU HANDS || STAY SAFE
SRE 4 @ Juniper Networks
4 å¹´Amazing Idea. Please upload a video tutorial on web scrapping.
Founder @ Skygraff London | Thinksprout Infotech? | Co-Founder @ Cherritos Coffee Co | Zita Ayurveda and Eatkins India | Msc Entrepreneurship, Innovation and Enterprise Development
4 å¹´Nice article.....Liked it!!!