登录查看更多内容

Starting On Data Scrapping?

Nzivo Katoo

Award-Winning Data (Samurai) Scientist || Senior Data Analyst || Senior Data Scientist || ML Engineer || Entrepreneur || Business Mentalist

发布日期: 2020年9月21日

A few weeks ago a friend approached me to scrape data for a class project we were working on. This was a week to the scheduled class and we were planning to conduct a demand forecasting of the services of a particular telecom company in Kenya using their twitter data. To me, it was an easy request because I had done a few twitter scrapping before and had learnt how to quickly put together a scrapping code within minutes and request for the data. The data I needed had to have a customer's tweet asking the telecom customer services for assistance and the response they received.

Scrapping is basically taking data from an online platform and saving it in a file which is mostly a CSV file depending on the developers' choice (others can decide on JSON or XML). There are basically 2 forms of data scrapping that I have worked with (and that I believe exist??).

Website source code
API based

Website Source Code

Technically this is just inspecting the code of a target website as shown in a browser to get the data you are looking for. This involves sending a request using the link for the website and the request returns the whole source code that is observed in a web browser. Simply put, your code pretends to be a browser to get the response from the servers (Just layman reference).

This form of data scrapping has two ways you can use to get the data.

Directly request source code - Here is just as simple as I explained above and it is the most used. It just gets the source code which is assigned to a variable.
Browser simulation - Here is where it is a little extra complex as you use your code to simulate the operations of a browser. Used mainly if the kind of websites you need to scrape needs some browser operations like scrolling and clicking. What this does is launch a browser and extract the data straight from that browser.

Now with this form, you need a bit of knowledge in HTML because you would be required to know the different tags holding your data. Not all websites have the same flow when it comes to the HTML tags unless they're a copy of each other or were developed by the same developer (Most developers switch their approach as they learn new things with each development). This simply means you would need to inspect the source code of the website or app from a browser which makes that a disadvantage. Another disadvantage is when it comes to manipulation of the page you want to scrape. Most web apps have been created to self generate as you scroll down, meaning to get all the data in a page you have to scroll. These are called Infinite pages.

That isn't all. There are pages where you have to move through a number of pages which are easy since each page has its link which is different from the base link by just a number. This you simply can create a loop to increment the number automatically but what happens when you find a website has JavaScript controlling the switching between pages? Well, most of such sites require you to click a button hence you would need to script your simulation to be able to make this clicks and ensure you scrape as much data as possible with the least effort. That is where simulation comes handy over just getting data from the source code.

My worse experience has been working with websites that literally have no sizable data. You have created a script for an hour just to use it for work you could have simply copy-pasted. I have worked with several packages such as urllib ( I think it is 2&3 but I have worked with both), request, selenium(Helpful when it comes to simulating a browser) and BeautifulSoup (Was introduced to this in a hackathon that we were conducting a sentiment analysis and used in almost all scrapping and even NLP)

API Based

Now compared with the above, I would say this is the most exciting and easy form of data scrapping though as my story would tell you it has some hidden landmines??. Most companies have started to create developer sections in their web apps where developers are able to access the data from the web app eg Twitter. Reason for this is mainly to encourage developers to come up with new solutions that can be able to increase value for their customers. Think of it as a Supermarket allowing KFC (or any other big franchise) to open a stand in it (Or vice versa ??).

So what the company actually do is create a base link that using it and some parameters you would be able to get particular data you require. Most of these APIs are CRUD (Create, Read, Update and Delete) APIs that are able to get and manipulate the company's data. Due to that most, if not all, are authenticated and you have to go through a process to get access to the access tokens.

To scrape data with an API all you need to do is authenticate yourself using provided access token and provide parameters that specify what kind of data you are looking for. This will actually just gives you access to the data you require but limited by the API provider perimeters such as a certain number of data types and entry. With this, you would really need to understand the different parameters and how to use them with the base URL link.

Personally I can't say I have extensively worked with different APIs in scrapping but as a matter of fact, I'm sure I'm capable of working with any API and get the data I require. Though my previous experience has been majorly on Twitter APIs. Working with the Twitter APIs kind of gives you access to a broad spectrum of working with APIs and the best thing with twitter scrapping is the availability of a number of libraries especially when working with python.

Some of the libraries simply create ease in the process of creating an API query by constructing it for you while other libraries simply use the web app's operations to acquire data according to parameters provided. These libraries in turn jump over the perimeters of the API like the fox did the fence.

Hence armed with the requirements for the data I used this form of data scrapping where I worked with both GetOldTweets3 and Tweepy. GOT3 is able to get data as old as Twitter itself (haven't tested that) and return a huge number of tweets depending on how many you want. This isn't the same with Tweepy which limits the number of tweets per 15 minutes and you can't get tweets older than a week unless you have the ID for that particular tweet you want.

So, I decided I would scrape over the weekend. My initial thought was that I would create the script on Friday evening and leave it running till morning. The code ran as expected except the output required was never achieved. Now the life of a developer is sleepless nights debugging your code and that is actually what happened to me that whole weekend. On Friday I kept editing the code every one hour depending on the crash but when it didn't work I hade to try finding out what was the matter.

Something that I had expected would only take the weekend at the worst, took the whole week and interfered with my concentration with work as any developer can tell you, a bug bugs you (Pun intended) until you are able to trace and solve it. Finally, the problem was actually the limitations of the Tweepy that really made the script return blanks. I finally figured out how to use it to my own advantage. At the end of the week, I can say I had created close to 600 lines of code including those deleted.

Think of this as a revelation of the scrapping journey as it just gives you an insight to data scrapping. Even though some of the failures I have faced in data scrapping look like tough points, I see them as the most joyous points of my development journey as it gave me the opportunity to learn something new and be creative. That is the reason I enjoy when my friends reach out to me for help. It is not because I know everything or it buffs up my ego but because I get to learn something new with someone.

??PEDRO CARDOSO - The Data Ninja, P.Eng, CBIP

4 年

I love the #DataNinja!

1 次回应

Brian Kariuki

Software Engineer at Microsoft

4 年

superb piece Eric

1 次回应

Faith Mwai

Systems Analyst |Data analyst |Data protection| PMI

4 年

Thanks for the article! superb!

1 次回应

Nelly Nyadzua

4 年

Awesome post. Thanks for narrating this journey. ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Nzivo Katoo的更多文章

The Narrator Bias Dilemma in Data

2024年5月13日

The Narrator Bias Dilemma in Data

Narrator bias, a phenomenon pervasive in both data analysis and everyday life, occurs when the source of information…
Hands Out of Pocket Help

2022年3月7日

Hands Out of Pocket Help

To most of us being human is having the capability to breathe, eat and pump blood in our veins but to me, I believe…

2 条评论

Starting On Data Scrapping?

Nzivo Katoo

Award-Winning Data (Samurai) Scientist || Senior Data Analyst || Senior Data Scientist || ML Engineer || Entrepreneur || Business Mentalist

Website Source Code

API Based

Nzivo Katoo的更多文章

社区洞察

其他会员也浏览了

React Query ?

Integrating Sanity with Next.js: A Guide to Data Import and Environment Setup

React - Use of an Empty Array [] in useEffect Hook

What’s New in Bold BI?: What We Built Throughout June 2024

SQL Day 15/50

useQuery vs useEffect

SQL Day 16/50

#7 Data Compression

SQLight Browser Biased Alternative

[AndroidBits] How to decouple navigation between your components

Website Source Code

API Based

Nzivo Katoo的更多文章

The Narrator Bias Dilemma in Data

Hands Out of Pocket Help

社区洞察

其他会员也浏览了

React Query ?

Integrating Sanity with Next.js: A Guide to Data Import and Environment Setup

React - Use of an Empty Array [] in useEffect Hook

What’s New in Bold BI?: What We Built Throughout June 2024

SQL Day 15/50

useQuery vs useEffect

SQL Day 16/50

#7 Data Compression

SQLight Browser Biased Alternative

[AndroidBits] How to decouple navigation between your components