Web Scrapping: Techniques and Ethics

Web Scrapping: Techniques and Ethics

In today's data-driven world, information is gold. According to the report, ~54.7 billion people around the world have been recorded to use the internet, creating 1.7MB of data every second. Websites hold vast troves of data, but manually extracting it can be tedious and time-consuming. Have you ever wondered how organizations gather data about their competitors online or predict the market trend? Web scraping, also known as screen scraping, web data mining, web harvesting, or web data extraction, serves the same purpose: to collect data from websites. Data scientists can leverage such a humongous amount of data to perform many tasks like real-time analytics, training predictive machine learning models, and improving natural language processing capabilities to make accurate predictions for their respective business fields. I felt compelled to write about this under-discussed topic, exploring its ethical boundaries alongside technical aspects of execution and prevention. Understanding web scraping is also crucial for system design interview rounds. So, let's delve deeper.

If you find it insightful and appreciate my writing, consider following me for updates on future content. I'm committed to sharing my knowledge and contributing to the coding community. Join me in spreading the word and helping others to learn.

Purpose of Web Scrapping

Web scraping is used for various reasons, but it all boils down to one core benefit: efficiently collecting large amounts of data from websites. This data can be used further for many purposes as explained below.

Downloaded from the internet

  1. Business Intelligence: Companies use web scraping to track competitor pricing, gather market research on trends and demographics, and identify potential leads or customers.
  2. Price Monitoring: Businesses especially e-commerce businesses can scrape product details from competitor websites to track price fluctuations and ensure they remain competitive.
  3. Market Research: Researchers can scrape data from various sources to understand market trends, customer sentiment, and demographics to inform product development and marketing strategies.
  4. Data Analysis: Web scraping allows researchers and data analysts to collect large datasets for studies, social analysis, and other projects.
  5. Personal Use: Individuals can use web scraping for tasks like building project lists, tracking movie releases, or automating repetitive tasks on websites.

Let's look at this picture that describes how web scraping is used in every industry.

If you want to comprehend the underlying business of this web-scrapping practice, you may consider this excellent article I have recently come across.

Web Scrapping System Architecture

Web scraping uses two components: Crawler and Scraper. Crawler is an artificial intelligence algorithm that browses the web to search for specific data. Whereas, a web scraper is a tool that extracts data from the website. Based on the collected data, the analysis and reporting is done. I have drawn a system design diagram with the fundamental blocks of the Web Scrapping process. I assume this diagram is self-explanatory but if you come across any doubts or queries, feel free to leave a comment.

Let's deep dive into various scrapping techniques by which information can be accumulated from websites.

Web Scrapping Techniques

Now, let's delve into various scraping techniques, each of which works differently. Please note that there is a significant overlap between these different scraping approaches, as there's no single, straightforward method to follow.

1. Crawler/Spider

Spider/Crawler tools are like DFS (Depth-First Search) algorithms (However, in reality, it is a complex system integrated with AI/ML programs) that recursively follow links to other pages to extract data. Web crawlers like Google's bot or website copiers like HTTrack are prime examples of this technique. These techniques are sometimes used for targeted scraping to acquire specific data, often in combination with an HTML parser to extract the desired information from each page.

2. Shell Scripts

Sometimes, common Unix tools are used for scraping. Tools like Wget or Curl can be used to download web pages, while Grep (using regular expressions) can be used for data extraction. However, this approach is not for real-time data collection; it's more suited for offline analysis.

3. HTML Parser

This technique resembles shell script regex-based approaches but it extracts data from webpages based on patterns within the HTML structure. Irrelevant content is typically ignored. Libraries like Jsoup and Scrapy are commonly used for this type of scraping. For example, consider a website with a search feature. This scraper could submit a search query and then extract all the resulting links and their titles from the results page's HTML. This allows for targeted acquisition of specific data points. This approach is also widely used in the industry.

4. Screenscrapers

This technique involves opening a website in a real browser, allowing JavaScript, AJAX, and other dynamic functionalities to run. Then, the desired text is extracted from the webpage using these two primary approaches:

  • DOM Manipulation: After the page fully loads and JavaScript executes, the HTML can be retrieved from the browser. A DOM parser can be used for extracting specific data points. This method is prevalent, and many techniques for bypassing scrapers also target this approach.
  • Screenshot and OCR (Optical Character Recognition): In rare cases, a screenshot is captured of the rendered webpage, and then OCR technology is used to extract text from the image. However, this is a more complex method typically employed only when conventional scraping techniques fail. Common tools for screen scraping include Selenium and PhantomJS.

5. 3rd-Party Scrapping Service Provider

If you'd rather not develop your scraping mechanism, you may consider using third-party scraping services like Zyte or Kimono. These services often combine automated and manual approaches to extract the required data. Additionally, AI technology is increasingly being used to automate the manual interpretation of data to expedite the scrapping process. You can find more tools here in this article.

Is it an Ethical or Legal Practice?

Web scraping can be misused, but it can also be an ethical practice for collecting data for market analysis, provided some guidelines and professional courtesies are followed. Let's explore some of the core ethical principles practiced in the industry.

  1. Respect robots.txt: This file specifies pages off-limits to scraping. If some pages are prohibited in the robots.txt, the scrapper should respect that.
  2. Avoid overloading servers: Scrappers should not send excessive requests that can crash websites.
  3. Use data responsibly: Don't violate privacy or scrape for illegal purposes.
  4. Be transparent: If you're scraping for commercial use, disclose it.

While scouring through the internet, I found an insightful article and an image of the decision tree by which you can easily determine the legality of the followed practice.

Web Scrapper Prevention Techniques

There's a vivid line between the use of public data ethically and the breaching of data privacy. Sometimes, businesses transgress ethical boundaries. This is why preventing website scraping can be important, as such operations can damage the website's performance also its reputation. However, designing a completely scrap-proof application is nearly impossible. Fortunately, we can make scraping operations more difficult to execute. This is also an important aspect of web security as well.

1. Monitor Logs & Traffic Patterns to Limit Unusual Activity

Checking the logs regularly is paramount. Suppose you detect unusual activity suggesting automated access (scrapers), such as a high volume of similar actions from a single IP address. Now the mitigation plan is to either implement blocking or access limitations (rate limiting). However, don't solely rely on IP blocking; consider using other indicators and methods to identify specific users or scrapers. Here are some helpful identifications:

  • User behavior: Analyze how quickly users fill out forms and where buttons are clicked. This user behavior data can be indicative of automated scraping.
  • User data collection: You can gather information like screen resolution, time zone, and installed fonts by executing JavaScript snippets. This data can help identify users or scrapers.
  • HTTP headers: The order and content of HTTP headers, especially the user-agent header can provide clues about the type of client accessing your website.
  • Identifying distributed scraping: Similar requests can be identified even if they originate from different IP addresses by analyzing request patterns. This might indicate scraping using a botnet or proxy network. If you encounter a sudden surge of identical requests from any IP address, consider blocking them. However, be cautious to avoid inadvertently blocking legitimate users.

This approach can be effective against screen scrapers that execute JavaScript, as it allows you to collect a humongous amount of user information. You can explore the following links if you wish to learn more about IP address blocking.

2. Introduce Registration & Login with Captchas

CAPTCHAs, those "I'm not a robot" tests, offer a simple and affordable way to deter web scraping bots. While they can effectively discourage automated attacks, they can also be frustrating for legitimate users, hindering their experience. If your website requires account creation, CAPTCHAs during registration and login can further thwart bots and allow you to track user activity, identifying and banning scraping accounts. However, the potential drawbacks of user friction require careful consideration. Ultimately, the decision to implement CAPTCHAs depends on weighing the need for security against the importance of a smooth user experience for your website. However, this is not an approach for any e-commerce business and users should be able to browse product catalogs without logging into the system.

3. Serve Text Content as an Image

Rendering text into an image on the server side and serving that image for display can hinder simple scrapers from extracting the text. However, this approach has several drawbacks:

  • Accessibility: It hinders screen readers and makes content inaccessible to visually impaired users.
  • Search Engine Optimization (SEO): Search engines cannot index the text within the image, harming your website's searchability.
  • Performance: This method can negatively impact website performance due to increased image loading times.
  • Legality: In some regions, it might violate accessibility laws like the Americans with Disabilities Act (ADA).

While it might deter simple scrapers, the text within the image can still be extracted using Optical Character Recognition (OCR) tools. Therefore, avoid this approach if you need to prioritize accessibility, search engine optimization, and website performance, or are subject to relevant accessibility regulations.

4. Limit Data Access and Implement Pagination

Do not provide a way for a script or bot to download your entire dataset at once. It is recommended to implement well-authorized, paginated access to your data. This allows for controlled retrieval and easier monitoring to identify potential misuse by scraper algorithms. Pagination is a recommended practice for not only preventing scraping operations but also a thumb-rule of web development as it improves performance and user experience by delivering data in manageable chunks.

5. Dynamic X-Path

This approach isn't effective at completely stopping scraping attempts, but it can disrupt scraper functionality and deter recurring attempts. It involves using a CSS pre-processor library to automatically generate random prefixes or suffixes for your CSS classes and HTML element IDs. These prefixes or suffixes would change after every website deployment.

For example, the CSS class .article-content might become something like .a4c36dda13eaf0. It's also important to change the length of your IDs and classes regularly. Otherwise, scrapers could still use XPath expressions like div.[any-14-characters] to target the desired elements.

While this method can disrupt scrapers that rely on static element identifiers, it will only be effective temporarily. The scraper might work initially but will likely break after repeated attempts as the element IDs and classes change. Tools like PostCSS can be useful for implementing this approach.

However, this technique has a serious drawback: Caching Issues! Changing element IDs or classes frequently hinders browser caching. Since CSS and JavaScript files might also need to be updated to reflect these changes, they will be re-downloaded by the browser, increasing page load times for repeat visitors and server load. If you only change the identifiers once a week, the caching impact might be minimal. But it's still a consideration.

Overall, dynamically changing element IDs and classes disrupt scrapers but is not a foolproof solution. It can introduce maintenance challenges and negatively impact website performance.

6. Geolocation-Based Content Delivery

This approach shares some similarities with the previous approach. By serving different HTML content based on the user's location (determined by IP address), you can potentially disrupt scrapers designed for specific regions.

For instance, a scrapper might function initially but that will eventually break down when the HTML contents are served to users differently based on their geolocation. The scraper wouldn't be programmed to handle this variation.

An advantage of this method is that it has a minimal impact on caching. Caching can still be implemented at the network load balancer level.

7. Honepotting

Screw with the scraper: Insert fake, invisible honeypot data into your page. You can add invisible honeypot items to your HTML to catch scrapers. Let's examine this snippet.

<div class="product-search-result" style="display:none">
  <h3 class="title">This is here to prevent scraping</h3>
  <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scrape, please click the link below.</p>
  <a class"search-result-link" href="/scraper-trap">I'm a scraper, block me for 24 hours!</a>
</div>        

Honeypots can reveal scraping activity. These invisible data fields are embedded within your HTML code.

A scraper programmed to collect all search results will capture the honeypot link just like any other real-time search result on the page. It will then mistakenly visit this link searching for the desired content. However, a real human user wouldn't see the honeypot link because it's hidden with CSS. Similarly, legitimate web crawlers like Google won't visit the link because your robots.txt file likely disallows access to URLs containing "/scrapertrap/".This approach helps distinguish between scrapers and genuine users or search engine crawlers.

Disruption by AI/ML Technology

Web scraping is becoming more sophisticated with the help of Artificial Intelligence (AI) and Machine Learning (ML) technologies. AI/ML can handle the challenges of websites that frequently change their layout by learning to adapt and identify the data you need regardless of presentation. It can also ensure the quality of the scraped data by checking for accuracy and consistency, and even categorize the information based on your needs. For example, AI/ML can sort through social media reviews and separate positive from negative ones. However, AI is important for ethical web scraping, following website guidelines, and avoiding overwhelming servers. You can read through this article for more insight.

Pet Project Ideas

Congratulations! If you've reached this point, you've grasped the fundamentals of website scraping. This topic frequently appears in system design interviews, so here is an exercise to hone your skills or consider for a personal project: Write algorithms (both recursive and non-recursive) to fetch all paragraphs from a given URL by following anchor tags in every child page. Please leave a comment if you have written an algorithm with better time and space complexity.

Are you interested in exploring more system design concepts? Then please check out my older articles on this topic for further insights. However, I have a plan to write about these pet projects in detail. So, stay tuned!

https://www.dhirubhai.net/posts/hello-amit-pal_frontenddevelopment-webdev-javascript-activity-7164846000623247360-bxRi/ and https://www.dhirubhai.net/posts/hello-amit-pal_systemdesign-highleveldesign-lowleveldesign-activity-7165570780615888897-I_7r/

Website owners may implement more sophisticated defenses as AI continues enhancing scraping techniques. Ultimately, responsible scraping practices that prioritize data accuracy and respect for websites will be key to maintaining a healthy ecosystem for online data collection.

Abhishek Gorisaria

Software Engineer | Ex - Qualcomm | Distributed Systems | Alumni @IIITH

1 年

Amit Pal total population of world is around 7 billion.

要查看或添加评论,请登录

Amit Pal的更多文章

社区洞察

其他会员也浏览了