登录查看更多内容

The hidden costs of web scraping

Ivan Vokhmin

Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI

发布日期: 2024年7月26日

During my long developer career I encountered multiple cases when companies were taking data directly from websites. What I found common for those cases is the lack of understanding from business side on how scraping is different from "normal" IT projects, that caused unpredictable effort and cost explosion on execution. Here I am going to summarize key takeaways for business that I found neglected before.

What is web scraping

Web scraping is a process of extracting data from a webpage using scraping bots that can understand html and process human-readable data. It can be done in context of search indexing, but also connecting to data that can not be fetched trough API (like showing user benefits from specific web page in another app)

What are main "hidden costs" of web scraping?

Here are the main ones:

1) No predictable lifecycle of a connector

If a website offers an API, they usually have some lifecycle guarantees - API have support periods and API lifecycle management. So it can be predicted around when the connector will need to be updated. This does not happen with web scraping. Any design update, even minor one, can affect scraper functionality. And large updates may break scrapers entirely. This change is unpredictable - some websites change once a year, some - every month or even every week.

领英推荐

The Future of Web Scraping for MVP Development -…

Arbisoft 9 个月前

How to choose a web scraping tool.

Zyte 1 年前

Web Scraping 101: How to Extract Data from Any Website

KanhaSoft 6 个月前

2) Too many edge cases

Want to scrape data on behalf of a user? Good. Why do scrapers fail at users birthday? Because a congratulation certificate is presented instead of normal menu and scraper is not expecting that (unless you encountered this case before already). What about gold/silver/platinum users menu differences? Does dev team have all possible test accounts?

3) Sudden termination

Suddenly, a company owning website started to dislike your bot and starts sending it to CAPTCHA. While some of challenges are possible to bypass, the effort and failure rate may not be worth it.

4) Big compute cost and long execution time

Classic server-only apps are slowly dying out, it is time of SPAs, when executing client-side javascript is pivotal for obtaining the data. Means: run chrome / firefox headless via puppeteer or selenium driver. Good old plain html bot will not suffice anymore. The computational costs on running multiple chrome scrapers are immense - you will need a big and expensive server(s). Also, as pages are slow (as they fetch data to render), expect every interaction to take time. Minutes of time. Dozens of minutes if you need to scrape paginated data.

Conclusion

Where web scraping is used, an API is missing. Resort to web scraping as last thing to try. But if you are into it remember: creation of scraper for a website is only a small fraction of total (unpredictable) maintenance effort.

要查看或添加评论，请登录

Ivan Vokhmin的更多文章

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

2025年3月14日

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

Introduction During my work on a CMS project using AWS Aurora Serverless v2, I faced numerous challenges related to…
Datadog vs self-hosted grafana/loki for observability - migration case

2024年11月25日

Datadog vs self-hosted grafana/loki for observability - migration case

Observability matters. Choosing right platform to retain and visualize logs and metrics is important for incident…
Executing scheduled serverless tasks with AWS ECS fargate or lambda

2024年9月11日

Executing scheduled serverless tasks with AWS ECS fargate or lambda

Recurring tasks require compute powers to be provisioned at predefined schedule. Like "process sales report at the end…
Technical challenges of AB test user segregation

2024年8月23日

Technical challenges of AB test user segregation

Every website that has a feature developed and ready for production wants to know if this feature is making a good…
Gravity of monoliths in feature-centered frontend projects

2024年7月5日

Gravity of monoliths in feature-centered frontend projects

During more than 10 years I had some pleasure of working with different projects with various codebases and code…
How to deal with third party API integration issues for web services?

2024年4月10日

How to deal with third party API integration issues for web services?

How to deal with third party API integration issues for web services? Many web services offer beautiful APIs that solve…
AWS Lambda: Accessing private VPC resources and internet without NAT gateway

2024年2月18日

AWS Lambda: Accessing private VPC resources and internet without NAT gateway

There is a commonly known design decision of AWS to launch lambda in a separate VPC that belongs to AWS itself. This…
Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

2023年12月1日

Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

Usually web apps (websites with SPA) are monitored for performance as “black boxes” - their response time (Time To…
AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

2023年8月22日

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

In moebel.de we use AWS lambda for many projects.

1 条评论
Using assembly in node.js

2023年7月25日

Using assembly in node.js

This is my successful timeboxed attempt to integrate assembly code in a node.js project for fun.

See all articles

The hidden costs of web scraping

Ivan Vokhmin

Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI

What is web scraping

What are main "hidden costs" of web scraping?

1) No predictable lifecycle of a connector

领英推荐

2) Too many edge cases

3) Sudden termination

4) Big compute cost and long execution time

Conclusion

Ivan Vokhmin的更多文章

社区洞察

其他会员也浏览了

The Path of Least Resistance: Streamlining Web Scraping with Scrapy and Zyte API

Tips and Tricks for Advanced Strategies in Web Scraping and Price Intelligence

Real-World Web Scraping Success Stories

Master Web Scraping in Google Sheets: No Code Required! ??

The Future of Web Scraping for MVP Development - (APIs, headless browsers, and advanced techniques)

Web Scraping API Tools to Track, Manage and Visualize Your Data Pipeline

The A to Z of Web Scraping Explained

DIY Scraping Tools vs Managed Web Scraping - What to Choose?

Top Industries Requiring Web Scraping Services in 2025

Getting Started with Web Scraping: A Simple Guide

What is web scraping

What are main "hidden costs" of web scraping?

1) No predictable lifecycle of a connector

领英推荐

2) Too many edge cases

3) Sudden termination

4) Big compute cost and long execution time

Conclusion

Ivan Vokhmin的更多文章

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

Datadog vs self-hosted grafana/loki for observability - migration case

Executing scheduled serverless tasks with AWS ECS fargate or lambda

Technical challenges of AB test user segregation

Gravity of monoliths in feature-centered frontend projects

How to deal with third party API integration issues for web services?

AWS Lambda: Accessing private VPC resources and internet without NAT gateway

Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

Using assembly in node.js

社区洞察

其他会员也浏览了

The Path of Least Resistance: Streamlining Web Scraping with Scrapy and Zyte API

Tips and Tricks for Advanced Strategies in Web Scraping and Price Intelligence

Real-World Web Scraping Success Stories

Master Web Scraping in Google Sheets: No Code Required! ??

The Future of Web Scraping for MVP Development - (APIs, headless browsers, and advanced techniques)

Web Scraping API Tools to Track, Manage and Visualize Your Data Pipeline

The A to Z of Web Scraping Explained

DIY Scraping Tools vs Managed Web Scraping - What to Choose?

Top Industries Requiring Web Scraping Services in 2025

Getting Started with Web Scraping: A Simple Guide