The hidden costs of web scraping
Ivan Vokhmin
Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI
During my long developer career I encountered multiple cases when companies were taking data directly from websites. What I found common for those cases is the lack of understanding from business side on how scraping is different from "normal" IT projects, that caused unpredictable effort and cost explosion on execution. Here I am going to summarize key takeaways for business that I found neglected before.
What is web scraping
Web scraping is a process of extracting data from a webpage using scraping bots that can understand html and process human-readable data. It can be done in context of search indexing, but also connecting to data that can not be fetched trough API (like showing user benefits from specific web page in another app)
What are main "hidden costs" of web scraping?
Here are the main ones:
1) No predictable lifecycle of a connector
If a website offers an API, they usually have some lifecycle guarantees - API have support periods and API lifecycle management. So it can be predicted around when the connector will need to be updated. This does not happen with web scraping. Any design update, even minor one, can affect scraper functionality. And large updates may break scrapers entirely. This change is unpredictable - some websites change once a year, some - every month or even every week.
领英推荐
2) Too many edge cases
Want to scrape data on behalf of a user? Good. Why do scrapers fail at users birthday? Because a congratulation certificate is presented instead of normal menu and scraper is not expecting that (unless you encountered this case before already). What about gold/silver/platinum users menu differences? Does dev team have all possible test accounts?
3) Sudden termination
Suddenly, a company owning website started to dislike your bot and starts sending it to CAPTCHA. While some of challenges are possible to bypass, the effort and failure rate may not be worth it.
4) Big compute cost and long execution time
Classic server-only apps are slowly dying out, it is time of SPAs, when executing client-side javascript is pivotal for obtaining the data. Means: run chrome / firefox headless via puppeteer or selenium driver. Good old plain html bot will not suffice anymore. The computational costs on running multiple chrome scrapers are immense - you will need a big and expensive server(s). Also, as pages are slow (as they fetch data to render), expect every interaction to take time. Minutes of time. Dozens of minutes if you need to scrape paginated data.
Conclusion
Where web scraping is used, an API is missing. Resort to web scraping as last thing to try. But if you are into it remember: creation of scraper for a website is only a small fraction of total (unpredictable) maintenance effort.