登录查看更多内容

DataTalk #2: JS rendering and web scraping with Ian Williams

ScraperAPI

Scale Data Collection with a Simple API.

发布日期: 2024年9月16日

I know, I know. We took our sweet time with the second DataTalk, sorry. As the industry changes rapidly, we’ve been working on improving ScraperAPI’s tools performance and developer experience.

However, we’re ready to start working on our newsletter again, and hopefully, you’ll find valuable insights that can help your team access data faster, easier, and/or more consistently.

For today’s entry, we invited Ian Williams, head of engineering at ScraperAPI and seasoned tech entrepreneur, to talk about how JS-based websites present a unique challenge to web scraping and how to overcome it.

LEO: Thanks a lot for the time, Ian! To get started, why don’t you tell us more about you and your background?

IAN: Thank you for inviting me. It’s a real pleasure to be here. My name is Ian Williams, and I’m the head of engineering at ScraperAPI.

I've been working in software engineering for more years than I care to count, from one of the big 6 consulting firms to my own small start-ups and in a variety of industries, from finance through psychometrics.

My real love is down in the real complexities of high-volume systems, which is what I get to play with at ScraperAPI. I'm the head of engineering, so I'm responsible for making sure that our customers are happy by keeping our success rates high and that we are continually at the forefront of scraping.

LEO: I think it’s fair to say you’ve seen many trends come and go, and that’s great because – and to get into today’s topic – I’d like to know if you've seen a trend of more pages using JS to inject content.

IAN: Definitely! Over the years I have been with ScraperaAPI, we have seen a trend of more sites and pages using JS to inject content. This kind of mirrors the technologies that are used to build websites. As these become more complex, so do the sites built with them, and we see more and more javascript injection of data onto those sites and pages.

Of course, it also means that our tools and strategies have to adapt to the rising popularity of scraping dynamic content, which, for example, needs to be rendered before any valuable insights can be extracted.

LEO: Sorry to interrupt, and maybe you were exactly moving in this direction, but can you tell us more about how JS rendering can be a challenge for web scraping?

IAN: Great question!

Historically, websites were fully "rendered" on the server side, which means that the browser would request a web page of the server and that page would contain all the data displayed on the page, in the format it would be displayed in.?

As websites and web programming languages become more complex, we find that sites will often respond with a skeleton of the page and use Javascript to load data from the server, then render that data on the page.

This can be a real challenge for scrapers as, without the browser to request or render the data, you just don't get what you need with a single scrape.

Add to that complex cookies, authorisation, and bot blockers, and getting data from some sites can be next to impossible.

LEO: So, if that’s the case, what strategies can we use to overcome this challenge?

IAN: There are a number of strategies that need to be brought together to overcome these challenges.

At the simplest level, using rotating proxies can help to avoid bot request detection, but that's only the starting point.?

You also have to layer on bot bypass techniques like TLS and browser signature management, blocker-specific bypass techniques and - at the final level - use an actual browser to render the page and retrieve the data you need – a common way to do this is using Selenium to control a headless browser –, before extracting the page's HTML and transforming it to something more useful.

LEO: On that last note, for what I’ve read, headless browsers can be more resource intensive for web scraping, but I’m not sure why. Could you explain this?

IAN: Using a headless browser to render a page can definitely make scraping more resource-intensive, on multiple levels.?

Scraping a single web page takes time to retrieve the HTML and data used on the page, as well as time to render the page itself.? This time means each individual scrape takes longer to complete than a single, simple HTML request, which means it takes longer to get through your volume of scrapes.?

领英推荐

10 BEST Web Scraping Tools

Guru99.com 1 年前

How to choose a web scraping tool.

Zyte 1 年前

The Path of Least Resistance: Streamlining Web…

Zyte 1 年前

If you're working in a high-volume environment, this additional time can be significant.? There is also the additional complexity of controlling interaction with the page.? Sometimes, in order to get the data you need, you may need to click on a link or a button (Amazon offer data is a prime example).

LEO: You mentioned tools need to adapt to these challenges. Could you share with us how ScraperAPI is adapting to this trend?

IAN: We are in a cycle of continual improvement to our headless browser / javascript rendering offering. Just this year, we rolled out a major new release to our rendering systems, improving success rates and significantly reducing rendering time.

To be more specific, ScraperAPI can help you handle dynamic content in two main ways:

First, you can outsource the page rendering to us by adding a simple render=true parameter to your request. This will tell our API that the page needs to be rendered before returning the HTML.
In one of our most recent updates, we also launched a render instruction set feature, letting you provide specific instructions to the API to interact with the site. So, instead of using a headless browser, you can just tell ScraperAPI to click on a button, submit a form, or scroll the page in a particular way before returning the HTML of the site.

These two features combined are designed to make scraping dynamic content easier and faster.

We are also continually working on bot-blocker prevention, integrated with rendering, to keep ahead of bot-blocking trends and keep our success rates high.

LEO: Before we call it today, are there any other tips you could give dev teams working with dynamic content?

IAN: When I am looking to scrape a new site and run up against JavaScript injection, I always try to get familiar with the requests and responses the site's pages make.

Chrome dev tools is invaluable. Using dev tools to inspect the page at the network request level and becoming familiar with the way the page operates is invaluable in understanding how best to scrape it.

For example, you could find the source from where they’re pulling the data they inject. Then, mimicking this request, you can get the response with the data you’re looking for. If I recall correctly, you used a similar approach to scrape LinkedIn in one of our tutorials.

Of course, you can always just use ScraperAPI to render the page for you and get the data you need.

LEO: Awesome, thanks so much for your time and great answers, Ian!

IAN: Real pleasure!

We hope you enjoyed this DataTalk interview! We have many more exciting conversation lineups for 2024, so stay tuned for more ^^

Want to learn more about dynamic content scraping? Check out our latest tutorials and guides:

Like what you see? Keep subscribing for the latest insights and tips.

Until next time, happy scraping!

Your ScraperAPI Team! ??

DataTalk #2: JS rendering and web scraping with Ian Williams

ScraperAPI

Scale Data Collection with a Simple API.

领英推荐

Scraping Simplified

389 位关注者

ScraperAPI的更多文章

社区洞察

其他会员也浏览了

Real-World Web Scraping Success Stories

Web Scraping API Tools to Track, Manage and Visualize Your Data Pipeline

Getting Started with Web Scraping: A Simple Guide

Best Web Scraping Tools in 2023

Why Proxy Rotation is Crucial for Successful Web Scraping

14 Best Web Scraping Tools and Software in 2024

Super scrapers!

Crawlee Blog is here!

How Does Web Scraping Work?

Optimizing Structured Data For AI Crawlers: Why Server-Side Solutions Matter

领英推荐

Scraping Simplified

389 位关注者

ScraperAPI的更多文章

??? Using AI for web scraping, build scalable Puppeteer scrapers, and discover new tools.

January 2025 product updates, new integrations, and new resources for your team

DataTalk #1: A Dive Into Ecommerce and Web Scraping with Pierluigi Vinciguerra

Build an In-House Keyword Tracker in 5 Minutes

Handle Millions of Requests at Near 100% Success Rate

Turn Amazon Product Pages Into Structure JSON Data

How to Get Around Twitter's Next Challenge: New [Crazy!] API Pricing

A Web Scraping Learning Hub to Bookmark!

Scrape to Keep Up with Amazon—How and Why

How to Scrape in Another Language or Location For Market Research

社区洞察

其他会员也浏览了

Real-World Web Scraping Success Stories

Web Scraping API Tools to Track, Manage and Visualize Your Data Pipeline

Getting Started with Web Scraping: A Simple Guide

Best Web Scraping Tools in 2023

Why Proxy Rotation is Crucial for Successful Web Scraping

14 Best Web Scraping Tools and Software in 2024

Super scrapers!

Crawlee Blog is here!

How Does Web Scraping Work?

Optimizing Structured Data For AI Crawlers: Why Server-Side Solutions Matter