DataTalk #2: JS rendering and web scraping with Ian Williams

DataTalk #2: JS rendering and web scraping with Ian Williams

I know, I know. We took our sweet time with the second DataTalk, sorry. As the industry changes rapidly, we’ve been working on improving ScraperAPI’s tools performance and developer experience.

However, we’re ready to start working on our newsletter again, and hopefully, you’ll find valuable insights that can help your team access data faster, easier, and/or more consistently.

For today’s entry, we invited Ian Williams, head of engineering at ScraperAPI and seasoned tech entrepreneur, to talk about how JS-based websites present a unique challenge to web scraping and how to overcome it.


LEO: Thanks a lot for the time, Ian! To get started, why don’t you tell us more about you and your background?

IAN: Thank you for inviting me. It’s a real pleasure to be here. My name is Ian Williams, and I’m the head of engineering at ScraperAPI.

I've been working in software engineering for more years than I care to count, from one of the big 6 consulting firms to my own small start-ups and in a variety of industries, from finance through psychometrics.

My real love is down in the real complexities of high-volume systems, which is what I get to play with at ScraperAPI. I'm the head of engineering, so I'm responsible for making sure that our customers are happy by keeping our success rates high and that we are continually at the forefront of scraping.

LEO: I think it’s fair to say you’ve seen many trends come and go, and that’s great because – and to get into today’s topic – I’d like to know if you've seen a trend of more pages using JS to inject content.

IAN: Definitely! Over the years I have been with ScraperaAPI, we have seen a trend of more sites and pages using JS to inject content. This kind of mirrors the technologies that are used to build websites. As these become more complex, so do the sites built with them, and we see more and more javascript injection of data onto those sites and pages.

Of course, it also means that our tools and strategies have to adapt to the rising popularity of scraping dynamic content, which, for example, needs to be rendered before any valuable insights can be extracted.

LEO: Sorry to interrupt, and maybe you were exactly moving in this direction, but can you tell us more about how JS rendering can be a challenge for web scraping?

IAN: Great question!

Historically, websites were fully "rendered" on the server side, which means that the browser would request a web page of the server and that page would contain all the data displayed on the page, in the format it would be displayed in.?

As websites and web programming languages become more complex, we find that sites will often respond with a skeleton of the page and use Javascript to load data from the server, then render that data on the page.

This can be a real challenge for scrapers as, without the browser to request or render the data, you just don't get what you need with a single scrape.

Add to that complex cookies, authorisation, and bot blockers, and getting data from some sites can be next to impossible.

LEO: So, if that’s the case, what strategies can we use to overcome this challenge?

IAN: There are a number of strategies that need to be brought together to overcome these challenges.

At the simplest level, using rotating proxies can help to avoid bot request detection, but that's only the starting point.?

You also have to layer on bot bypass techniques like TLS and browser signature management, blocker-specific bypass techniques and - at the final level - use an actual browser to render the page and retrieve the data you need – a common way to do this is using Selenium to control a headless browser –, before extracting the page's HTML and transforming it to something more useful.

LEO: On that last note, for what I’ve read, headless browsers can be more resource intensive for web scraping, but I’m not sure why. Could you explain this?

IAN: Using a headless browser to render a page can definitely make scraping more resource-intensive, on multiple levels.?

Scraping a single web page takes time to retrieve the HTML and data used on the page, as well as time to render the page itself.? This time means each individual scrape takes longer to complete than a single, simple HTML request, which means it takes longer to get through your volume of scrapes.?

If you're working in a high-volume environment, this additional time can be significant.? There is also the additional complexity of controlling interaction with the page.? Sometimes, in order to get the data you need, you may need to click on a link or a button (Amazon offer data is a prime example).

LEO: You mentioned tools need to adapt to these challenges. Could you share with us how ScraperAPI is adapting to this trend?

IAN: We are in a cycle of continual improvement to our headless browser / javascript rendering offering. Just this year, we rolled out a major new release to our rendering systems, improving success rates and significantly reducing rendering time.

To be more specific, ScraperAPI can help you handle dynamic content in two main ways:

  • First, you can outsource the page rendering to us by adding a simple render=true parameter to your request. This will tell our API that the page needs to be rendered before returning the HTML.
  • In one of our most recent updates, we also launched a render instruction set feature, letting you provide specific instructions to the API to interact with the site. So, instead of using a headless browser, you can just tell ScraperAPI to click on a button, submit a form, or scroll the page in a particular way before returning the HTML of the site.

These two features combined are designed to make scraping dynamic content easier and faster.

We are also continually working on bot-blocker prevention, integrated with rendering, to keep ahead of bot-blocking trends and keep our success rates high.

LEO: Before we call it today, are there any other tips you could give dev teams working with dynamic content?

IAN: When I am looking to scrape a new site and run up against JavaScript injection, I always try to get familiar with the requests and responses the site's pages make.

Chrome dev tools is invaluable. Using dev tools to inspect the page at the network request level and becoming familiar with the way the page operates is invaluable in understanding how best to scrape it.

For example, you could find the source from where they’re pulling the data they inject. Then, mimicking this request, you can get the response with the data you’re looking for. If I recall correctly, you used a similar approach to scrape LinkedIn in one of our tutorials.

Of course, you can always just use ScraperAPI to render the page for you and get the data you need.

LEO: Awesome, thanks so much for your time and great answers, Ian!

IAN: Real pleasure!


We hope you enjoyed this DataTalk interview! We have many more exciting conversation lineups for 2024, so stay tuned for more ^^

Want to learn more about dynamic content scraping? Check out our latest tutorials and guides:


Like what you see? Keep subscribing for the latest insights and tips.

Until next time, happy scraping!

Your ScraperAPI Team! ??


要查看或添加评论,请登录

ScraperAPI的更多文章

社区洞察

其他会员也浏览了