登录查看更多内容

Bypass Cloudflare with these web scraping tools

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

发布日期: 2023年2月14日

In this article of The Web Scraping Club we see the Python tools we can use to bypass Cloudflare protected websites when trying to scrape them.

What is Cloudflare

Cloudflare is one of the most well-known anti-bot solutions and bypassing it could be a difficult challenge for any web scraping project.

No alt text provided for this image — Cloudflare I’m not a robot check

In my past articles, I’ve written several times about bypassing Cloudflare using different approaches and I will summarize them in this post.

Undetected Chromedriver

We have seen in the?Anti-Detect Anti-Bot matrix post?that a good solution against Cloudflare could be the?Undetected Chromedriver python package.

Basically, it consists of a Chromedriver version modified for specific usage in web scraping projects.

Combined with Selenium, you can automate most of the Chromium browsers. Not only Chrome then, but also Brave and GoLogin.

The setup for this solution comes in two easy step

Setting up the connection to the chromedriver. After importing the package, with a few lines, we can load a page and eventually take a screenshot, like in this example.

Undetected Chromedriver test for Cloudflare

Then, using Selenium, we can extract data from nodes using its classical syntax,?as described in the documentation.

Free Online Courses With Certificates 1 年前

How Zyte API takes care of the fundamental needs of…

Zyte 1 年前

Simplifying Docker Setup with docker init: An…

Diogo Ribeiro 1 个月前

Playwright

In this blog, we often write about Playwright, because of its ductility and its ease to use. As an example, in?one of the latest “The Lab” posts we have seen how to use it together with GoLogin, but it’s not the only setup we can use. It supports natively all modern rendering engines including Chromium, WebKit, and Firefox so we can have a lot of fun testing around which solution fits the best for our needs.

During the last 4-5 months I’ve noticed that the best solution to bypass most of the Cloudflare-protected websites consists in using Firefox together with Playwright, as stated also?in our Anti-Detect Anti-Bot matrix.

Even in this case, the setup is quite easy.

After importing the package, we set up a Playwright session (sync or async, depending on your needs for multi-threading and parallelism) and then we load the page in the browser we prefer.

The different options and setups we can try are countless but I’m bringing to attention some of them I’ve found useful.

slow_mo: in the previous example you have noticed I’ve added this option when creating an instance of Firefox. It’s an option to slow down, in milliseconds, the execution of the commands. It happened to me that scrapers launched and working from my laptop did not work from VMs on data centers. Changing only this option allowed me to make it work, probably because the connection or the hardware of the servers was too much performing compared to the average desktop machine and the anti-bot used the execution speed as a red flag.
launch_persistent_context: using this option, Playwright will create (if it does not exist) a directory where it will store all the files the browser usually stores when executing (cookies, history files, and so on). This is useful when you need to show the anti-bot you’re not using a brand-new user profile but one with some history. You can use this option as follows

browser = p.chromium.launch_persistent_context(user_dir, headless=False)

Conclusions

Yes, Cloudflare can be a pain to bypass, especially on some websites where the rules for detecting bots are strict, but usually these configurations damage also real users.

We have the means for bypassing the anti-bot solution, but it all depends from case to case and there's no silver bullet for it. And if there was, it would work only until the next release. It’s always a cat-and-mouse game and techniques need to be always updated. The only permanent condition is to treat target websites respectfully and ethically, to not cause harm or malfunction.

If you liked this post and want to receive all every article from the Club in your email, consider subscribing for free at this link https://substack.thewebscraping.club/

Oleg Ruscinski

Специалист Интернет-технологии

8 个月

Only working solution nowdays https://rapidapi.com/ruspacenet-6x8jdbfLxk8/api/cloudflaresolver/

1 次回应

Bob Miles ??

Founder & CEO | Salad Technologies

1 年

Pierluigi Vinciguerra I recently signed up to the web scraping club and appreciate your work! At Salad we've been looking into this use case for our network: we can distribute 10 to 10,000 containerised workloads, compute at the edge with unique residential IPs. We're seeing some customers run headless browsers, and I'd be very curious to get your thoughts on use cases for our infrastructure ??

Gologin

1 年

Cheers, keep up the good work! ??

Eugene Stepnov

Head of Marketing at GoLogin

1 年

Thanks for mentioning GoLogin!

1 次回应

Giuseppe Barletta

Software Engineer

1 年

It’s common that protected websites set up Cloudflare without changing the origin’s IP address, which is very likely still visible on older DNS records. That should not be a problem, although sometimes they don’t deny connections outside CF IP ranges, resulting on leaving a door open for a very efficient way to bypass Cloudflare protections and caching.

3 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Bypass Cloudflare with these web scraping tools

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

What is Cloudflare

Undetected Chromedriver

领英推荐

Playwright

Conclusions

更多精彩文章

社区洞察

其他会员也浏览了

Test Automation - How to Bypass Re-Login With Playwright Python And Pytest

Most Popular Scraping Libraries for 2023

Test Automation - Speeding Up Testing with Playwright Python using Local Storage

OpenAI Completions API — Complete Guide

How important PYTHON for SEO?

Web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python.

Web scraping SDKs & templates

Web Scraping with Python: A Beginner’s Guide

Build a Web Scraper with Python and BeautifulSoup

A Guide to Web Scraping with Python

What is Cloudflare

Undetected Chromedriver

领英推荐

Playwright

Conclusions

The new OpenAI User Agent and its consequences

2023年8月9日

What is device fingerprinting?

2023年5月21日

Web Scraping news recap - April 2023

2023年5月1日

Web scraping and alternative data for financial markets

2023年4月25日

Writing a web scraper with ChatGPT. Is it a good idea?

2023年4月16日

How to scrape Datadome protected websites (early 2023 version)

2023年4月14日

XPath vs CSS selectors: a comparison

2023年4月2日

Bypass Cloudflare Bot Protection with GoLogin

2023年1月19日

How I've built my home made mobile proxy

2023年1月15日

Scraping OpenSea data to analyze NFT collections

2023年1月6日

社区洞察

其他会员也浏览了

Test Automation - How to Bypass Re-Login With Playwright Python And Pytest

Most Popular Scraping Libraries for 2023

Test Automation - Speeding Up Testing with Playwright Python using Local Storage

OpenAI Completions API — Complete Guide

How important PYTHON for SEO?

Web scraping project using Docker, Selenium, proxies, login, JavaScript, regex, and Python.

Web scraping SDKs & templates

Web Scraping with Python: A Beginner’s Guide

Build a Web Scraper with Python and BeautifulSoup

A Guide to Web Scraping with Python