THE LAB #11: The Anti-Detect Anti-Bot matrix

Excerpt

An incomplete but still yes useful list of interesting resources on web scraping. Testing the most well-known web scraping tools in Python against Cloudflare, Kasada, PerimeterX, Datadome and Shape


This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.

[

Oxylabs

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4efae9bc-f330-46d6-98bd-8ce3b96804d6_960x368.png)

Oxylabs

In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can save 25% OFF for your residential proxy buying.

You can find all our partners and the discounts they apply to our readers on our Discord Server.

Join our Discord

The web scraping landscape

[

Captcha

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217e5fc5-b985-41c8-808b-899310bede7b_480x268.gif)

Captcha

In these pages we said over and over that web scraping is getting more complex. Anti-bot softwares require more advanced solutions, leading to higher computing and memory costs. There’s also a less visible cost, which is the complexity of the web scraping infrastructure: since there’s no silver bullet or magic solution that fits every case, the modern web scraper needs a full array of tools in his belt to tackle different cases.

In this episode of The Lab we’ll see some of the tools I’m using daily to tackle the most common anti-bot solutions and how, through a quick test, they behave against them.

The chosen tools

[

Army of bots

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b149127-5f37-463c-8ad8-1ff1fd20cc5c_480x201.gif)

Army of bots

As a python developer, basically my potential toolset of headful browsers is restricted to:

  • Selenium

  • Undetected Chromedriver

  • Playwright (in different sauces)

  • Pyppeteer + stealth

and that’s the reason why you won’t find puppeteer or cheerio in the following tests.

Selenium

Given that, you won’t find also Selenium: in my opinion, it has lost some appeal in the latest years, specifically from when Playwright has been released. It relies on standard webdrivers which are not meant to be used for web scraping and can be easily detected by anti-bot softwares.

Undetected Chromedriver

On the other hand, you can get a better result at a fraction of the complexity using the undetected_chromedriver python package. In this case, you’re still using a webdriver but that is modified and compiled with the final purpose to be used in web scraping projects.

Playwright

Playwright has been released in 2020 and at the moment it’s my favorite tool because of its flexibility and ease of usage. After the installation (via pip), you can start right with 3 different browsers, both in headful or headless mode. And if you need more, you can install other clients like Chrome (instead of the chromium bundled) or some compatible anti-detect browser like GoLogin, to get more options for your scrapers. Even Playwright is not meant for web scraping and there was a plugin for customizing the bundled browsers but it’s not been updated for a too long time and is no more effective, but playing around with the right combo of browser and settings, to me it’s my first choice.

Pyppeteer

It’s an unofficial porting in Python of Puppeteer, the original project of browser automation from where Playright took “inspiration”. I don’t find any reason actually to prefer it to Playwright, but it’s another option worth mentioning. It has a stealth module but, at least in these tests, it didn’t work as expected.

And what’s your favorite tool for web scraping in Python?

POLL CLOSED

The tested antibots

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9278b668-f9b1-4e03-9635-f1afab047ee9_500x365.gif)

Not a robot

In this post, we’ll see how the tools mentioned before perform against the most well-known anti-bot solutions.

We will perform a generic page load test on 5 different websites, one per solution. It cannot be an exhaustive test, since every website can have a different setup and different rules to block or not suspicious traffic. On top of that, by loading only one page, we cannot test if the behavior of a spider written using one of the tools could be marked as a bot. And last but not least, there will be cases where some sections of websites (like the login pages) would be protected by stricter rules than the home page.

Given that, our test could be a good starting point to understand which tool is more convenient to start with.

Cloudflare

Cloudflare Bot Management solution utilizes a combination of signature-based detection and machine learning algorithms to accurately identify and classify bot traffic. The solution also offers rate-limiting, CAPTCHA challenges, and JavaScript challenges to mitigate the impact of bots on a website’s security, performance, and user experience.

It’s one of the most used and stronger solutions to by-pass if configured in a strict way.

PerimeterX

PerimeterX uses real-time behavior analysis and machine learning to detect and block bots in real-time while allowing legitimate traffic to pass through.

Compared to Cloudflare Bot Management, PerimeterX also focuses on real-time behavior analysis and machine learning, including at the same time the ability to detect advanced bots that use techniques like IP hopping, browser fingerprinting, and headless browsing.

In our tests, we’ll see it will be the easiest to bypass, at least for the website we considered.

Datadome

Datadome is another anti-bot solution with all the features mentioned before, for our tests it was the hardest to consistently bypass. Typically we could load the target page for the first time with every solution but if we try to test it for a second time from the same IP, the load will fail.

Kasada

I think it’s the youngest solution on the market between the ones tested here, and it’s the most recognizable. When loading inside your browser a website protected with Kasada for the first time, you should notice in the network tab of the developers’ tool window a 429 error. This is the “challenge” that Kasada sends to the browser and, if it is solved, then you get redirected to the target website. It is called a zero-trust security policy.

F5

In my filter bubble (fashion e-commerces) I don’t see many F5 protected websites, but when configured strictly it’s not that easy to bypass. It seems to rely heavily on AI to detect strange behavior in the users but even loading a single page of our testing website was not simple.

What’s the hardest anti-bot to bypass for you?

POLL CLOSED

The tests results

[

The Anti Detect Anti Bot Matrix

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff178b49a-6646-43f6-abe4-b09e3341f844_1178x225.png)

The Anti Detect Anti Bot Matrix

The last image says almost everything but I want to add my two cents before going into details.

I have the feeling that the anti-bot industry focused more on tackling the chromium webdriver, since probably is one of the most used and the one that leaks more information to the target server. Pyppetteer and Playwright with Chrome are the worst performers, while the undetected chromedriver gets much better results due to its customization.

Playwright with a headful execution of Firefox instead probably seems more genuine to the target server or, at least, less distinguishable, from genuine traffic.

Before having a look briefly at the code, just remember you can find the full code on our GitHub repository for paying users. If you are a paying subscriber and don’t have the access to the repository, feel free to write me at pier@thewebscraping.club for getting access (unluckily there’s no automation for it).

The Lab repository code

Undetected Chromedriver

[

Undetected Chromedriver tests

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af22201-077f-4e5a-a653-a38d69c45ebf_454x425.png)

Undetected Chromedriver tests

The script I’ve used it’s quite straightforward without any customization or particular options and got already these good results.

[

Undetected Chromedriver results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f599c80-d473-4ac8-b7ff-820dd45c555a_702x452.png)

Undetected Chromedriver results

Playwright + Chrome

[

Playwright and Chrome tests

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae88cbb0-5761-47cd-afa8-c703b6f92cf1_2198x706.png)

Playwright and Chrome tests

In this configuration, I’ve opted for one of the most successful setups I used in production until some months ago. We have a Chrome client (not a webdriver) with a persistent context, in order to store all the session data gathered from the navigation, like a real browser.

The awful results of the test confirmed what I’ve seen some months ago, the focus of the anti-bot industry seems to be on detecting real traffic from Chrome browsers.

[

Playwright and Chrome results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F919e948b-128b-405d-882e-3c8ded5e0064_648x452.png)

Playwright and Chrome results

Playwright + Firefox

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbda71b8-38f1-43eb-a3c6-e4c08b78483e_1118x666.png)

Playwright and Firefox tests

On the contrary, a plain script using Firefox can pass almost every anti-bot solution, with mixed results on Datadome because of the behavior of the test scraper.

[

Playwright and Firefox results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9aac7f9-3088-4a87-af6e-e86334a3af1f_654x468.png)

Playwright and Firefox results

Playwright + GoLogin

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0817811-0fbd-415c-ad47-13c3b7b71e5b_1196x906.png)

Playwright and GoLogin tests

I used this setup starting the past month because of a website protected by Cloudflare in a particularly strict way. You need to buy a plan from GoLogin to use its service and to install their browser’s client on the machine but it worked like a charm and performed quite well also during these tests.

[

Playwright and Gologin results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe686c402-9438-4906-8c26-d3b78aad6847_640x458.png)

Playwright and Gologin results

Pyppetteer

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa85aa3d-5783-48ba-bd6a-efe6c151b8ef_1104x694.png)

Pyppetteer tests

This out-of-the-box solution performed badly on these tests but probably with some fine tuning things would improve.

[

Pyppetteer results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17084e35-b9ca-4a56-bc59-5d59d1eba7dc_630x448.png)

Pyppetteer results

For today it’s all, remember you can find the full code on our GitHub repository. If you are a paying subscriber and can’t access the repository, feel free to write me at pier@thewebscraping.club.

The Lab repository code

The Lab - premium content with real-world cases

Our discord server is the place where we can share our experiences interactively or have a chat, find the bargain from our partners, and much more. I’d be glad to see you all there.

Join our Discord Server