In the past articles of The Lab, and especially when we talked about bypassing Cloudflare, we have seen commercial solutions like anti-detect browsers for generating a consistent fingerprint able to bypass anti-bots solutions.
I didn’t find any OSS solution to achieve similar results (in Python) unless I encountered this repository, Browserforge, which is a reimplementation of Apify’s fingerprint suite in Python.
Of course, I needed to test it!
Anti-ban solutions: Buy versus Make
While until some years ago, a good setup for your Scrapy or Selenium scraper was good enough to overcome anti-bot solutions, I think you already know that today this is not true anymore.
We have companies pouring money and time into anti-bot solutions more and more hard to overcome and raising the bar for having a 100% success rate when scraping.
There’s a list of challenges today that a professional web scraper needs to face:
-
IP ban
-
Rate limiting on requests
-
Browser fingerprinting
-
WebRTC leak
-
TLS Fingerprinting
and for sure I’ve forgotten something else.
This means that building your scrapers to overcome anti-bots requires adding more and more tools:
-
Proxies (from residential to mobile)
-
Rotating IPs
-
Anti-detect browsers
-
Browser automation tools
In the Discord Server of The Web Scraping Club, we often talk about Make vs Buy and it’s an interesting debate that has a right and wrong answer.
My two cents on this topic are that when competing against anti-bot vendors, you’re against companies with R&D labs and many talented engineers working on detecting bots. Is this a race you want to participate in it?
If you have a passion for anti-bot evasion techniques, deobfuscation, and maybe you’re focused on a few websites, it could be worth it for you. I love to test new tools and find new creative solutions for these challenges, so I completely understand this. You learn a lot and grow as a scraping professional, by tackling challenges in first person.
On the other hand, from a business perspective, if you run a company based on web data or you’re a data vendor, you surely have deadlines to meet and these rarely give you the time to dig deep on reverse engineering anti-bots. Your mission is to deliver data (or services based on that), not to bypass CAPTCHAs. In this case, you would prefer a commercial solution that is easy to implement and reliable, which hopefully you could charge to your final customers.
My experience leads me to a different approach: Buy AND Make. To deliver data timely while preserving the balance sheet of the company by avoiding overspending on commercial solutions, I’ve adopted a mind map of challenges I could tackle with OSS tools and others with commercial ones.
It’s a matter of finding the balance between budget and time and this is variable for every company.
All this intro is just to say that one of the challenges I delegate to commercial tools is browser fingerprinting: whenever I encounter a website that heavily relies on browser fingerprinting to block my scrapers, I use commercial tools like unblockers or anti-detect browsers to overcome them.
Today, since I’ve discovered Browserforge, which can be used together with Playwright, I’m testing this OSS solution to check if I can get the same results as a commercial one like Kameleo against three well-known anti-bot solutions.
Before continuing, a quick reminder about ScrapeCon 2024 by Bright Data, is on the 2nd of April.
Expect hands-on sessions, live coding, expert Q&As, and insights from the industry’s top practitioners and tech leaders.
You cannot miss this conference if you want to know more about the world of web data collection and learn new skills and strategies to optimize your scraping operations.
[

Setting the baseline with Playwright
The test consists of loading a first check-up page of your browser fingerprint and then two pages from the Harrods.com website, protected by Cloudflare, the Footlocker website protected by Datadome, and the StockX website protected by PerimeterX. We’re running these tests from an AWS server and using residential proxies to avoid blocks because of the IP.
All the code of the tests can be found in The Lab GitHub repository, available for paying users, under folder 46.Fingerprint-Injection.
If you already subscribed but don’t have access to the repository, please write me at pier@thewebscraping.club since I need to add you manually.
First run: test with Chromium with default parameters
The first run we’ll do is using a basic Chromium Browser window to take the Browsernet test and check its results. In the following series of screenshots, we’ll see the issues detected.
[

[

First of all, a proxy is detected because of the WebRTC IP address leak. Every browser has embedded a WebRTC client and this technology allows real-time communication between browsers without requiring an intermediate server
WebRTC leaks take place when you’re trying to establish video or audio communication with another person via a browser that uses WebRTC technology, revealing your real IP even if you’re behind a proxy.
[

Another issue here is the timestamp obtained by querying the browser’s API and the one deducted by the timezone of the IP. If they’re different, it’s another sign you’re using a proxy.
[

Last but not least, we have the SwiftShader renderer. As mentioned in the documentation, it is a high-performance CPU-based implementation of the Vulkan 1.3 graphics API. This means that it’s used mainly by devices with no GPU available, like servers in a data center. This is one of the main red flags used for discriminating between bots and humans, as we’ve seen in previous articles of The Lab.
Second run: test with Chromium with Firefox on Windows fingerprint
Now we’ll inject a fingerprint using Browserforge and repeat the tests. I decided to use a Chromium browser but inject a fingerprint of Firefox on a Windows machine, let’s see what happens.
[

[

As I suspected, using a different browser for generating the fingerprint created some issues. The headers of the browser (Chromium) did not match the browser in the fingerprint (Firefox), and the test recognized it, while it correctly noted that we were faking a Windows 10 machine.
[

The timezone was wrong because the system time did not match the timezone of the IP.
[

A good news, instead was the detection of the Nvidia GeForce graphic card.
Third run: test with Firefox Browser with Firefox on Windows fingerprint
I’ve made some adjustments for the third test: since the AWS machine is in the Berlin data center, I’ve restricted the IPs only to Germany. I’ve also made a change on the system timezone of the machine, to set it correctly to the German time.
Using a Firefox browser with a Firefox Fingerprint injected, together with these adjustments, the test improved.
[

We still get the proxy detected because of WebRTC, but at least the IP timezone is correct.
[

The only thing is that we cannot select the version of Firefox to be fingerprinted, and this creates a discrepancy between the Firefox version used in Playwright and the one used for the fingerprint.
A comparison run with Kameleo
We obtained a good fingerprint spoofing but not perfect. But how far we’re from a commercial solution like Kameleo?
On the Kameleo profile, after disabling manually the WebRTC (which is something that can’t be done programmatically in traditional browsers) and setting the right timezone, here’s the result.
[

A much more coherent fingerprint with only a small issue of the version of the browser.
Fingerprint injection implementation and tests against anti-bots
Given the fact that with a commercial anti-detect browser we have a better and more coherent browser fingerprint, is our homemade solution enough to bypass anti-bots?
Let’s start by implementing it in our scraper.
Fingerprint injection implementation
From what we have seen on the official repository of Browserforge, the usage is quite straightforward.
def run_mocked_with_location(playwright):
CHROMIUM_ARGS= [
'--no-first-run',
'--disable-blink-features=AutomationControlled',
]
proxy={
"server": "http://OURGERMANIP",
"username": "USER",
"password": "PWD"
}
browser = playwright.firefox.launch(proxy=proxy,headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"])
# Create a new context with the injected fingerprint
context = NewContext(browser,fingerprint_options ={'browser':'firefox', 'os':'windows'})
page = context.new_page()
After creating a browser with Playwright, we add a new context using the Browserforge NewContext function, which takes as arguments the browser instance and the options for the fingerprint to be created, in this case, a Windows OS and Firefox as a browser. Unluckily you cannot select the version of Firefox to match the one used by Playwright, but the repository is quite new and I hope this feature will be added soon.
Now let’s repeat the three tests we made on Browsernet on websites protected by anti-bots, so we can understand if this solution could work against them.
You can find all the code of the tests on the GitHub repository, directory 46.
Run 1: Cloudflare
We got blocked by Cloudflare turnstile on the first test, with the plain vanilla Playwright Chromium.
[

But as soon as we inject a fingerprint, even if not coherent with the main browser like in the second test, we can bypass the turnstile and start scraping our data.
We can assume that the WebGL renderer is the key parameter Cloudflare checks, in this case, for blocking bots.
Run 2: Datadome
With Datadome, we got blocked both in the first and in the second tests. While we could load the home page in the first case and then we got blocked in the product list page by a CAPTCHA, in the second case we got banned immediately. This means that all the incongruences in the timezone and on the browser headers are red flags.
In fact, on the third test, after we solved them, we could load both pages.
[

Run 3: PerimeterX
With PerimeterX we can see another different behaviour from the anti-bot solution.
[

The plain vanilla Playwright test number 1 is able to bypass the protection, while the second test fails. It means that in this case, the anti-bot protection does not care too much about the browser fingerprint but it’s more sensitive about the coherence of the headers with the browser.
In fact, as we run the third test, we can bypass it again.
Final remarks
The Browserforge repository was a great discovery: I’ve been looking for this kind of solution in Python some months ago with no luck, but finally, it’s here.
We’ve been able to bypass all three anti-bot solutions on these three test sites, but as we’ve seen, the camouflage applied is not as good as the one from commercial solutions like Kameleo. So, even if it worked on these websites, it could not work on others with more strict rules.
This is a great example of the Make AND Buy approach: you can try a DIY solution to bypass easier websites and for the hardest one you could delegate to technology providers and their products the fingerprinting.
As with everything in life, it’s a matter of balance.