THE LAB #19: How to mask the device fingerprint
Excerpt
Beating the fingerprinting by Cloudflare is possible and we’ll see the tools we use to bypass the challenge and scrape the Cloudflare-protected websites
We’ve already seen in previous posts what’s a device fingerprint and how it is created, using an IP and its derivative information, the ciphers used by the HTTPS connection, and, most importantly, the information passed by your browsers.
When we tackle modern anti-bot software we mostly need a headful browser: this makes our scraper seem more human-like but also exposes a great variety of data from the environment where the scraper is running, and sometimes this approach is counterproductive.
Device fingerprinting at work
This can happen, as an example, when Cloudflare is configured at its highest level of paranoia.
A well-known example of this situation is the harrods.com website. In the GitHub repository of The Lab, reserved for our paying readers, you will find a file called tests_pl_chrome.py.
It’s basically a Playwright browser session with Chrome and some flags to make it more human-like. Within this browser session, we open several pages of Harrods’ website.
If we run this script from a local machine, it bypasses the Cloudflare challenge and we can browse the website. If the same script runs from a virtual machine in a datacenter, it won’t pass the test.
[

Harrods website protected by Cloudflare
And it’s not a problem on the IP, because even if we use Playwright with some residential proxy, like in the following snippet, the result is the same.
with sync_playwright() as p:
browser = p.chromium.launch_persistent_context(user_data_dir='./userdata/', channel="chrome", headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"], proxy={
'server':'proxyserver',
'username':'userl',
'password':'pwd'
},)
page = browser.new_page()
page.goto('https://www.harrods.com/en-it/', timeout=0)
This means that the browser leaks some pieces of information that are seen as red flags by Cloudflare. Fundamentally, we’re getting blocked by the fingerprint of our server.
Trying to hide from fingerprinting
Since fingerprinting is a well-known issue also for its privacy implications, in previous years several browsers with increased privacy functions appeared on the market. Chrome is known for leaking a lot of data, while Firefox and Safari started to differentiate from it by putting more focus on the user’s privacy.
Let’s try first using Firefox instead of Chrome. This is also my general advice when using Playwright, since leaking fewer pieces of information, it performs better against anti-bots.
But this is not the case. Even our tests_pl_firefox.py script fails if we run it from a virtual machine on a server.
Let’s try another browser, Brave. I actually love it and use it on a daily basis for browsing, it has a built-in ad-blocker and spoofs some type of fingerprints, as you can see from the results of the deviceinfo.me test.
[

Spoofing fingerprints
But unluckily also the tests_pl_brave.py script fails. I’ve also tried to launch an incognito window by Playwright but I’ve noticed a strange behavior.
CHROMIUM_ARGS= [
'--no-sandbox',
'--disable-setuid-sandbox',
'--no-first-run',
'--disable-blink-features=AutomationControlled',
'--incognito'
]
with sync_playwright() as p:
browser = p.chromium.launch(executable_path='/Applications/Brave Browser.app/Contents/MacOS/Brave Browser', headless=False,slow_mo=200, args=CHROMIUM_ARGS,ignore_default_args=["--enable-automation"], proxy={
'server':'server',
'username':'user',
'password':'pwd'
})
page = browser.new_page()
page.goto('https://www.harrods.com/en-it/', timeout=0)
Using this script, a first window is opened in ‘incognito mode’ but the new_page() command opens a new window, which is not in the same context, making the —incognito flag unuseful. I’ve seen there’s an issue open on the Brave GitHub but anyone has found a solution or workaround for it? Please let me know in the comment below.
A working solution, finally
There’s another class of browsers, called anti-detect, which I discovered a few months ago and they are specifically built for spoofing the fingerprints they send.
If you follow this newsletter for a while, you’ve noticed I often mention GoLogin. At Databoutique.com we’re using this solution in some cases since it integrates with Playwright, which is in our tech stack, and requires zero maintenance.
After you log in to your account, you can create a number of profiles that depends on your plan. Each profile is described by an operating system and a set of advanced options to create your fake fingerprint.
[

[

For this test, I’ve created a profile based on a Windows machine, with a speaker and some noise in the Canvas and WebGL fingerprint.
You can also add your proxy provider (or use the free GoLogin proxies) directly via the web interface, just like I did.
[

In our tests_pl_gologin.py, Playwright will open an Orbita browser session, the browser created by GoLogin, which uses all the settings we’ve seen before to forge a set of fake parameters to pass to the server. In this way, it will create a fingerprint for a common Windows machine instead of a Linux server, and the Cloudflare check is passed.
In fact, by launching the script from the server, I could finally see the Harrods’ product page.
[

What we use in production
Even if the GoLogin solution is a great one, that’s not what we use in our production environment.
Our goal is to scrape the whole Harrods website and, making some maths and splitting the execution into several machines, the cost of keeping up the VMs capable to run a fully headed browser and the costs of proxies were greater than using the Bright Data Unlocker API. As we have seen in our Hands On article, it performs great against any anti-bot and also against Cloudflare.
By doing so, we’re able to use smaller machines (t3.micro on AWS instead of t3.medium) and even if the Unlocker API costs more than an average residential proxy, we still save some money.
Hope you have found interesting this test and I’d like to know how eventually you solved the same issue, maybe there’s a smarter way I still don’t know.