The Lab #37: Bypassing Cloudflare with anti-detect browsers - Part 2
Excerpt
Using Kameleo to bypass Cloudflare bot detection: with this anti-detect browser we can scrape data from Cloudflare protected websites in Python
In the latest article of The Web Scraping Club, we’ve seen how to configure GoLogin to bypass Cloudflare Bot Protection. We have seen how device fingerprinting works, since our scraper worked from our local machine but not on a server on the AWS datacenter, even using residential proxies.
Thanks to the Kameleo team, I had the opportunity to test their anti-detect browser, so we can try a new tool we can add to our toolbelt.
[

If you’re willing to access all The Lab articles like this, consider becoming a paid reader and supporting The Web Scraping Club project.
What is Kameleo and how does it work?
Kameleo is an anti-detect tool that allows you to create different profiles you can use in your scrapers.
Every profile could be seen as a device, with its own OS, canvas fingerprint, WebGL, and so on.
[

You can choose between two custom-built browsers, one to simulate Chrome profiles and one for Firefox. You can also install an app on an Android device to use and automate your browser on it.
All these profiles, created via the web interface or the API, can be used inside the most popular web automation frameworks like Selenium, Puppeteer, or Playwright.
First of all, you need to download and set up the Kameleo installer but unluckily it’s only available for Windows, which is a great limitation. This will be the machine where the browser will be opened and browse the requested pages.
On the machine where you launch the scraper, instead, you need to install the API client, which will need to connect to the Kameleo server.
Explaining the functioning of Kameleo it’s quite easy: its team has collected a database of fingerprints from real devices (called base profiles) that are used as a template when we create the profiles we’re gonna use for scraping (virtual profiles). Since base profiles come from real devices, when we’re creating virtual profiles we can modify only a few options, to avoid errors in the overall coherence of the fingerprint. It seems to me a smart move, especially if the database of base profiles is big enough to create many different devices.
But let’s play with different profiles and see how they perform against the traditional fingerprint tests.
Profile 1: A Windows profile
First of all, we need to install the Kameleo API, in our case using Python.
python3.10 -m pip install kameleo.local-api-client
After that, following the examples in the documentation, we’ll create our first profile mimicking a desktop Windows machine.
from kameleo.local_api_client import KameleoLocalApiClient
from kameleo.local_api_client.builder_for_create_profile import BuilderForCreateProfile
from playwright.sync_api import sync_playwright
# This is the port Kameleo.CLI is listening on. Default value is 5050, but can be overridden in appsettings.json file
kameleo_port = 5050
client = KameleoLocalApiClient(
endpoint='http://YOURSERVERPORT:5050',
retry_total=0
)
# Search Chrome Base Profiles
base_profiles = client.search_base_profiles(
device_type='desktop',
browser_product='chrome',
os_family='windows'
)
# Create a new profile with recommended settings
# Choose one of the Base Profiles
create_profile_request = BuilderForCreateProfile \
.for_base_profile(base_profiles[0].id) \
.set_recommended_defaults() \
.build()
profile = client.create_profile(body=create_profile_request)
I only selected the browser, the machine type, and the OS family, allowing Kameleo to create the profile with its default configurations.
In the second part of the script, which you can always find in the GitHub repository available for paying subscribers, I’ve used the profile to load the Sannysoft test collection.
# Start the browser profile
client.start_profile(profile.id)
print(profile.id)
# Connect to the browser with Playwright through CDP
browser_ws_endpoint = f'ws://YOURSERVERIP:{kameleo_port}/playwright/{profile.id}'
with sync_playwright() as playwright:
browser = playwright.chromium.connect_over_cdp(endpoint_url=browser_ws_endpoint)
context = browser.contexts[0]
page = context.new_page()
# Use any Playwright command to drive the browser
# and enjoy full protection from bot detection products
page.goto('https://browserleaks.com/javascript')
interval=randrange(10,30)
time.sleep(interval)
# Wait for 5 seconds
time.sleep(5)
# Stop the browser by stopping the Kameleo profile
client.stop_profile(profile.id)
The profile looks perfectly legit despite the browser has been loaded on a data center machine on AWS using Windows Server 2022 as an operating system.
[

[

Profile 2: A Mac OS profile running on a Windows Server
Let’s test the consistency of the profile by creating a Mac OS profile keeping Kameleo running on the same Windows machine, and we reload the Sannysoft test collection for something macro not coherent.
[

[

We can see that not only the user agent has changed but also the navigator.platform attribute and the WebGL renderer.
The interesting part of this process of profile creation is that if we create different profiles on the same OS, they all differ in some details like the screen size or sound devices, creating different but plausible hardware configurations.
Can we finally bypass Cloudflare Anti-Bot?
Now we have seen the functioning of Kameleo, let’s continue with the main topic of the post and see if we can bypass Cloudflare using it.
I’ve created the script playwright_kameleo_harrods.py in The Lab Github Repository, available for paying subscribers. If you’re one of them but don’t have access to it, please write me at pier@thewebscraping.club to be added to the repository.
This script is quite simple and it’s very similar to the one available in the examples made on the Kameleo documentation. Instead of creating a new profile at every launch, I printed out the profile ID on the script kameleo_API.py and used that value, since I could not find where to get it somewhere else.
The result is great, with Harrod’s website loading on the first try (after adding residential proxies to the profiles via the web interface), on both Mac OS and Windows profiles.
[

Pros and cons of this solution
Kameleo, in this case, has been a more effective solution than GoLogin: I need to confess that in both cases I didn’t have time to play around too much with the tools but the fact that, without effort, I could bypass Cloudflare protection, it’s something great. Probably using more tweaks on the code and on the GoLogin profiles I could have obtained the same result, so I’d be happy to host on these pages a fully working solution.
As we have seen, we needed two machines to run this scraper. A Windows one, where Kameleo is installed and where the browsers will be loaded, and the scraper one, which could be a micro instance since it runs only the code of the scraper. I could run the scraper on the same Windows Machine, but in terms of scalability and failsafe scraping operations, I would prefer to have the scraper running on a separate machine. This doesn’t change the fact that to optimize costs, multiple scrapers should connect to the same Windows Machine, creating a bottleneck in terms of performance and a single failure point for numerous scrapers.
The costs and the pricing model of Kameleo are the real pain points of the solution: to have access to both APIs and the Web Interface you need to spend 199 USD per month per user (with some discounts if you commit for 6 months). This means that if you need to run, let’s say, 50 scrapers concurrently and you route them in 5 different Windows machines, which should be large enough to handle 10 headful scrapers, you need to purchase licenses for 5 users.
Without discounts, it’s almost 1K USD per month, it’s becoming heavy lifting even for a company.
On the other hand, the product is great and I love it, simple to use (I’d appreciate a bit more documentation, but that’s ok) and greatly effective and probably in cases like sneaker bots, where you have few target websites but high protection, it’s the best choice, while on broader scopes the costs are too high.