THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting

Excerpt

In other words: fake it until you scrape it


Here’s another post of “THE LAB”: in this series, we’ll cover real-world use cases, with code and an explanation of the methodology used.

Being a paying user gives:

  • Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here an example).

  • Access to the GitHub repository with the code seen on ‘The LAB”

  • Access to private channels on our Discord server

But in case you want to read this newsletter for free, you will always get a post per week about:

  • News about web scraping

  • Anti-bot software and techniques insights

  • Interviews with key people in the industry

And you can always join the Web Scraping Club Discord server

Enough housekeeping, for now, let’s start.

As you surely know, the most advanced anti-bot solutions act on different levels:

  • at a behavioral level, they check how the scraper act and try to distinguish a bot from a human.

  • at a browser level, they try to distinguish a genuine browser from an automated version, looking for some incongruence in the setup.

  • at an HTTP level, they try to identify the device configuration to detect suspicious setups.

On our Discord server the focus was on this latest case, so today we’ll try to explain how this can be achieved via TLS Fingerprinting and what can we do as a counter-measure in our scrapers.

Understanding TLS Fingerprinting

TLS fingerprinting is a passive (or server-side) fingerprinting technique used by servers to identify the configuration of the clients connecting to it.

The fingerprints are created using the ciphers exchanged when the connection between the client and servers establishes.

To better understand how this technique works, let’s borrow the image from this Cloudflare blog post.

[

HTTP protocol

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F60b85cbc-4552-45af-9078-c7001220e4d9_2918x1667.webp)

HTTP protocol

When we connect a client to a server, the first interaction is made by the TCP protocol. It’s called Three-way Handshake, where the client and server share their willingness and availability to connect.

  • The client sends a SYN packet to ask for availability to the server for a new connection.

  • If the server is available, it replies with an SYN/ACK packet to the client.

  • The client again replies then with an ACK packet and the connection is established. From now on, the two can exchange data.

Without entering too many details about the full TLS protocol, we’ll focus now on what happens after a connection is established.

The “Hello Message”, the first one sent by the client after the handshake, is where data needed for fingerprinting are sent. The message will include which TLS version the client supports, the cipher suites supported, and a string of random bytes known as the “client random.”

But the point is that ciphers differ from client to client: a Chrome connection has a different cipher suite than a Safari one or a Scrapy one, sent from the same machine.

Here are the ciphers of a connection made to google.com with Chrome from a Mac laptop.

[8A8A]Unrecognized cipher - See https://www.iana.org/assignments/tls-parameters/
[1301]TLS_AES_128_GCM_SHA256
[1302]TLS_AES_256_GCM_SHA384
[1303]TLS_CHACHA20_POLY1305_SHA256
[C02B]TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[C02F]TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[C02C]TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C030]TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[CCA9]TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[CCA8]TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C013]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[C014]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[009C]TLS_RSA_WITH_AES_128_GCM_SHA256
[009D]TLS_RSA_WITH_AES_256_GCM_SHA384
[002F]TLS_RSA_WITH_AES_128_CBC_SHA
[0035]TLS_RSA_WITH_AES_256_CBC_SHA

Safari:

[2A2A]Unrecognized cipher - See https://www.iana.org/assignments/tls-parameters/
[1301]TLS_AES_128_GCM_SHA256
[1302]TLS_AES_256_GCM_SHA384
[1303]TLS_CHACHA20_POLY1305_SHA256
[C02C]TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C02B]TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[CCA9]TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[C030]TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[C02F]TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[CCA8]TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C00A]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
[C009]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
[C014]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[C013]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[009D]TLS_RSA_WITH_AES_256_GCM_SHA384
[009C]TLS_RSA_WITH_AES_128_GCM_SHA256
[0035]TLS_RSA_WITH_AES_256_CBC_SHA
[002F]TLS_RSA_WITH_AES_128_CBC_SHA
[C008]TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA
[C012]TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA
[000A]SSL_RSA_WITH_3DES_EDE_SHA

Scrapy:

[1302]TLS_AES_256_GCM_SHA384
[1303]TLS_CHACHA20_POLY1305_SHA256
[1301]TLS_AES_128_GCM_SHA256
[C02C]TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C030]TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[009F]TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
[CCA9]TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[CCA8]TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[CCAA]TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C02B]TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[C02F]TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[009E]TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
[C024]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
[C028]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[006B]TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
[C023]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
[C027]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
[0067]TLS_DHE_RSA_WITH_AES_128_CBC_SHA256
[C00A]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
[C014]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[0039]TLS_DHE_RSA_WITH_AES_256_CBC_SHA
[C009]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
[C013]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[0033]TLS_DHE_RSA_WITH_AES_128_CBC_SHA
[009D]TLS_RSA_WITH_AES_256_GCM_SHA384
[009C]TLS_RSA_WITH_AES_128_GCM_SHA256
[003D]TLS_RSA_WITH_AES_256_CBC_SHA256
[003C]TLS_RSA_WITH_AES_128_CBC_SHA256
[0035]TLS_RSA_WITH_AES_256_CBC_SHA
[002F]TLS_RSA_WITH_AES_128_CBC_SHA
[00FF]TLS_EMPTY_RENEGOTIATION_INFO_SCSV

They all differ in order and number of ciphers. It means that the server, using these ciphers and some other parameters sent, has an idea of my client’s architecture as soon as I try to connect to it and can use this data to create fingerprints and block suspicious ones.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3ba78b-fd98-493e-9dc4-48aca714ead1_768x291.png)

Credits to LWT Hiker Blog

This great LWT Hiker blog post, from where the previous table comes, digs deeper in detail and shows also two of the most know algorithms to create fingerprints used nowadays, the JA3 and the TS1.

Countermeasures

The easiest way for having a TLS fingerprint not included in any blacklist is to use a real browser for scraping, like Selenium or Playwright, but at a large scale, this can be expensive and time-consuming. Luckily we have some other options in Python.

Python requests allow sending custom ciphers, as stated in this 2017 post.

Scrapy instead allows some tweaks like modifying the list of ciphers sent to the server, using some options in the settings.py file.

With the default settings, the ciphers from my Mac are sent as follows:

Ciphers: 
[1302]TLS_AES_256_GCM_SHA384
[1303]TLS_CHACHA20_POLY1305_SHA256
[1301]TLS_AES_128_GCM_SHA256
[C02C]TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C030]TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[009F]TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
[CCA9]TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[CCA8]TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[CCAA]TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C02B]TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[C02F]TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[009E]TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
[C024]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
[C028]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[006B]TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
[C023]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
[C027]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
[0067]TLS_DHE_RSA_WITH_AES_128_CBC_SHA256
[C00A]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
[C014]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[0039]TLS_DHE_RSA_WITH_AES_256_CBC_SHA
[C009]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
[C013]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[0033]TLS_DHE_RSA_WITH_AES_128_CBC_SHA
[009D]TLS_RSA_WITH_AES_256_GCM_SHA384
[009C]TLS_RSA_WITH_AES_128_GCM_SHA256
[003D]TLS_RSA_WITH_AES_256_CBC_SHA256
[003C]TLS_RSA_WITH_AES_128_CBC_SHA256
[0035]TLS_RSA_WITH_AES_256_CBC_SHA
[002F]TLS_RSA_WITH_AES_128_CBC_SHA
[00FF]TLS_EMPTY_RENEGOTIATION_INFO_SCSV

But using the

DOWNLOADER_CLIENT_TLS_CIPHERS='HIGH'

option, that uses only “High” encryption cipher suites (with key lengths larger than 128 bits, and some cipher suites with 128-bit keys), the result is the following:

Ciphers: 
[1302]TLS_AES_256_GCM_SHA384
[1303]TLS_CHACHA20_POLY1305_SHA256
[1301]TLS_AES_128_GCM_SHA256
[C02C]TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C030]TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[00A3]TLS_DHE_DSS_WITH_AES_256_GCM_SHA384
[009F]TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
[CCA9]TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[CCA8]TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[CCAA]TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C0AF]TLS_ECDHE_ECDSA_WITH_AES_256_CCM_8
[C0AD]TLS_ECDHE_ECDSA_WITH_AES_256_CCM
[C0A3]TLS_DHE_RSA_WITH_AES_256_CCM_8
[C09F]TLS_DHE_RSA_WITH_AES_256_CCM
[C05D]TLS_ECDHE_ECDSA_WITH_ARIA_256_GCM_SHA384
[C061]TLS_ECDHE_RSA_WITH_ARIA_256_GCM_SHA384
[C057]TLS_DHE_DSS_WITH_ARIA_256_GCM_SHA384
[C053]TLS_DHE_RSA_WITH_ARIA_256_GCM_SHA384
[C02B]TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[C02F]TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[00A2]TLS_DHE_DSS_WITH_AES_128_GCM_SHA256
[009E]TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
[C0AE]TLS_ECDHE_ECDSA_WITH_AES_128_CCM_8
[C0AC]TLS_ECDHE_ECDSA_WITH_AES_128_CCM
[C0A2]TLS_DHE_RSA_WITH_AES_128_CCM_8
[C09E]TLS_DHE_RSA_WITH_AES_128_CCM
[C05C]TLS_ECDHE_ECDSA_WITH_ARIA_128_GCM_SHA256
[C060]TLS_ECDHE_RSA_WITH_ARIA_128_GCM_SHA256
[C056]TLS_DHE_DSS_WITH_ARIA_128_GCM_SHA256
[C052]TLS_DHE_RSA_WITH_ARIA_128_GCM_SHA256
[C024]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
[C028]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[006B]TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
[006A]TLS_DHE_DSS_WITH_AES_256_CBC_SHA256
[C073]TLS_ECDHE_ECDSA_WITH_CAMELLIA_256_CBC_SHA384
[C077]TLS_ECDHE_RSA_WITH_CAMELLIA_256_CBC_SHA384
[00C4]TLS_DHE_RSA_WITH_CAMELLIA_256_CBC_SHA256
[00C3]TLS_DHE_DSS_WITH_CAMELLIA_256_CBC_SHA256
[C023]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
[C027]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
[0067]TLS_DHE_RSA_WITH_AES_128_CBC_SHA256
[0040]TLS_DHE_DSS_WITH_AES_128_CBC_SHA256
[C072]TLS_ECDHE_ECDSA_WITH_CAMELLIA_128_CBC_SHA256
[C076]TLS_ECDHE_RSA_WITH_CAMELLIA_128_CBC_SHA256
[00BE]TLS_DHE_RSA_WITH_CAMELLIA_128_CBC_SHA256
[00BD]TLS_DHE_DSS_WITH_CAMELLIA_128_CBC_SHA256
[C00A]TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
[C014]TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[0039]TLS_DHE_RSA_WITH_AES_256_CBC_SHA
[0038]TLS_DHE_DSS_WITH_AES_256_CBC_SHA
[0088]TLS_DHE_RSA_WITH_CAMELLIA_256_CBC_SHA
[0087]TLS_DHE_DSS_WITH_CAMELLIA_256_CBC_SHA
[C009]TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
[C013]TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[0033]TLS_DHE_RSA_WITH_AES_128_CBC_SHA
[0032]TLS_DHE_DSS_WITH_AES_128_CBC_SHA
[0045]TLS_DHE_RSA_WITH_CAMELLIA_128_CBC_SHA
[0044]TLS_DHE_DSS_WITH_CAMELLIA_128_CBC_SHA
[009D]TLS_RSA_WITH_AES_256_GCM_SHA384
[C0A1]TLS_RSA_WITH_AES_256_CCM_8
[C09D]TLS_RSA_WITH_AES_256_CCM
[C051]TLS_RSA_WITH_ARIA_256_GCM_SHA384
[009C]TLS_RSA_WITH_AES_128_GCM_SHA256
[C0A0]TLS_RSA_WITH_AES_128_CCM_8
[C09C]TLS_RSA_WITH_AES_128_CCM
[C050]TLS_RSA_WITH_ARIA_128_GCM_SHA256
[003D]TLS_RSA_WITH_AES_256_CBC_SHA256
[00C0]TLS_RSA_WITH_CAMELLIA_256_CBC_SHA256
[003C]TLS_RSA_WITH_AES_128_CBC_SHA256
[00BA]TLS_RSA_WITH_CAMELLIA_128_CBC_SHA256
[0035]TLS_RSA_WITH_AES_256_CBC_SHA
[0084]TLS_RSA_WITH_CAMELLIA_256_CBC_SHA
[002F]TLS_RSA_WITH_AES_128_CBC_SHA
[0041]TLS_RSA_WITH_CAMELLIA_128_CBC_SHA
[00FF]TLS_EMPTY_RENEGOTIATION_INFO_SCSV

Quite different and this could be enough for being out of the blacklists. For the list of usable values, you can have a look at the OpenSSL documentation.

The latest option, which you’ll find also on our GitHub repository, is using a proxy that masks your ciphers. I’ve found this project and it made its job.

The repository anyway seems discontinued and the author is looking for some help to maintain it.

I’m leaving some links if you want to dig deeper in this topic, there are some great posts around the web:

  1. LWT Hiker blog post, already noted before

  2. Cory Benfield’s blog post (2017)

  3. Restricting TLS in Python requests (2018) by Hussain Ali Akbar

  4. List of TLS extensions

The Lab - premium content with real world cases

If you liked this post, please share it with your friends and colleagues and spread the word about The Web Scraping Club

Share

Share The Web Scraping Club