Welcome back to another episode of The Lab, our series of articles where we write some code to solve some common issues in web scraping. And what’s more common than being blocked by an anti-bot?
In the past weeks, we have seen how to bypass the most common anti-bots we can find.
All these articles give some hints on how to bypass a certain anti-bot in a real-world case, but unluckily there’s no silver bullet available.
Different websites with the same anti-bot installed could set different engagement rules and protection levels, or even inside the same website, the countermeasures can change.
Let’s take as an example the famous online travel listing Booking.com: the website is protected by PerimeterX and it uses some internal API to show the results of your queries about your next travels. If you try to browse the website you need some tool to bypass PerimeterX but the internal APIs are not protected, probably intentionally.
There could be different reasons for that:
-
getting data from the APIs is less resource intensive both from the scraper and from the website point of view, so they prefer that people use them to get data
-
they know that some external applications are using them
-
web scraping could be nothing more than a bother if done responsibly.
Probably the last point, in my opinion, is a common pattern: yes, web scraping can be a nuisance, but the most important use case of anti-bots is fraud detection, like automating the buying process of a certain item soon after it’s published (think about the sneaker market). This explains why websites selling “rare” items like the most recent drops of sneakers or streetwear, or even Hermes, are the most protected, maybe not from the home page, but for sure when you try to purchase something.
But let’s go back to the 100% legal web scraping of public information, which is the main content of this newsletter.
We just mentioned PerimeterX, which is a widespread anti-bot solution we already covered in the past.
Two months ago we wrote about how to bypass it using Plawright, since almost every anti-bot requires a JS rendering engine to solve their challenges and in most cases, scrapy_splash is not enough.
But since there are a lot of OSS tools available on GitHub for web scraping, are we sure we cannot use any of them to avoid launching Playwright and make scraping faster and less resource-intensive?
Spoiler: yes, I’ve found one ✅
How to detect a website using PerimeterX?
PerimeterX (now Human Scraping Defense) is known for throwing their Press and Hold CAPTCHA, but before that, we have other methods to detect PerimeterX.
[

In the first instance, we can use the free Chrome Extension from Wappalyzer: it’s easy and quite accurate.
[

Other signals of its presence can be discovered in the cookies, as mentioned on the great Web Scraping Wiki created by Maurice-Michel Didelot, a super expert in cybersecurity and one of the members of our amazing community. In this Wiki, you can find info about deobfuscating and reverse engineering anti-bots, essential for creating new tools and understanding what happens under the hood of our browser.
All the techniques and tools we see in this article are for testing purposes and should not be used to harm any website or its business. Please use web scraping techniques in a ethical way. If you’ve got any doubt about your web scraping operations, please ask for legal advice.
How to bypass PerimeterX with Scrapy?
The challenge of the article is to find a way to bypass PerimeterX on websites and scrape public data from it.
I’ll save you time and jump already to the conclusion, without telling you all the trials and errors I’ve made to find the right solution, which is Scrapy Impersonate.
Let me share with you two real-world examples where this solution fits particularly well.
You can find the code of the scrapers on The Web Scraping Club GitHub repository, available for paying readers. If you’re one of them but don’t have access, write me at pier@thewebscraping.club with your GH username.
Booking.com
We mentioned the website before, as proof that sometimes it happens that a bot protection covers only some parts of the website and, in this case, left out the APIs.
But let’s try to scrape data from the HTML, that should be the most protected part of the website.
Well, I’ve soon found out that this is not so true.
A Scrapy spider with some good request headers is enough to avoid being detected, at least on a small scale.
DEFAULT_REQUEST_HEADERS={
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.8",
"cache-control": "max-age=0",
"sec-ch-ua": "\"Not A(Brand\";v=\"99\", \"Brave\";v=\"121\", \"Chromium\";v=\"121\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"macOS\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"sec-gpc": "1",
"upgrade-insecure-requests": "1"
}
You’ll probably need some proxies on a larger one, but it surprises me that it also works from an AWS virtual machine.
NeimanMarcus.com
This is the usual website we use for testing solutions against PerimeterX, since it’s one of the few, between the most famous, where PerimeterX is the only anti-bot installed.
The thing that I like the most about scrapy-impersonate is that it takes 1 minute to be implemented in your Scrapy Spider and then works like a charm.
As soon as you create the scraper, you just need to add these custom settings at the very beginning of your scraper
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_impersonate.ImpersonateDownloadHandler",
"https": "scrapy_impersonate.ImpersonateDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
}
and add your preferred browser to impersonate in the meta parameters of your scrapy Requests
meta={'impersonate': 'chrome110'}
That’s it!
The scraper itself it’s quite easy and reads the JSON embedded in the HTML instead of using the internal APIs, just like Booking.com.
def start_requests(self):
yield Request('https://www.neimanmarcus.com/en-it/', callback=self.parse_cat_list, meta={'impersonate': 'chrome110'})
def parse_cat_list(self, response):
category_urls=response.xpath('//a[contains(@href,"/c/")]/@href').extract()
for category in category_urls:
#print(category)
if '/c/womens-clothing-cat58290731' in category:
yield Request('https://www.neimanmarcus.com'+category, callback=self.parse_products, meta={'impersonate': 'chrome110'})
def parse_products(self, response):
DEFAULT_VALUE='n.a.'
#print(response.text)
json_data=json.loads(response.xpath('//script[@id="state" ]/text()').extract()[0])
#print(json_data)
for pdt in json_data['productListPage']['products']['list']:
try:
product_code = pdt['id']
except:
product_code=DEFAULT_VALUE #IF XPATH OR JSON FIELD DOES NOT EXIST, WRITE THE DEFAULT VALUE IN THE FIELD
gender = 'n.a'
try:
try:
fullprice = pdt['oprc']
except:
fullprice = pdt['rprc']
except:
fullprice=0
try:
price = pdt['rprc']
except:
price =0
try:
currency = 'EUR'
except:
currency=DEFAULT_VALUE
country = 'ITA'
try:
product_url = 'https://www.neimanmarcus.com/en-it/'+pdt['canonical']
except:
product_url = DEFAULT_VALUE
try:
brand =pdt['designer']
except:
brand=DEFAULT_VALUE
website = 'NEIMANMARCUS' #REPLACE WITH WEBSITE NAME
date = (datetime.now()).strftime("%Y%m%d")
try:
pricemax = pdt['xleft']
except:
pricemax = 0
with open("file_output_tmp.txt", "a") as file:
csv_file = csv.writer(file, delimiter="|")
csv_file.writerow([product_code,'N',fullprice,price,currency,country,product_url,brand,website,date, pricemax])
file.close()
Last minute edit before publishing the article
I don’t know really what happened, but it seems that PerimeterX has generally lowered their protection against scraping.
I’ve just found out that today there’s no need even for Scrapy-Impersonate to scrape Neiman Marcus and the same can be said for Ticketmaster.com, Booking.com, and other websites where it’s the only anti-bot protection.
Some months ago it wasn’t so, since we tried other tools like Undetected-Chromedriver for bypassing it and it failed.
I’ve also some scrapers in production built with Playwright for Neiman Marcus since it was not possible to bypass PerimeterX in another way, so this is really strange. I don’t know if it’s a bug or a feature, but that’s it for the moment. 🤷🏻♂️