When a human-like mouse movement is important in web scraping?

As we approach a web scraping project for a website, we may encounter an anti-bot protection software installed. Depending on the industry in which the website operates, it could be a common case or not: in my experience with 200+ fashion e-commerce websites, where I scrape the public components like prices and products, I could say that around 20% of them have a sort of bot protection. This is true especially if you target large players with the budget to spend on these solutions, while lesser-known websites usually prefer to focus on the security of the purchasing process.

If there’s no anti-bot protection, probably a Scrapy spider will do the job and you don’t have to care about mouse movements, since there’s no one on the server side who cares about it.

Even if you see there’s an anti-bot that requires you to provide a plausible device fingerprint to bypass, in many cases this is enough and you don’t need to worry about movements. Several anti-bot solutions prefer to rely primarily on signals about the browser and the device configuration rather than tracking every event sent by the browser, like the mouse movement, at least when they’re set at a low “aggressivity” level.

Other anti-bot softwares, like Datadome, when set on a high level of security, are very sensitive to the behavior of the scraper, and everything that goes off the rails is marked as suspect and could lead to a block, even when a human is browsing. Try by yourself at the Hermes.com website, which in my experience is one of the most protected websites I’ve encountered. Enter a random product category and start browsing fast all the products, opening images and links in new tabs: you will get blocked after a few minutes.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfd197c-5b63-4093-85b9-9905b2e99c36_530x395.webp)

This is because you’re not browsing the website like someone who’s genuinely interested in buying will do and it’s the result of the so-called “behavioral analysis”, which claim could be found on the Datadome page.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c69d926-69de-4ecb-8aec-ba21cd47d7e1_1302x566.png)

Datadome will analyze not only the sequence of the requests you’re making to the target website but also how your scraper transitions from one page to another: is it accessing via a direct link even in the internal pages? Is it clicking around with the mouse? And if so, how the mouse moves on the page?

The standard Playwright mouse movement

Playwright, since it is a browser automation tool for app testing, has no native function to emulate a human-like mouse movement. Its core functionalities are built for creating and executing tests on web apps: it provides a set of functions to move the mouse inside a Playwright browser instance but the movements are quite unnatural.

In the GitHub repository reserved for paying subscribers, you’ll find a simple program (draw_play.py) that draws a line between two points in the way Playwright would move between them: fast and straight.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6b081-9aaf-4346-8a8c-2f776c6c0fe4_556x360.png)

Difficult to believe this is a mouse movement between two points generated by a human. Every time we tell Plawright to click on a point on the screen, it will immediately go to it following a straight line.

Advanced anti-bots listen to the mousemove event, which is triggered when, as you can imagine, the mouse moves on a page. As we’re reading from the documentation of W3C standards:

The frequency rate of events while the pointing device is moved is implementation-, device-, and platform-specific, but multiple consecutive mousemove events SHOULD be fired for sustained pointer-device movement, rather than a single event for each instance of mouse movement.

This means that on average, every page we browse manually triggers hundreds of events, while if our scraper jumps from selector to selector, it will generate few of them. Also if we force Playwright to move the mouse around, these movements will be too fast and straight to be compatible with human-like ones.

The python_ghost_cursor mouse movement implementation

We already talked about Bezier curves in the past: they’re a way to draw curve lines programmatically. Multiple Python packages use them to mimic human-like interactions with web pages, together with some adjustments to the speed of the mouse.

Ghost_cursor is a JS library that implements the Bezier curves for calculating the trajectory of the mouse between two points and adds also an implementation of the Fitt’s law to calculate its speed.
From Wikipedia:

A movement during a single Fitts’s law task can be split into two phases:

  • initial movement. A fast but imprecise movement towards the target

  • final movement. Slower but more precise movement in order to acquire the target

The first phase is defined by the distance to the target. In this phase the distance can be closed quickly while still being imprecise. The second movement tries to perform a slow and controlled precise movement to actually hit the target. The task duration scales linearly in regards to difficulty. But as different tasks can have the same difficulty, it is derived that distance has a greater impact on the overall task completion time than target size.

Basically, the mouse starts moving faster and then slows down when it’s approaching the target button or selector, simulating human behavior. Is this enough to bypass the anti-bots?

After all this theory, let’s go back to the initial example of the Hermes website, protected by Datadome.

As mentioned, it’s one of the toughest websites to bypass and we want to see if what we’ve learned today could help fool its antibot.

I wrote two small scripts, one with Playwright without Ghost cursor and one with it. They both want to open the home page of the website and iterate over its dropdown menu, without getting blocked.

Playwright with no Ghost cursor ❌

Opening a session with Playwright with Brave Browser but no Ghost cursor, we could load the home page but we cannot even click on the menu. I don’t know if there’s any issue with my script but since I cannot open the burger menu the selectors used to iterate through the categories cannot be found and the scraper returns an error.

You can find the full code in the file test_play.py on the repository for paying subscribers. If you’re one of them but don’t have access, please write me at pier@thewebscraping.club with your GitHub username, since I need to you manually.

Playwright with Ghost cursor ✅

Let’s add the ghost-cursor package to the program and create a cursor to move the mouse around the page and click on selectors.

cursor = create_cursor(page)
page.goto('https://www.hermes.com')
selector = page.locator('xpath=//button[@id="collection-button"]')
selector.wait_for()
cursor.click(selector)

With this simple add-on, we can iterate the whole menu without getting blocked. You can find the full program on the same repository in the program test_ghost.py.

It seems a good way to proceed with our scraping project. Of course, this is a small piece of the puzzle for a successful scraping of the website: you need to have a good device fingerprint, quality IP address, and design the scraper to mimic the path a human user would follow if he wanted to browse the website.

Thanks for reading this issue, see you next Thursday with another “The Lab” article and the next Sunday with our new lesson of “Web Scraping from 0 to hero” course.