In the past articles, we’ve always seen scraping techniques for websites where data is updated maybe frequently but not every second.

But how can we scrape websites where data is updated on a very high frequency, like the trade view of Bitstamp or sports bets?

Well, first we should understand how these websites work, and in most cases, it means understanding what is a WebSocket and its functioning.

What is a WebSocket and how it works

A WebSocket is a communication protocol that provides a full-duplex communication channel over a single, long-lasting connection between a client and a server on the web. It is designed to be implemented in web browsers and web servers but can be used by any client or server application. The WebSocket protocol facilitates real-time data transfer and interaction, making it an essential technology for modern web applications that require live content updates without the need to reload the web page, such as chat applications, live sports updates, and interactive games.

The operation of WebSockets is initiated through a handshake mechanism, which is performed over the HTTP protocol. This handshake starts when the client sends a WebSocket handshake request to the server, expressing its desire to establish a WebSocket connection. The request includes a specific upgrade header that signals the server to switch from the HTTP protocol to the WebSocket protocol. If the server supports WebSockets and accepts the connection request, it responds with a handshake response, confirming the protocol switch. Once this handshake is successfully completed, the initial HTTP connection is upgraded to a WebSocket connection, allowing for full-duplex communication.

Unlike the traditional HTTP request-response model, where each request necessitates a new TCP connection, the WebSocket protocol maintains an open connection, enabling both the client and the server to send data independently and at any time. This persistent connection is maintained until explicitly closed by either the client or the server. The ability to send data in both directions simultaneously without the overhead of multiple HTTP requests significantly reduces latency and increases the efficiency of data transfer, making WebSockets particularly suitable for real-time applications.

WebSockets operate at a lower level than HTTP, allowing them to bypass some of the limitations of HTTP such as connection throttling and proxy filtering. Furthermore, WebSockets support message-based data transfer, enabling the transmission of discrete data packets, which can be text or binary.

WebSocket in action: Bistamp

Once we understand how WebSockets work, let’s see them in action and let’s try to use them to get a continuous stream of data.

All the code of the tests can be found in The Lab GitHub repository, available for paying users, under folder 47.REAL-TIME-SCRAPING.

If you already subscribed but don’t have access to the repository, please write me at pier@thewebscraping.club since I need to add you manually.

Step one: let’s find out a WebSocket

In this step, we’re looking for a website that uses WebSockets and the first one that came to my mind as a potential target is Bitstamp, with its tradeview section.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eb44ba-cf50-4020-93bc-acf82b7dc540_2204x1275.png)

We can see very dynamic content, with the current price of Bitcoin (or other cryptos), the order book, and the trades, that get updated even several times per second.

In fact, when loading the page, we can see in the Network tab of the Developers’ tool a WebSocket connection.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83b361ab-4ced-49f8-bdf9-c6f5e5dc25b7_987x359.png)

It’s easier to find it if you click on the WS filter, as shown in the above image.

Step two: what’s going on?

Just like any request, we can click on the console to see how the connection between the client and the server is established inside the WebSock and which messages are sent between the two.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5f255c-40c8-42e3-a65d-e7ac2d29b954_807x369.png)

In this case, we can see that the first five messages that the client (our browser) sends to the server are five subscriptions to different channels, that correspond to the dynamic content of the page.

We have a channel where we’ll get the data of the live trades between BTC and USD, the order book between the two, and the movement of the BTC ticker.

Once we subscribe to these channels, we get flooded by incoming messages for every trade and price change happening.

Let’s see how to collect it in the next step. I’m not such an expert in real-time scraping so the examples you’ll see in the next chapter will be very basic, but if you have more experience than me, please reach out since I’d like to know more about this fascinating world.

Step 3: collect data in Python

Luckily Python makes look like everything so easy to implement.

We start by importing the necessary modules

import websocket
import json

The websocket module is essential for WebSocket communication in Python, providing the functionalities required to create and manage WebSocket connections.

The core of the program is the creation of a WebSocket client app, like the following

   
    ws = websocket.WebSocketApp("ws://example.com/websocket",
                                on_open=on_open,
                                on_message=on_message,
                                on_error=on_error,
                                on_close=on_close)

    ws.run_forever()

where all the callbacks are handling the four different events that could happen.

We can add instructions on what happens when the connection is open, when we receive a message back, when we receive an error, or when the connection is closed by one end of the line.

The method run_forever initiates the WebSocket connection and enters a loop that keeps the connection open, allowing the client to send and receive messages continuously. The loop runs indefinitely until the connection is closed.

Let’s apply this to create a scraper for the live trades on Bitstamp.

import websocket
import json

HEADERS= {
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "pragma": "no-cache",
    "sec-gpc": "1",
    "sec-websocket-key": "z0Lp8XguGYGl2HrK3SxU0Q==",
    "sec-websocket-version": "13"
  }
  
def on_message(ws, message):
print(message)
data = json.loads(message)
print(f"Real-time data: {data}")

def on_error(ws, error):
print(error)

def on_close(ws, close_status_code, close_msg):
print("### closed ###")

def on_open(ws):
print("Connection established")
subscription=json.dumps({"event":"bts:subscribe","data":{"channel":"live_trades_btcusd"}})
print("Sending message")
print(subscription)
ws.send(subscription)
print("Message sent")

websocket_url = 'wss://ws.bitstamp.net/'

ws = websocket.WebSocketApp(websocket_url,
on_open=on_open,
on_message=on_message,
on_error=on_error,
on_close=on_close,
header=HEADERS)


ws.run_forever()

As soon as the connection is open, we subscribe to the live trades channel, by sending the same message we’ve seen on the Network tab.

When creating the WebSocket app, I’ve added also some custom headers taken always from the Network Tab and that’s it! So damn easy!

If we run the program Bitstamp.py available on The Lab repository, we’ll see the live trades on our monitor.

Inside the on_message function, instead of printing them, we could save it to a file or load directly on a database to store them.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ece6d31-6b20-45e7-924d-14ceb608fccf_1973x955.png)

Bonus Example: Sofascore

Another kind of websites that change frequently are the ones linked to betting. The odds and the results of sports events change frequently so the websites need to refresh this date as fast as possible. Imagine a delay in updating the odds after a goal what could this mean for a betting agency?

I’m not a great fan of betting but I’ve heard that Sofascore is one of the most important aggregators of bets around, so I wanted to test if my scraper with WebSocket worked also in this case.

Unluckily, it doesn’t!

The reason is that while the Bistamp server and my client communicate via JSON, Sofascore communicates via NATS messaging, another protocol for sending messages.

In fact, the messages we see from the network tab are not JSON but strings similar to

‘CONNECT {"no_responders":true,"protocol":1,"verbose":false,"pedantic":false,"user":"none","pass":"none","lang":"nats.ws","version":"1.19.1","headers":true}’

But with some modifications to the scraper, we’re able to gather live data also from there, at least for some minutes.

def on_open(ws):
print("Connection established")
message='CONNECT {"no_responders":true,"protocol":1,"verbose":false,"pedantic":false,"user":"none","pass":"none","lang":"nats.ws","version":"1.19.1","headers":true}\r\n'
print("Sending message")
print(message)
ws.send(message)
print("Message sent")
message='SUB sport.football 1\r\n'
print("Sending message")
print(message)
ws.send(message)
print("Message sent")
message='SUB event.11475137 2\r\n'
print("Sending message")
print(message)
ws.send(message)
print("Message sent")
message='SUB useraccount.646209b5b4fa7c4eee29dc62 3\r\n'
print("Sending message")
print(message)
ws.send(message)
print("Message sent")
message='SUB odds.1758455111 4\r\n'
print("Sending message")
print(message)
ws.send(message)
print("Message sent")

The issue is that the NATS protocol needs a series of PING and PONG to keep the connection alive, otherwise it would close. In our example, this is not handled but we’ll see in the next episodes if I could make a deep dive on it!