THE LAB #1: Scraping data from an app

Excerpt

How to inspect the network traffic of an app with Fiddler Everywhere and scrape the data from its servers.


This is the first post of “THE LAB”: in this series, we’ll cover real-world use cases, with code and an explanation of the methodology used.

Why scrape data from an app?

I usually write in this newsletter about how to extract data from websites but what if our target is an app with no web interface? Or if the website is too complex to scrape and, knowing the target has an app, we want to test if there’s another door to access the data.

Since we cannot see the source code of the app, we must intercept the requests made from it to its servers and try to replicate them in our scraper.

To do so, we need a network traffic analysis tool. In this post, we’ll use Fiddler Everywhere to support us, but there’s plenty of similar software on the market, like Wireshark.

How Fiddler Everywhere works

Fiddler works like a “man-in-the-middle” between your target app and its servers.

[

How Fiddler works

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1689fd0e-c090-4abb-a8f2-ae363ae5fca9_825x424.png)

How Fiddler works

First of all, we need to install Fiddler on our computer and set it up to decrypt HTTPS traffic, as described on this page.

The second step is to find out the IP of our computer and configure the network on our mobile phone, where the app is installed, to use our computer as a proxy, using the port opened by Fiddler to route the network traffic.

From now on, every request made by our mobile to external servers will be shown in Fiddler. This is extremely interesting to understand what happens after the installation of a new app and to find out hidden calls to external services we are not aware of.

Start scraping from an app: a real case

Going back to our web scraping project, let’s see a real-world case of how to use Fiddler.

We’ll use the app of the well-known streetwear brand Off White as an example, even if they have a website to scrape data from.

Before starting our test, let’s close all the apps on the phone in the background to reduce the noise in the requests made by our phones.

As soon as we open the app, we will see all these requests to Facebook and the Off-White internal API.

[

List of requests from the app

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6571316-5044-48c2-85ca-420dfdcea7f2_1905x998.png)

List of requests from the app

Let’s try to open a catalog page to see what happens.

I browsed to man’s ready-to-wear and, after entering the section, noticed this request.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48c6ef70-7f73-4965-a864-fc18025e1294_1912x1021.png)

API requests made by the app

Bingo!

There’s an API that retrieves the product information from the server.

Let’s try to replicate the same request, with the same headers, on a Scrapy project to see if it works.

I’ll start with a basic scraper, that requests only this API call, so after starting a new project in Scrapy, my spider will look like the following.

[

Scrapy first lines

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7db3bffa-c736-4f10-94ee-21739044a408_624x338.png)

Scrapy first lines

While the settings.py file will have the variable DEFAULT_REQUEST_HEADERS set with almost all the headers we read on Fiddler.

[

Scrapy default headers

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52662bd-146b-4fef-aa3c-97608526e69d_603x262.png)

Scrapy default headers

As you may have noticed, I’ve deleted the “Accept” and “Accept-Encoding” voices, otherwise, Scrapy would mess out with the response output format.

I’ve deleted also the Cookie parameter because I wanted to be sure that this configuration will work also in the future and not only now because we have the right Cookie assigned.

At the moment we’ll use the Bearer token read from Fiddler even if probably it won’t work in few hours, but we’ll see later how to get it dynamically.

The scraper should print the resulting JSON on the terminal.

[

Scrapy results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae0ee9a-24f2-4e54-a519-d2161cff9d76_1902x985.png)

Scrapy results

It works!

Now let’s define the Item properties and fields and read the JSON. I suggest copying the whole JSON output into any online formatter tool, like https://jsonformatter.curiousconcept.com/, it helps a lot to understand its structure.

[

Item class declared in Scrapy

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4a0c11e8-920d-4205-ba37-b08f344f3e38_723x483.png)

Item class declared in Scrapy

[

JSON formatted result

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6614ca-d8eb-4910-aec7-3cb6adc397fd_670x719.png)

JSON formatted result

Let’s use the JSON structure to assign fields to Item fields, the scraper will run like the following.

[

Scrapy first 60 results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c9617-8bc5-47dc-85aa-8d8b373fa197_1140x926.png)

Scrapy first 60 results

60 items scraped, perfect, the same number as the items in the JSON!

A step further, scraping a whole category from the app

As we see from the JSON, the total items for this category are 337 and the pages to query are six, while we are requesting only page one.

The problem is that even the app doesn’t allow you to see more than sixty products per category, at least with this configuration.

Roaming around the app furthermore, I’ve found a new call that did not notice before that could help us scroll a whole category.

Now we are ready to edit the scraper to download all the items from this category. We will iterate the call to the API, browsing pages until the end of the product category.

[

Scrapy the whole product category

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0995e74-2004-422d-9965-d383649d0c3e_1217x732.png)

Scrapy the whole product category

Good, 337 items were scraped! That’s exactly the number of items listed in the API.

[

Scrapy whole category results

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F17e890b4-d8fa-4ebe-9708-a0e9587e38e3_1394x938.png)

Scrapy whole category results

The next step is to store them in a CSV file.

In all my projects and at Databoutique we use a custom CSV exporter, that allows us to change the field delimiter, file extension, and order of the columns via the settings.py file.

[

Scrapy custom csv exporter

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F0909b986-44aa-44a7-9f24-3d2064d073dd_690x336.png)

Scrapy custom csv exporter

In the settings.py file, we set the delimiter and the order of the columns.

[

Scrapy settings output fields

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F9e6bd71a-cd3a-4768-a291-acfe16336170_1177x84.png)

Scrapy settings output fields

In this way, when we call the scraper via the command “scrapy crawl offwhite -o offwhite.csv -t csv” we save to the file offwhite.csv the results as we set up before.

Scraping the whole Italian catalog from the app

We’d like to scrape now all the categories available on the app, so let’s go back again to Fiddler to see where we can find a list containing all of them.

After a bit back and forth on the app, trying to understand when the menu of the categories is generated, I’ve finally met on Fiddler the following API call that seems to return the product hierarchy.

[

API to get all the categories from Fiddler

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F5478d0a4-e083-4089-b276-2269cd51b92a_1153x180.png)

API to get all the categories from Fiddler

Including the parsing of the JSON of product categories in the scraper code, we can request the API for all the categories. All we need is the name of the category, that we derived by the property “link”, and the category id.

[

JSON with all the categories

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F37e535a7-fb46-4acc-a7c2-eabe3c25ab96_812x1065.png)

JSON with all the categories

Let’s rework the scraper to integrate this API call and explore every category we find in the hierarchy.

[

Results for the whole website with Scrapy

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F71687031-c2ad-4644-9a20-3e0c5a2457ba_1405x789.png)

Results for the whole website with Scrapy

With the latest run, we scraped 1163 items! Great!

Retrieving the bearer token dynamically from the app

During the previous tests, I needed to manually change the bearer token in the header, since it lasts only one hour and then is no more valid for API calls.

[

API call to get the Bearer Token

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F62ee3d25-514a-47ea-b3c1-7b1d67b85fdf_1796x185.png)

API call to get the Bearer Token

Again Fiddler helps us to understand how the app works and how to integrate our scraper, including the token request as the first step of the process.

[

Implementing the call in Scrapy

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7b96868c-44f8-4bd5-aefe-259a9c332f3d_1352x596.png)

Implementing the call in Scrapy

As we see, it’s a POST request where a guest user ID is passed as a parameter. Since it has a different header than the other requests, I rewrote the scraper passing them explicitly.

After the latest run, we still got 1169 items, so the scraper keeps working and it will in the next days until something changes in the app at least.

Recap of the topics covered

  • Usage of Fiddle Everywhere to track traffic from the apps on your phone to the server

  • Using internal API to scrape a website

  • Scrape API that is protected with Bearer authorization tokens.