Web Scraping: Intercepting XHR Requests

Have you ever tried scraping AJAX websites? Sites full of Javascript and XHR calls? Decipher tons of nested CSS selectors? Or worse, daily changing selector? Maybe you won't need that ever again. Keep on reading, XHR scraping might prove your ultimate solution!

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.

pip install playwright
playwright install

Intercept Responses

As we saw in a previous blog post about blocking resources, headless browsers allow request and response inspection. We will use Playwright in python for the demo, but it can be done in Javascript or using Puppeteer.

We can quickly inspect all the responses on a page. As we can see below, the response parameter contains the status, URL, and content itself. And that's what we'll be using instead of directly scraping content in the HTML using CSS selectors.

page.on("response", lambda response: print(
    "<<", response.status, response.url))

Use case: auction.com

Our first example will be auction.com. You might need proxies or a VPN Since it blocks outside of the countries they operate in. Anyway, it might be a problem trying to scrape from your IP since they will ban it eventually. Check out how to avoid blocking if you find any issues.

Here is a basic example of loading the page using Playwright while logging all the responses.

from playwright.sync_api import sync_playwright

url = "https://www.auction.com/residential/ca/"

with sync_playwright() as p:
    browser = p.firefox.launch()
    page = browser.new_page()

    page.on("response", lambda response: print(
        "<<", response.status, response.url))
    page.goto(url, wait_until="networkidle", timeout=90000)

    print(page.content())

    page.context.close()
    browser.close()

auction.com will load an HTML skeleton without the content we are after (house prices or auction dates). They will then load several resources such as images, CSS, fonts, and Javascript. If we wanted to save some bandwidth, we could filter out some of those. For now, we're going to focus on the attractive parts.

As we can see in the network tab, almost all relevant content comes from an XHR call to an assets endpoint. Ignoring the rest, we can inspect that call by checking that the response URL contains this string: if ("v1/search/assets?" in response.url).

There is a size and time problem: the page will load tracking and map, which will amount to more than a minute in loading (using proxies) and 130 requests :O. We could do better by blocking certain domains and resources. We were able to do it in under 20 seconds with only 7 loaded resources in our tests. We will leave that as an exercise for you ;)

<< 407 https://www.auction.com/residential/ca/ 
<< 200 https://www.auction.com/residential/ca/ 
<< 200 https://cdn.auction.com/residential/page-assets/styles.d5079a39f6.prod.css 
<< 200 https://cdn.auction.com/residential/page-assets/framework.b3b944740c.prod.js 
<< 200 https://cdn.cookielaw.org/scripttemplates/otSDKStub.js 
<< 200 https://static.hotjar.com/c/hotjar-45084.js?sv=5 
<< 200 https://adc-tenbox-prod.imgix.net/resi/propertyImages/no_image_available.v1.jpg 
<< 200 https://cdn.mlhdocs.com/rcp_files/auctions/E-19200/photos/thumbnails/2985798-1-G_bigThumb.jpg 
# ...

For a more straightforward solution, we decided to change to the wait_for_selector function. It is not the ideal solution, but we noticed that sometimes the script stops altogether before loading the content. To avoid those cases, we change the waiting method.

While inspecting the results, we saw that the wrapper was there from the skeleton. But each houses' content is not. So we will wait for one of those: "h4[data-elm-id]".

with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in 
        if ("v1/search/assets?" in response.url):
            print(response.json()['result']['assets']['asset'])
    # ...
    page.on("response", handle_response)
    # really long timeout since it gets stuck sometimes
    page.goto(url, timeout=120000)
    page.wait_for_selector("h4[data-elm-id]", timeout=120000)

Here we have the output, with even more info than the interface offers! Everything is clean and nicely formatted 😎

[
  {
    "item_id": "E192003",
    "global_property_id": 2981226,
    "property_id": 5444765,
    "property_address": "13841 COBBLESTONE CT",
    "property_city": "FONTANA",
    "property_county": "San Bernardino",
    "property_state": "CA",
    "property_zip": "92335",
    "property_type": "SFR",
    "seller_code": "FSH",
    "beds": 4,
    "baths": 3,
    "sqft": 1704,
    "lot_size": 0.2,
    "latitude": 34.10391,
    "longitude": -117.50212,
    ...

We could go a step further and use the pagination to get the whole list, but we'll leave that to you.

Use case: twitter.com

Another typical case where there is no initial content is Twitter. To be able to scrape Twitter, you will undoubtedly need Javascript Rendering. As in the previous case, you could use CSS selectors once the entire content is loaded. But beware, since Twitter classes are dynamic and they will change frequently.

What will most probably remain the same is the API endpoint they use internally to get the main content: TweetDetail. In cases like this one, the easiest path is to check the XHR calls in the network tab in devTools and look for some content in each request. It is an excellent example because Twitter can make 20 to 30 JSON or XHR requests per page view.

Once we identify the calls and the responses we are interested in, the process will be similar.

import json
from playwright.sync_api import sync_playwright

url = "https://twitter.com/playwrightweb/status/1396888644019884033"

with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in
        if ("/TweetDetail?" in response.url):
            print(json.dumps(response.json()))

    browser = p.firefox.launch()
    page = browser.new_page()
    page.on("response", handle_response)
    page.goto(url, wait_until="networkidle")
    page.context.close()
    browser.close()

The output will be a considerable JSON (80kb) with more content than we asked for. More than ten nested structures until we arrive at the tweet content. The good news is that we can now access favorite, retweet, or reply counts, images, dates, reply tweets with their content, and many more.

Use case: nseindia.com

Stock markets are an ever-changing source of essential data. Some sites offering this info, such as the National Stock Exchange of India, will start with an empty skeleton. After browsing for a few minutes on the site, we see that the market data loads via XHR.

Another common clue is to view the page source and check for content there. If it's not there, it usually means that it will load later, which probably requires XHR requests. And we can intercept those!

Since we are parsing a list, we will loop over it a print only part of the data in a structured way: symbol and price for each entry.

from playwright.sync_api import sync_playwright

url = "https://www.nseindia.com/market-data/live-equity-market"

with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in
        if ("equity-stockIndices?" in response.url):
            item = response.json()['data'][1]
            print(item['symbol'], item['lastPrice'])

    browser = p.firefox.launch()
    page = browser.new_page()

    page.on("response", handle_response)
    page.goto(url, wait_until="networkidle")

    page.context.close()
    browser.close()

# Output:
# NIFTY 50 18125.4
# ICICIBANK 846.75
# AXISBANK 845
# ...

As in the previous examples, this is a simplified example. Printing is not the solution to a real-world problem. Instead, each page structure should have a content extractor and a method to store it. And the system should also handle the crawling part independently.

Conclusion

We'd like you to go with three main points:

  1. Inspect the page looking for clean data
  2. API endpoints change less often than CSS selectors, and HTML structure
  3. Playwright offers more than just Javascript rendering

Even if the extracted data is the same, fail-tolerance and effort in writing the scraper are fundamental factors. The less you have to change them manually, the better.

Apart from XHR requests, there are many other ways to scrape data beyond selectors. Not every one of them will work on a given website, but adding them to your toolbelt might help you often.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

18