The seven scraping commandments [1]

  1. Don't Lie about your User Agent
  2. Don't scrape too aggressively
  3. Pick the right choice of scraping/crawling architecture that matches your needs
  4. Learn from Mistakes from Professional Scraping Services
  5. Don't disregard Fingerprinting
  6. Be Aware of Behavioral UI Analysis
  7. Side Channel Attacks can Reveal that you are a Bot

Introduction

I have been creating scrapers since years. Most of them were quite rubbish. But in the painful process you learn from some common mistakes. In this blog post, I share some of the most frequent mistakes that I spot in the wild and that I made myself. Furthermore, I give general advice how to remain (somewhat) undetected when scraping.

Mandatory note: Many large websites actually don't sanction scraping that much. I would count Google and Amazon to the sites that only moderately prevent scraping. The reason is obvious: Those platforms actually massively profit when other people/companies are using their data in their products. Google and Amazon are heavy players when it comes to E-Commerce and Online Marketing. So they both have an incentive to allow third-party-tools to access their platforms to a certain degree.

Other sites such as Instagram and LinkedIn are much more aggressive when it comes to blocking scrapers. They'll ban ill behaving user agents on the first suspicious activity. Websites such as LinkedIn are practically impossible to use without having an account.

Therefore, the advice in this blog post might be too strict or too lax for your specific use case. Keep that in mind.

Furthermore, this advice does not explicitly apply to scraping or crawling. Scraping often has a very negative connotation.

But an increasingly large percentage of people using frameworks such as puppeteer or playwright to automate otherwise mundane tasks. Therefore, there are many legit reasons why you want to train your software to navigate websites like a real human being.

7 Common Mistakes in Professional Scraping

1. Don't Lie about your User Agent

If you are scraping with puppeteer and headless chrome on an Amazon EC2 instance and you set your user agent to be an iPhone

Mozilla/5.0 (iPhone; CPU iPhone OS 14_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1

websites have a million ways to find out that you are lying.

For example, what happens when you forgot to adjust the user agent accordingly in the navigator.userAgent and navigator.appVersion properties? Or what if you forget to spoof the navigator.platform property to the correct iPhone platform?

Those static strings are easy to fix, but the browser exposes such a extremely vast API to websites, which is impossible to fix in its entirety.

It is much harder to fix the following things to behave like a true iPhone device:

  • WebGL rendering and audio fingerprints
  • Permissions API, iOS devices have a very unique permissions restrictions
  • Correct screen dimensions. And I mean ALL OF THEM.
  • mobile touch events emulation
  • battery status API
  • deviceorientation and devicemotion events

And there are many other APIs that behave different on an iPhone compared to other mobile devices.

The following websites are quite good in detecting such inconsistencies:

  1. creepjs
  2. pixelscan.net
  3. bot.incolumitas.com (yeah I know)

It is just extremely hard to convincingly pretend you are an iPhone when in reality you are a headless chrome browser in some cloud infrastructure.

For example, when you run the following code:

const { webkit, devices } = require('playwright');
const androidDevice = devices['Pixel 2 XL'];

(async () => {
  const browser = await webkit.launch({headless: false});
  const context = await browser.newContext({
    ...androidDevice,
    locale: 'en-US',
  });
  const page = await context.newPage();
  await page.goto('https://bot.incolumitas.com/');
  await page.screenshot({ path: 'botOrNot.png' });
  await browser.close();
})();

you will find so many cues whey the user agent just can't possibly be a real Pixel smartphone.

What to do instead?

I'd suggest to not lie on the user agent that you truly are. If your automated browser is running on a Linux system in the cloud, don't alter your user agent. Although Linux systems are quite rare in the wild, they are still legit user agents that websites should not block.

2. Don't scrape too aggressively

This is another common mistakes I see people making. Don't scrape too aggressively. After all, you are interested in public data, you don't want to launch a Denial of Service attack against a website. So please be considerate.

Furthermore, if your scraping becomes a major pain for the websites administrators, they will be extra careful to block all illegitimate traffic.

So it's better to throttle your scraping and to stay below the radar.

3. Pick the right choice of scraping/crawling architecture that matches your needs

In the past, I used AWS Lambda for many larger scraping projects. I managed to scrape millions of Google SERPs within days on AWS Lambda (Even without any external proxies. Just by using the AWS Lambda public IP address pool, I was good to go).

There exist mature Node.js modules that ship chromium binaries specifically compiled for the AWS Lambda runtime.

But in order to run the chrome browser on AWS Lambda, you need to launch the chrome browser with many special command line arguments:

/**
  * Returns a list of recommended additional Chromium flags.
  */
static get args() {
  const result = [
    '--autoplay-policy=user-gesture-required',
    '--disable-background-networking',
    '--disable-background-timer-throttling',
    '--disable-backgrounding-occluded-windows',
    '--disable-breakpad',
    '--disable-client-side-phishing-detection',
    '--disable-component-update',
    '--disable-default-apps',
    '--disable-dev-shm-usage',
    '--disable-domain-reliability',
    '--disable-extensions',
    '--disable-features=AudioServiceOutOfProcess',
    '--disable-hang-monitor',
    '--disable-ipc-flooding-protection',
    '--disable-notifications',
    '--disable-offer-store-unmasked-wallet-cards',
    '--disable-popup-blocking',
    '--disable-print-preview',
    '--disable-prompt-on-repost',
    '--disable-renderer-backgrounding',
    '--disable-setuid-sandbox',
    '--disable-speech-api',
    '--disable-sync',
    '--disk-cache-size=33554432',
    '--hide-scrollbars',
    '--ignore-gpu-blocklist',
    '--metrics-recording-only',
    '--mute-audio',
    '--no-default-browser-check',
    '--no-first-run',
    '--no-pings',
    '--no-sandbox',
    '--no-zygote',
    '--password-store=basic',
    '--use-gl=swiftshader',
    '--use-mock-keychain',
  ];

  if (Chromium.headless === true) {
    result.push('--single-process');
  } else {
    result.push('--start-maximized');
  }

  return result;
}

I am quite positive that normal Chrome browsers are not launched with those command line arguments and that it is possible to detect those settings for the websites being visited.

Furthermore, there are just too many disadvantages when scraping with AWS Lambda:

  • The lambda runtime is very restrictive
  • You have to test every feature twice and double before it works AWS Lambda
  • It is more expensive then self managed VPS servers
  • It is a major pain in the ass for debugging
  • Deployments eat up quite some time

In the long run, those disadvantages kill the two or three plus points that lambda brings to the table:

  • AWS Lambda caches state and prevents coldstarts
  • You only pay for what you use
  • You do not have to manage servers yourself

How should I setup my scraping/crawling infrastructure then?

It depends on how strong the anti bot defense of the site you are trying to scrape is.

Nowadays, I would suggest to use something like a proper Google Chrome Docker Image and a cluster management toolchain such as Docker Swarm or Kubernetes instead and rent out VPS servers as scraping demand increases. Maybe you can use Rancher and its many VPS vendor integrations to speed up cluster deployments.

Some propositions for possible scraping/crawling architectures in order of increasing anti scraping measures:

  1. No Scraping Defenses - Use curl and switch User-Agents and other HTTP Headers once in a while...

  2. Easy Scraping Target - If the website is relatively easy to scrape, use AWS Lambda + chrome-aws-lambda (with the shipped chromium compiled for the AWS runtime) + plugin-stealth + some proxy provider

  3. Some Anti Scraping Defense - If you need a real Google Chrome browser for more stealth, use a Docker Image such as the one from browserless.io and rent out VPS servers from a provider such as Digitalocean, AWS EC2 or Hetzner. You can use the raw Chrome DevTools Protocol (CDP) for automation. A good collection of tools that use the CDP can be found here. Furthermore, you can simulate mouse movements and keyboard events with a UI automation library such as PyAutoGUI. Your Docker image should also launch a virtual frame buffer such as Xvfb to simulate a graphical user interface.

  4. Advanced Anti Scraping Defense - If the above setup is still not enough, you can rent out physical devices and conduct your scraping there. You can rent real devices for browser testing on sites such as browserstack.com or AWS Device Farm. It's best to also rent proxies, since the default IP addresses are probably already flagged.

  5. Brutal Anti Scraping Defense - As a final measure, if all of the above fails, you could of course buy your own collection of Android mobile devices and mount a cheap Internet data plan. Then there is nothing to block, because you are a real device after all without pretending to be something else ;) You can buy cheap Android devices starting from 69USD and a simple 10GB data plan should not cost more than 10$ a month. You can then install a lightweight Android distribution such as Android Go and then you are good to go. With a 4G connection, your IP address is automatically changed whenever you switch Airplane mode on/off. I currently do not have an exact idea how to automate the scraping there, but it looks like appium.io might be a solution. Anyhow, it is important to humanize all interactions with the browser/apps there as well.

4. Learn from Mistakes from Professional Scraping Services

There are many professional scraping services that you can research. They have learned over the years and you can see how they camouflage they scrapers. Sometimes even the professional services make mistakes though.

ScrapingBee

For example they set the user agent to either

  1. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
  2. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36

but they forget to spoof the navigator.platform property accordingly. It is still set to "platform": "Linux x86_64". Obviously, that is very suspicious and would get flagged by many bot detection software.

scraperapi.com

This is another commercial scraping service. Their scrapers also exhibit some weird behavior.

For example, their scraping browsers have the timezone set to Etc/Unknown. This is very obscure.

You can obtain the timezone with the JavaScript snippet (new window.Intl.DateTimeFormat).resolvedOptions().timeZone.

Furthermore, the scraperapi.com scrapers have no WebGL support. It is not possible to obtain the video card settings.

Usually, normal browsers have a video card such as:

"videoCard": [
  "Intel Inc.",
  "Intel Iris OpenGL Engine"
]

You can obtain your video card brand names with the following script:

(function getVideoCardInfo() {
  const gl = document.createElement('canvas').getContext('webgl');

  if (!gl) {
    return {
      error: "no webgl",
    };
  }

  const debugInfo = gl.getExtension('WEBGL_debug_renderer_info');

  if(debugInfo){
    return {
      vendor: gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL),
      renderer:  gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL),
    };
  }

  return {
    error: "no WEBGL_debug_renderer_info",
  };
})()

scrapingrobot.com

Some issues with scrapingrobot.com are the following:

The scrapers of scrapingrobot.com are using the following screen size properties:

"dimensions": {
  "window.outerWidth": 800,
  "window.outerHeight": 600,
  "window.innerWidth": 2470,
  "window.innerHeight": 1340,
  "window.screen.width": 2560,
  "window.screen.height": 1440
}

It should not be the case that the window.outerWidth and window.outerHeight properties are smaller than the window.innerWidth and window.innerWidth screen dimensions. This is a very strong indication that the browser was messed with.

Furthermore, their scrapers don't have any plugin information (navigator.plugins) associated with them. This is very uncommon for legit chrome browses. Usually, every chrome browser has standard plugin information such as

"plugins": [
  {
    "name": "Chrome PDF Plugin",
    "description": "Portable Document Format",
    "mimeType": {
      "type": "application/x-google-chrome-pdf",
      "suffixes": "pdf"
    }
  },
  {
    "name": "Chrome PDF Viewer",
    "description": "",
    "mimeType": {
      "type": "application/pdf",
      "suffixes": "pdf"
    }
  },
  {
    "name": "Native Client",
    "description": "",
    "mimeType": {
      "type": "application/x-nacl",
      "suffixes": ""
    }
  }
]

With scrapingrobot.com, this information looks like this:

"plugins": [
  {
    "mimeType": null
  },
  {
    "mimeType": null
  }
]

Another issue with the scrapers of scrapingrobot.com is that they don't have any multimedia devices (navigator.mediaDevices) associated with the browser. Usually, a normal browser has at least one multimedia device associated (such as speakers, micros, webcams).

Normal:

"multimediaDevices": {
  "speakers": 1,
  "micros": 1,
  "webcams": 1
}

scrapingrobot.com:

"multimediaDevices": {
  "speakers": 0,
  "micros": 0,
  "webcams": 0
}

5. Don't disregard Fingerprinting

Scraping is so hard, because there are endless ways to fingerprint a device. You can fingerprint devices on the following levels:

  1. TCP/IP fingerprinting (for example with p0f)
  2. TLS fingerprinting (for example with ja3)
  3. Browser/JavaScript fingerprinting (for example with fingerprintjs2)
  4. HTTP header fingerprinting
  5. ... and probably many others stacks where fingerprinting is applicable

But why is fingerprinting so relevant for scraping?

Remember, when you want to create a scraper that is able to request a certain websites many 1000 times in a short time, your overall goal for your scraping traffic is to appear to be organic, human like.

Put differently, your goal is to make it as hard as possible to cluster and correlate your scraping user agents into groups.

That is also the reason why you are using proxies. Each scraper instance changes it's IP address after a couple of requests (sometimes as low as 20).

But what happens if you request a certain website 1.000.000 times and you change the IP address on every 500th request. Then the website cannot reasonably block you based on the IP address level, but they can infer that you are still the very same device, because most likely your browser fingerprint does not change when the IP address changes.

See?

So what exactly constitutes a good fingerprint?

A good fingerprint needs to have as much entropy as possible, while at the same time aiming to be resilient against minor changes!

Note that those two optimization goals contradict each other! You can't maximize both at the same time. It's a typical optimization problem.

For example, a good browser fingerprint should not change when the browser is updated. Therefore, the User-Agent is not a good entropy source for a fingerprint. Furthermore, a good fingerprint should not change when the user is switching into incognito mode. Therefore, cookies and other server side set data are no good either!

The much liked open source project fingerprintjs2 uses the following entropy sources to build its fingerprint:

export const sources = {
  osCpu: getOsCpu, // navigator.oscpu
  languages: getLanguages, // navigator.language and navigator.languages
  colorDepth: getColorDepth, // window.screen.colorDepth
  deviceMemory: getDeviceMemory, // navigator.deviceMemory
  screenResolution: getScreenResolution, // screen.width and screen.height
  availableScreenResolution: getAvailableScreenResolution, // screen.availWidth and screen.availHeight
  hardwareConcurrency: getHardwareConcurrency, // navigator.hardwareConcurrency
  timezoneOffset: getTimezoneOffset, //
  timezone: getTimezone, // (new window.Intl.DateTimeFormat).resolvedOptions().timeZone
  sessionStorage: getSessionStorage, // !!window.sessionStorage
  localStorage: getLocalStorage, // !!window.localStorage
  indexedDB: getIndexedDB, // !!window.indexedDB
  openDatabase: getOpenDatabase, // !!window.openDatabase
  cpuClass: getCpuClass, // navigator.cpuClass
  platform: getPlatform, // navigator.platform
  plugins: getPlugins, // navigator.plugins
  canvas: getCanvasFingerprint, // 
  touchSupport: getTouchSupport,// navigator.maxTouchPoints
  fonts: getFonts, //
  audio: getAudioFingerprint, //
  pluginsSupport: getPluginsSupport, // !!navigator.plugins
  productSub: getProductSub, // navigator.productSub
  emptyEvalLength: getEmptyEvalLength, // eval.toString().length
  errorFF: getErrorFF, //
  vendor: getVendor, // navigator.vendor
  chrome: getChrome, // window.chrome !== undefined
  cookiesEnabled: areCookiesEnabled, // check document.cookie writable
}

As you can see from the entropy sources, they are more or less stable. But concatenated and hashed with SHA2, they have enough entropy to be very unique.

6. Be Aware of Behavioral UI Analysis

Some anti bot companies such as PerimeterX and Shape Security record mouse movements and other user generated UI data such as scroll events or key presses.

The idea is to record the following JavaScript events from any visiting browser:

And then attach a relative timestamp with performance.now() to each such event. The data is transmitted with navigator.sendBeacon(), an Image Pixel or in real time with Web Sockets.

This will result in a time series of user generated events for the time spent on the recorded website.

With this data (or the lack of the data), you can derive several conclusions about the behavior of the visiting user.

For example, all of the above researched scraping companies

  1. ScrapingBee
  2. scrapingrobot.com
  3. scraperapi.com

do not produce such UI events for their headless scrapers. All they produce is the following event trace (scrapingbee.com was used as an example):

[
  ["DOMContentLoaded", 2716.16],
  ["load", 3363.74],
  ["pagehide",false, 4728.83],
  ["visibilitychange","hidden", 4729.195],
  ["unload", 4729.485],
]

As you can see, their scraper is only 1.4 seconds on the website (time difference from pagehide to load event).

Compare this to an event trace of a real human visitor:

[
  ["DOMContentLoaded",373.36],
  ["load",428.55],
  ["mousemove",948,30,749.33],
  ["mousemove",969,119,765.54],
  ["mousemove",1000,176,781.81],
  ["mousemove",1053,218,798.47],
  ["mousemove",1133,227,,815.335],
  ["mousemove",1222,218,831.815],
  ["mousemove",1300,185,848.86],
  ...
  ["beforeunload",1632.67],
  ["pagehide",false,1635.35],
  ["visibilitychange","hidden",1635.57],
  ["unload",1635.85],
  ["visibilitychange","hidden",1639.87],
]

This event traced was produced by moving the mouse quickly and then leaving the page.

So the real question is: Isn't it quite common that some people just open a page, but never choose to interact with the page?

This is not an easy question to answer, but in reality, most humans that open a website and quickly navigate away at least emit a few mousemove and scroll events. Even if they navigate the page with the keyboard, they usually emit some keyboard combination that switches the tab or closes the page.

It is quite rare that real human beings just open a page and never interact with it. On Desktop systems, this could for example happen if you are browsing a page and click on a link with the middle mouse pointer: This opens a page without switching the tab to it.

Nevertheless, a pattern such as the above is quite rare for real human user agents. If the above event trace appears a lot on a website, it is relatively safe to assume that the user agent is a robot.

How can you equip your scraper with real humanly generated synthetic behavioral data?

There is a module called ghost-cursor which simulates human mouse movements with the help of Bezier curves and Fitts's law. Fitt's Law suggests according to Wikipedia:

A movement during a single Fitts's law task can be split into two phases:

  1. Initial movement. A fast but imprecise movement towards the target. The first phase is defined by the distance to the target. In this phase the distance can be closed quickly while still being imprecise.
  2. Final movement. Slower but more precise movement in order to acquire the target. The second movement tries to perform a slow and controlled precise movement to actually hit the target.

It's best to use ghost-cursor in combination with plugin-humanize according to the authors.

For example, one big problem with ghost-cursor: Mouse movement always starts in the perfect origin (0, 0). Real humans start their mouse movement somewhere random on the page or somewhere on the top/left side of the window.

Another alternative is to use the kinda new scraping browser project secret-agent. The documentation of the HumanEvaluator promises:

HumanEmulators are plugins that sit between your script and SecretAgent's mouse/keyboard movements. They translate your clicks and moves into randomized human-like patterns that can pass the bot-blocker checks.

I haven't tested secret-agent extensively, but it appears to be a very ambitious project. Could be too ambitious after all. Sometimes, simplicity is key.

7. Side Channel Attacks can Reveal that you are a Bot

Side channel attacks regarding browsers are a endlessly large group. When you are using a real browser for your scraping projects, you are exhibiting an extremely vast side channel attack surface to the website you are visiting.

To give you some quick overview, those are some examples that leak information about the environment where your browser (and thus scraper) is running:

  1. Browser Red Pills - Timing JavaScript code with performance.now() in order to make educated guesses whether the browser is running in an virtual machine.
  2. Proof of Work Captchas such as friendlycaptcha
  3. DNS Leaks - They occur when the DNS Name is not resolved by the anonymous tunnel that you are using for hiding your browser traffic
  4. WebRTC Leaks
  5. Browser Based Port Scanning - Port scanning the internal network where your scraper is running might reveal a lot about your network. Or for example that you are running a browser on a VPS with open ports 9222 or 22.

Conclusion

To be done.