Please Note: This blog post is not finished yet. Research is still ongoing.

Introduction

Nowadays, advanced web bots become more powerful each passing day. With the help of browser automation frameworks such as puppeteer and playwright, fully fledged Chrome browsers are deployed to the cloud and programmed to automate many different work flows that are too tedious and repetitive for normal humans.

Those advanced web bots are employed for many different use cases:

  • Scraping SERP data from Search Engines such as Google, Bing or Baidu
  • Price Data scraping from sites such as Ebay or Amazon in order to gain an competitive edge
  • Advertisement Fraud: Make bots click on Advertisement Links and cash in the ad impression payout
  • Sneaker Bots: Automatically buying highly sought after goods such as limited edition Nike sneakers
  • Social Media Bots (such as Twitter Bots): Misleading and spreading false information with the aim to manipulate the consuming masses

The general tendency is very clear: A lot of behavior in the Internet is completely automated. This trend will most likely prevail and there is a constant battle between bot programmers and anti-bot companies.

In my humble opinion, as long as bot traffic is not causing Denial of Service issues or leading to any direct damage, it should not be legally sanctioned. After all, public data is public for a reason. So for example: Scraping public data from a site without using excessive bandwith and without causing bursts is okay. On the other hand, influencing people by creating misleading comments is not okay.

Therefore, there is a constant demand for detecting bad bot behavior. The range of available techniques is vast. Mostly however, the detection techniques can be grouped into the following categories:

  • IP Address reputation techniques
  • Browser Fingerprinting
  • Browsing Behavioral Analysis

The Idea

For economical reasons, most bot programmers do not use their own personal computer to run their bots. The reason is obvious: The personal computer does not scale horizontally and it's unpractical to let it run constantly.

Therefore, bot owners usually rent cloud computational resources to host their bots. A popular solution is to automatically spawn AWS EC2 instances in a docker swarm and to assign a certain amount of resources to each container.

Another popular approach is to use serverless computation infrastructure such as AWS Lambda or Microsoft Azure.

The exact method does not matter. What is common to all approaches: Each crawler gets assigned as little resources as necessary in order to save infrastructure costs.

This is in stark contrast to most human website visitors: Real humans are using their browser on their computer and rendering a website usually takes only a fraction of all available computational resources.

Motivation

The motivation to write this blog post originates from a paper from Stanford Professor Dan Boneh and several other researchers. The paper's title is "Tick Tock: Building Browser Red Pills from Timing Side Channels". The paper is a really great read and I highly suggest you guys to read it.

In the paper, Boneh et. al try to find browser based techniques using only JavaScript to show that the browser is running in a virtualized environment such as within VirtualBox or VMware.

They propose different JavaScript functions of two different classes:

  1. Baseline Operations: Those classes of algorithms are assumed to take the same execution time on bare metal computational environments as on virtual machines
  2. Differential Operations: Those algorithms have significant execution times in virtual machines compared to normal machines.

The authors propose several techniques for the above two classes of algorithms and they successfully demonstrate that it is possible to recognize that the Browser is running in a virtual machine with high statistical confidence.

The purpose of this blog post is to take this concept from the above paper and to find algorithms for the two algorithm classes that are able to tell a normal computing device (such as Laptop, Smartphone, Tablet) apart from generic cloud usage.

Put differently: Due to the restricted computational resources allocated for cloud based web pots, I suspect that it is possible to find certain algorithms that have significant different execution times compared to normal devices usually used by humans.

Other Researches thinking into the same Direction

Antoine Vastel wrote in his PhD thesis submitted on 24th October 2019:

I propose to investigate red pills, sequences of instructions that aim at detecting if a browser is running in a virtualized environment, for bot detection. Indeed, since crawling at scale requires a large infrastructure, a significant fraction of crawlers likely runs on virtual machines from public cloud providers. Ho et al. proposed several red pills capable of detecting if the host system is a virtual machine from within the browser.

Nevertheless, the paper has been published in 2014 and has not been evaluated on popular public cloud providers. Moreover, the underlying implementations of some of the APIs used in the red pills may have evolved, which can impact the accuracy of the red pills. Thus, I argue there is a need for evaluating these red pills on the main public cloud providers and developing new red pills techniques.

The Targeted Environment

The goal of this blog article is to detect that a browser is running in serverless cloud infrastructure. It is assumed that the serverless environment is assigned 1500MB of memory.

For simplicity's sake, we will try to detect that a browser is running from withing AWS Lambda. And it should be possible to distinguish AWS Lambda from at least the following devices:

  1. Normal Laptops
  2. Tablets
  3. Smart Phones

The above devices must be reasonably modern, let's say not older than 6 years.

So what environmental restrictions does AWS Lambda have?

A good start is to look at the limits of the AWS Lambda Environment.

The essential question we need to ask ourselves: What is the best way to trigger the AWS Lambda imposed limits with JavaScript without reaching the computational limits on commonly used standard devices to browser the web (Laptop, Tablets, Smartphone)?

It appears that AWS Lambda does not have good GPU support.

As of 1th December 2020, it is possible to allocate up to 10GB of RAM and 6 vCPU cores for Lambda Functions. vCPU cores are allocated proportionally to the amount of RAM (between 128 MB and 10,240 MB).

Other Detection Methods

Our goal is to detect that a browser is running from within a serverless cloud infrastructure.

Several fingerprinting sites such as pixelscan.net and browserleaks.com give rise to new promising ideas.

There are several other detection methods that come to mind quickly:

  • See if the browser in the cloud leaks DNS info that is not configured to run through the proxy
  • Check if the browser leaks the real IP address via WebRTC

Implementation Idea

This is the basic algorithm that will be used in this blog post:

function redpill() {
  var baseStart = performance.now();
  baselineOperation();
  var baseTime = performance.now() - baseStart;
  console.log('baseTime: ' + baseTime);

  var diffStart = performance.now();
  differentialOperation();
  var diffTime = performance.now() - diffStart;
  console.log('diffTime: ' + diffTime);

  return diffTime / baseTime ;
}

Baseline Operations

The text baseline writes a random text into the DOM.

function textBaseline() {
  var addString = "Writes lots of text:"
  addString += Math.floor(Math.random() * 10000);

  var pNode = document.createElement("p");
  document.body.appendChild(pNode);

  for (var i = 0; i < 500; i++) {
    pNode.innerHTML = pNode.textContent + addString;
  }
}

var baseStart = performance.now();
textBaseline();
var baseTime = performance.now() - baseStart;
console.log('baseTime: ' + baseTime);

The memory baseline. This algorithm writes 40000 times a random number into the memory and reads it back.

function memoryBaseline() {
  var RANDOM = Math.floor(Math.random() * 1000000);

  var array = new Array();
  for (var i = 0; i < 40000; i++) {
    array.push(new Number(RANDOM));
  }

  for (var i = 0; i < 40000; i++) {
    array.pop();
  }
}

var baseStart = performance.now();
memoryBaseline();
var baseTime = performance.now() - baseStart;
console.log('baseTime: ' + baseTime);

Differential Operations

Console Write Differential Operation.

function consoleOperation() {
  var error_str = "Error: Writing to Console: " + Math.floor(Math.random() * 10000);
  for (var i = 0; i < 2000; i++) {
    console.log(error_str);
  }
}

var diffStart = performance.now();
consoleOperation();
var diffTime = performance.now() - diffStart;
console.log('diffTime: ' + diffTime);

Local Storage Differential Operation

This operation writes a 500 byte random string to local storage and reads it back into an array. It is assumed that this operation has different running times on different computers.

function localStorageOperation() {
  var randomStr = "x".repeat(495) + Math.random().toString().slice(2, 7);
  var data = new Array();
  for (var i = 0; i < 1000; i++) {
    localStorage.setItem('lsDO' + i, randomStr);
  }

  for (var i = 0; i < 1000; i++) {
    data.push(localStorage.getItem('lsDO' + i));
  }
}

var diffStart = performance.now();
localStorageOperation();
var diffTime = performance.now() - diffStart;
console.log('diffTime: ' + diffTime);

ReadPixels Differential Operation CPU - GPU Communication

This algorithm tests the communication latency between the CPU and graphics card. It is assumed that generic cloud infrastructure does not have a GPU available for each of their users. However, most normal users have a GPU in their device. Therefore, it should be possible to see differentiable run times.

Due to the size of this algorithm, there will be a link only here.

Visualization of Results

Memory Baseline Operation

Text Baseline Operation

Console Differential Operation

Local Storage Differential Operation

WebGL Differential Operation

Other Sources

Hacker News Discussion of other Red Pill techniques