So you want to Scrape like the Big Boys? 🚀

Intro

Let's do some thinking, shall we?

When I used to run a scraping service, I managed to scrape at most a couple of million Google SERPs per week. But I never ever purchased proxies from proxy providers such as Brightdata, Packetstream or Oxylabs.

Why?

Because I could not fully trust the other customers with whom I shared the proxy bandwidth. What if I share proxy servers with criminals that do more malicious stuff than the somewhat innocent SERP scraping?

Full disclosure: Non-DoS scraping of public information is okay for me. Ad-fraud, social media spam, web attacks such as automated SQL injections or XSS is not.

Furthermore, those proxy services are quite pricey, and me being a stingy German, I didn't possibly see a reasonable way for this combination to work out.

So how did I manage to scrape millions of Google SERP's?

I used AWS Lambda, put Headless Chrome into an AWS Lambda function and used puppeteer-extra and chrome-aws-lambda to create a function that automatically launches a browser for 300 seconds that I can solely use for scraping.

Actually, I could have probably achieved the same with plain curl, because Google really doesn't put too much effort into blocking bots from their own search engine (they mostly rate limit by IP). But I needed a full browser for other projects, so there was that.

Anyhow, AWS gives you access to 16 regions all around the world (are they offering even more regions in the meantime?) and after three AWS Lambda function invocations, your function obtains a new public IP address. And if you concurrently invoke 1000 Lambda functions, you will bottom out at around 250 public IP addresses. And then you have 16 regions, which gives you around 16 * 250 = 4000 public IP addresses at any time when using AWS Lambda. This was enough to be able to scrape millions of Google SERPs / week, even when sharing public datacenter IP addresses.

I tried the same with Google Cloud Platform, but funnily enough, Google blocks their own cloud infrastructure much more aggressively compared to traffic from AWS.

(This was all in 2019 and 2020, things possibly changed)

But I digress.

The above setup is not good. It will work for scraping Google / Bing / Amazon, because they want to be scraped to a certain extent.

But it will never work against well protected websites that employ protection from anti bot companies such as DataDome, Akamai or Imperva (there are more anti bot companies, don't be salty when I didn't name you, okay?).

Those companies employ ill-adjusted individuals that do nothing else than look for the most recent techniques to fingerprint browsers, find out if a browser lies about it's own configuration or exhibits artifacts that don't pertain to a humanly controlled browser. When normal people are out drinking beers in the pub on Friday night, these individuals invent increasingly bizarre ways to fingerprint browsers and detect bots ;)

Browser Red Pills - Dan Boneh - Awesome Paper
Browser Based Port Scanning
Google Picasso
Font Fingerprinting
TCP/IP Fingerprinting - zardaxt.py
Browser based Crypto Challenges - Proof of Work
Generic Browser Fingerprinting
TLS Fingerprinting
WebGL Fingerprinting
WebRTC real IP detection
Behavioral Classification
Gyroscope API querying (device movement / rotation detection)
Fingerprinting without JavaScript using HTTP headers, CSS feature queries and Fonts.
...

I kid you not, there are millions of different ways to detect if a browser is being controlled by a bot or not. It's insanely complex and almost all bot architectures are to a degree vulnerable to detection.

Maybe I am just not a good enough bot developer myself, but I think it's harder to create a good bot than to detect a bot. The real problem for anti bot companies is to reduce the false positive rate, not detecting most bots.

The main reason that makes bots prone to detection is simple economics: In order to scrape millions of pages, bot programmers put their browsers into docker containers and orchestrate them with docker swarm. Others use Kubernetes to orchestrate scraping clusters. And of course they will use cloud providers such as Hetzner, AWS or Digitalocean to host their bots. Nobody uses their MacBook Pro to run 20 Chrome Docker images over night.

The above described architecture is highly non-humanlike. What sane human being is browsing Instagram from within a docker container on a Hetzner VPS?!

Let's propose a scraping architecture that is not that easily detectable.

An undetectable and scalable scraping infrastructure

First let's proclaim the two laws of successful scraping:

The second most important rule about evading anti bot companies is: You shall not lie about your browser configuration.
And the most important rule is: You shall only lie about your browser configuration if nobody catches you.

Because I am not that good at reverse engineering those heavily obfuscated fingerprinting libraries from anti bot companies, my suggestion is to just use real devices for scraping.

phones man — Device Farm (Source: https://github.com/DeviceFarmer/stf)

If I would try to create a undetectable scraping service, I would probably buy 500 cheap Android devices (Starting at 58$ per device), maybe from 5 different manufacturers. We want diversity after all for fingerprinting reasons! You can also buy old (but more powerful) Android devices. If you buy 100 devices at once, you'll get a massive discount.

Then I would buy cheap data plans for the devices and I would control the devices with DeviceFarmer/stf and rent some cheap storage space (With a mobile cellular antenna closeby) in five top major cities of the world such as London, Paris, Boston, Frankfurt and Los Angeles and put 100 phones in there each.

Then I install the lightweight Android Go on each device, throw out everything unnecessary that bloats my device and then plug it into a power source. Every 5 minutes I turn on/off airplane mode so my phone gets another IP address from the 4G carrier grade NAT.

Mobile IP addresses (4G, 5G, LTE) are practically un-bannable, because they are shared by up to hundred thousands of legitimate users in major cities. Instagram will never dare to ban 200.000 people in LA just because of some pesky spammers use the same IP! When Carrier Grade NATs were designed, the designers knew about this issue:

In the event that an IPv4 address is blocked or blacklisted as a source of spam, the impact on a CGN would be greater, potentially affecting an entire subscriber base. This would increase cost and support load for the ISP, and, as we have seen earlier, damage its IP reputation.

Do you think IPv6 comes to a rescue? Think again! Most anti bot companies give little to no IP reputation to IPv6 addresses, because the address space is so insanely vast.

One problem with the setup described above is that I will need to spoof those pesky deviceorientation and devicemotion JavaScript events on a kernel level, because no real device is laying on the ground without rotation/movement all day long. Every website can access rotation and velocity data from Android data without asking for permission. So we have to spoof that.

But apart from that, I cannot see a way how bot detection systems are going to block this scraping infrastructure.

Of course the downsides are apparent:

I have to buy 500 Android devices. I own already three of those things, I would go ballistic with 500 of them.
I need to rent storage space in major cities. That's expensive.
I need people in 5 cities to fix problems in the device farms.
I have to deal with hardware. I hate that. It causes problems non stop.

So that would be a larger project, probably costing thousands of dollars in maintenance.

Improvement: Emulate Android

Instead of buying real Android devices, it would be better to emulate Android devices with Android emulators such as

Obviously, here we play with the devil again because we want to cut costs!

How are those pesky anti bot companies going to find out that we are emulating Android devices?

An idea is to use browser based red pills that reveal that the browser is running in an emulated environment
Maybe they will launch browser based port scans against well known ports that are only running on emulated Android devices (such as adb service)?
Maybe Google sets some device wide advertisement ID's on each mobile device? If this ID is missing or always stays the same, it could be a sign of suspicion.
Every website can find out whether you are logged into a Gmail or YouTube account with Social Media Login Detection. No logged in Google account on Android equals suspicion!
There are probably 1000 more techniques that can be used to detect emulated Android devices

Most likely, the Android emulators are imperfect and this imperfection is exhibited over the massive JavaScript API that each mobile browser offers to every website.

I am absolutely in favour of the emulation approach. This would mean that we only have to own several powerful servers, plug 4G dongles into them and we are ready to go. It could look like this (The image is taken from proxidize.com):

What proxidize.com is doing is offering 4G mobile proxies. I don't want proxies, because proxies are detectable by itself. I want to directly use the 4G dongle from the Android emulator! No latency due to geographical discrepancy between Android emulator and proxy.

So in the end, the scraping infrastructure could look like this:

Install one powerful scraping server with 50 4G dongles connected to it in one geographical location
For each scraping server, run 50-100 emulated Android devices.
Put this scraping station in 5 major cities.
A simple command & control server orchestrates the 5 scraping stations.
Profit.