Without ever having developed a fully functional anti bot system myself, I want to investigate some of the obstacles that need to be overcome if I started such a project.
First of all, I have to define the functional specification that such an anti bot system must meet:
A bot detection system should be able to detect passively (1) that a website visitor (2) is in fact not a human but an automated program (3)
I deliberately keep my definition very broad.
- An automated program is software not controlled by an human user. Some examples:
- The Google Chrome browser automated with a framework such as playwright or puppeteer
curlcommands orchestrated with shell scripts
- Real physical mobile phones running Android automated over
adb(Android Debug Bridge) and (optionally) controlled with frameworks such as appium.io or stf (DeviceFarmer)
- And what means website visitor? There is a fundamental difference between a website visitor that simply opens a single page before leaving again and a user that performs a complex action such as logging into an online bank and transferring money. In the latter case, a bot detection system has much more time in order to detect a bot. Furthermore, in the latter case, there is also behavioral and intent data, which is completely lacking in the former case.
Our attacker model is as follows:
Some explanatory words regarding the graphic above:
A fundamental property of client/server security is: All input from clients cannot be trusted from the standpoint of the server. This will be important in the remainder of this blog post, that's why I repeat it so vehemently here.
So what are actually some methods to passively detect that a browser is not controlled by an organic human?
Actually there rarely exist signals that fall into that exact binary category of: This visitor is a bot or not.
Rather, anti bot vendors rephrase the question to: On what level can we uniquely identify a browser user and then rate limit her? Because that's the overall goal of anti bot systems: Rate limiting unique users.
Some techniques to identify users:
- By IP address
- By TLS or TCP/IP fingerprint
- By HTTP headers and their order/case sensitivity
- By Cookies/Session ids
It should be obvious that the examples from the list above fall into two categories:
- Signals that are unspoofable due to the design of the Internet, such as IP addresses
Of course an attacker can alter her IP address by setting up a proxy server.
Technically, spoofing IP addresses on a IP level will work, but the first router outside your home network will most likely drop the IP packet if it detects that the source IP does not belong to the correct network. And even if that does not happen: How are you ever going to receive a reply to your spoofed IP packet?
With TLS and TCP/IP fingerprinting, it's way easier to spoof those signals on the client side.
- Fingerprinting data mostly from the
- Behavioral data such as mouse movements and touch events or key presses
- WebGL fingerprints or audio fingerprints
Architecture of Bot Detection Services
Let's think about how bot detection services are implemented by following a typical browsing session of a user visiting a website. As discussed above, bot detection systems collect data from both the client side and the server side. In the following section, I will follow a typical browsing session on a very low level and I will explain the different checks that a bot detection system performs at each point along the way.
The very first event relevant for bot detection is the DNS lookup of the hostname to which a browser establishes a connection. The browser uses the operating systems local DNS resolver to lookup the
AAAA DNS resource record of a hostname. Worded differently, it asks for the IPv4 or IPv6 address that corresponds to the hostname used in the URL that was entered in the browser's address bar. The DNS request will be answered by the responsible name server.
During such a lookup, the DNS server can check that the resolvers client's IP address is the same or belongs to the same ISP as the IP address that communicates with the web server. Put differently, on the DNS server it can be checked that no DNS leak happens. A DNS leak occurs, if the DNS traffic is not routed through the proxy/VPN configured in the browser.
After the IP address of the domain name has been obtained, the browser is ready to establish a connection with the web server.
Before a browser is able to display anything, a TCP and TLS handshake has to occur to establish a TCP connection to the web server (In case we are using a https connection, which almost always is the case nowadays).
As soon as the first incoming SYN packet arrives on the web server, an anti bot system is capable of performing the following lookups on the server side:
Obtain the source IP address of the client. As soon as we have the source IP address, we can do a wide range of IP reputation checks:
- IP address counter: Check if we already have received too many requests from this specific IP address in a certain time frame. Abort the TCP handshake by sending a RST packet if that's the case.
- Also increase the counter for the specific (assumed) subnet to which this IP address belongs. Many attackers are in possession of whole IPv4 and IPv6 subnets to which this IP address belongs, this fact needs to be addressed.
- Lookup the IP address on spam abuse databases such as spamhaus.org. Are there any databases that indicate that this IP address was used for spamming/botting purposes?
- Conduct a datacenter IP lookup for the IP address. If the IP address belongs to a datacenter, it's more likely that it's a bot compared to a residential IP address. Similarly, you can also check if an IP address belongs to a large residential ISP such as Comcast, AT&T or Deutsche Telekom.
- Look up the geographical location for this IP address. Is it a tier 1 country (rich countries in the west such as the United Kingdom or Switzerland)? Or is the country known for spamming/botting (Ukraine, Vietnam, Russia - nothing personal here)?
- Look up the ASN, organization or registry to which the IP address belongs. This can be done using the
whoiscommand. For example,
- Use IP address metadata services such as ipinfo.io or https://ip-api.com/ to infer more metadata for this IP address.
- Make a reverse lookup for an IP address with the
host 220.127.116.11. If the hostname belongs to a trustworthy company, give it a better reputation. If it belongs to a untrustworthy organization, block it.
Generate a TCP/IP fingerprint from the SYN packet. A TCP/IP fingerprint gives us information about the assumed operating system of the client based on certain TCP and IP header fields/values such as TCP options header field. The TCP/IP fingerprint alone does not have enough entropy to reasonably exclude a client. However, when the inferred OS seems to be Linux system, then it's valid to raise the bot score, since most legit users do not use Linux. The TCP/IP fingerprint can be altered and spoofed by the client!
- Measure TCP/IP latencies and the RTT's of the exchanged packages. What's the throughput? Can we infer that the client used a WiFi network?
After the initial TCP/IP handshake is completed, the TLS handshake comes next. After the TLS handshake is completed, the server is able to compute a TLS handshake fingerprint.
I am not a specialist regarding TLS fingerprinting, but my guess is that a TLS fingerprint has slightly more entropy compared to a TCP/IP fingerprint and allows to differentiate between different TLS implementations and maybe operating systems. If that is the case, then at this point it would be already possible to see if there is a mismatch in TCP/IP fingerprint OS and TLS fingerprint OS. However, keep in mind that TLS and TCP/IP fingerprints can easily be forged on the client side.
After the TLS handshake has been established, the client sends an initial HTTP GET request to fetch the requested URL. Let's assume the client requests the
index.html document. Based on this very first GET request, we can do a couple of things:
- Compute a HTTP fingerprint. What headers are sent with the client? In what order are they sent? Are the headers case sensitive?
- Do the HTTP headers contain typical proxy headers such as
var proxy_headers = ['Forwarded', 'Proxy-Authorization',
'X-Real-Ip', 'Via', 'True-Client-Ip', 'Proxy_Connection'];
- Does the HTTP Referer look suspicious?
- Is the HTTP version not from a modern browser but from a non-browser library such as
If all checks pass until this point, the web server is going to serve the contents of the
index.html document. After the
- Encryption and encoding of payloads sent back to the server (only makes sense in combination with obfuscation/virtual machines)
There are some reasons for that:
For example, when I look for apartments on the German real estate search engine immobilienscout24.de, I am sometimes presented a bot detection challenge that performs a check passively in the background.
p is a 30KB long base64 encoded binary blob:
The next steps in reverse engineering would be to find out what exactly
p becomes encrypted/encoded and dump the clear text contents. Then I know what data is sent and I can therefore spoof it.
This is also the reason why many bot detection companies decide upfront to interrupt the browsing session of the user and display a message that an active bot detection check is happening. Cloudflare Bot Management does this for example.
However, in this blog article I focus on passive detection without interrupting the user's workflow.
- There was a network outage
- The system / browser crashed
- The user blocked the execution of the script deliberately to evade bot detection
Only the last point indicates malicious behavior of the client. Therefore, every bot detection system needs to give each client some free attempts, let's say the IP address get blocked if a threshold of
N=20 has surpassed.
But what if the user switches his IP address between requests? For example by using a mobile 4G or 5G proxy that sits between a Carrier-grade NAT?
You can't just block a whole mobile address range because you feel like it. Wikipedia says:
In cases of banning traffic based on IP addresses, the system might block the traffic of a spamming user by banning the user's IP address. If that user happens to be behind carrier-grade NAT, other users sharing the same public address with the spammer will be mistakenly blocked. This can create serious problems for forum and wiki administrators attempting to address disruptive actions from a single user sharing an IP address with legitimate users.
And there is another very important point here: Bot detection systems can only reliably ban clients by IP address! This is so important to understand. Every other signal from the client can be spoofed (if the client understands what he executes on his machine)!
If a bot detection system bans based on other ways of identification (Browser fingerprint, WebGL fingerprint, font fingerprint, TLS Fingerprint, TCP/IP fingerprint, ...), a client can change those fingerprints and evade a ban!
Having said that, let's look at some techniques to detect bots on the client side:
navigator.platformas possible, while simultaneously trying to look for attributes that are static and don't change on browser/OS updates.
- Proof of work challenges such as solving cryptographic puzzles in the browser as for example friendlycaptcha.com is doing it. The rough idea is: Make the browser find the input for a hash function until the first
Kbits are all zeroes. This takes some time.
- Another common bot detection method is to produce unique fingerprints with the browser built in webGL rendering system that accesses the client's graphic hardware. For example, Google Picasso detects the client's operating system class by leveraging the unique outputs of the webGL renderer. They deliberately only want to reliably detect the device/operating system, simply because most spammers/botters use cheap cloud/hosting infrastructure which are mostly Linux flavors.
adb(5037). Furthermore, this also allows to check if the client has a router accessible by scanning for
- Recording behavioral data by listening to DOM events such as
- Lastly, the technique that is still most used is the good old CAPTCHA. Google's reCAPTCHA or the newer hCAPTCHA are well known solutions.
The Proof of Work kind of challenges cannot be bypassed by clients. They have to be solved somewhere. However, the solving of the proof of work challenge does not have to be on the browser/client that receives the challenge.
There are two fundamental approaches how to defeat bot detection:
- Don't participate in the cat and mouse race between bot detection companies and botters and use real devices with legit IP addresses (such as mobile device farms) to conduct your attacks. This is very costly in terms of hardware and data plans.
Only one thing is for sure: All what bot detection companies are doing is rising the transaction costs for automation in the Internet. But we live in times were platforms are becoming more monopolized and powerful each passing day.
At the same time, mobile phones become cheaper and cheaper. In our modern times, a mobile phone basically constitutes a digital identity. But all what you need to acquire such an identity is 100USD to buy a cheap phone and a cheap data plan for 7.99USD a month. If bot detection companies do their job too good, spammers will simply create mobile device farms and conduct their botting/spamming/attacks with those device farms.
Detecting real automated mobile devices is much much harder then to detect Headless Chrome on AWS...As of now, I would only know two ways to detect such a mobile device farm:
- Portscan for an open
- Maybe some statistical analysis that a website suddenly has a surge in traffic from cheap or old Android smartphones from the same CGNAT which might indicate a mobile device farm as source of the traffic...