Last Update: October 2022
I used the following libraries and frameworks in the past:
Actively Maintained Projects
The following projects will be updated in the coming years and they are safe to use in production:
- IP Address API - I maintain a public IP Address API that gives ASN, geolocation and company information for each IP address.
- TLS Fingerprinting - In this project, I investigate several different approaches to fingerprint TLS 1.2 and TLS 1.3 connections. I make use of existing research and deep-dive into TLS fingerprinting techniques and try to get one step further than existing solutions such as Saleforce's JA3 and JA3S TLS fingerprint and the Cisco TLS fingerprint have gone.
- TCP/IP Fingerprinting - Similar to the TLS fingerprinting technique, different operating systems use different TCP/IP stack configurations which are advertised in the initial SYN packet in a TCP/IP three-way handshake. By correlating the HTTP User-Agent with OS-specific TCP/IP header fields, it is possible to deduce the operating system by looking exclusively at the TCP/IP layer. The research has been released as zardaxt.py.
- Proxy/VPN Detection Test Site - In this project, I make use of several different methods to detect proxies and VPN's. The sum of independent detection tests gives a good heuristic score if a visitor of a website is using a proxy to hide their true IP address.
- Bot Detection Test - This is a puppeteer/playwright bot detection page. It implements widely known bot detection tests. This page is constantly under development and tries to incorporate the most recent bot detection techniques.
I don't guarantee to maintain those older projects anymore:
- Breaking Google's Audio ReCaptcha - In this project, I make use of a method from 2019 that demonstrates how to solve the audio ReCaptcha with Googel's own Speech to Text API. This method still works, which is quite astonishing.
- Distributed crawling infrastructure - Distributed crawling infrastructure running on top of serverless computation, cloud storage (such as S3) and sophisticated queues. This software allows you to crawl and scrape the Internet in scale. It supports basic crawling via http as well as sophisticated crawling with the help of a heavily customized headless chrome browser controlled via puppeteer.
- struktur.js - A way to extract structured information from any visually rendered HTML page. This project aims to deprecate the scraping of websites with CSS selectors and Xpath queries. I don't have enough time to push it forward, but I like the idea a lot and I think it has a tremendous amount of potential.
- scrapeulous - A scraping platform, aiming to solve many annoying tasks when developing scrapers/crawlers. Currently, scrapeulous focuses on search engine scraping. In the near future, scraping of any website will be possible. Discontinued
- se-scraper - The successor of GoogleScraper that builds on top of puppeteer, written in JS. Discontinued
- GoogleScraper deprecated since 2018, gathered 2200 stars on Github and taught me that there is a demand for data extraction and scraping.
- SVG Captcha A captcha implementation with SVG.
- Lichess Bot based on Stockfish engine