I know a bit of web design too, but tend to avoid it in favor of CSS frameworks such as bulma.io, bootstrap or ant.design.
Currently, I am interested in creating reliable, distributed and queue-based scraping infrastructures, because I need them for scrapeulous.
My most recent projects are:
- Bot Detection Test - This is a puppeteer/playwright bot detection page. It implements widely known bot detection tests. This page is constantly under development and tries to incorporate the most recent bot detection techniques.
- Breaking Google's Audio ReCaptcha - In this project, I make use of a method from 2019 that demonstrates how to solve the audio ReCaptcha with Googel's own Speech to Text API. This method still works, which is quite astonishing.
- Distributed crawling infrastructure - Distributed crawling infrastructure running on top of serverless computation, cloud storage (such as S3) and sophisticated queues. This software allows you to crawl and scrape the Internet in scale. It supports basic crawling via http as well as sophisticated crawling with the help of a heavily customized headless chrome browser controlled via puppeteer.
- struktur.js - A way to extract structured information from any visually rendered HTML page. This project aims to deprecate the scraping of websites with CSS selectors and Xpath queries. I don't have enough time to push it forward, but I like the idea a lot and I think it has a tremendous amount of potential.
- scrapeulous - A scraping platform, aiming to solve many annoying tasks when developing scrapers/crawlers. Currently, scrapeulous focuses on search engine scraping. In the near future, scraping of any website will be possible.
- se-scraper - The successor of GoogleScraper that builds on top of puppeteer, written in JS.
Some old projects of mine: