In this blog article I will introduce my most recent project: The distributed crawling infrastructure which allows to crawl any website with a low-level Http library or a fully fledged chrome browser configured to evade bot detection attempts.

This introduction is divided into three distinct blog articles, because one blog article would be too large to cover this huge topic.

  1. (This article) The first part of the series motivates the development of the crawling infrastructure, introduces the architecture of the software and demonstrates how the crawling backend works at a high level.
  2. The second part covers the installation of the distributed crawling infrastructure within the AWS cloud infrastructure and tests the freshly deployed stack with a test crawl task.
  3. In the third part of this tutorial series, a crawl task with the top 10.000 websites of the world is created. The downloaded Html documents are stored in s3. For the top 10.000 websites, we use the scientific tranco list: A Research-Oriented top sites ranking hardened against manipulation. As a concluding task, we run business logic on the stored Html files. For example, we extract all urls from the Html documents or we run analytics on the <meta> tags …

Continue reading