Crawling Infrastructure - Introduction

Posted on in Crawling • Tagged with Crawling, Distributed Computing, Cloud, Web Bots • 6 min read

In this blog article I will introduce my most recent project: The distributed crawling infrastructure which allows to crawl any website with a low-level Http library or a fully fledged chrome browser configured to evade bot detection attempts.

This introduction is divided into three distinct blog articles, because one blog article would be too large to cover this huge topic.

  1. (This article) The first part of the series motivates the development of the crawling infrastructure, introduces the architecture of the software and demonstrates how the crawling backend works at a high level.
  2. The second part covers the installation of the distributed crawling infrastructure within the AWS cloud infrastructure and tests the freshly deployed stack with a test crawl task.
  3. In the third part of this tutorial series, a crawl task with the top 10.000 websites of the world is created. The downloaded Html documents are stored in s3. For the top 10.000 websites, we use the scientific tranco list: A Research-Oriented top sites ranking hardened against manipulation. As a concluding task, we run business logic on the stored Html files. For example, we extract all urls from the Html documents or we run analytics on the <meta> tags …

Continue reading

Dynamic creation of S3 buckets in many regions

Posted on in Scripts • Tagged with AWS, S3, Buckets • 1 min read

Quick script that demonstrates how to create s3 buckets in many regions.

Continue reading

The value of work in the coming decades

Posted on in Society • Tagged with Digitalization, Society, automation, Second-Machine-Age, Work • 12 min read

This article makes an attempt to understand and predict the consequences of the rapid automation/computerization in the realm of human work. I make them based on my experiences as a software engineer, while I am fully aware that programming is not threatened to be eradicated by automation in the next decades. For that reason, I realize that I am holding a privileged position.

In the second part of the article, I argue why work in general should be voluntary within a welfare state and why the governments main task should be to create a situation where humans dont' have to work in order to obtain the bare essentials such as housing, food and healthcare.

What is work?

So what exactly is work? Work is something that sucks, right? After all, if I could spend my time without having to worry about finances, I'd rather spend my days at a tropic beach drinking some beers and doing exactly nothing.

But for how long? When would my natural urge to be productive kick in? I assume that after a couple of days, I would start a project that consumes my time. But since I don't have to worry about it's financial …

Continue reading

How to dynamically change http/s proxy servers in puppeteer?

Posted on in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read

Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:

  1. Dynamically changing proxy servers: Once the chrome browser is started, it is not possible to change the proxy configuration any longer. A restart is required to switch proxy configuration.
  2. user/pass proxy authentication: The chrome browser does not support username/password proxy authentication for socks proxies. Puppeteer supports the http proxy authentication via the page.authenticate() function, but it does not have an equivalent for socks proxies.
  3. Per page proxies: Per page proxies are not supported with the chrome browser. The global proxy configuration applies to all pages and windows of a launched chrome process. It seems like the new module tries to solve this issue.

For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.

The reason is, I don't want to restart the browser each time I need to change …

Continue reading

Using http/s and socks4/5 proxies with puppeteer and chrome with squid and danted

Posted on in Scraping • Tagged with Puppeteer, Proxy, Danted, Squid, socks4, socks5 • 5 min read

This blogs post demonstrates how puppeteer and the chrome browser can be used with http/s and socks4/5 proxies. For that reason, a proxy server is setup on Ubuntu 18.04 with squid3 and dante.

Continue reading

5 crucial tips how to survive riding a motorbike/scooter in Thailand (2019)

Posted on in Traveling • Tagged with Thailand, Scooter, Motorbike, Koh Phangan, Ko Tao, Krabi, Koh Lanta • 3 min read

Thailand is the second most deadly country when it comes to traffic accidents. 80% of all deaths originate from people driving motorbikes. In this blog post, I try to share my experiences in the form of 5 survival tips in a honest way. I drove a scooter on 4 distinct tourist destinations in Thailand without a proper license and wasn't caught in a police checkpoint a single time.

Continue reading