Behavioral Analysis: Recording Mouse Movements and other User Interactions with JavaScript

Posted on December 24, 2020 in Programming • Tagged with Behavioral Analysis, JavaScript, Analytics, Mouse, Touch Events, Mobile, visibilitychange • 10 min read

In this blog post, I will introduce a JavaScript library that allows to track various user interactions of website visitors. Several key problems that arise when creating a JavaScript analytics application will be discussed and solved in this blog post.


Continue reading

Dynamically changing proxies with puppeteer

Posted on December 20, 2020 in Security • Tagged with puppeteer, dynamic proxies, Express API • 3 min read

The chrome browser controlled via puppeteer doesn't support the dynamic change of proxies without restarting the browser. In this tutorial, I demonstrate how to implement this functionality with the help of a third party npm module named proxy-chain. This module acts as an intermediate proxy.


Continue reading

Remove YouTube Ads from your Android Phone

Posted on December 16, 2020 in Tutorials • Tagged with YouTube, ads, adblock, ublockorigin • 3 min read

I am a heavy user of YouTube. I use it to listen to podcasts while cooking or in order to watch the latest documentaries before going to sleep. But lately, the extremely aggressive advertisement of YouTube sparked enough motivation within myself to remove YouTube ads for good. Google overdid it. I have enough.


Continue reading

Abusing image tags for cross domain requests

Posted on December 15, 2020 in Security • Tagged with cross-domain-requests, cors, browser • 3 min read

Cross domain requests with <img> tags are not bound to the same origin policy. I will shed light on several possibilities how malicious web site owners can potentially abuse cross domain request done with <img> and script tags created with JavaScript.


Continue reading

Reliable Cross Domain Requests when the User leaves the Page

Posted on December 10, 2020 in JavaScript • Tagged with navigator.sendBeacon(), visibilitychange, onbeforeunload, cross domain request • 7 min read

In this article, I demonstrate how to reliably communicate JSON data to a cross domain server after the user is about to end or interrupt the browsing session by either:

  • switching the focus to another page
  • switching from the browser to another applicaton
  • closing the tab
  • closing the browser

or any other means of terminating or interrupting the current browsing session. Mobile devices and desktop devices should be equally supported.

Why do I have this very specific requirement?

I am in the process of developing a JavaScript analytics application and I need to record user interactions and send those user interactions from any website to my remote server.

Put differently: I need to record user interactions up until the point where the user leaves the browsing session. The ideal event for this scenario is to attach an event listener to visibilitychange:

document.addEventListener("visibilitychange", function() {
  if (document.visibilityState === 'hidden') {
    localStorage.setItem('triggeredOnPageClose', new Date());
  }
})

This event fires when the user loses focus of the current window and the page visiblity becomes hidden, for example when the user changes the tab.

However, is the above event also triggered when the user closes the page or closes the entire browser? In order …


Continue reading

Crawling Infrastructure - Introduction

Posted on May 18, 2020 in Crawling • Tagged with Crawling, Distributed Computing, Cloud, Web Bots • 6 min read

In this blog article I will introduce my most recent project: The distributed crawling infrastructure which allows to crawl any website with a low-level Http library or a fully fledged chrome browser configured to evade bot detection attempts.

This introduction is divided into three distinct blog articles, because one blog article would be too large to cover this huge topic.

  1. (This article) The first part of the series motivates the development of the crawling infrastructure, introduces the architecture of the software and demonstrates how the crawling backend works at a high level.
  2. The second part covers the installation of the distributed crawling infrastructure within the AWS cloud infrastructure and tests the freshly deployed stack with a test crawl task.
  3. In the third part of this tutorial series, a crawl task with the top 10.000 websites of the world is created. The downloaded Html documents are stored in s3. For the top 10.000 websites, we use the scientific tranco list: A Research-Oriented top sites ranking hardened against manipulation. As a concluding task, we run business logic on the stored Html files. For example, we extract all urls from the Html documents or we run analytics on the <meta> tags …

Continue reading