Avoid Puppeteer or Playwright for Web Scraping

Posted on May 20, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright, CDP • 10 min read

In this blog post I explain why it is best to avoid puppeteer and playwright for web scraping.


Continue reading

7 Common Mistakes in Professional Scraping

Posted on March 01, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright • 13 min read

In this blog post, I am talking about my several year long experience with web scraping and common mistakes I made along the road. The more I dive into web scraping, the more I realize how easy it is to take wrong decisions when scraping a site. For that reason, I compiled a list of seven common mistakes in regard to web scraping.


Continue reading

Browser Red Pills: Why are you browsing my website from AWS Lambda?

Posted on January 17, 2021 in Security • Tagged with red pill, Bot, Advanced Bots, JavaScript, Puppeteer, Playwright • 6 min read

Advanced bots use modern browsers and automation frameworks such as puppeteer and playwright. It becomes increasingly hard to distinguish bots from real human traffic, therefore, new methods are required.


Continue reading

Dynamically changing proxies with puppeteer

Posted on December 20, 2020 in Security • Tagged with puppeteer, dynamic proxies, Express API • 3 min read

The chrome browser controlled via puppeteer doesn't support the dynamic change of proxies without restarting the browser. In this tutorial, I demonstrate how to implement this functionality with the help of a third party npm module named proxy-chain. This module acts as an intermediate proxy.


Continue reading

How to dynamically change http/s proxy servers in puppeteer?

Posted on February 14, 2020 in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read

Find the updated blog post here.

Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:

  1. Dynamically changing proxy servers: Once the chrome browser is started, it is not possible to change the proxy configuration any longer. A restart is required to switch proxy configuration.
  2. user/pass proxy authentication: The chrome browser does not support username/password proxy authentication for socks proxies. Puppeteer supports the http proxy authentication via the page.authenticate() function, but it does not have an equivalent for socks proxies.
  3. Per page proxies: Per page proxies are not supported with the chrome browser. The global proxy configuration applies to all pages and windows of a launched chrome process. It seems like the new module tries to solve this issue.

For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.

The reason is, I don't want to restart the browser …


Continue reading

Using http/s and socks4/5 proxies with puppeteer and chrome with squid and danted

Posted on February 12, 2020 in Scraping • Tagged with Puppeteer, Proxy, Danted, Squid, socks4, socks5 • 5 min read

This blogs post demonstrates how puppeteer and the chrome browser can be used with http/s and socks4/5 proxies. For that reason, a proxy server is setup on Ubuntu 18.04 with squid3 and dante.


Continue reading