Detecting scraping services

Posted on March 11, 2021 in Scraping • Tagged with detecting, scraping, security, fingerprint • 13 min read

In this blog post I will demonstrate how it is possible to detect several scraping services: luminati.io, ScrapingBee, scraperapi.com, scrapingrobot.com, scrapfly.io.


Continue reading

7 Common Mistakes in Professional Scraping

Posted on March 01, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright • 13 min read

In this blog post, I am talking about my several year long experience with web scraping and common mistakes I made along the road. The more I dive into web scraping, the more I realize how easy it is to take wrong decisions when scraping a site. For that reason, I compiled a list of seven common mistakes in regard to web scraping.


Continue reading

How to dynamically change http/s proxy servers in puppeteer?

Posted on February 14, 2020 in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read

Find the updated blog post here.

Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:

  1. Dynamically changing proxy servers: Once the chrome browser is started, it is not possible to change the proxy configuration any longer. A restart is required to switch proxy configuration.
  2. user/pass proxy authentication: The chrome browser does not support username/password proxy authentication for socks proxies. Puppeteer supports the http proxy authentication via the page.authenticate() function, but it does not have an equivalent for socks proxies.
  3. Per page proxies: Per page proxies are not supported with the chrome browser. The global proxy configuration applies to all pages and windows of a launched chrome process. It seems like the new module tries to solve this issue.

For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.

The reason is, I don't want to restart the browser …


Continue reading

Using http/s and socks4/5 proxies with puppeteer and chrome with squid and danted

Posted on February 12, 2020 in Scraping • Tagged with Puppeteer, Proxy, Danted, Squid, socks4, socks5 • 5 min read

This blogs post demonstrates how puppeteer and the chrome browser can be used with http/s and socks4/5 proxies. For that reason, a proxy server is setup on Ubuntu 18.04 with squid3 and dante.


Continue reading

Scraping 1 million keywords on the Google Search Engine

Posted on September 17, 2019 in Scraping • Tagged with puppeteer, web scraping, headless chrome, 1 million, queue, architecture • 5 min read

Scraping one million keywords is not a easy task. There are proxy problems, big data problems and reliability issues. In this blog post, the most valuable insights are shared.


Continue reading

Scraping with puppeteer and headless chrome deployed to AWS Lambda

Posted on August 31, 2019 in Scraping • Tagged with puppeteer, web scraping, AWS lambda, headless chrome • 4 min read

In this blog post, we demonstrate how a web scraping function is deployed to the AWS cloud with puppeteer and headless chrome.


Continue reading