incolumitas.com

Avoid Puppeteer or Playwright for Web Scraping

Posted on May 20, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright, CDP • 10 min read

In this blog post I explain why it is best to avoid puppeteer and playwright for web scraping.

Detecting scraping services

Posted on March 11, 2021 in Scraping • Tagged with detecting, scraping, security, fingerprint • 13 min read

In this blog post I will demonstrate how it is possible to detect several scraping services: luminati.io, ScrapingBee, scraperapi.com, scrapingrobot.com, scrapfly.io.

7 Common Mistakes in Professional Scraping

Posted on March 01, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright • 13 min read

In this blog post, I am talking about my several year long experience with web scraping and common mistakes I made along the road. The more I dive into web scraping, the more I realize how easy it is to take wrong decisions when scraping a site. For that reason, I compiled a list of seven common mistakes in regard to web scraping.

How to dynamically change http/s proxy servers in puppeteer?

Posted on February 14, 2020 in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read

Find the updated blog post here.

Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:

**Dynamically changing proxy servers: ** Once the chrome browser is started, it is not possible to change the proxy configuration any longer. A restart is required to switch proxy configuration.
user/pass proxy authentication: The chrome browser does not support username/password proxy authentication for socks proxies. Puppeteer supports the http proxy authentication via the page.authenticate() function, but it does not have an equivalent for socks proxies.
**Per page proxies: ** Per page proxies are not supported with the chrome browser. The global proxy configuration applies to all pages and windows of a launched chrome process. It seems like the new module tries to solve this issue.

For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.

The reason is, I don't want to restart the browser …

Using http/s and socks4/5 proxies with puppeteer and chrome with squid and danted

Posted on February 12, 2020 in Scraping • Tagged with Puppeteer, Proxy, Danted, Squid, socks4, socks5 • 5 min read

This blogs post demonstrates how puppeteer and the chrome browser can be used with http/s and socks4/5 proxies. For that reason, a proxy server is setup on Ubuntu 18.04 with squid3 and dante.

Scraping 1 million keywords on the Google Search Engine

Posted on September 17, 2019 in Scraping • Tagged with puppeteer, web scraping, headless chrome, 1 million, queue, architecture • 5 min read

Scraping one million keywords is not a easy task. There are proxy problems, big data problems and reliability issues. In this blog post, the most valuable insights are shared.

Older Posts