In this blog post I explain why it is best to avoid puppeteer and playwright for web scraping.
Continue reading
Posted on May 20, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright, CDP • 10 min read
In this blog post I explain why it is best to avoid puppeteer and playwright for web scraping.
Posted on March 01, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright • 13 min read
In this blog post, I am talking about my several year long experience with web scraping and common mistakes I made along the road. The more I dive into web scraping, the more I realize how easy it is to take wrong decisions when scraping a site. For that reason, I compiled a list of seven common mistakes in regard to web scraping.
Posted on January 17, 2021 in Security • Tagged with red pill, Bot, Advanced Bots, JavaScript, Puppeteer, Playwright • 6 min read
Advanced bots use modern browsers and automation frameworks such as puppeteer and playwright. It becomes increasingly hard to distinguish bots from real human traffic, therefore, new methods are required.
Posted on December 20, 2020 in Security • Tagged with puppeteer, dynamic proxies, Express API • 3 min read
The chrome browser controlled via puppeteer doesn't support the dynamic change of proxies without restarting the browser. In this tutorial, I demonstrate how to implement this functionality with the help of a third party npm module named proxy-chain
. This module acts as an intermediate proxy.
Posted on February 14, 2020 in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read
Find the updated blog post here.
Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:
page.authenticate()
function, but it does not have an equivalent for socks proxies.For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.
The reason is, I don't want to restart the browser …
Posted on February 12, 2020 in Scraping • Tagged with Puppeteer, Proxy, Danted, Squid, socks4, socks5 • 5 min read
This blogs post demonstrates how puppeteer and the chrome browser can be used with http/s and socks4/5 proxies. For that reason, a proxy server is setup on Ubuntu 18.04 with squid3 and dante.