In this blog post I explain why it is best to avoid puppeteer and playwright for web scraping.
Continue reading
Posted on May 20, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright, CDP • 10 min read
In this blog post I explain why it is best to avoid puppeteer and playwright for web scraping.
Posted on March 11, 2021 in Scraping • Tagged with detecting, scraping, security, fingerprint • 13 min read
In this blog post I will demonstrate how it is possible to detect several scraping services: luminati.io, ScrapingBee, scraperapi.com, scrapingrobot.com, scrapfly.io.
Posted on March 01, 2021 in Scraping • Tagged with web scraping, crawling, puppeteer, playwright • 13 min read
In this blog post, I am talking about my several year long experience with web scraping and common mistakes I made along the road. The more I dive into web scraping, the more I realize how easy it is to take wrong decisions when scraping a site. For that reason, I compiled a list of seven common mistakes in regard to web scraping.
Posted on February 14, 2020 in Scraping • Tagged with Puppeteer, Proxies, Issues, Proxy-Authentication • 5 min read
Find the updated blog post here.
Chrome/Puppeteer has a couple of annoying issues when trying to use http/s proxies and socks proxies with the chrome browser controlled by puppeteer. The most pressing issues are the following:
page.authenticate()
function, but it does not have an equivalent for socks proxies.For my purposes, I don't really care about problem 3). I don't need per page proxies anyway, since the crawling software I write runs with one browser tab at the time. However, issue 1) is a mandatory requirement for me and thus needs to be solved.
The reason is, I don't want to restart the browser …
Posted on February 12, 2020 in Scraping • Tagged with Puppeteer, Proxy, Danted, Squid, socks4, socks5 • 5 min read
This blogs post demonstrates how puppeteer and the chrome browser can be used with http/s and socks4/5 proxies. For that reason, a proxy server is setup on Ubuntu 18.04 with squid3 and dante.
Posted on September 17, 2019 in Scraping • Tagged with puppeteer, web scraping, headless chrome, 1 million, queue, architecture • 5 min read
Scraping one million keywords is not a easy task. There are proxy problems, big data problems and reliability issues. In this blog post, the most valuable insights are shared.