For what reason are proxies used when crawling/scraping with puppeteer?
When we are crawling different websites, it's usually a good idea to change the browsing fingerprint by re-routing the TCP traffic through multiple distinct hops.
By doing so, the crawled website cannot rate limit the clients requests by IP address blacklisting. Put differently, by switching proxy servers, the detection rate can be reduced by a wide margin.
But changing your IP address is usually not enough. It's also smart to alter your browser fingerprint in other ways, such as changing the user agent, cleaning session data such as cookies and cached objects and modifying accept-language headers or changing the browser viewpoint.
Scraping and crawling in the year 2020 is usually done with a fully functional browser in order to prevent blocking attempts that build on the requirement to be able to execute javascript. Puppeteer in combination with headless chromium is often used for that matter.
In this blog post, we will learn how Puppeteer/Chromium can be used with
- Http/Https proxy support with and without
username:password
authentication. The proxy server is build with squid3. - Socks proxy support with and without
username:password
authentication. The proxy server is setup with dante.
For this reason, we will also show instructions how to create your own http/s proxy server and socks proxy server on Ubuntu 18.04.
Proxy Server Setup
All proxy server software is going to be installed on a Ubuntu 18.04 server. The only requirement is that the server should have a static, public IP address.
- Client IP address (the computer that uses the proxy) = 1.1.1.1
- Proxy Server IP address (the server that is the proxy) = 100.100.100.100
Creating a http/s proxy server with squid3 on Ubuntu 18.04
Squid is a powerful and mature http/s proxy server and caching software. For our purposes, we are solely interested in the proxying functionality. The configuration is based on a stackoverflow answer that explains in depth how to setup squid to work as an anonymous http/s proxy.
First of all, we need to install the required software packages for squid3
to work:
sudo apt-get update
sudo apt-get install squid3
sudo apt-get install apache2-utils
Then we create the password file that squid3
is going to use for authentication. This command requires you to enter a password of your choice. We will use the credentials proxyuser:proxypass
.
sudo htpasswd -c /etc/squid/.squid_users proxyuser
You can verify that the password works by issuing the following command and enter proxyuser proxypass
and pressing enter. You should see OK
as output.
/usr/lib/squid3/basic_ncsa_auth /etc/squid/.squid_users
Then we configure the configuration file /etc/squid/squid.conf
as follows. Please replace the dummy IP address 100.100.100.100 with the ip address of your own server.
# http_port: specifies the proxy listen port. This is required
http_port 3128
# dns_v4_first on: effectively turns off IPv6 DNS. Without this your proxy may run very slowly.
dns_v4_first on
# cache deny all: stops the proxy caching pages
cache deny all
# forwarded_for delete: remove the forwarded_for http header which would expose your source to the destination
forwarded_for delete
# tcp_outgoing_address: Set this to the address of your server. You can find the address with the command "ip a"
tcp_outgoing_address 100.100.100.100
# via off: removes more headers which would expose your source
via off
# auth_param: defines your the location of your basic_ncsa_auth and password file you created. Note you may need to check the location of basic_ncsa_auth.
auth_param basic program /usr/lib/squid3/basic_ncsa_auth /etc/squid/.squid_users
auth_param basic realm proxy
# acl authenticated: creates an access control list for user authenticated by the password store
acl authenticated proxy_auth REQUIRED
# http_access allow authenticated: allow user to access the proxy if they have been authenticated by password
http_access allow authenticated
# http_access deny all: if you have not been authenticated by password, you're not coming in
http_access deny all
How we need to open the port on the standard firewall of Ubuntu with the command:
ufw allow 3128
After having saved the file, restart the squid3
service with service squid restart
.
Test that everything works fine with curl:
curl --proxy http://proxyuser:proxypass@100.100.100.100:3128 http://ipinfo.io/json
The above command should show you the IP details of the proxy server.
Configuring a socks proxy server with danted
In order to create a socks4/socks5 server, we will use dante. dante is the name of the software, danted
stands for dante daemon.
First we show how to configure danted
to create a socks4 proxy server, then how to setup socks5. The difference between socks4 and socks5 is essentially that socks5 supports username:password
authentication.
In order to begin, you need to install a recent danted
server. This has been explained countless times: Here are decent installation instructions.
After you installed danted
, add this configuration to /etc/danted.conf
:
# /etc/danted.conf
logoutput: syslog
user.privileged: root
user.unprivileged: nobody
# The listening network interface or address.
internal: 0.0.0.0 port=53425
# The proxying network interface or address.
external: eth0
# socks-rules determine what is proxied through the external interface.
# The default of "none" permits anonymous access.
socksmethod: username
# client-rules determine who can connect to the internal interface.
# The default of "none" permits anonymous access.
clientmethod: none
client pass {
from: 1.1.1.1/16 to: 0.0.0.0/0
log: connect disconnect error
}
socks pass {
from: 0.0.0.0/0 to: 0.0.0.0/0
log: connect disconnect error
}
In my case, I needed to change the external network interface to ens3. You can look it up with the ifconfig
command.
danted
uses the credentials username and password of your linux users. Therefore, we create a user with:
# add user
useradd -r -s /bin/false proxyuser
# set password
passwd proxyuser
Now open the port in the firewall with:
ufw allow 53425
Finally, we restart danted
with the command service danted restart
.
Confirm that the socks server is working by testing it with curl:
curl --proxy socks5://proxyuser:proxypass@100.100.100.100:53425 http://ipinfo.io/json
If we do not want to use socks authentication, we edit /etc/danted.conf
to alter the line from socksmethod: username
to socksmethod: none
and issue the command service danted restart
.
Then we can test the socks server without auth (which is socks4) with:
curl --proxy socks4://100.100.100.100:53425 http://ipinfo.io/json
Puppeteer with proxies
Now we can see how to use the http/s and socks proxy server that we configured in the previous steps with a fully functional browser controlled via puppeteer.
Puppeteer with http/s proxy
You can test puppeteer with a http proxy by launching the following node script. The output should show IP address details of the proxy.
const puppeteer = require('puppeteer');
(async() => {
const proxyUrl = 'http://100.100.100.100:3128';
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyUrl}`],
});
const page = await browser.newPage();
await page.authenticate({
username: 'proxyuser',
password: 'proxypass'
});
await page.goto('https://ipinfo.io/json');
await page.screenshot({ path: 'ipinfo.png', fullPage: true });
console.log(await page.content());
await browser.close();
})();
Puppeteer with socks4 proxy
You can test puppeteer in combination with a socks proxy server by launching the following node program. The output should show IP address details of the proxy.
const puppeteer = require('puppeteer');
(async() => {
const proxyUrl = 'socks://100.100.100.100:53425';
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyUrl}`],
});
const page = await browser.newPage();
await page.goto('https://ipinfo.io/json');
await page.screenshot({ path: 'ipinfo.png', fullPage: true });
console.log(await page.content());
await browser.close();
})();
Tear down the proxy servers
To stop the proxy servers, execute the following commands on your servers:
systemctl stop danted
systemctl stop squid
Known Limitations
Unfortunately, it is not possible to use puppeteer/chromium with a socks5 proxy. The chrome browser does not support socks with authentication.
How to address the problem of switching proxies on the fly without restarting puppeteer/chromium will be handled in the next blog post.