This comes with the benefits of simplicity. It is much simpler to handle login functionality and complex browsing actions by programming a real web browser. The library to control headless chrome is called puppeteer.
All code can be found in the respective GitHub repository.
Setting up the project
puppeteer-core. You can install them with the
npm node package manager. Alternatively, you can clone the following git repository where I have already setup everything. You also need to install the serverless package globally. You can read up the instructions here how to do so.
After you cloned the repository with the command
git clone firstname.lastname@example.org:NikolaiT/aws-scraper-example.git, it is time to setup you AWS credentials. Enter them into the file
.env file has the following boilerplate structure:
export AWS_ACCESS_KEY= export AWS_SECRET_KEY= export AWS_REGION=us-east-1 export AWS_PROFILE= export AWS_FUNCTION_URN=
you need to enter the user and secret key credentials that you received when you created your AWS account. After that, you are all set to deploy the scraper to AWS Lambda cloud:
source .env serverless deploy
After a successful deployment,
serverless outputs a message such as the following:
$ serverless deploy Serverless: Packaging service... Serverless: Excluding development dependencies... Serverless: Creating Stack... Serverless: Checking Stack create progress... ..... Serverless: Stack create finished... Serverless: Uploading CloudFormation file to S3... Serverless: Uploading artifacts... Serverless: Uploading service google-aws-scraper.zip file to S3 (39.4 MB)... Serverless: Validating template... Serverless: Updating Stack... Serverless: Checking Stack update progress... ............... Serverless: Stack update finished... Service Information service: google-aws-scraper stage: dev region: us-east-1 stack: google-aws-scraper-dev resources: 5 api keys: None endpoints: None functions: google-aws-scraper: google-aws-scraper-dev-google-aws-scraper layers: None Serverless: Run the "serverless" command to setup monitoring, troubleshooting and testing
Now you need to update the
.env file with the correct function name of your deployed scraper. You can look up the function name in the AWS console in the Lambda tab of the correct region.
Testing the cloud scraper
After the successful lambda function deployment, we can test if it is possible to search google with our headless chrome function living in the AWS cloud. For this task, we create a
test.js script that invokes the AWS function. The test script has the following contents:
You need to first source the
.env file with the correct parameters in order for the test script to properly work. After that, execute the test script and the four keywords will be searched on Google via AWS Lambda cloud:
This command should output four array of urls obtained from the Google SERP.
In this tutorial we learned how to deploy a scraping function to the AWS Lambda cloud. One big advantage of the cloud is that you only pay for the computing time that your function needed. The free tier includes a large computing volume, such that you may develop whatever your scraping heart desires.
If you want to only focus on the scraping logic and don't want to hassle with scalability issues, intelligent request retries and infrastructure problems, you can direct use the web scraper service on scrapeulous.com. It comes with many practical examples how to quickly create a scraper of crawler for any website in the Internet.