In this tutorial we are going to show users how to use GoogleScraper.
The best way to learn a new tool is to use it in a real world case study. And because GoogleScraper allows you to query search engines automatically, we are going to scrape 1000 keywords with GoogleScraper.
Let's assume that we want to create a business in the USA. We do not know in which industry yet and we do not know in which city. Therefore we want to scrape some industries. I will take:
- coffee shop
- pizza place
- burger place
- sea food restaurant
- pastry shop
- shoes repair
- jeans repair
- smartphone repair
- wine shop
- tea shop
Because we do not now in which city we want to open our shop, we are going to combine the above shop places with the 100 largest cities in the US. Here I found a list of the Largest 1000 Cities in America. This will yield 1000 keyword combinations. You can see the final keyword file here: keyword file.
Installation of GoogleScraper
First of all you need to install Python3. I personally use the Anaconda Python distribution, because it ships with many precompiled scientific packages that I need to use in my everyday work. But you can install Python3 also directly from the python website.
Now I assume that you have
python3 installed. In my case I have:
$ python3 --version Python 3.7.0
Now you need
pip, the package manager of python. It usually comes installed with python already. When you have pip, you can install
pip install virtualenv.
Now that you have virtualenv, go to your project directory and create a virtual environment with
In my case it looks like this:
nikolai@nikolai:~/projects/work/google-scraper-tutorial$ virtualenv env Using base prefix '/home/nikolai/anaconda3' New python executable in /home/nikolai/projects/work/google-scraper-tutorial/env/bin/python copying /home/nikolai/anaconda3/bin/python => /home/nikolai/projects/work/google-scraper-tutorial/env/bin/python Installing setuptools, pip, wheel...done.
Now we can activate the virtual environment and install GoogleScraper from the github repository:
# Activate environment nikolai@nikolai:~/projects/work/google-scraper-tutorial$ source env/bin/activate (env) nikolai@nikolai:~/projects/work/google-scraper-tutorial$ # install GoogleScraper pip install --ignore-installed git+git://github.com/NikolaiT/GoogleScraper/
Now you should be all set. When everything worked smoothely, you should have a similar output to this:
$ GoogleScraper --version 0.2.2
Preparation and Scraping Options
We will use the Google search engine soley. We will request 10 results per page and only 1 page for each query.
We are going to use 10 simultaenous browser instances in selenium mode. Therefore each browser needs to scrape 100 keywords.
We are going to use one IP address to test how far we can reach with GoogleScraper with a single IP address.
As output we want a
We enable caching such that we don't have to start the scraping process from scratch if something fails.
We will pass all configuration via a configuration file to GoogleScraper. We can create such a configuration file with the following command:
GoogleScraper --view-config > config.py
Now the file
config.py is our configuration file.
In this file set the following variables:
google_selenium_search_settings = False google_selenium_manual_settings = False do_caching = True do_sleep = True
Now you are ready to scrape. Enter the following command in your the terminal:
GoogleScraper --config-file config.py -m selenium --sel-browser chrome --browser-mode normal --keyword-file list.txt -o results.json -z10
This will start 10 browser windows that begin to scrape the keywords in the provided file.
After about 22 minutes of scraping, I got the following results in a json file. As you can see there are 1000 results with 10 results per page and all links and snippets and stuff. You can now analize this data and make marketing decision based on it.
Here is a short video of how the scraping looks like: Video of scraping.