incolumitas.com

Scraping Amazon Reviews using Headless Chrome Browser and Python3

Posted on October 03, 2018 in Scraping • Tagged with Amazon, Reviews, Scraping • 2 min read

Tutorial that teaches how scrape amazon reviews

GoogleScraper Tutorial - How to scrape 1000 keywords with Google

Posted on September 05, 2018 in GoogleScraper • Tagged with tutorial, GoogleScraper, scraping • 3 min read

Tutorial that teaches how to use GoogleScraper to scrape 1000 keywords with 10 selenium browsers.

Scraping and Extracting Links from any major Search Engine like Google, Yandex, Baidu, Bing and Duckduckgo

Posted on November 12, 2014 in Meta • Tagged with Scraping, Baidu, Extracting, Google, Programming, Python, Searchengine, Bing, Meta • 7 min read

Prelude

It's been quite a while since I worked on my projects. But recently I had some motivation and energy left, which is quite nice considering my full time university week and a programming job besides.

I have a little project on GitHub that I worked on every now and again in the last year or so. Recently it got a little bit bigger (I have 115 github stars now, would've never imagined that I ever achieve this) and I receive up to 2 mails with job offers every week (Sorry if I cannot accept any request :( ).

But unfortunately my progress with this project is not as good as I want it to be (that's probably a quite common feeling under us programmers). It's not a problem of missing ideas and features that I want to implement, the hard part is to extend the project without blowing legacy code up. GoogleScraper has grown evolutionary and I am waisting a lot of time to understand my old code. Mostly it's much better to just erease whole modules and reimplement things completely anew. This is essentially what I made with the parsing module.

Parsing SERP pages with many search engines

So I …

GoogleScraper.py - A simple python module to parse google search results.

Posted on January 06, 2013 in Programming • Tagged with Google, Scraping, Programming, Security • 14 min read

UPDATE on 18th February 2014:

This python module has now its own github repository!

The plugin can extract

All links
Link titles
The description/caption below the links

and has the following features:

Advanced proxy support for SOCKS4/4a/5 and HTTP PROXY
Multithreading
XPATH parsing
Supports almost all search parameters

Please note that this is by no means a permanent version! Heavy structural changes will be implemented in the near future (I'll experiment with asynchronous networking for instance). But on this site, I will always host a working version with instructions how to use it, such that visitors can always use the script!

1. Edit (07.01.2013):

Using requests instead of urllib
Added random User Agents for every new search.
Cleaned the code
Implemented foundation to combine with proxychains

Original Blog Post

Sample output after searching for 'cats are not cute' (sorry) with 100 results per page on 3 ascending pages: results.txt

I always was in need of a fast and reliable working python module to query the google search engine. The google API is rubbish, because they just give you maximally 36 results. This is completly inacceptable!

So, I looked further and found http://code.google …

Newer Posts