Scraping and Extracting Links from any major Search Engine like Google, Yandex, Baidu, Bing and Duckduckgo

Posted on in Meta • Tagged with Scraping, Baidu, Extracting, Google, Programming, Python, Searchengine, Bing, Meta

Prelude

It's been quite a while since I worked on my projects. But recently I had some motivation and energy left, which is quite nice considering my full time university week and a programming job besides.

I have a little project on GitHub that I worked on every now and again in the last year or so. Recently it got a little bit bigger (I have 115 github stars now, would've never imagined that I ever achieve this) and I receive up to 2 mails with job offers every week (Sorry if I cannot accept any request :( ).

But unfortunately my progress with this project is not as good as I want it to be (that's probably a quite common feeling under us programmers). It's not a problem of missing ideas and features that I want to implement, the hard part is to extend the project without blowing legacy code up. GoogleScraper has grown evolutionary and I am waisting a lot of time to understand my old code. Mostly it's much better to just erease whole modules and reimplement things completely anew. This is essentially what I made with the parsing module.

Parsing SERP pages with many search engines

So I …


Continue reading

GoogleScraper.py - A simple python module to parse google search results.

Posted on in Programming • Tagged with Google, Scraping, Programming, Security

UPDATE on 18th February 2014:

This python module has now its own github repository!

The plugin can extract

  • All links
  • Link titles
  • The description/caption below the links

and has the following features:

  • Advanced proxy support for SOCKS4/4a/5 and HTTP PROXY
  • Multithreading
  • XPATH parsing
  • Supports almost all search parameters

Please note that this is by no means a permanent version! Heavy structural changes will be implemented in the near future (I'll experiment with asynchronous networking for instance). But on this site, I will always host a working version with instructions how to use it, such that visitors can always use the script!

1. Edit (07.01.2013):

  • Using requests instead of urllib
  • Added random User Agents for every new search.
  • Cleaned the code
  • Implemented foundation to combine with proxychains

Original Blog Post

Sample output after searching for 'cats are not cute' (sorry) with 100 results per page on 3 ascending pages: results.txt

I always was in need of a fast and reliable working python module to query the google search engine. The google API is rubbish, because they just give you maximally 36 results. This is completly inacceptable!

So, I looked further and found http://code.google …


Continue reading