GoogleScraper - A simple module to scrape and extract links from Google.
What does GoogleScraper?
GoogleScraper parses Google search engine results easily and in a performant way. It allows you to extract all found links/link titles/ link descriptions and the total results for you query problematically and your application can do whatever it want with them (Probably some SEO related research)
There are unlimited use cases:
- Quickly harvest masses of google dorks.
- Use it as a SEO tool.
- Discover trends.
- Compile lists of sites to feed your own database.
- Many more use cases...
GoogleScraper is implemented with the following techniques/software:
- Written in Python 3.3
- Uses multihreading/asynchronous IO (Uses twisted).
- Supports parallel google scraping with multiple IP addresses.
- Provides proxy support using
- Support for additional google search futures.
- Includes exhaustive research of similar projects!
import GoogleScraper import urllib.parse if __name__ == '__main__': results = GoogleScraper.scrape('HOly shit', number_pages=1) for link_title, link_snippet, link_url in results['results']: # You can access all parts of the search results like that # link_url.scheme => URL scheme specifier (Ex: 'http') # link_url.netloc => Network location part (Ex: 'www.python.org') # link_url.path => URL scheme specifier (Ex: ''help/Python.html'') # link_url.params => Parameters for last path element # link_url.query => Query component try: print(urllib.parse.unquote(link_url.geturl())) # This reassembles the parts of the url to the whole thing except: pass # How many urls did we get? print(len(results['results'])) # How many hits has google found with our keyword? print(results['num_results_for_kw'])
This is a example output of the above use.py:
Direct command line usage
In case you want to use GoogleScraper.py as a CLI tool, use it somehow like this:
python GoogleScraper.py -p 1 -n 25 -q 'inurl:".php?id=555"'
But be aware that google might recognize you pretty fast as a abuser if you use such google dorks.
Maybe try a socks proxy then (But don't bet on TOR) [This is just a example, this socks will probably not work anymore when you are here]
python GoogleScraper.py -p 1 -n 25 -q 'i hate google' --proxy="184.108.40.206:2214"
If you feel like contacting me, do so and send me a mail. You can find my contact information on my blog.
To-do list (As of 25.12.2013)
- Figure out whether to use threads or asynchronous I/O for multiple connections.
- Determine if is is possible to use one google search session with multiple connections that are independent of each other (They have different IP's)
This is a development repository. But you can always find a working GoogleScraper.py script here.