Thursday, December 27, 2012

Pholcidae, a Python web crawler library

bbrodriges/pholcidae · GitHub

Pholcidae (named after a _family_ of spiders, appropriately) (*), is a  Python library that can be used to write custom web crawlers (a.k.a. spiders).

It has a handful of attributes that can be set when creating an instance of the spider, to customize how it works, including a list of valid links, exclude links, domain to crawl, page at which to start crawling the domain, whether to crawl links pointing out of the domain, cookies and headers to set for the page requests, etc.

Your program gets access to each page that it crawls, both the URL and the raw content, to process as you wish.

Looks like a useful tool to experiment with creating custom web crawlers, since it is not a  standalone crawler program but a crawler library.

As I've many times in the past on this blog and elsewhere, making the bulk of your code into a library or libraries (and then writing a thinnish main wrapper over it to make it a complete runnable program), enhances its applicability manyfold.

This point applies even if you are not going to release the code as open source, because even then, you, or others on your team, can reuse those libraries to create other useful programs for the same area, at less cost and time.

Though this idea is not new or original (it's almost as old as computing, in fact, a lot of people still don't seem to know it or apply it, which is the  cause of tons of wasted effort and rework (a.k.a. waste of money) in the software industry.

(*) Appropriate because, being a library, it can be used to create a family of spiders (or many different spiders), rather than just one, as would be the case if it was a standalone program - which neatly illustrates the point I just made above.

- Vasudev Ram
www.dancingbison.com

No comments: