About This Site

Over a year ago, looking for a python-based web crawler I discovered Harvestman. Harvestman did everything I wanted and more but it was just too complicated for me and didn't quite work they way I wanted. Re-visiting the code more recently I noticed that version 2 of Harvestman did exactly what I wanted and more (still) but it also did it the way I wanted too.

This site is an experiment that will hopefully grow out discussion between the creator of Harvestman, Anand B Pillai and me, Tom Smith. Anand doesn't have a lot of time to support this product and so hopefully I will be doing what I can, namely, helping to flesh out the documentation and creating a site to make writing the documentation as easy as it can be. I also had a go at a logo :-)

I wondered about creating a Wordpress or Wiki site but I have decided to use a Trac site so that my "live" experiments with Harvestman might be distributed as examples, helping them get started with what I call "Personal Data Mining". There's still a few glitches to iron out, so bear with me. I have managed to get a WYSIWYG editor installed but I need a little help working with .egg files on this hosted server. Still, I like the fact that Trac supports python syntax colouring, like this...

class DataCrawler(HarvestMan):
    """ A crawler which fetches pages by looking for matching data """

    # This is an extreme case of using events. This combines
    # three events to create a fine grained filter that downloads
    # only page which has the string 'database' in it.

    # This is a rather simple filtered crawler, but by overriding
    # the handlers below with more powerful processing which can
    # scan a page and look for regular expressions by using
    # complex grammars, it is possible to build a topic focussed
    # crawler.
    
    def __init__(self, keyword):
        self.keyword = keyword
        super(DataCrawler, self).__init__()

... Given that both Harvest Man? and Trac are python-powered, I can see that perhaps this may be the most important feature of Trac in helping any documentation effort.

I am no python expert, so the content will be aimed at someone like me, who knows a little python (or is willing to learn) but likes things to be very simple. Let's see how we go...


Tom

http://www.theotherblog.com