1. I would like to write a web crawler that starts at a URL and follows the links on that page. It would then follow the links on the pages found. As the crawler is doing this it sniffs the pages for RSS feeds. What is the best way to create this using Harvestman? Can I do it by editing the Config.xml file or should I write a plugin or create an event?

2. Can I ask Harvestman to get every page on a web site? How would it know?

3. If I want to get regular, paged information, for example where the url might be like this...

http://www.thesite.com/articles/article_number/

... would I be better using python with say, CURL or the urllib? I tried creating a crawler and it seemed to wander about a lot, is there a way to give the crawler some focus or urgency?


See also:

http://bulba.sdsu.edu/docwiki/HarvestMan

http://www.harvestmanontheweb.com/faq.html