OrangeSpider WebCrawler

The OrangeSpider is an experimental webcrawler designed to collect a large snapshot of the world wide web. I have designed the machine to be as gentle as possible to not disturb any services and systems by it's html protocol requests. It obeys the robots.txt protocol standard and even some more not standard-conform extensions. Although I have been as carefull as possible and analyzed hundrets of strange robots.txt pages, I cannot guarantee that a non-standard conform robots.txt page will always be parsed and interpreted correctly. If you find that the crawler has requested a forbidden page, please contact me, so I can fix the bug.

The crawler should not contact any site (*.domain.tld and country specific tld like co.uk or co.jp etc.) more than once in two minutes and does primarly collect the most prominent pages. It does not do a deep crawl. The goal is to scale up to 100.000.000 pages to ensure the robustness of the software. In the long run, the search engine Orangeslicer (currently down) will allow queries on the indexed pages.

Thank you in advance for your feedback!

Marco Pesarese - Nußloch, Germany, 12. June 2007

Electronic mail contact: orangespider -at- orangebase -dot- org