OrangeSpider WebCrawler
The OrangeSpider is an experimental webcrawler designed to collect a large snapshot of the world wide web. I have designed
the machine to be as gentle as possible to not disturb any services and systems by it's html protocol requests. It obeys the robots.txt
protocol standard and even some more not standard-conform extensions. Although I have been as carefull as possible and analyzed hundrets of strange
robots.txt pages, I cannot guarantee that a non-standard conform robots.txt page will always be parsed and interpreted correctly. If you find
that the crawler has requested a forbidden page, please contact me, so I can fix the bug.
The crawler should not contact any site (*.domain.tld and country specific
tld like co.uk or co.jp etc.) more than once in two minutes and does primarly collect the most prominent pages. It does not do a deep crawl. The goal
is to scale up to 100.000.000 pages to ensure the robustness of the software. In the long run, the search engine
Orangeslicer (currently down) will allow queries on the indexed pages.
Thank you in advance for your feedback!
Marco Pesarese - Nußloch, Germany, 12. June 2007
Electronic mail contact: orangespider -at- orangebase -dot- org