If you're reading this, chances are you've seen a Tokenizer robot visiting your site while looking through your server logs. Our software obeys robots.txt files and robot META tags in HTML. These are the standard mechanisms for webmasters to tell web robots which portions of a site a robot is welcome to access.
We'd like to hear about any bad behavior. We can be reached at agent@tokenizer.org.
Our software obeys the robots.txt exclusion standard, described at http://www.robotstxt.org/wc/exclusion.html#robotstxt. To ban Tokenizer-crawler from your site, place the following in your robots.txt file:
User-agent: Tokenizer Disallow: /
Tokenizer/1.1.9 didn't understand META instructions for robots, and we fixed it in version 1.1.10 of Robot. Sorry for inconvenience.
If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag, as described at http://www.robotstxt.org/wc/meta-user.html.
If your site has problems or questions about the Tokenizer crawler, please send an email to the Tokenizer.
If you have any technology related questions: Fuad at www.efendi.ca is an independent consultant specializing in enterprise software development, data mining, natural language processing, and search. For instance, you may wish to implement your own SOLR-based Faceted Browsing on your website. SOLR is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.