robots

Parser for robots.txt

  • robots

Downloads in past

Stats

StarsIssuesVersionUpdatedCreatedSize
robots
6580.10.15 years ago11 years agoMinified + gzip package size for robots in KB

Readme

robots.js
robots.js — is parser for robots.txt files for node.js.

Installation

It's recommended to install via npm:
$ npm install -g robots

Usage

Here's an example of using robots.js:
var robots = require('robots')
  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access) {
      if (access) {
        // parse url
      }
    });
  }
});

Default crawler user-agent is:
Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0
Here's an example of using another user-agent and more detailed callback:
var robots = require('robots')
  , parser = new robots.RobotsParser(
                'http://nodeguide.ru/robots.txt',
                'Mozilla/5.0 (compatible; RobotTxtBot/1.0)',
                after_parse
            );
            
function after_parse(parser, success) {
  if(success) {
    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access, url, reason) {
      if (access) {
        console.log(' url: '+url+', access: '+access);
        // parse url ...
      }
    });
  }
};

Here's an example of getting list of sitemaps:
var robots = require('robots')
  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    parser.getSitemaps(function(sitemaps) {
      // sitemaps — array
    });
  }
});

Here's an example of getCrawlDelay usage:
var robots = require('robots')
  , parser = new robots.RobotsParser();

// for example:
//
// $ curl -s http://nodeguide.ru/robots.txt
//
// User-agent: Google-bot
// Disallow: / 
// Crawl-delay: 2
//
// User-agent: *
// Disallow: /
// Crawl-delay: 2

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    var GoogleBotDelay = parser.getCrawlDelay("Google-bot");
    // ...
  }
});

An example of passing options to the HTTP request:
var options = {
  headers:{
    Authorization:"Basic " + new Buffer("username:password").toString("base64")}
}

var robots = require('robots')
  , parser = new robots.RobotsParser(null, options);

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  ...
});

API

RobotsParser — main class. This class provides a set of methods to read, parse and answer questions about a single robots.txt file.
setUrl(url, read) — sets the URL referring to a robots.txt file.
by default, invokes read() method.
If read is a function, it is called once the remote file is downloaded and parsed, and it
  takes in two arguments: the first is the parser itself, and the second is a boolean
  which is True if the the remote file was successfully parsed.
read(afterparse) — reads the robots.txt URL and feeds it to the parser parse(lines) — parse the input lines from a robots.txt file canFetch(userAgent, url, callback) — using the parsed robots.txt decide if
userAgent can fetch url. Callback function:
``function callback(access, url, reason) { ... }``
where:
* *access* — can this url be fetched. true/false.
* *url* — target url
* *reason* — reason for ``access``. Object:
  * type — valid values: 'statusCode', 'entry', 'defaultEntry', 'noRule'
  * entry — an instance of ``lib/Entry.js:``. Only for types: 'entry', 'defaultEntry'
  * statusCode — http response status code for url. Only for type 'statusCode'
canFetchSync(userAgent, url) — using the parsed robots.txt decide if
userAgent can fetch url. Return true/false.
getCrawlDelay(userAgent) — returns Crawl-delay for the certain userAgent getSitemaps(sitemaps) — gets Sitemaps from parsed robots.txt getDisallowedPaths(userAgent) — gets paths explictly disallowed for the user agent specified AND

License

See LICENSE
file.
Resources
Robots.txt Specifications by Google Robots.txt parser for python A Standard for Robot Exclusion