robots-txt-parser
Installing
Via NPM:npm install robots-txt-parser --save
.
Getting Started
After installing robots-txt-parser it needs to be required and initialised: ```js const robotsParser = require('robots-txt-parser'); const robots = robotsParser( {userAgent: 'Googlebot', // The default user agent to use when looking for allow/disallow rules, if this agent isn't listed in the active robots.txt, we use *.
allowOnNeutral: false // The value to use when the robots.txt rule's for allow and disallow are balanced on whether a link can be crawled.
});
```
Example Usage:
```js
const robotsParser = require('robots-txt-parser');
const robots = robotsParser(
{
userAgent: 'Googlebot', // The default user agent to use when looking for allow/disallow rules, if this agent isn't listed in the active robots.txt, we use *.
allowOnNeutral: false, // The value to use when the robots.txt rule's for allow and disallow are balanced on whether a link can be crawled.
},
);
robots.useRobotsFor('http://example.com')
.then(() => {
robots.canCrawlSync('http://example.com/news'); // Returns true if the link can be crawled, false if not.
robots.canCrawl('http://example.com/news', (value) => {
console.log('Crawlable: ', value);
}); // Calls the callback with true if the link is crawlable, false if not.
robots.canCrawl('http://example.com/news') // If no callback is provided, returns a promise which resolves with true if the link is crawlable, false if not.
.then((value) => {
console.log('Crawlable: ', value);
});
});
```
Condensed Documentation
Below is a condensed form of the documentation, each is a function that can be found on the robotsParser object. | Method | Parameters | Return | | ----------- | -------- | ------ | | parseRobots(key, string) |key:String, string:String
| None |
| isCached(domain) | domain:String
| Boolean for whether robots.txt for url is cached. |
| fetch(url) | url:String
| Promise, resolved when robots.txt retrieved. |
| useRobotsFor(url) | url:String
| Promise, resolved when robots.txt is fetched. |
| canCrawl(url)| url:String, callback:Func (Opt)
| Promise, resolves with Boolean. |
| getSitemaps()| callback:Func (Opt)
| Promise if no callback provided, resolves with String. |
| getCrawlDelay() | callback:Func (Opt)
| Promise if no callback provided, resolves with Number. |
| getCrawlableLinks(links) | links:[String], callback:Func (Opt)
| Promise if no callback provided, resolves with String. |
| getPreferredHost() | callback:Func (Opt)
| Promise if no callback provided, resolves with String. |
| setUserAgent(userAgent) | userAgent:String
| None. |
| setAllowOnNeutral(allow) | allow:Boolean
| None. |
| clearCache() | None
| None. |
Full Documentation
---parseRobots
robots.parseRobots(key, string)
Parses a string representation of a robots.txt file and cache's it with the given key.
Parameters
- key -> Can be any URL.
- string -> String representation of a robots.txt file.
Returns
None.Example
```js robots.parseRobots('https://example.com', `User-agent: *
Allow: /*.php$
Disallow: /
`);
```
---
isCached
robots.isCached(domain)
A method used to check if a robots.txt has already been fetched and parsed.
Parameters
- domain -> Can be any URL.
Returns
Returns true if a robots.txt has already been fetched and cached by the robots-txt-parser.Example
```js robots.isCached('https://example.com'); // true or false robots.isCached('example.com'); // Attempts to check the cache for only http:// and returns true or false. ``` ---fetch
robots.fetch(url)
Attempts to fetch and parse a robots.txt file located at the url, this method avoids checking the built-in cache and will always attempt to retrieve a fresh copy of the robots.txt.
Parameters
- url -> Any URL.
Returns
Returns a Promise which will resolve once the robots.txt has been fetched with the parsed robots.txt.Example
```js robots.fetch('https://example.com/robots.txt').then((tree) => {
console.log(Object.keys(tree)); // Will log sitemap and any user agents.
});
```
---
useRobotsFor
robots.useRobotsFor(url)
Attempts to download and use the robots.txt at the given url, if the robots.txt has already been downloaded, reads from the cached copy instead.
Parameters
- url -> Any URL.
Returns
Returns a promsise that resolves once the URL is fetched and parsed.Example
```js robots.useRobotsFor('https://example.com/news').then(() => {
// Logic to check if links are crawlable.
});
```
---
canCrawl
robots.canCrawl(url, callback)
Tests whether a url can be crawled for the current active robots.txt and user agent. If a robots.txt isn't cached for the domain of the url, it is fetched and parsed before returning a boolean value.
Parameters
- url -> Any URL.
- callback -> An optional callback, if undefined returns a promise.
Returns
Returns a Promise which will resolve with a boolean value.Example
```js robots.canCrawl('https://example.com/news').then((crawlable) => {
console.log(crawlable); // Will log a boolean value.
});
```
---
getSitemaps
robots.getSitemaps(callback)
Returns a list of sitemaps present on the active robots.txt.
Parameters
- callback -> An optional callback, if undefined returns a promise.
Returns
Returns a Promise which will resolve with an array of strings.Example
```js robots.getSitemaps().then((sitemaps) => {
console.log(sitemaps); // Will log an list of strings.
});
```
---
getCrawlDelay
robots.getCrawlDelay(callback)
Returns the crawl delay on requests to the current active robots.txt.
Parameters
- callback -> An optional callback, if undefined returns a promise.
Returns
Returns a Promise which will resolve with an Integer.Example
```js robots.getCrawlDelay().then((crawlDelay) => {
console.log(crawlDelay); // Will be an Integer greater than or equal to 0.
});
```
---
getCrawlableLinks
robots.getCrawlableLinks(links, callback)
Takes an array of links and returns an array of links which are crawlable
for the current active robots.txt.
Parameters
- links -> An array of links to check for crawlability.
- callback -> An optional callback, if undefined returns a promise.
Returns
A Promise that will resolve to contain an Array of all the crawlable links.Example
```js robots.getCrawlableLinks().then((links) => {
console.log(links);
});
```
---
getPreferredHost
robots.getPreferredHost(callback)
Returns the preferred host name specified in the active robots.txt's host: directive or null if there isn't one.
Parameters
- callback -> An optional callback, if undefined returns a promise.
Returns
An String if the host is defined, undefined otherwise.Example
```js robots.getPreferredHost().then((host) => {
console.log(host);
});
```
---
setUserAgent
robots.setUserAgent(userAgent)
Sets the current user agent to use when checking if a link can be crawled.
Parameters
- userAgent -> A string.
Returns
undefinedExample
```js robots.setUserAgent('exampleBot'); // When interacting with the robots.txt we now look for records for 'exampleBot'. robots.setUserAgent('testBot'); // When interacting with the robots.txt we now look for records for 'testBot'. ``` ---setAllowOnNeutral
robots.setAllowOnNeutral(allow)
Sets the canCrawl behaviour to return true or false when the robots.txt rules are balanced on whether a link should be crawled or not.
Parameters
- allow -> A boolean value.
Returns
undefinedExample
```js robots.setAllowOnNeutral(true); // If the allow/disallow rules are balanced, canCrawl returns true. robots.setAllowOnNeutral(false); // If the allow/disallow rules are balanced, canCrawl returns false. ``` ---clearCache
robots.clearCache()
The cache can get extremely long over extended crawling, this simple method resets the cache.
Parameters
NoneReturns
NoneExample
```js robots.clearCache(); ``` ---Synchronous API
Synchronous variants of the API, will be deprecated in a future version.canCrawlSync
robots.canCrawlSync(url)
Tests whether a url can be crawled for the current active robots.txt and user agent. This won't attempt to fetch the robots.txt if it is not cached.
Parameters
- url -> Any url.
Returns
Returns a boolean value depending on whether the url is crawlable. If there is no cached robots.txt for this url, it will always return true.Example
```js robots.canCrawlSync('https://example.com/news') // true or false. ``` ---getSitemapsSync()
robots.getSitemapsSync()
Returns a list of sitemaps present on the active robots.txt.
Parameters
NoneReturns
An Array of Strings.Example
```js robots.getSitemapsSync(); // Will be an array e.g. 'http://example.com/sitemap1.xml', 'http://example.com/sitemap2.xml'. ``` ---getCrawlDelaySync()
robots.getCrawlDelaySync()
Returns the crawl delay on specified in the active robots.txt's for the active user agent
Parameters
NoneReturns
An Integer greater than or equal to 0.Example
```js robots.getCrawlDelaySync(); // Will be an Integer. ``` ---getCrawlableLinksSync
robots.getCrawlableLinksSync(links)
Takes an array of links and returns an array of links which are crawlable
for the current active robots.txt.
Parameters
- links -> An array of links to check for crawlability.
Returns
An Array of all the links are crawlable.Example
```js robots.getCrawlableLinks('example.com/test/news', 'example.com/test/news/article'); // Will return an array of the links that can be crawled. ``` ---getPreferredHostSync
robots.getPreferredHostSync()
Returns the preferred host name specified in the active robots.txt's host: directive or undefined if there isn't one.
Parameters
NoneReturns
An String if the host is defined, undefined otherwise.Example
```js robots.getPreferredHostSync(); // Will be a string if the host directive is defined . ```License
See LICENSE file.
Resources
Robots.txt Specifications by Google
A Standard for Robot Exclusion
Robots.txt Specifications by Yandex