Spiderline
custom search engine solutions
Your Own Search Engine.
Just seconds after registering, your web site can be searchable with the features you want and reliability you need. No software to install or maintenance required. Search results can match your website design seamlessly.

Site Search Knowledge Base

Search  
   
Browse by Category
Site Search Knowledge Base .: Crawl Questions .: Robot Exclusion Guide

Robot Exclusion Guide

The robots.txt file and robot META tags are methods used to allow and disallow crawling portions of your site by robots (web robots, spiders). Website administrators and content providers can define what parts of the site the robot should and should not be visited.

Spiderline honors the robot exclusion protocol and robot META tags. Our spider will not index directories or follow links that have been disallowed in the robots.txt configuration file located on a server or META tags designating "noindex" and/or "nofollow". If you use these methods for controlling spidering, you do not need to specify NOINDEX and NOFOLLOW for your account in the URL Configuration fields.

If you do not have a Spiderline account or you want to disallow other robots from crawling your website, the follow document provides general information regarding Robot Exclusion. Disallowing robots from your website or part(s) of your website can be accomplished by two methods:

  • Robot Exclusion Protocol (robots.txt)
  • Robot META tags

Robot Exclusion Protocol - robots.txt

The robots.txt is a TEXT file (not HTML). When a compliant robot vists a site, it first checks for a "/robots.txt" URL at the web root. If this file exists, the robot parses its contents for directives that instruct the robot to visit or not visit certain parts of the site.

Each Directive has a user-agent line which names the robot to be controlled and has a list of "disallows". The disallows are scanned in order, with the last match encountered determining whether a document is allowed to be visited or not. If there are no matches at all then the document may be crawled.

A Directive in the robots.txt consists of the following fields:
  User-agent:
Disallow:

The User-agent Field
  • The name of the robot that should follow the specified access policy.
  • Acceptable values include a Robot's name or an asterik * to indicate all robots.
  • Each Disallow field must be preceeded by a User-agent field.
  • More than one User-agent field can be present per directive.
The Disallow Field
  • A URL path or pattern that should not be visited (crawled).
  • Acceptable values include a full path, partial path, or empty set.
  • An empty set (the value is left blank) indicates that all paths can be visited.
  • Each User-agent field must be accompanied by a Disallow field.
  • More than one Disallow field can be present per directive.

Examples:

Exclude all robots from the entire server:
  User-agent: *
Disallow: /
Allow all robots complete access:
  User-agent: *
Disallow:
Or create an empty "/robots.txt" file.
Exclude all robots from part of the server:
  User-agent: *
Disallow: /private/ Disallow: /tmp/ Disallow: /cgi/
Exclude a single robot:
User-agent: Badbot
Disallow: /
Exclude more than one robot:
  User-agent: Badbot_1
User-agent: Badbot_2
User-agent: Badbot_3
Disallow: /
Allow a single robot:
  User-agent: Spiderline
Disallow:
Directives can be combined for more specific instructions and control.
  User-agent: Spiderline
Disallow:
  User-agent: *
Disallow: /
Exclude all files or directory paths except one:
This is difficult as there is no "Allow" field. The easiest way to accomplish this task is to place the files or directories you do not want crawled in a directory, for example 'norobots'. Put the file(s) and directories you do want robots to crawl in a level above the norobots directory.
  User-agent: *
Disallow: /norobots/
Alternatively you can explicitly disallow all pages that should not be visited by robots:
  User-agent: *
Disallow: /dir/private.html
Disallow: /dir/tmp.html
Disallow: /dir/

IMPORTANT NOTES!

There is a difference between the following:
User-agent: *
Disallow: /docs
and
User-agent: *
Disallow: /docs/

In the first example, compliant robots will not visit documents that begin with '/docs'. This would cover /docs.html, docs.pdf, and docs.jpg
In the second example, compliant robots will not visit the three documents mentioned above; but it will also disallow robots from visiting /docs/webpage.html, /docs/tmp/page.pdf, /docs/dir/tmp/image.gif

Regular expression are not supported in the User-agent or Disallow fields. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have entries such as "Disallow: /tmp/*" or "Disallow: *.gif".

You need a separate "Disallow" line for every URL prefix you want to exclude. You cannot enter "Disallow: /cgi-bin/ /tmp/" on one line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.


Robots META Tags

The Robots META tag is another method that may be used to indicate to visiting robots whether a page should be indexed (crawled), or links on the page should be followed. It differs from the Protocol for Robots Exclusion in that you need no effort or permission from your Web Server Administrator.

The content of the robots META tag contains directives separated by commas. You can define [no]index, [no]follow, all, or none. The INDEX directive specifies if an indexing robot should index the page. While a robot crawls around your web site, it collects information about the words and links on each page; this is the process of indexing. The FOLLOW directive specifies if a robot is to follow links on the page. The defaults are INDEX and FOLLOW. The values ALL and NONE set all directives on or off: all=index,follow and none=noindex,nofollow. NOTE: The "robots" name of the tag and the content are case insensitive.

Like any META tag it should be placed beteen the <head></head> tags of an HTML page:

  <html>
<head>
<meta name="robots" content="none">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>
...


Examples:

HTML page you do not want crawled/indexed:
  <meta name="robots" content="noindex">
HTML page you want crawled, but do not want the robot to follow the links on that page:
  <meta name="robots" content="nofollow">
HTML page you do not want crawled AND do not want the robot to follow the links on that page:
  <meta name="robots" content="none">

Excluding/Including Sections of a Page

This help topic describes how to prevent sections of a document from being indexed. To prevent an entire document from being indexed, see the topics above.

Spiderline supports the proprietary "robots" comment tag. This tag allows a web author to apply robots exclusion rules to arbitrary sections of a document. The tag has one attribute, content, with the following possible values:

  • noindex - the text enclosed in the tag is not saved in the index
  • nofollow - links are not extracted from the text enclosed
  • none - enclosed text is not indexed nor searched for links

Values "index", "follow", and "all" are also valid. In practice they are ignored since they are the unspoken defaults.

This feature is expected to fit the customer need of preventing certain parts of a document - such as a navigational sidebar - from being included in the search.

Example:

<HTML>
<BODY>

This text will be indexed.
    <A HREF="foo.html"> this link will be followed </A>

<!-- robots content="none" -->

This text will NOT be indexed.
        <A HREF="bar.html"> this link will NOT be followed </A>

<!-- /robots -->

<!-- robots content="noindex" -->

This text will NOT be indexed.
<A HREF="bar1.html"> this link WILL be followed </A>

<!-- /robots -->

<!-- robots content="nofollow" -->

This text WILL be indexed.
<A HREF="bar1.html"> this link will NOT be followed </A>

<!-- /robots -->

la la la

</BODY>
</HTML>

For the example of a navigational sidebar, the "noindex" value would be the best choice.

This syntax was designed to match the robots META tag.

For documents which have both the "robots" META tag and the "robots" comment tag, the most restrictive interpretation will be made, always erring on the side on not indexing or not following.


How helpful was this article to you?

Related Articles

article Search Interface Guide
The Bare Minimum The components in the example below are mandatory for making your search interface work. ACCOUNT_NUMBER and SEARCH_SERVER should be replaced with values specific to your...

  2005-01-20    Views: 14327   
article Does Spiderline honor the robot exclusion protocol?
Yes, Spiderline does honor the robot exclusion protocol. Our spiders will not index directories or follow links that have been disallowed in the robots.txt configuration file located on your...

(No rating)  2005-01-20    Views: 5260   
article How do robot meta tags work?
The Robots META tag is another method that may be used to indicate to visiting robots whether a page should be indexed (crawled), or links on the page should be followed. It differs from the...

(No rating)  2005-04-27    Views: 6184   


.: Powered by Lore 1.5.3

Powered by Lucene