"Crawling" is the process of finding content on your web site. Finding web
pages is similar to a web user browsing through a site and clicking on links.
Spiderline spiders also browse your site gathering and following the
hyperlinks on every page. While Spiderline crawls around your web site, it collects
information about the words and links on each page. This information is indexed
into a searchable database. The index is much like an index found in the back
of a book. You scan the index (with alphabetical reference) to find a word and where
it is located in the book. The index of your web site content is stored in a database
on Spiderline servers and is accessed whenever a query is submitted. The NOINDEX and
NOFOLLOW options in the Patterns field allow you to specify what should and should not
be crawled without having to use a robots.txt file or robot meta tags.*
NOINDEX tells our spiders to not collect information about the
content (words). When a document is not indexed, it is not searchable. The document will not appear
in any search results. You can use this feature for one document, a type of document, or document
path when you enter Patterns on the URL configuration page.
NOFOLLOW tells our spiders to not follow the hyperlinks. Given that
only Document A links to Document B. If the links in document A are not followed (or as we say 'harvested'),
the contents of document B will not be seen by our spiders. The result of a link not being followed is that
the content of the document B will not be searchable. You can use this feature for one document, a type of
document, or document path when you enter Patterns on the URL configuration page.
If you do not want to index the text on the pages within a particular directory,
but do want to follow those pages' links, simply add the keyword "NOINDEX" after
the appropriate entry in the Patterns field.
If you want to index the text on a page, but do not want to follow any of that
page's links, add the keyword "NOFOLLOW" after the entry in the Patterns field.
| Examples: |
|
|
| |
|
- Index all text and follow all links
- http://www.domain.com/docs
- Index all text for, but do not follow any links.
- http://www.domain.com/links INDEX NOFOLLOW
- Follow all links from, but do not index any text.
- http://www.domain.com/sitemap NOINDEX FOLLOW
|
|
More Example entries:
Enter "http://www.domain.com/sitemap.html NOINDEX FOLLOW" in the Patterns field.
All hyperlinks found within this document will be followed.
Enter ".swf NOINDEX FOLLOW" in the Patterns field. The text of .swf files will not be
indexed.
Enter "/tmp NOINDEX FOLLOW". All documents and directories that contain the pattern '/tmp' in
their path will not be indexed. Documents such as 'http://www.domain.com/tmp/index.html' ,
'http://www.domain.com/dir/tmp/file.txt' , and 'http://www.domain.com/tmp.pdf' will not be
indexed.
I enter "http://www.domain.com/dir/document.html INDEX NOFOLLOW" in the Patterns field.
All text will be indexed, but the hyperlinks found within this document will not be followed.
I enter ".pdf INDEX NOFOLLOW" in the Patterns field. Text of all PDF files will be indexed.
I enter "/resources INDEX NOFOLLOW". All links within documents that contain this path will not be
followed. Hyperlinks within 'http://www.domain.com/resources/index.html' ,
'http://www.domain.com/dir/A/resources/index.html' , and 'http://www.domain.com/resources.txt' will
not be followed.
*Spiderline honors the robot exclusion protocol and META robot tags. Our spider will
not index directories or follow links that have been disallowed in the robots.txt configuration
file located on your server or META tags designating "noindex", "nofollow" and/or "none".
If you already used these methods for controlling the spider, you do not need to specify
NOINDEX and NOFOLLOW in the URL patterns field.