Spiderline
custom search engine solutions
Your Own Search Engine.
Just seconds after registering, your web site can be searchable with the features you want and reliability you need. No software to install or maintenance required. Search results can match your website design seamlessly.

Site Search Knowledge Base

Search  
   
Browse by Category
Site Search Knowledge Base .: Crawl Questions .: Learn How to Configure URLs

Learn How to Configure URLs

Configuring URLs can be as simple or detailed as needed for your website. The Starting URL and Pattern fields in combination with the INDEX and FOLLOW options allow you to control exactly what portions of your website are crawled and indexed by Spiderline.

Starting URLs
Pattern Basics
Order of Precedence
Noindex & Nofollow Options
Excluding Documents
Ending with Slashes or Not
Searching Other Websites

Starting URLs



Enter paths to web pages for your Starting URLs. These should not be a pattern, but rather a complete URL that points to an actual web page. Each Starting URL lets you specify pages on your website that our spiders should start crawling from.
  • Type your Starting URL(s), one per line. You may use the noindex option in this field. Entering the nofollow option in this field would negate the purpose of the starting URL.

  • The Starting URL typically matches the homepage of the web site you want to index and search. All other pages are linked to either directly or indirectly from the homepage URL.

  • If your site has multiple domains, subdomains, or if portions of your site are not linked from one main Starting URL, you may enter additional starting URLS. This is useful for pages on your web site that are not linked to from pages under the homepage.

  • Regardless of the Starting URL, Spiderline will still honor any INDEX / FOLLOW options, robot META tags, and the standard robot exclusion protocol.

Patterns

The patterns field is used to specify what documents and directories linked to, directly or indirectly, from the starting urls should be crawled and indexed and which ones should not.

  • Enter document paths, patterns, or regular expressions in the Patterns field. Type each pattern, one per line, with any desired index or follow options. Learn more about index and follow options.

  • Entries in the Patterns field should follow the format below. The default is INDEX and FOLLOW, unless otherwise specified.
    Format:
           Pattern  Index_Option  Follow_Option

  • If you want everything on your website searchable, enter your domain name in the Patterns field. CAUTION! Entering just a "/" rather than your full domain name will allow our spiders to crawl and index all documents on your website and on websites you link to. All documents have a "/" in their path somewhere! For example, "http://www.anydomain.com/. Spiderline does not necessarily crawl all documents beginning with your domain first. If you enter a "/", you could crawl every website on the internet. Fortunately, Spiderline has document limits in place in preparation for such human errors.

  • If you have only a few specific paths you want crawled, enter those paths in the Patterns field. Searchable documents should match a pattern that is followed by the INDEX option. In order to exclude a certain document or directory, the pattern entered must specify NOINDEX or the document must only be linked to by another document which has a NOFOLLOW command.

Order of URLs in the Patterns Field

The precedence of entries in the Patterns field is the inverse of their order. Entries are read from top to bottom, meaning that the last entry's options will override any previous conflicting entries. If the Spiderline robot encounters a page that matches more than one entry in the Patterns field, the entry that is listed later will take precedence. This allows you to be even more specific in what you do and do not want crawled.

Example:
By entering the following two lines in the patterns field, all documents that contain "/dir" will be crawled, unless the document path also contains "/tmp". A document located at "/tmp/x.html", "/dir/tmp/here.html", or "/dir/tmp.html" will not be crawled and links within those documents will not be followed.

   /dir
   /tmp NOINDEX NOFOLLOW

Preventing Documents from being Searchable



You can use the URL Patterns field to prevent documents on your website from being indexed or crawled, and therefore not searchable.
  • If you want to prevent a particular type of file from being indexed, enter the file extension followed by NOINDEX.
    For example:
    • .cgi  NOINDEX
    • .pdf  NOINDEX

  • If you want to prevent just a few specific documents from being searchable, enter a path to each page followed by NOINDEX.
    For example:
    • /some/path/private.html  NOINDEX
    • /tmp/finances.pdf  NOINDEX

  • If you have many documents you do not want searchable, enter a path to each page or directory where the documents are all located followed by NOINDEX.
    For example:
    • /some/path  NOINDEX
    • /tmp/  NOINDEX

  • If the pages you do not want searchable have links to other documents, which you also do not want searched you should add NOFOLLOW to the entry.
    For example:
    .pdf  NOINDEX  NOFOLLOW
    /some/path/private.html  NOINDEX  NOFOLLOW
    /tmp/  NOINDEX  NOFOLLOW

The Difference between Ending a Pattern entry with "/" versus no slash

Yes, there is a difference between the following two entries for the Patterns field:
  • /path/x/  NOINDEX  NOFOLLOW
  • /path/x  NOINDEX  NOFOLLOW

    The first entry will not index or follow links from documents that begin with '/path/x/'. This would cover /path/x/a.html, /path/x/b.html, and /path/x/etc.html.
    The second entry has the same effects on the example documents a.html, b.html, and etc.html, but will also not index or follow links from documents such as /path/x.html, /path/x.pdf, /path/x_file.html.

    Linking to Documents Outside your Domain



    In order to make documents on other websites searchable, but only the documents you link to and not the entire other website, enter "/  INDEX  NOFOLLOW" on the first line of the Patterns field. And on the second line, enter "www.yourdomain.com  INDEX  FOLLOW". This allows you to still configure what parts of your website you do and do not want searchable on subsequent lines in the Patterns field.

         /  INDEX  NOFOLLOW
         www.yourdomain.com   INDEX  FOLLOW

    If you have only a few webiste you link to, you can just specify their domain name followed by INDEX  NOFOLLOW.

         www.referencesite_a.com  INDEX  NOFOLLOW
         www.referencesite_b.com  INDEX  NOFOLLOW


  • How helpful was this article to you?

    Related Articles

    article Learn How to Use NOINDEX & NOFOLLOW
    "Crawling" is the process of finding content on your web site. Finding web pages is similar to a web user browsing through a site and clicking on links. Spiderline spiders also browse your...

      2005-01-20    Views: 5774   


    .: Powered by Lore 1.5.3

    Powered by Lucene