Search Engine Spiders Lost Without Guidance - Post This Sign!

The robots.txt file is an exclusion standard required by all web crawlers/robots to tell them what files and directories that you want them to stay OUT of on your site. Not all crawlers/bots follow the exclusion standard and will continue crawling your site anyway. I like to call them "Bad Bots" or trespassers. We block them by IP exclusion which is another story entirely.

This is a very simple overview of robots.txt basics for webmasters. For a complete and thorough lesson, visit http://www.robotstxt.org/

To see the proper format for a somewhat standard robots.txt file look directly below. That file should be at the root of the domain because that is where the crawlers expect it to be, not in some secondary directory.

Below is the proper format for a robots.txt file ----->

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/

User-agent: msnbot
Crawl-delay: 10

User-agent: Teoma
Crawl-delay: 10

User-agent: Slurp
Crawl-delay: 10

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

--------> End of robots.txt file

This tiny text file is saved as a plain text document and ALWAYS with the name "robots.txt" in the root of your domain.

A quick review of the listed information from the robots.txt file above follows. The "User Agent: MSNbot" is from MSN, Slurp is from Yahoo and Teoma is from AskJeeves. The others listed are "Bad" bots that crawl very fast and to nobody's benefit but their own, so we ask them to stay out entirely. The * asterisk is a wild card that means "All" crawlers/spiders/bots should stay out of that group of files or directories listed.

The bots given the instruction "Disallow: /" means they should stay out entirely and those with "Crawl-delay: 10" are those that crawled our site too quickly and caused it to bog down and overuse the server resources. Google crawls more slowly than the others and doesn't require that instruction, so is not specifically listed in the above robots.txt file. Crawl-delay instruction is only needed on very large sites with hundreds or thousands of pages. The wildcard asterisk * applies to all crawlers, bots and spiders, including Googlebot.

Those we provided that "Crawl-delay: 10" instruction to were requesting as many as 7 pages every second and so we asked them to slow down. The number you see is seconds and you can change it to suit your server capacity, based on their crawling rate. Ten seconds between page requests is far more leisurely and stops them from asking for more pages than your server can dish up.

(You can discover how fast robots and spiders are crawling by looking at your raw server logs - which show pages requested by precise times to within a hundredth of a second - available from your web host or ask your web or IT person. Your server logs can be found in the root directory if you have server access, you can usually download compressed server log files by calendar day right off your server. You'll need a utility that can expand compressed files to open and read those plain text raw server log files.)

To see the contents of any robots.txt file just type robots.txt after any domain name. If they have that file up, you will see it displayed as a text file in your web browser. Click on the link below to see that file for Amazon.com

http://www.Amazon.com/robots.txt

You can see the contents of any website robots.txt file that way.

The robots.txt shown above is what we currently use at Publish101 Web Content Distributor, just launched in May of 2005. We did an extensive case study and published a series of articles on crawler behavior and indexing delays known as the Google Sandbox. That Google Sandbox Case Study is highly instructive on many levels for webmasters everywhere about the importance of this often ignored little text file.

One thing we didn't expect to glean from the research involved in indexing delays (known as the Google Sandbox) was the importance of robots.txt files to quick and efficient crawling by the spiders from the major search engines and the number of heavy crawls from bots that will do no earthly good to the site owner, yet crawl most sites extensively and heavily, straining servers to the breaking point with requests for pages coming as fast as 7 pages per second.

We discovered in our launch of the new site that Google and Yahoo will crawl the site whether or not you use a robots.txt file, but MSN seems to REQUIRE it before they will begin crawling at all. All of the search engine robots seem to request the file on a regular basis to verify that it hasn't changed.

Then when you DO change it, they will stop crawling for brief periods and repeatedly ask for that robots.txt file during that time without crawling any additional pages. (Perhaps they had a list of pages to visit that included the directory or files you have instructed them to stay out of and must now adjust their crawling schedule to eliminate those files from their list.)

Most webmasters instruct the bots to stay out of "image" directories and the "cgi-bin" directory as well as any directories containing private or proprietary files intended only for users of an intranet or password protected sections of your site. Clearly, you should direct the bots to stay out of any private areas that you don't want indexed by the search engines.

The importance of robots.txt is rarely discussed by average webmasters and I've even had some of my client business' webmasters ask me what it is and how to implement it when I tell them how important it is to both site security and efficient crawling by the search engines. This should be standard knowledge by webmasters at substantial companies, but this illustrates how little attention is paid to use of robots.txt.

The search engine spiders really do want your guidance and this tiny text file is the best way to provide crawlers and bots a clear signpost to warn off trespassers and protect private property - and to warmly welcome invited guests, such as the big three search engines while asking them nicely to stay out of private areas.

Copyright © August 17, 2005 by Mike Banks Valentine

Google Sandbox Case Study http://publish101.com/Sandbox2 Mike Banks Valentine operates http://Publish101.com Free Web Content Distribution for Article Marketers and Provides content aggregation, press release optimization and custom web content for Search Engine Positioning http://www.seoptimism.com/SEO_Contact.htm

In The News:

SEO myths busted by an ex-Googler  Search Engine Land
Managing Successful SEO Migrations  Search Engine Journal
TF-IDF: Can It Really Help Your SEO?  Search Engine Journal
For SEO, How Fast Is Fast Enough?  Practical Ecommerce
How to find your ideal SEO agency  Search Engine Land
Tomek Rudzki  Search Engine Journal
The Lowdown on SEO This Year (infographic)  Digital Information World
Google BERT Update – What it Means  Search Engine Journal
Writing articles for good SEO  Market Business News
Heather LloydMartin  Search Engine Journal
What Does It Mean to ‘Do SEO’?  Search Engine Journal
5 Easy SEO Wins with Powerful Results  Search Engine Journal
My experience with SEO  Practical Ecommerce
Can SEO Be Made Predictable?  Search Engine Journal
Google Shares Top 3 SEO Factors  Search Engine Journal
How to Create Content for SEO  Search Engine Journal

Free Search Engine Advertising: 10 Secret Ways To Indirectly Race To The Top Of Search Engines

Do you have a website that has little or no... Read More

Beat Google?s Dampening Link Filter with SEO Articles

Most Search Engine Optimization (SEO) experts agree that links back... Read More

Google Adsense Optimization Tips

There are plenty of tips to help you enhance your... Read More

The Top 3 Mistakes That Can Ruin Your Websites Search Engine Rankings- and How to Fix Them!

Getting your website up and running is hard enough. After... Read More

Search Engine Optimization Tips For 2005 - Part One

Anybody who has their own website or is involved in... Read More

Search Engines: Different Types, Different Strategies

There are four basic types of Search Engines: Free Search... Read More

Ten Steps To A Well Optimized Website - Step 2: Content Creation

Welcome to part two in this search engine optimization series.... Read More

SEM - Research Measures Success

SEM - Research Measures SuccessSearch engine marketing success comes from... Read More

How Do I Improve My Web Site Conversion Rate? Part 1

Question 1.What do you mean by conversion? Do you mean... Read More

Google Gunning For Directories?

Why is it that webmasters are so quick to blame... Read More

SEO and Directories

If you are a webmaster, then you've probably submitted your... Read More

Google Zombies Need To Wake Up

Over the last couple of weeks, I've received more e-mails... Read More

Googles PR System Explained

The complexities of Google's PR (Page Ranking) System have grown... Read More

Beating the New Google AdWord Rules with Blogs and RSS

When Google Adwords first came on scene, it was not... Read More

How To Write Effective And Unique Articles That Are Optimized for the Search Engines

It is a well known fact that writing, distributing and... Read More

Why Pay-Per-Inclusion Search Engines are Dying

A Pay-Per-Inclusion search engine is a service in which a... Read More

7 Essential SEO techniques

1) Title Tag ? When we're talking about SEO Technique,... Read More

How to Boost Your Traffic and Profits with Content!

Are you aware of how vitally important and valuable CONTENT... Read More

Writing Search Engine Friendly Webpages

In order to tap the huge stream of targeted traffic... Read More

Dynamic Pages

Dynamic pages and the Search Engines By Clare Lawrence 10th... Read More

Getting To Know Google

Having greatly benefited from my relationship with Google in the... Read More

Google Sitemaps: 7 Benefits You Cant Ignore

Google Sitemaps enables Webmasters to Directly Alert Google to Changes... Read More

Link Popularity: Why Its The Best Investment You Can Do For Your Business

More and more search engines rank your web pages based... Read More

How To Boost Your Keyword Density On Your Web Site To Gain Top Positions At The Search Engines

Let's talk about what keyword density is and how to... Read More

Most Overlooked Search Engine Optimization Technique

Some of the basics when it comes to SEO and... Read More