Sunday, June 7, 2009

About robots.txt and resources

Some basic introduction for beginners.
robots.txt file tells search engines which parts of website to crawl.
robots.txt file should be placed at the root of website.
If sitename is www.abc.com, then robots.txt file should be placed at www.abc.com/robots.txt

Syntax for robots.txt file,
Disallowing urls
Disallow:
Specifying Useragent
User-agent: [You can specify * if you want to allow all useragents]

Adding disallow urls in robot.txt doesn't mean that all crawlers will not crawl those urls. some webcrawlers doesn't respect robots.txt file.

Please enable useragent logging in your server so that you will know that robot has crawled your site.

Google SEO guide says that, if you have a subdomain and if you want some pages in your subdomain not be crawled you need to create separate robots.txt file for your subdomain.
Google webmaster tools link for robots.txt generator.
http://googlewebmastercentral.blogspot.com/2008/03/speaking-language-of-robots.html

You can also find all information and FAQ about robots.txt here http://www.robotstxt.org/