Introduction

The robots.txt file is a file web crawlers (a.k.a. bots) search for at the root of a website before they start visiting its web pages. It contains instructions telling them which parts of the site they can visit or not.

One of the simplest robots.txt example is:

  user-agent: *
  disallow:

  sitemap: http://[www.mysite.com]/sitemap.xml

The first two instructions tell all user-agents (web crawlers) that nothing is disallowed (or allow all). Therefore, they can request any content on the website. If this instruction was followed by a /, it would mean nothing can be requested (a.k.a. disallow everything).

The allow instruction can also be used to authorize access to parts of the website. These instructions are policies web crawlers ought to follow, but they don't prevent malignant bots from ignoring directives if they want to. Secured access control cannot be achieved with robots.txt files.

The last line tells web crawlers where to find the site's sitemap file. It is known as a sitemap directive. It is possible to declare multiple sitemaps:

  sitemap: http://[www.mysite.com]/sitemap1.xml
  sitemap: http://[www.mysite.com]/sitemap2.xml
  sitemap: http://[www.mysite.com]/sitemap3.xml

If a website does not implement a robots.txt file, it is interpreted by Google as all access is allowed (if and only if the returned HTTP status code is 404 or 410 when fetching the robots.txt file).

Best Practices

1. Every website should have a robots.txt file implemented at its root and preferably a sitemap.xml file too. The sitemap file can be empty at first.
2. All sitemaps should be declared in the robots.txt file, since not every web crawlers search for them automatically. However, all web crawlers search for a robots.txt file.
3. One can use the simple example provided above as a start.
4. Unless you really know what you are doing and you have a good reason to block parts of your website, let web crawlers access it fully. don't block CSS and Javascript files, since some search engines may refuse to rank pages referencing these if they cannot access them.
5. Keep your sitemaps in the root directory (preferably). If a sitemap is located in a subdirectory, it can only include urls of pages located in that subdirectory (in theory).
6. It is recommended to use UTF8 both for your robots.txt file and your sitemaps.

Once you have created (or modified) your robots.txt file, it is always good to double-check it with an online checker.

How To Block All Crawlers?

The simplest way to prevent web crawlers from accessing a website is to modify the disallow instruction as following:

  user-agent: *
  disallow: /

If you want to remove some web pages from search engine indexes, this is not the recommended method. Return a 410 HTTP status code or mark you pages as NOINDEX instead.

Other Facts

- The order of rules in the robots.txt does not matter for Google.
- The most specific rule prevails on the less specific (the number of matching character is used).
- Google bots search for the robots.txt file on each website at least once per day.
- You cannot use escaped fragment URLs in a robots.txt file.
- Unknown robots.txt directives are ignored.