The robots.txt file is a file web crawlers (a.k.a. bots) search for at the root of a website before they start visiting its web pages. It contains instructions telling them which parts of the site they can visit or not.
The first two instructions tell all user-agents (web crawlers) that nothing is disallowed (or allow all). Therefore, they can request any content on the website. If this instruction was followed by a /, it would mean nothing can be requested (a.k.a. disallow everything).
The allow instruction can also be used to authorize access to parts of the website. These instructions are policies web crawlers ought to follow, but they don't prevent malignant bots from ignoring directives if they want to. Secured access control cannot be achieved with robots.txt files.
The last line tells web crawlers where to find the site's sitemap file. It is known as a sitemap directive. It is possible to declare multiple sitemaps:
If a website does not implement a robots.txt file, it is interpreted by Google as all access is allowed (if and only if the returned HTTP status code is 404 or 410 when fetching the robots.txt file).
Once you have created (or modified) your robots.txt file, it is always good to double-check it with an online checker.
How To Block All Crawlers?
The simplest way to prevent web crawlers from accessing a website is to modify the disallow instruction as following:
If you want to remove some web pages from search engine indexes, this is not the recommended method. Return a 410 HTTP status code or mark you pages as NOINDEX instead.
- The order of rules in the robots.txt does not matter for Google.- The most specific rule prevails on the less specific (the number of matching character is used).- Google bots search for the robots.txt file on each website at least once per day.- You cannot use escaped fragment URLs in a robots.txt file.- Unknown robots.txt directives are ignored.