Robots.txt
A text file placed in a website's root directory that tells search engine crawlers which pages or sections of the site should or should not be crawled.
What is Robots.txt?
Robots.txt is a plain text file that follows the Robots Exclusion Protocol, a standard used by websites to communicate with search engine crawlers and other web robots. Located at the root of a domain (e.g., example.com/robots.txt), it provides instructions about which parts of the site crawlers are allowed or disallowed from accessing. Every major search engine respects robots.txt directives.
The file uses a simple syntax with User-agent directives to specify which crawlers the rules apply to, and Allow/Disallow directives to permit or block access to specific URL paths. You can also include a Sitemap directive to point crawlers to your XML sitemap. For example, you might block crawlers from accessing admin pages, search results pages, or staging environments while ensuring all important content pages are accessible.
It is critical to understand that robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in search results if other pages link to it -- Google may index the URL with a limited description even though it cannot crawl the page's content. To prevent indexing, use the noindex meta robots tag or X-Robots-Tag HTTP header instead. Common robots.txt mistakes include accidentally blocking important pages or entire sections of the site, blocking CSS and JavaScript files that Googlebot needs to render pages, and using overly broad disallow rules.