Best Practices for robots.txt: A Complete SEO Guide

The robots.txt file tells search engine crawlers which parts of your site they can and cannot access. A properly configured robots.txt optimizes crawl budget, prevents indexing of sensitive pages, and ensures search engines spend their time crawling your most important content.

What Is robots.txt?

robots.txt is a plain text file placed at the root of your domain (example.com/robots.txt) that provides directives to web crawlers about which URLs they are allowed to crawl.

Location: Must be at the exact path /robots.txt — no subdirectories
Format: Plain text with specific syntax for user-agent and allow/disallow directives
Advisory, not enforced: robots.txt is a suggestion — well-behaved crawlers follow it, but malicious bots may ignore it
Not for security: Do not use robots.txt to hide sensitive content — it is publicly readable

Basic robots.txt Syntax

User-Agent

Specifies which crawler the rules apply to.

User-agent: * — applies to all crawlers
User-agent: Googlebot — applies only to Google’s crawler
User-agent: Bingbot — applies only to Bing’s crawler

Disallow

Tells crawlers not to access specific paths.

Disallow: /admin/ — blocks the entire /admin/ directory
Disallow: /private-page — blocks a specific URL path
Disallow: / — blocks the entire site
Disallow: (empty) — allows everything

Allow

Overrides a disallow for specific paths within a blocked directory.

Allow: /admin/public-page — allows access to one page within a blocked /admin/ directory

Sitemap

Points crawlers to your XML sitemap location.

Sitemap: https://example.com/sitemap.xml
Can include multiple sitemap directives

Common robots.txt Configurations

Allow All Crawling

User-agent: *
Disallow:

This allows all crawlers to access all pages. Appropriate for most small to medium sites.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

Blocks administrative areas while allowing WordPress AJAX functionality that some themes and plugins need.

Block URL Parameters

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?sessionid=

Prevents crawling of parameterized URLs that create duplicate content issues.

robots.txt Best Practices for SEO

1. Do Not Block CSS, JavaScript, or Images

Google needs to render your pages to understand them — blocking CSS/JS prevents proper rendering
Blocked resources can cause mobile usability and Core Web Vitals issues in Google’s assessment
This was a common practice years ago but is now explicitly discouraged by Google

2. Block Administrative and System Paths

Block CMS admin areas: /wp-admin/, /admin/, /backend/
Block staging environments if accessible at a subdirectory
Block internal search results pages to prevent thin content indexing
Block user-specific pages like carts, wishlists, and account areas

3. Use robots.txt for Crawl Budget Optimization

For large sites (10,000+ pages), crawl budget matters — Google allocates a finite number of crawls per site.

Block faceted navigation URLs that create thousands of low-value pages
Block parameter-based duplicates (sort orders, filters, tracking parameters)
Block paginated archive pages beyond a reasonable depth
This ensures Google spends crawl budget on your most important pages

4. Do Not Use robots.txt for Deindexing

Disallow in robots.txt prevents crawling, not indexing — Google can still index a blocked URL if other pages link to it
For deindexing, use the noindex meta tag or X-Robots-Tag HTTP header instead
If a page is already indexed and you block it in robots.txt, Google cannot see the noindex tag to remove it

5. Always Include Your Sitemap

Add a Sitemap directive pointing to your XML sitemap
This helps crawlers discover your sitemap even before it is submitted in Search Console
You can include multiple sitemaps if your site uses a sitemap index

6. Test Before Deploying

Google Search Console: Use the robots.txt Tester to validate your file
Test specific URLs: Verify that important pages are not accidentally blocked
Check after changes: Whenever you update robots.txt, test to confirm no critical pages are blocked

Common robots.txt Mistakes

Accidentally Blocking the Entire Site

User-agent: *
Disallow: /

This blocks all crawlers from all pages. It is surprisingly common — especially when sites migrate from staging to production and forget to update robots.txt.

Blocking Critical Pages

Overly broad disallow patterns can accidentally block important content
Example: Disallow: /products blocks /products/, /products/widget, and any URL starting with /products
Use specific paths and test thoroughly

Using robots.txt Instead of Noindex

A common misconception is that blocking a URL in robots.txt removes it from Google. It does not. If external sites link to a blocked URL, Google may still index it — just without seeing the content.

Forgetting Trailing Slashes

Disallow: /admin blocks /admin, /admin/, and /administrator
Disallow: /admin/ blocks only paths within the /admin/ directory
Be intentional about trailing slashes to avoid over-blocking

robots.txt Audit Checklist

File exists at yourdomain.com/robots.txt
No accidental Disallow: / blocking the entire site
CSS, JavaScript, and images are not blocked
Admin areas and system paths are blocked
Duplicate-generating parameters are blocked (for large sites)
Sitemap URL is included
File is tested in Google Search Console robots.txt Tester
Important pages are verified as crawlable
No sensitive content relies solely on robots.txt for protection