Best Practices for robots.txt: A Complete SEO Guide

The robots.txt file tells search engine crawlers which parts of your site they can and cannot access. A properly configured robots.txt optimizes crawl budget, prevents indexing of sensitive pages, and ensures search engines spend their time crawling your most important content.

What Is robots.txt?

robots.txt is a plain text file placed at the root of your domain (example.com/robots.txt) that provides directives to web crawlers about which URLs they are allowed to crawl.

  • Location: Must be at the exact path /robots.txt — no subdirectories
  • Format: Plain text with specific syntax for user-agent and allow/disallow directives
  • Advisory, not enforced: robots.txt is a suggestion — well-behaved crawlers follow it, but malicious bots may ignore it
  • Not for security: Do not use robots.txt to hide sensitive content — it is publicly readable

Basic robots.txt Syntax

User-Agent

Specifies which crawler the rules apply to.

  • User-agent: * — applies to all crawlers
  • User-agent: Googlebot — applies only to Google’s crawler
  • User-agent: Bingbot — applies only to Bing’s crawler

Disallow

Tells crawlers not to access specific paths.

  • Disallow: /admin/ — blocks the entire /admin/ directory
  • Disallow: /private-page — blocks a specific URL path
  • Disallow: / — blocks the entire site
  • Disallow: (empty) — allows everything

Allow

Overrides a disallow for specific paths within a blocked directory.

  • Allow: /admin/public-page — allows access to one page within a blocked /admin/ directory

Sitemap

Points crawlers to your XML sitemap location.

  • Sitemap: https://example.com/sitemap.xml
  • Can include multiple sitemap directives

Common robots.txt Configurations

Allow All Crawling

User-agent: *
Disallow:

This allows all crawlers to access all pages. Appropriate for most small to medium sites.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

Blocks administrative areas while allowing WordPress AJAX functionality that some themes and plugins need.

Block URL Parameters

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?sessionid=

Prevents crawling of parameterized URLs that create duplicate content issues.

robots.txt Best Practices for SEO

1. Do Not Block CSS, JavaScript, or Images

  • Google needs to render your pages to understand them — blocking CSS/JS prevents proper rendering
  • Blocked resources can cause mobile usability and Core Web Vitals issues in Google’s assessment
  • This was a common practice years ago but is now explicitly discouraged by Google

2. Block Administrative and System Paths

  • Block CMS admin areas: /wp-admin/, /admin/, /backend/
  • Block staging environments if accessible at a subdirectory
  • Block internal search results pages to prevent thin content indexing
  • Block user-specific pages like carts, wishlists, and account areas

3. Use robots.txt for Crawl Budget Optimization

For large sites (10,000+ pages), crawl budget matters — Google allocates a finite number of crawls per site.

  • Block faceted navigation URLs that create thousands of low-value pages
  • Block parameter-based duplicates (sort orders, filters, tracking parameters)
  • Block paginated archive pages beyond a reasonable depth
  • This ensures Google spends crawl budget on your most important pages

4. Do Not Use robots.txt for Deindexing

  • Disallow in robots.txt prevents crawling, not indexing — Google can still index a blocked URL if other pages link to it
  • For deindexing, use the noindex meta tag or X-Robots-Tag HTTP header instead
  • If a page is already indexed and you block it in robots.txt, Google cannot see the noindex tag to remove it

5. Always Include Your Sitemap

  • Add a Sitemap directive pointing to your XML sitemap
  • This helps crawlers discover your sitemap even before it is submitted in Search Console
  • You can include multiple sitemaps if your site uses a sitemap index

6. Test Before Deploying

  • Google Search Console: Use the robots.txt Tester to validate your file
  • Test specific URLs: Verify that important pages are not accidentally blocked
  • Check after changes: Whenever you update robots.txt, test to confirm no critical pages are blocked

Common robots.txt Mistakes

Accidentally Blocking the Entire Site

User-agent: *
Disallow: /

This blocks all crawlers from all pages. It is surprisingly common — especially when sites migrate from staging to production and forget to update robots.txt.

Blocking Critical Pages

  • Overly broad disallow patterns can accidentally block important content
  • Example: Disallow: /products blocks /products/, /products/widget, and any URL starting with /products
  • Use specific paths and test thoroughly

Using robots.txt Instead of Noindex

A common misconception is that blocking a URL in robots.txt removes it from Google. It does not. If external sites link to a blocked URL, Google may still index it — just without seeing the content.

Forgetting Trailing Slashes

  • Disallow: /admin blocks /admin, /admin/, and /administrator
  • Disallow: /admin/ blocks only paths within the /admin/ directory
  • Be intentional about trailing slashes to avoid over-blocking

robots.txt Audit Checklist

  • File exists at yourdomain.com/robots.txt
  • No accidental Disallow: / blocking the entire site
  • CSS, JavaScript, and images are not blocked
  • Admin areas and system paths are blocked
  • Duplicate-generating parameters are blocked (for large sites)
  • Sitemap URL is included
  • File is tested in Google Search Console robots.txt Tester
  • Important pages are verified as crawlable
  • No sensitive content relies solely on robots.txt for protection

Try Autorank

Generate SEO-optimized blog content and publish to WordPress automatically.