The robots.txt file tells search engine crawlers which parts of your site they can and cannot access. A properly configured robots.txt optimizes crawl budget, prevents indexing of sensitive pages, and ensures search engines spend their time crawling your most important content.
What Is robots.txt?
robots.txt is a plain text file placed at the root of your domain (example.com/robots.txt) that provides directives to web crawlers about which URLs they are allowed to crawl.
- Location: Must be at the exact path
/robots.txt— no subdirectories - Format: Plain text with specific syntax for user-agent and allow/disallow directives
- Advisory, not enforced: robots.txt is a suggestion — well-behaved crawlers follow it, but malicious bots may ignore it
- Not for security: Do not use robots.txt to hide sensitive content — it is publicly readable
Basic robots.txt Syntax
User-Agent
Specifies which crawler the rules apply to.
User-agent: *— applies to all crawlersUser-agent: Googlebot— applies only to Google’s crawlerUser-agent: Bingbot— applies only to Bing’s crawler
Disallow
Tells crawlers not to access specific paths.
Disallow: /admin/— blocks the entire /admin/ directoryDisallow: /private-page— blocks a specific URL pathDisallow: /— blocks the entire siteDisallow:(empty) — allows everything
Allow
Overrides a disallow for specific paths within a blocked directory.
Allow: /admin/public-page— allows access to one page within a blocked /admin/ directory
Sitemap
Points crawlers to your XML sitemap location.
Sitemap: https://example.com/sitemap.xml- Can include multiple sitemap directives
Common robots.txt Configurations
Allow All Crawling
User-agent: *
Disallow:
This allows all crawlers to access all pages. Appropriate for most small to medium sites.
Block Specific Directories
User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
Blocks administrative areas while allowing WordPress AJAX functionality that some themes and plugins need.
Block URL Parameters
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?sessionid=
Prevents crawling of parameterized URLs that create duplicate content issues.
robots.txt Best Practices for SEO
1. Do Not Block CSS, JavaScript, or Images
- Google needs to render your pages to understand them — blocking CSS/JS prevents proper rendering
- Blocked resources can cause mobile usability and Core Web Vitals issues in Google’s assessment
- This was a common practice years ago but is now explicitly discouraged by Google
2. Block Administrative and System Paths
- Block CMS admin areas:
/wp-admin/,/admin/,/backend/ - Block staging environments if accessible at a subdirectory
- Block internal search results pages to prevent thin content indexing
- Block user-specific pages like carts, wishlists, and account areas
3. Use robots.txt for Crawl Budget Optimization
For large sites (10,000+ pages), crawl budget matters — Google allocates a finite number of crawls per site.
- Block faceted navigation URLs that create thousands of low-value pages
- Block parameter-based duplicates (sort orders, filters, tracking parameters)
- Block paginated archive pages beyond a reasonable depth
- This ensures Google spends crawl budget on your most important pages
4. Do Not Use robots.txt for Deindexing
- Disallow in robots.txt prevents crawling, not indexing — Google can still index a blocked URL if other pages link to it
- For deindexing, use the
noindexmeta tag or X-Robots-Tag HTTP header instead - If a page is already indexed and you block it in robots.txt, Google cannot see the noindex tag to remove it
5. Always Include Your Sitemap
- Add a Sitemap directive pointing to your XML sitemap
- This helps crawlers discover your sitemap even before it is submitted in Search Console
- You can include multiple sitemaps if your site uses a sitemap index
6. Test Before Deploying
- Google Search Console: Use the robots.txt Tester to validate your file
- Test specific URLs: Verify that important pages are not accidentally blocked
- Check after changes: Whenever you update robots.txt, test to confirm no critical pages are blocked
Common robots.txt Mistakes
Accidentally Blocking the Entire Site
User-agent: *
Disallow: /
This blocks all crawlers from all pages. It is surprisingly common — especially when sites migrate from staging to production and forget to update robots.txt.
Blocking Critical Pages
- Overly broad disallow patterns can accidentally block important content
- Example:
Disallow: /productsblocks/products/,/products/widget, and any URL starting with/products - Use specific paths and test thoroughly
Using robots.txt Instead of Noindex
A common misconception is that blocking a URL in robots.txt removes it from Google. It does not. If external sites link to a blocked URL, Google may still index it — just without seeing the content.
Forgetting Trailing Slashes
Disallow: /adminblocks/admin,/admin/, and/administratorDisallow: /admin/blocks only paths within the/admin/directory- Be intentional about trailing slashes to avoid over-blocking
robots.txt Audit Checklist
- File exists at
yourdomain.com/robots.txt - No accidental
Disallow: /blocking the entire site - CSS, JavaScript, and images are not blocked
- Admin areas and system paths are blocked
- Duplicate-generating parameters are blocked (for large sites)
- Sitemap URL is included
- File is tested in Google Search Console robots.txt Tester
- Important pages are verified as crawlable
- No sensitive content relies solely on robots.txt for protection
