- Added multiple new skills to skills-lock.json from the repository `aaron-he-zhu/seo-geo-claude-skills`. - Introduced `fuzzy-search` dependency in package.json for improved search functionality. - Updated pnpm-lock.yaml to include the new `fuzzy-search` package. - Enhanced SEO metadata across various pages, including canonical links and descriptions for better search visibility. - Improved layout and structure of several components, including the homepage and changelog, to enhance user experience.
12 KiB
Robots.txt Reference Guide
Complete reference for creating, testing, and troubleshooting robots.txt files.
Syntax Guide
Basic Structure
User-agent: [bot name]
Disallow: [path to block]
Allow: [path to allow]
Sitemap: [sitemap URL]
Crawl-delay: [seconds]
Core Directives
User-agent
Specifies which bot the rules apply to.
Syntax: User-agent: [bot-name]
Common user-agents:
User-agent: * # All bots
User-agent: Googlebot # Google's crawler
User-agent: Bingbot # Bing's crawler
User-agent: GPTBot # OpenAI's crawler
User-agent: CCBot # Common Crawl bot
User-agent: anthropic-ai # Anthropic's crawler
User-agent: PerplexityBot # Perplexity AI crawler
User-agent: ClaudeBot # Claude's web crawler
Multiple user-agents: Group rules by leaving no blank lines between user-agent declarations.
User-agent: Googlebot
User-agent: Bingbot
Disallow: /admin/
Disallow
Blocks bots from crawling specified paths.
Syntax: Disallow: [path]
Examples:
Disallow: / # Block entire site
Disallow: /admin/ # Block admin directory
Disallow: /private # Block private directory (and subdirectories)
Disallow: /*.pdf$ # Block all PDF files
Disallow: /*? # Block all URLs with parameters
Disallow: # Allow everything (empty disallow)
Path matching:
/at end = block directory and all subdirectories- Without
/at end = block all paths starting with string *= wildcard, matches any sequence$= end of URL
Allow
Explicitly allows crawling (overrides Disallow).
Syntax: Allow: [path]
Common use: Allow specific subdirectories within blocked parent.
User-agent: *
Disallow: /admin/
Allow: /admin/public/
Note: Allow is not standard but supported by Google, Bing, and most major crawlers.
Sitemap
Specifies location of XML sitemap.
Syntax: Sitemap: [absolute URL]
Examples:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_index.xml
Sitemap: https://example.com/blog/sitemap.xml
Best practices:
- Use absolute URLs (not relative)
- Can include multiple Sitemap directives
- Place at end of file
- Submit same sitemap(s) to Google Search Console
Crawl-delay
Adds delay between requests (seconds).
Syntax: Crawl-delay: [seconds]
Example:
User-agent: *
Crawl-delay: 10
Warning: Not supported by Googlebot (use Search Console rate limiting instead). Supported by Bing, Yandex, and others.
Common Configurations
1. Allow All Bots (Default)
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
Use when you want all bots to crawl entire site.
2. Block All Bots
User-agent: *
Disallow: /
Use for development/staging sites or private content.
3. Block Specific Directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Disallow: /cgi-bin/
Sitemap: https://example.com/sitemap.xml
Standard configuration blocking admin and utility directories.
4. Block All AI Crawlers
# Block OpenAI
User-agent: GPTBot
Disallow: /
# Block Anthropic
User-agent: anthropic-ai
User-agent: ClaudeBot
Disallow: /
# Block Common Crawl
User-agent: CCBot
Disallow: /
# Block Perplexity
User-agent: PerplexityBot
Disallow: /
# Block Google-Extended (Bard training)
User-agent: Google-Extended
Disallow: /
# Allow search engines
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
Sitemap: https://example.com/sitemap.xml
Use when you want search indexing but not AI training.
5. Allow Search Engines, Block Everything Else
# Block all by default
User-agent: *
Disallow: /
# Allow Google
User-agent: Googlebot
Disallow:
# Allow Bing
User-agent: Bingbot
Disallow:
# Allow DuckDuckGo
User-agent: DuckDuckBot
Disallow:
Sitemap: https://example.com/sitemap.xml
6. Block URL Parameters
User-agent: *
Disallow: /*? # Block all URLs with parameters
Allow: /? # Allow homepage with parameters
Sitemap: https://example.com/sitemap.xml
Prevents duplicate content from parameter variations.
7. Block File Types
User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /*.xls$
Disallow: /*.zip$
Sitemap: https://example.com/sitemap.xml
8. E-commerce Configuration
User-agent: *
# Block search/filter pages
Disallow: /*?q=
Disallow: /*?sort=
Disallow: /*?filter=
# Block account pages
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/
# Block admin
Disallow: /admin/
# Allow product pages
Allow: /products/
Sitemap: https://example.com/sitemap.xml
9. WordPress Configuration
User-agent: *
# WordPress core
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
# WordPress directories
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
# Allow uploads
Allow: /wp-content/uploads/
# Block parameter pages
Disallow: /?s=
Disallow: /feed/
Disallow: /trackback/
Sitemap: https://example.com/sitemap_index.xml
10. Shopify Configuration
User-agent: *
# Block admin and account
Disallow: /admin
Disallow: /account
Disallow: /cart
Disallow: /checkout
# Block search
Disallow: /search
# Block collections with filters
Disallow: /collections/*+*
Disallow: /collections/*?*
Sitemap: https://example.com/sitemap.xml
Platform-Specific Templates
Wix
User-agent: *
Disallow: /_api/
Disallow: /_partials/
Sitemap: https://example.com/sitemap.xml
Squarespace
User-agent: *
Disallow: /config/
Disallow: /search
Sitemap: https://example.com/sitemap.xml
Webflow
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Drupal
User-agent: *
Disallow: /admin/
Disallow: /user/
Disallow: /node/add/
Disallow: /?q=
Sitemap: https://example.com/sitemap.xml
Testing and Validation
Google Search Console Robots.txt Tester
- Go to: Search Console → Settings → robots.txt
- View current robots.txt
- Test specific URLs
- See which user-agents are affected
Manual Testing
Test URL pattern: https://example.com/robots.txt
Check file is:
- Accessible (returns 200 status)
- Plain text format
- UTF-8 encoded
- Located at root domain
- No more than 500KB (Google limit)
Common Testing Scenarios
Test these URLs in tester:
- Homepage:
/ - Product page:
/products/example - Admin page:
/admin/ - Parameter page:
/search?q=test - File:
/document.pdf
Common Mistakes and Fixes
Mistake 1: Blocking CSS/JS Files
Wrong:
User-agent: *
Disallow: /css/
Disallow: /js/
Why it's wrong: Google needs CSS/JS to render pages properly.
Fix:
User-agent: *
Allow: /css/
Allow: /js/
Mistake 2: Using Relative URLs for Sitemap
Wrong:
Sitemap: /sitemap.xml
Fix:
Sitemap: https://example.com/sitemap.xml
Mistake 3: Spaces in Directives
Wrong:
User-agent : Googlebot
Disallow : /admin/
Fix (no spaces before colons):
User-agent: Googlebot
Disallow: /admin/
Mistake 4: Forgetting Trailing Slash
Intention: Block /admin directory
Wrong:
Disallow: /admin
Result: Also blocks /admin-panel, /administrator, etc.
Fix:
Disallow: /admin/
Mistake 5: Blocking Entire Site Accidentally
Wrong:
User-agent: *
Disallow: /
Allow: /blog/
Why it's wrong: Many bots don't support Allow directive.
Fix: Use noindex meta tags for pages you don't want indexed, not robots.txt.
Mistake 6: Not Blocking Development Environments
Wrong: No robots.txt on staging.example.com
Result: Staging site gets indexed.
Fix:
User-agent: *
Disallow: /
On all non-production environments.
Mistake 7: Case Sensitivity Errors
Note: Directives are case-insensitive, but paths are case-sensitive.
Example:
Disallow: /Admin/ # Blocks /Admin/ but not /admin/
Fix: Block both if needed:
Disallow: /admin/
Disallow: /Admin/
Advanced Patterns
Wildcard Examples
# Block all PDFs
Disallow: /*.pdf$
# Block all URLs with parameters
Disallow: /*?
# Block all URLs ending in .php
Disallow: /*.php$
# Block all admin paths regardless of location
Disallow: /*/admin/
Multiple Sitemaps
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Bot-Specific Rules
# Aggressive bot - slow it down
User-agent: BadBot
Crawl-delay: 60
Disallow: /
# Good bots - full access
User-agent: Googlebot
User-agent: Bingbot
Disallow:
# Default for others
User-agent: *
Crawl-delay: 10
Disallow: /admin/
Robots.txt vs Meta Robots vs X-Robots-Tag
When to use each:
Robots.txt:
- Block crawling of entire directories
- Reduce crawl budget waste
- Block parameter variations
- Does NOT prevent indexing if page is linked from elsewhere
Meta robots tag:
- Prevent specific pages from being indexed
- Control snippet display
- Control following links
- Example:
<meta name="robots" content="noindex,follow">
X-Robots-Tag HTTP header:
- Control non-HTML files (PDFs, images)
- Server-level control
- Example:
X-Robots-Tag: noindex
Important: If you don't want a page indexed, use noindex (meta tag or header), NOT robots.txt.
Monitoring and Maintenance
Regular Checks
Monthly:
- Verify robots.txt is accessible
- Check Search Console for blocked URLs
- Review crawl stats for blocked resources
Quarterly:
- Audit blocked paths - still relevant?
- Check for new admin/private sections to block
- Review AI crawler landscape (new bots?)
After site changes:
- Update robots.txt if URL structure changed
- Test new sections (should they be blocked?)
- Verify sitemaps still referenced
Search Console Monitoring
Check these reports:
- Coverage → Excluded by robots.txt
- Settings → Crawl stats
- URL Inspection → Test specific URLs
Robots.txt Checklist
Before deploying:
- File is named exactly
robots.txt(lowercase) - Located at root domain (
example.com/robots.txt) - Plain text format (not HTML or PDF)
- UTF-8 encoding
- No HTML tags in file
- All paths start with
/ - Sitemap URLs are absolute
- No spaces before colons
- Tested in Search Console robots.txt tester
- Not blocking important CSS/JS/images
- Not blocking content you want indexed
- Trailing slashes used correctly for directories
- Wildcard patterns tested
- File size under 500KB
Emergency Fixes
Accidentally Blocked Entire Site
Symptom: All pages blocked in Search Console
Fix:
- Edit robots.txt to:
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
- Test in Search Console
- Request urgent recrawl for key pages
- Monitor Coverage report for recovery
Recovery time: 1-7 days
Blocked CSS/JS Files
Symptom: "Blocked by robots.txt" in Mobile-Friendly Test
Fix:
- Add Allow directives:
User-agent: *
Allow: /css/
Allow: /js/
Allow: /wp-content/uploads/
- Test in robots.txt tester
- Request re-render in URL Inspection tool
Staging Site Indexed
Symptom: staging.example.com appears in search results
Fix:
- Add to staging robots.txt:
User-agent: *
Disallow: /
- Add noindex meta tag to all staging pages
- Remove staging URLs in Search Console (Removals tool)
Resources and Tools
Testing:
- Google Search Console robots.txt tester
- Bing Webmaster Tools robots.txt analyzer
- Technical SEO browser extensions
Validation:
- https://www.google.com/webmasters/tools/robots-testing-tool
- https://en.ryte.com/free-tools/robots-txt/
- https://technicalseo.com/tools/robots-txt/
Documentation: