SurfSense/.cursor/skills/technical-seo-checker/references/robots-txt-reference.md
DESKTOP-RTLN3BA\$punk 7ea840dbb2
Some checks failed
Build and Push Docker Images / tag_release (push) Has been cancelled
Build and Push Docker Images / build (./surfsense_backend, ./surfsense_backend/Dockerfile, backend, surfsense-backend, ubuntu-24.04-arm, linux/arm64, arm64) (push) Has been cancelled
Build and Push Docker Images / build (./surfsense_backend, ./surfsense_backend/Dockerfile, backend, surfsense-backend, ubuntu-latest, linux/amd64, amd64) (push) Has been cancelled
Build and Push Docker Images / build (./surfsense_web, ./surfsense_web/Dockerfile, web, surfsense-web, ubuntu-24.04-arm, linux/arm64, arm64) (push) Has been cancelled
Build and Push Docker Images / build (./surfsense_web, ./surfsense_web/Dockerfile, web, surfsense-web, ubuntu-latest, linux/amd64, amd64) (push) Has been cancelled
Build and Push Docker Images / create_manifest (backend, surfsense-backend) (push) Has been cancelled
Build and Push Docker Images / create_manifest (web, surfsense-web) (push) Has been cancelled
feat: enhance SurfSense with new skills, blog section, and improve SEO metadata
- Added multiple new skills to skills-lock.json from the repository `aaron-he-zhu/seo-geo-claude-skills`.
- Introduced `fuzzy-search` dependency in package.json for improved search functionality.
- Updated pnpm-lock.yaml to include the new `fuzzy-search` package.
- Enhanced SEO metadata across various pages, including canonical links and descriptions for better search visibility.
- Improved layout and structure of several components, including the homepage and changelog, to enhance user experience.
2026-04-11 23:38:12 -07:00

12 KiB

Robots.txt Reference Guide

Complete reference for creating, testing, and troubleshooting robots.txt files.

Syntax Guide

Basic Structure

User-agent: [bot name]
Disallow: [path to block]
Allow: [path to allow]
Sitemap: [sitemap URL]
Crawl-delay: [seconds]

Core Directives

User-agent

Specifies which bot the rules apply to.

Syntax: User-agent: [bot-name]

Common user-agents:

User-agent: *                    # All bots
User-agent: Googlebot            # Google's crawler
User-agent: Bingbot              # Bing's crawler
User-agent: GPTBot               # OpenAI's crawler
User-agent: CCBot                # Common Crawl bot
User-agent: anthropic-ai         # Anthropic's crawler
User-agent: PerplexityBot        # Perplexity AI crawler
User-agent: ClaudeBot            # Claude's web crawler

Multiple user-agents: Group rules by leaving no blank lines between user-agent declarations.

User-agent: Googlebot
User-agent: Bingbot
Disallow: /admin/

Disallow

Blocks bots from crawling specified paths.

Syntax: Disallow: [path]

Examples:

Disallow: /                      # Block entire site
Disallow: /admin/                # Block admin directory
Disallow: /private               # Block private directory (and subdirectories)
Disallow: /*.pdf$                # Block all PDF files
Disallow: /*?                    # Block all URLs with parameters
Disallow:                        # Allow everything (empty disallow)

Path matching:

  • / at end = block directory and all subdirectories
  • Without / at end = block all paths starting with string
  • * = wildcard, matches any sequence
  • $ = end of URL

Allow

Explicitly allows crawling (overrides Disallow).

Syntax: Allow: [path]

Common use: Allow specific subdirectories within blocked parent.

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Note: Allow is not standard but supported by Google, Bing, and most major crawlers.


Sitemap

Specifies location of XML sitemap.

Syntax: Sitemap: [absolute URL]

Examples:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_index.xml
Sitemap: https://example.com/blog/sitemap.xml

Best practices:

  • Use absolute URLs (not relative)
  • Can include multiple Sitemap directives
  • Place at end of file
  • Submit same sitemap(s) to Google Search Console

Crawl-delay

Adds delay between requests (seconds).

Syntax: Crawl-delay: [seconds]

Example:

User-agent: *
Crawl-delay: 10

Warning: Not supported by Googlebot (use Search Console rate limiting instead). Supported by Bing, Yandex, and others.


Common Configurations

1. Allow All Bots (Default)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Use when you want all bots to crawl entire site.


2. Block All Bots

User-agent: *
Disallow: /

Use for development/staging sites or private content.


3. Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Disallow: /cgi-bin/

Sitemap: https://example.com/sitemap.xml

Standard configuration blocking admin and utility directories.


4. Block All AI Crawlers

# Block OpenAI
User-agent: GPTBot
Disallow: /

# Block Anthropic
User-agent: anthropic-ai
User-agent: ClaudeBot
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block Perplexity
User-agent: PerplexityBot
Disallow: /

# Block Google-Extended (Bard training)
User-agent: Google-Extended
Disallow: /

# Allow search engines
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

Sitemap: https://example.com/sitemap.xml

Use when you want search indexing but not AI training.


5. Allow Search Engines, Block Everything Else

# Block all by default
User-agent: *
Disallow: /

# Allow Google
User-agent: Googlebot
Disallow:

# Allow Bing
User-agent: Bingbot
Disallow:

# Allow DuckDuckGo
User-agent: DuckDuckBot
Disallow:

Sitemap: https://example.com/sitemap.xml

6. Block URL Parameters

User-agent: *
Disallow: /*?                    # Block all URLs with parameters
Allow: /?                        # Allow homepage with parameters

Sitemap: https://example.com/sitemap.xml

Prevents duplicate content from parameter variations.


7. Block File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /*.xls$
Disallow: /*.zip$

Sitemap: https://example.com/sitemap.xml

8. E-commerce Configuration

User-agent: *
# Block search/filter pages
Disallow: /*?q=
Disallow: /*?sort=
Disallow: /*?filter=

# Block account pages
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/

# Block admin
Disallow: /admin/

# Allow product pages
Allow: /products/

Sitemap: https://example.com/sitemap.xml

9. WordPress Configuration

User-agent: *
# WordPress core
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# WordPress directories
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

# Allow uploads
Allow: /wp-content/uploads/

# Block parameter pages
Disallow: /?s=
Disallow: /feed/
Disallow: /trackback/

Sitemap: https://example.com/sitemap_index.xml

10. Shopify Configuration

User-agent: *
# Block admin and account
Disallow: /admin
Disallow: /account
Disallow: /cart
Disallow: /checkout

# Block search
Disallow: /search

# Block collections with filters
Disallow: /collections/*+*
Disallow: /collections/*?*

Sitemap: https://example.com/sitemap.xml

Platform-Specific Templates

Wix

User-agent: *
Disallow: /_api/
Disallow: /_partials/

Sitemap: https://example.com/sitemap.xml

Squarespace

User-agent: *
Disallow: /config/
Disallow: /search

Sitemap: https://example.com/sitemap.xml

Webflow

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Drupal

User-agent: *
Disallow: /admin/
Disallow: /user/
Disallow: /node/add/
Disallow: /?q=

Sitemap: https://example.com/sitemap.xml

Testing and Validation

Google Search Console Robots.txt Tester

  1. Go to: Search Console → Settings → robots.txt
  2. View current robots.txt
  3. Test specific URLs
  4. See which user-agents are affected

Manual Testing

Test URL pattern: https://example.com/robots.txt

Check file is:

  • Accessible (returns 200 status)
  • Plain text format
  • UTF-8 encoded
  • Located at root domain
  • No more than 500KB (Google limit)

Common Testing Scenarios

Test these URLs in tester:

  • Homepage: /
  • Product page: /products/example
  • Admin page: /admin/
  • Parameter page: /search?q=test
  • File: /document.pdf

Common Mistakes and Fixes

Mistake 1: Blocking CSS/JS Files

Wrong:

User-agent: *
Disallow: /css/
Disallow: /js/

Why it's wrong: Google needs CSS/JS to render pages properly.

Fix:

User-agent: *
Allow: /css/
Allow: /js/

Mistake 2: Using Relative URLs for Sitemap

Wrong:

Sitemap: /sitemap.xml

Fix:

Sitemap: https://example.com/sitemap.xml

Mistake 3: Spaces in Directives

Wrong:

User-agent : Googlebot
Disallow : /admin/

Fix (no spaces before colons):

User-agent: Googlebot
Disallow: /admin/

Mistake 4: Forgetting Trailing Slash

Intention: Block /admin directory

Wrong:

Disallow: /admin

Result: Also blocks /admin-panel, /administrator, etc.

Fix:

Disallow: /admin/

Mistake 5: Blocking Entire Site Accidentally

Wrong:

User-agent: *
Disallow: /
Allow: /blog/

Why it's wrong: Many bots don't support Allow directive.

Fix: Use noindex meta tags for pages you don't want indexed, not robots.txt.


Mistake 6: Not Blocking Development Environments

Wrong: No robots.txt on staging.example.com

Result: Staging site gets indexed.

Fix:

User-agent: *
Disallow: /

On all non-production environments.


Mistake 7: Case Sensitivity Errors

Note: Directives are case-insensitive, but paths are case-sensitive.

Example:

Disallow: /Admin/        # Blocks /Admin/ but not /admin/

Fix: Block both if needed:

Disallow: /admin/
Disallow: /Admin/

Advanced Patterns

Wildcard Examples

# Block all PDFs
Disallow: /*.pdf$

# Block all URLs with parameters
Disallow: /*?

# Block all URLs ending in .php
Disallow: /*.php$

# Block all admin paths regardless of location
Disallow: /*/admin/

Multiple Sitemaps

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

Bot-Specific Rules

# Aggressive bot - slow it down
User-agent: BadBot
Crawl-delay: 60
Disallow: /

# Good bots - full access
User-agent: Googlebot
User-agent: Bingbot
Disallow:

# Default for others
User-agent: *
Crawl-delay: 10
Disallow: /admin/

Robots.txt vs Meta Robots vs X-Robots-Tag

When to use each:

Robots.txt:

  • Block crawling of entire directories
  • Reduce crawl budget waste
  • Block parameter variations
  • Does NOT prevent indexing if page is linked from elsewhere

Meta robots tag:

  • Prevent specific pages from being indexed
  • Control snippet display
  • Control following links
  • Example: <meta name="robots" content="noindex,follow">

X-Robots-Tag HTTP header:

  • Control non-HTML files (PDFs, images)
  • Server-level control
  • Example: X-Robots-Tag: noindex

Important: If you don't want a page indexed, use noindex (meta tag or header), NOT robots.txt.


Monitoring and Maintenance

Regular Checks

Monthly:

  • Verify robots.txt is accessible
  • Check Search Console for blocked URLs
  • Review crawl stats for blocked resources

Quarterly:

  • Audit blocked paths - still relevant?
  • Check for new admin/private sections to block
  • Review AI crawler landscape (new bots?)

After site changes:

  • Update robots.txt if URL structure changed
  • Test new sections (should they be blocked?)
  • Verify sitemaps still referenced

Search Console Monitoring

Check these reports:

  • Coverage → Excluded by robots.txt
  • Settings → Crawl stats
  • URL Inspection → Test specific URLs

Robots.txt Checklist

Before deploying:

  • File is named exactly robots.txt (lowercase)
  • Located at root domain (example.com/robots.txt)
  • Plain text format (not HTML or PDF)
  • UTF-8 encoding
  • No HTML tags in file
  • All paths start with /
  • Sitemap URLs are absolute
  • No spaces before colons
  • Tested in Search Console robots.txt tester
  • Not blocking important CSS/JS/images
  • Not blocking content you want indexed
  • Trailing slashes used correctly for directories
  • Wildcard patterns tested
  • File size under 500KB

Emergency Fixes

Accidentally Blocked Entire Site

Symptom: All pages blocked in Search Console

Fix:

  1. Edit robots.txt to:
User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml
  1. Test in Search Console
  2. Request urgent recrawl for key pages
  3. Monitor Coverage report for recovery

Recovery time: 1-7 days


Blocked CSS/JS Files

Symptom: "Blocked by robots.txt" in Mobile-Friendly Test

Fix:

  1. Add Allow directives:
User-agent: *
Allow: /css/
Allow: /js/
Allow: /wp-content/uploads/
  1. Test in robots.txt tester
  2. Request re-render in URL Inspection tool

Staging Site Indexed

Symptom: staging.example.com appears in search results

Fix:

  1. Add to staging robots.txt:
User-agent: *
Disallow: /
  1. Add noindex meta tag to all staging pages
  2. Remove staging URLs in Search Console (Removals tool)

Resources and Tools

Testing:

  • Google Search Console robots.txt tester
  • Bing Webmaster Tools robots.txt analyzer
  • Technical SEO browser extensions

Validation:

Documentation: