mirror of
https://github.com/MODSetter/SurfSense.git
synced 2026-04-28 10:26:33 +02:00
718 lines
12 KiB
Markdown
718 lines
12 KiB
Markdown
|
|
# Robots.txt Reference Guide
|
||
|
|
|
||
|
|
Complete reference for creating, testing, and troubleshooting robots.txt files.
|
||
|
|
|
||
|
|
## Syntax Guide
|
||
|
|
|
||
|
|
### Basic Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: [bot name]
|
||
|
|
Disallow: [path to block]
|
||
|
|
Allow: [path to allow]
|
||
|
|
Sitemap: [sitemap URL]
|
||
|
|
Crawl-delay: [seconds]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Core Directives
|
||
|
|
|
||
|
|
### User-agent
|
||
|
|
|
||
|
|
Specifies which bot the rules apply to.
|
||
|
|
|
||
|
|
**Syntax**: `User-agent: [bot-name]`
|
||
|
|
|
||
|
|
**Common user-agents**:
|
||
|
|
```
|
||
|
|
User-agent: * # All bots
|
||
|
|
User-agent: Googlebot # Google's crawler
|
||
|
|
User-agent: Bingbot # Bing's crawler
|
||
|
|
User-agent: GPTBot # OpenAI's crawler
|
||
|
|
User-agent: CCBot # Common Crawl bot
|
||
|
|
User-agent: anthropic-ai # Anthropic's crawler
|
||
|
|
User-agent: PerplexityBot # Perplexity AI crawler
|
||
|
|
User-agent: ClaudeBot # Claude's web crawler
|
||
|
|
```
|
||
|
|
|
||
|
|
**Multiple user-agents**: Group rules by leaving no blank lines between user-agent declarations.
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: Googlebot
|
||
|
|
User-agent: Bingbot
|
||
|
|
Disallow: /admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Disallow
|
||
|
|
|
||
|
|
Blocks bots from crawling specified paths.
|
||
|
|
|
||
|
|
**Syntax**: `Disallow: [path]`
|
||
|
|
|
||
|
|
**Examples**:
|
||
|
|
```
|
||
|
|
Disallow: / # Block entire site
|
||
|
|
Disallow: /admin/ # Block admin directory
|
||
|
|
Disallow: /private # Block private directory (and subdirectories)
|
||
|
|
Disallow: /*.pdf$ # Block all PDF files
|
||
|
|
Disallow: /*? # Block all URLs with parameters
|
||
|
|
Disallow: # Allow everything (empty disallow)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Path matching**:
|
||
|
|
- `/` at end = block directory and all subdirectories
|
||
|
|
- Without `/` at end = block all paths starting with string
|
||
|
|
- `*` = wildcard, matches any sequence
|
||
|
|
- `$` = end of URL
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Allow
|
||
|
|
|
||
|
|
Explicitly allows crawling (overrides Disallow).
|
||
|
|
|
||
|
|
**Syntax**: `Allow: [path]`
|
||
|
|
|
||
|
|
**Common use**: Allow specific subdirectories within blocked parent.
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /admin/
|
||
|
|
Allow: /admin/public/
|
||
|
|
```
|
||
|
|
|
||
|
|
**Note**: Allow is not standard but supported by Google, Bing, and most major crawlers.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Sitemap
|
||
|
|
|
||
|
|
Specifies location of XML sitemap.
|
||
|
|
|
||
|
|
**Syntax**: `Sitemap: [absolute URL]`
|
||
|
|
|
||
|
|
**Examples**:
|
||
|
|
```
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
Sitemap: https://example.com/sitemap_index.xml
|
||
|
|
Sitemap: https://example.com/blog/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
**Best practices**:
|
||
|
|
- Use absolute URLs (not relative)
|
||
|
|
- Can include multiple Sitemap directives
|
||
|
|
- Place at end of file
|
||
|
|
- Submit same sitemap(s) to Google Search Console
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Crawl-delay
|
||
|
|
|
||
|
|
Adds delay between requests (seconds).
|
||
|
|
|
||
|
|
**Syntax**: `Crawl-delay: [seconds]`
|
||
|
|
|
||
|
|
**Example**:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Crawl-delay: 10
|
||
|
|
```
|
||
|
|
|
||
|
|
**Warning**: Not supported by Googlebot (use Search Console rate limiting instead). Supported by Bing, Yandex, and others.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common Configurations
|
||
|
|
|
||
|
|
### 1. Allow All Bots (Default)
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
Use when you want all bots to crawl entire site.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. Block All Bots
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /
|
||
|
|
```
|
||
|
|
|
||
|
|
Use for development/staging sites or private content.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. Block Specific Directories
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /admin/
|
||
|
|
Disallow: /private/
|
||
|
|
Disallow: /temp/
|
||
|
|
Disallow: /cgi-bin/
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
Standard configuration blocking admin and utility directories.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. Block All AI Crawlers
|
||
|
|
|
||
|
|
```
|
||
|
|
# Block OpenAI
|
||
|
|
User-agent: GPTBot
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Block Anthropic
|
||
|
|
User-agent: anthropic-ai
|
||
|
|
User-agent: ClaudeBot
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Block Common Crawl
|
||
|
|
User-agent: CCBot
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Block Perplexity
|
||
|
|
User-agent: PerplexityBot
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Block Google-Extended (Bard training)
|
||
|
|
User-agent: Google-Extended
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Allow search engines
|
||
|
|
User-agent: Googlebot
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
User-agent: Bingbot
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
Use when you want search indexing but not AI training.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 5. Allow Search Engines, Block Everything Else
|
||
|
|
|
||
|
|
```
|
||
|
|
# Block all by default
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Allow Google
|
||
|
|
User-agent: Googlebot
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
# Allow Bing
|
||
|
|
User-agent: Bingbot
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
# Allow DuckDuckGo
|
||
|
|
User-agent: DuckDuckBot
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 6. Block URL Parameters
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /*? # Block all URLs with parameters
|
||
|
|
Allow: /? # Allow homepage with parameters
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
Prevents duplicate content from parameter variations.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 7. Block File Types
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /*.pdf$
|
||
|
|
Disallow: /*.doc$
|
||
|
|
Disallow: /*.xls$
|
||
|
|
Disallow: /*.zip$
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 8. E-commerce Configuration
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
# Block search/filter pages
|
||
|
|
Disallow: /*?q=
|
||
|
|
Disallow: /*?sort=
|
||
|
|
Disallow: /*?filter=
|
||
|
|
|
||
|
|
# Block account pages
|
||
|
|
Disallow: /account/
|
||
|
|
Disallow: /cart/
|
||
|
|
Disallow: /checkout/
|
||
|
|
|
||
|
|
# Block admin
|
||
|
|
Disallow: /admin/
|
||
|
|
|
||
|
|
# Allow product pages
|
||
|
|
Allow: /products/
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 9. WordPress Configuration
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
# WordPress core
|
||
|
|
Disallow: /wp-admin/
|
||
|
|
Allow: /wp-admin/admin-ajax.php
|
||
|
|
|
||
|
|
# WordPress directories
|
||
|
|
Disallow: /wp-includes/
|
||
|
|
Disallow: /wp-content/plugins/
|
||
|
|
Disallow: /wp-content/themes/
|
||
|
|
|
||
|
|
# Allow uploads
|
||
|
|
Allow: /wp-content/uploads/
|
||
|
|
|
||
|
|
# Block parameter pages
|
||
|
|
Disallow: /?s=
|
||
|
|
Disallow: /feed/
|
||
|
|
Disallow: /trackback/
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap_index.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 10. Shopify Configuration
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
# Block admin and account
|
||
|
|
Disallow: /admin
|
||
|
|
Disallow: /account
|
||
|
|
Disallow: /cart
|
||
|
|
Disallow: /checkout
|
||
|
|
|
||
|
|
# Block search
|
||
|
|
Disallow: /search
|
||
|
|
|
||
|
|
# Block collections with filters
|
||
|
|
Disallow: /collections/*+*
|
||
|
|
Disallow: /collections/*?*
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Platform-Specific Templates
|
||
|
|
|
||
|
|
### Wix
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /_api/
|
||
|
|
Disallow: /_partials/
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
### Squarespace
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /config/
|
||
|
|
Disallow: /search
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
### Webflow
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Allow: /
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
### Drupal
|
||
|
|
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /admin/
|
||
|
|
Disallow: /user/
|
||
|
|
Disallow: /node/add/
|
||
|
|
Disallow: /?q=
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Testing and Validation
|
||
|
|
|
||
|
|
### Google Search Console Robots.txt Tester
|
||
|
|
|
||
|
|
1. Go to: Search Console → Settings → robots.txt
|
||
|
|
2. View current robots.txt
|
||
|
|
3. Test specific URLs
|
||
|
|
4. See which user-agents are affected
|
||
|
|
|
||
|
|
### Manual Testing
|
||
|
|
|
||
|
|
Test URL pattern: `https://example.com/robots.txt`
|
||
|
|
|
||
|
|
Check file is:
|
||
|
|
- Accessible (returns 200 status)
|
||
|
|
- Plain text format
|
||
|
|
- UTF-8 encoded
|
||
|
|
- Located at root domain
|
||
|
|
- No more than 500KB (Google limit)
|
||
|
|
|
||
|
|
### Common Testing Scenarios
|
||
|
|
|
||
|
|
Test these URLs in tester:
|
||
|
|
- Homepage: `/`
|
||
|
|
- Product page: `/products/example`
|
||
|
|
- Admin page: `/admin/`
|
||
|
|
- Parameter page: `/search?q=test`
|
||
|
|
- File: `/document.pdf`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common Mistakes and Fixes
|
||
|
|
|
||
|
|
### Mistake 1: Blocking CSS/JS Files
|
||
|
|
|
||
|
|
**Wrong**:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /css/
|
||
|
|
Disallow: /js/
|
||
|
|
```
|
||
|
|
|
||
|
|
**Why it's wrong**: Google needs CSS/JS to render pages properly.
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Allow: /css/
|
||
|
|
Allow: /js/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Mistake 2: Using Relative URLs for Sitemap
|
||
|
|
|
||
|
|
**Wrong**:
|
||
|
|
```
|
||
|
|
Sitemap: /sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
```
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Mistake 3: Spaces in Directives
|
||
|
|
|
||
|
|
**Wrong**:
|
||
|
|
```
|
||
|
|
User-agent : Googlebot
|
||
|
|
Disallow : /admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix** (no spaces before colons):
|
||
|
|
```
|
||
|
|
User-agent: Googlebot
|
||
|
|
Disallow: /admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Mistake 4: Forgetting Trailing Slash
|
||
|
|
|
||
|
|
**Intention**: Block /admin directory
|
||
|
|
|
||
|
|
**Wrong**:
|
||
|
|
```
|
||
|
|
Disallow: /admin
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result**: Also blocks /admin-panel, /administrator, etc.
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
```
|
||
|
|
Disallow: /admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Mistake 5: Blocking Entire Site Accidentally
|
||
|
|
|
||
|
|
**Wrong**:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /
|
||
|
|
Allow: /blog/
|
||
|
|
```
|
||
|
|
|
||
|
|
**Why it's wrong**: Many bots don't support Allow directive.
|
||
|
|
|
||
|
|
**Fix**: Use noindex meta tags for pages you don't want indexed, not robots.txt.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Mistake 6: Not Blocking Development Environments
|
||
|
|
|
||
|
|
**Wrong**: No robots.txt on staging.example.com
|
||
|
|
|
||
|
|
**Result**: Staging site gets indexed.
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /
|
||
|
|
```
|
||
|
|
|
||
|
|
On all non-production environments.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Mistake 7: Case Sensitivity Errors
|
||
|
|
|
||
|
|
**Note**: Directives are case-insensitive, but paths are case-sensitive.
|
||
|
|
|
||
|
|
**Example**:
|
||
|
|
```
|
||
|
|
Disallow: /Admin/ # Blocks /Admin/ but not /admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix**: Block both if needed:
|
||
|
|
```
|
||
|
|
Disallow: /admin/
|
||
|
|
Disallow: /Admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Advanced Patterns
|
||
|
|
|
||
|
|
### Wildcard Examples
|
||
|
|
|
||
|
|
```
|
||
|
|
# Block all PDFs
|
||
|
|
Disallow: /*.pdf$
|
||
|
|
|
||
|
|
# Block all URLs with parameters
|
||
|
|
Disallow: /*?
|
||
|
|
|
||
|
|
# Block all URLs ending in .php
|
||
|
|
Disallow: /*.php$
|
||
|
|
|
||
|
|
# Block all admin paths regardless of location
|
||
|
|
Disallow: /*/admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
### Multiple Sitemaps
|
||
|
|
|
||
|
|
```
|
||
|
|
Sitemap: https://example.com/sitemap-pages.xml
|
||
|
|
Sitemap: https://example.com/sitemap-posts.xml
|
||
|
|
Sitemap: https://example.com/sitemap-products.xml
|
||
|
|
```
|
||
|
|
|
||
|
|
### Bot-Specific Rules
|
||
|
|
|
||
|
|
```
|
||
|
|
# Aggressive bot - slow it down
|
||
|
|
User-agent: BadBot
|
||
|
|
Crawl-delay: 60
|
||
|
|
Disallow: /
|
||
|
|
|
||
|
|
# Good bots - full access
|
||
|
|
User-agent: Googlebot
|
||
|
|
User-agent: Bingbot
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
# Default for others
|
||
|
|
User-agent: *
|
||
|
|
Crawl-delay: 10
|
||
|
|
Disallow: /admin/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Robots.txt vs Meta Robots vs X-Robots-Tag
|
||
|
|
|
||
|
|
### When to use each:
|
||
|
|
|
||
|
|
**Robots.txt**:
|
||
|
|
- Block crawling of entire directories
|
||
|
|
- Reduce crawl budget waste
|
||
|
|
- Block parameter variations
|
||
|
|
- Does NOT prevent indexing if page is linked from elsewhere
|
||
|
|
|
||
|
|
**Meta robots tag**:
|
||
|
|
- Prevent specific pages from being indexed
|
||
|
|
- Control snippet display
|
||
|
|
- Control following links
|
||
|
|
- Example: `<meta name="robots" content="noindex,follow">`
|
||
|
|
|
||
|
|
**X-Robots-Tag HTTP header**:
|
||
|
|
- Control non-HTML files (PDFs, images)
|
||
|
|
- Server-level control
|
||
|
|
- Example: `X-Robots-Tag: noindex`
|
||
|
|
|
||
|
|
**Important**: If you don't want a page indexed, use noindex (meta tag or header), NOT robots.txt.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Monitoring and Maintenance
|
||
|
|
|
||
|
|
### Regular Checks
|
||
|
|
|
||
|
|
**Monthly**:
|
||
|
|
- [ ] Verify robots.txt is accessible
|
||
|
|
- [ ] Check Search Console for blocked URLs
|
||
|
|
- [ ] Review crawl stats for blocked resources
|
||
|
|
|
||
|
|
**Quarterly**:
|
||
|
|
- [ ] Audit blocked paths - still relevant?
|
||
|
|
- [ ] Check for new admin/private sections to block
|
||
|
|
- [ ] Review AI crawler landscape (new bots?)
|
||
|
|
|
||
|
|
**After site changes**:
|
||
|
|
- [ ] Update robots.txt if URL structure changed
|
||
|
|
- [ ] Test new sections (should they be blocked?)
|
||
|
|
- [ ] Verify sitemaps still referenced
|
||
|
|
|
||
|
|
### Search Console Monitoring
|
||
|
|
|
||
|
|
Check these reports:
|
||
|
|
- **Coverage** → Excluded by robots.txt
|
||
|
|
- **Settings** → Crawl stats
|
||
|
|
- **URL Inspection** → Test specific URLs
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Robots.txt Checklist
|
||
|
|
|
||
|
|
Before deploying:
|
||
|
|
|
||
|
|
- [ ] File is named exactly `robots.txt` (lowercase)
|
||
|
|
- [ ] Located at root domain (`example.com/robots.txt`)
|
||
|
|
- [ ] Plain text format (not HTML or PDF)
|
||
|
|
- [ ] UTF-8 encoding
|
||
|
|
- [ ] No HTML tags in file
|
||
|
|
- [ ] All paths start with `/`
|
||
|
|
- [ ] Sitemap URLs are absolute
|
||
|
|
- [ ] No spaces before colons
|
||
|
|
- [ ] Tested in Search Console robots.txt tester
|
||
|
|
- [ ] Not blocking important CSS/JS/images
|
||
|
|
- [ ] Not blocking content you want indexed
|
||
|
|
- [ ] Trailing slashes used correctly for directories
|
||
|
|
- [ ] Wildcard patterns tested
|
||
|
|
- [ ] File size under 500KB
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Emergency Fixes
|
||
|
|
|
||
|
|
### Accidentally Blocked Entire Site
|
||
|
|
|
||
|
|
**Symptom**: All pages blocked in Search Console
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
1. Edit robots.txt to:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow:
|
||
|
|
|
||
|
|
Sitemap: https://example.com/sitemap.xml
|
||
|
|
```
|
||
|
|
2. Test in Search Console
|
||
|
|
3. Request urgent recrawl for key pages
|
||
|
|
4. Monitor Coverage report for recovery
|
||
|
|
|
||
|
|
**Recovery time**: 1-7 days
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Blocked CSS/JS Files
|
||
|
|
|
||
|
|
**Symptom**: "Blocked by robots.txt" in Mobile-Friendly Test
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
1. Add Allow directives:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Allow: /css/
|
||
|
|
Allow: /js/
|
||
|
|
Allow: /wp-content/uploads/
|
||
|
|
```
|
||
|
|
2. Test in robots.txt tester
|
||
|
|
3. Request re-render in URL Inspection tool
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Staging Site Indexed
|
||
|
|
|
||
|
|
**Symptom**: staging.example.com appears in search results
|
||
|
|
|
||
|
|
**Fix**:
|
||
|
|
1. Add to staging robots.txt:
|
||
|
|
```
|
||
|
|
User-agent: *
|
||
|
|
Disallow: /
|
||
|
|
```
|
||
|
|
2. Add noindex meta tag to all staging pages
|
||
|
|
3. Remove staging URLs in Search Console (Removals tool)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Resources and Tools
|
||
|
|
|
||
|
|
**Testing**:
|
||
|
|
- Google Search Console robots.txt tester
|
||
|
|
- Bing Webmaster Tools robots.txt analyzer
|
||
|
|
- Technical SEO browser extensions
|
||
|
|
|
||
|
|
**Validation**:
|
||
|
|
- https://www.google.com/webmasters/tools/robots-testing-tool
|
||
|
|
- https://en.ryte.com/free-tools/robots-txt/
|
||
|
|
- https://technicalseo.com/tools/robots-txt/
|
||
|
|
|
||
|
|
**Documentation**:
|
||
|
|
- Google: https://developers.google.com/search/docs/crawling-indexing/robots/intro
|
||
|
|
- Bing: https://www.bing.com/webmasters/help/robots-txt-validation
|
||
|
|
- Robots.txt spec: https://www.robotstxt.org/
|