mirror of https://github.com/MODSetter/SurfSense.git synced 2026-04-28 02:23:53 +02:00

$DESKTOP-RTLN3BA\$punk$ DESKTOP-RTLN3BA\$punk 7ea840dbb2

Build and Push Docker Images / tag_release (push) Has been cancelled

Details

Build and Push Docker Images / build (./surfsense_backend, ./surfsense_backend/Dockerfile, backend, surfsense-backend, ubuntu-24.04-arm, linux/arm64, arm64) (push) Has been cancelled

Details

Build and Push Docker Images / build (./surfsense_backend, ./surfsense_backend/Dockerfile, backend, surfsense-backend, ubuntu-latest, linux/amd64, amd64) (push) Has been cancelled

Details

Build and Push Docker Images / build (./surfsense_web, ./surfsense_web/Dockerfile, web, surfsense-web, ubuntu-24.04-arm, linux/arm64, arm64) (push) Has been cancelled

Details

Build and Push Docker Images / build (./surfsense_web, ./surfsense_web/Dockerfile, web, surfsense-web, ubuntu-latest, linux/amd64, amd64) (push) Has been cancelled

Details

Build and Push Docker Images / create_manifest (backend, surfsense-backend) (push) Has been cancelled

Details

Build and Push Docker Images / create_manifest (web, surfsense-web) (push) Has been cancelled

Details

feat: enhance SurfSense with new skills, blog section, and improve SEO metadata

- Added multiple new skills to skills-lock.json from the repository `aaron-he-zhu/seo-geo-claude-skills`.
- Introduced `fuzzy-search` dependency in package.json for improved search functionality.
- Updated pnpm-lock.yaml to include the new `fuzzy-search` package.
- Enhanced SEO metadata across various pages, including canonical links and descriptions for better search visibility.
- Improved layout and structure of several components, including the homepage and changelog, to enhance user experience.

2026-04-11 23:38:12 -07:00

12 KiB

Raw Blame History

Robots.txt Reference Guide

Complete reference for creating, testing, and troubleshooting robots.txt files.

Syntax Guide

Basic Structure

User-agent: [bot name]
Disallow: [path to block]
Allow: [path to allow]
Sitemap: [sitemap URL]
Crawl-delay: [seconds]

Core Directives

User-agent

Specifies which bot the rules apply to.

Syntax: User-agent: [bot-name]

Common user-agents:

User-agent: *                    # All bots
User-agent: Googlebot            # Google's crawler
User-agent: Bingbot              # Bing's crawler
User-agent: GPTBot               # OpenAI's crawler
User-agent: CCBot                # Common Crawl bot
User-agent: anthropic-ai         # Anthropic's crawler
User-agent: PerplexityBot        # Perplexity AI crawler
User-agent: ClaudeBot            # Claude's web crawler

Multiple user-agents: Group rules by leaving no blank lines between user-agent declarations.

User-agent: Googlebot
User-agent: Bingbot
Disallow: /admin/

Disallow

Blocks bots from crawling specified paths.

Syntax: Disallow: [path]

Examples:

Disallow: /                      # Block entire site
Disallow: /admin/                # Block admin directory
Disallow: /private               # Block private directory (and subdirectories)
Disallow: /*.pdf$                # Block all PDF files
Disallow: /*?                    # Block all URLs with parameters
Disallow:                        # Allow everything (empty disallow)

Path matching:

/ at end = block directory and all subdirectories
Without / at end = block all paths starting with string
* = wildcard, matches any sequence
$ = end of URL

Allow

Explicitly allows crawling (overrides Disallow).

Syntax: Allow: [path]

Common use: Allow specific subdirectories within blocked parent.

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Note: Allow is not standard but supported by Google, Bing, and most major crawlers.

Sitemap

Specifies location of XML sitemap.

Syntax: Sitemap: [absolute URL]

Examples:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_index.xml
Sitemap: https://example.com/blog/sitemap.xml

Best practices:

Use absolute URLs (not relative)
Can include multiple Sitemap directives
Place at end of file
Submit same sitemap(s) to Google Search Console

Crawl-delay

Adds delay between requests (seconds).

Syntax: Crawl-delay: [seconds]

Example:

User-agent: *
Crawl-delay: 10

Warning: Not supported by Googlebot (use Search Console rate limiting instead). Supported by Bing, Yandex, and others.

Common Configurations

1. Allow All Bots (Default)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Use when you want all bots to crawl entire site.

2. Block All Bots

User-agent: *
Disallow: /

Use for development/staging sites or private content.

3. Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Disallow: /cgi-bin/

Sitemap: https://example.com/sitemap.xml

Standard configuration blocking admin and utility directories.

4. Block All AI Crawlers

# Block OpenAI
User-agent: GPTBot
Disallow: /

# Block Anthropic
User-agent: anthropic-ai
User-agent: ClaudeBot
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block Perplexity
User-agent: PerplexityBot
Disallow: /

# Block Google-Extended (Bard training)
User-agent: Google-Extended
Disallow: /

# Allow search engines
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

Sitemap: https://example.com/sitemap.xml

Use when you want search indexing but not AI training.

5. Allow Search Engines, Block Everything Else

# Block all by default
User-agent: *
Disallow: /

# Allow Google
User-agent: Googlebot
Disallow:

# Allow Bing
User-agent: Bingbot
Disallow:

# Allow DuckDuckGo
User-agent: DuckDuckBot
Disallow:

Sitemap: https://example.com/sitemap.xml

6. Block URL Parameters

User-agent: *
Disallow: /*?                    # Block all URLs with parameters
Allow: /?                        # Allow homepage with parameters

Sitemap: https://example.com/sitemap.xml

Prevents duplicate content from parameter variations.

7. Block File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.doc$
Disallow: /*.xls$
Disallow: /*.zip$

Sitemap: https://example.com/sitemap.xml

8. E-commerce Configuration

User-agent: *
# Block search/filter pages
Disallow: /*?q=
Disallow: /*?sort=
Disallow: /*?filter=

# Block account pages
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/

# Block admin
Disallow: /admin/

# Allow product pages
Allow: /products/

Sitemap: https://example.com/sitemap.xml

9. WordPress Configuration

User-agent: *
# WordPress core
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# WordPress directories
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

# Allow uploads
Allow: /wp-content/uploads/

# Block parameter pages
Disallow: /?s=
Disallow: /feed/
Disallow: /trackback/

Sitemap: https://example.com/sitemap_index.xml

10. Shopify Configuration

User-agent: *
# Block admin and account
Disallow: /admin
Disallow: /account
Disallow: /cart
Disallow: /checkout

# Block search
Disallow: /search

# Block collections with filters
Disallow: /collections/*+*
Disallow: /collections/*?*

Sitemap: https://example.com/sitemap.xml

Platform-Specific Templates

Wix

User-agent: *
Disallow: /_api/
Disallow: /_partials/

Sitemap: https://example.com/sitemap.xml

Squarespace

User-agent: *
Disallow: /config/
Disallow: /search

Sitemap: https://example.com/sitemap.xml

Webflow

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Drupal

User-agent: *
Disallow: /admin/
Disallow: /user/
Disallow: /node/add/
Disallow: /?q=

Sitemap: https://example.com/sitemap.xml

Testing and Validation

Google Search Console Robots.txt Tester

Go to: Search Console → Settings → robots.txt
View current robots.txt
Test specific URLs
See which user-agents are affected

Manual Testing

Test URL pattern: https://example.com/robots.txt

Check file is:

Accessible (returns 200 status)
Plain text format
UTF-8 encoded
Located at root domain
No more than 500KB (Google limit)

Common Testing Scenarios

Test these URLs in tester:

Homepage: /
Product page: /products/example
Admin page: /admin/
Parameter page: /search?q=test
File: /document.pdf

Common Mistakes and Fixes

Mistake 1: Blocking CSS/JS Files

Wrong:

User-agent: *
Disallow: /css/
Disallow: /js/

Why it's wrong: Google needs CSS/JS to render pages properly.

Fix:

User-agent: *
Allow: /css/
Allow: /js/

Mistake 2: Using Relative URLs for Sitemap

Wrong:

Sitemap: /sitemap.xml

Fix:

Sitemap: https://example.com/sitemap.xml

Mistake 3: Spaces in Directives

Wrong:

User-agent : Googlebot
Disallow : /admin/

Fix (no spaces before colons):

User-agent: Googlebot
Disallow: /admin/

Mistake 4: Forgetting Trailing Slash

Intention: Block /admin directory

Wrong:

Disallow: /admin

Result: Also blocks /admin-panel, /administrator, etc.

Fix:

Disallow: /admin/

Mistake 5: Blocking Entire Site Accidentally

Wrong:

User-agent: *
Disallow: /
Allow: /blog/

Why it's wrong: Many bots don't support Allow directive.

Fix: Use noindex meta tags for pages you don't want indexed, not robots.txt.

Mistake 6: Not Blocking Development Environments

Wrong: No robots.txt on staging.example.com

Result: Staging site gets indexed.

Fix:

User-agent: *
Disallow: /

On all non-production environments.

Mistake 7: Case Sensitivity Errors

Note: Directives are case-insensitive, but paths are case-sensitive.

Example:

Disallow: /Admin/        # Blocks /Admin/ but not /admin/

Fix: Block both if needed:

Disallow: /admin/
Disallow: /Admin/

Advanced Patterns

Wildcard Examples

# Block all PDFs
Disallow: /*.pdf$

# Block all URLs with parameters
Disallow: /*?

# Block all URLs ending in .php
Disallow: /*.php$

# Block all admin paths regardless of location
Disallow: /*/admin/

Multiple Sitemaps

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

Bot-Specific Rules

# Aggressive bot - slow it down
User-agent: BadBot
Crawl-delay: 60
Disallow: /

# Good bots - full access
User-agent: Googlebot
User-agent: Bingbot
Disallow:

# Default for others
User-agent: *
Crawl-delay: 10
Disallow: /admin/

Robots.txt vs Meta Robots vs X-Robots-Tag

When to use each:

Robots.txt:

Block crawling of entire directories
Reduce crawl budget waste
Block parameter variations
Does NOT prevent indexing if page is linked from elsewhere

Meta robots tag:

Prevent specific pages from being indexed
Control snippet display
Control following links
Example: <meta name="robots" content="noindex,follow">

X-Robots-Tag HTTP header:

Control non-HTML files (PDFs, images)
Server-level control
Example: X-Robots-Tag: noindex

Important: If you don't want a page indexed, use noindex (meta tag or header), NOT robots.txt.

Monitoring and Maintenance

Regular Checks

Monthly:

Verify robots.txt is accessible
Check Search Console for blocked URLs
Review crawl stats for blocked resources

Quarterly:

Audit blocked paths - still relevant?
Check for new admin/private sections to block
Review AI crawler landscape (new bots?)

After site changes:

Update robots.txt if URL structure changed
Test new sections (should they be blocked?)
Verify sitemaps still referenced

Search Console Monitoring

Check these reports:

Coverage → Excluded by robots.txt
Settings → Crawl stats
URL Inspection → Test specific URLs

Robots.txt Checklist

Before deploying:

File is named exactly robots.txt (lowercase)
Located at root domain (example.com/robots.txt)
Plain text format (not HTML or PDF)
UTF-8 encoding
No HTML tags in file
All paths start with /
Sitemap URLs are absolute
No spaces before colons
Tested in Search Console robots.txt tester
Not blocking important CSS/JS/images
Not blocking content you want indexed
Trailing slashes used correctly for directories
Wildcard patterns tested
File size under 500KB

Emergency Fixes

Accidentally Blocked Entire Site

Symptom: All pages blocked in Search Console

Fix:

Edit robots.txt to:

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Test in Search Console
Request urgent recrawl for key pages
Monitor Coverage report for recovery

Recovery time: 1-7 days

Blocked CSS/JS Files

Symptom: "Blocked by robots.txt" in Mobile-Friendly Test

Fix:

Add Allow directives:

User-agent: *
Allow: /css/
Allow: /js/
Allow: /wp-content/uploads/

Test in robots.txt tester
Request re-render in URL Inspection tool

Staging Site Indexed

Symptom: staging.example.com appears in search results

Fix:

Add to staging robots.txt:

User-agent: *
Disallow: /

Add noindex meta tag to all staging pages
Remove staging URLs in Search Console (Removals tool)

Resources and Tools

Testing:

Google Search Console robots.txt tester
Bing Webmaster Tools robots.txt analyzer
Technical SEO browser extensions

Validation:

Documentation:

Google: https://developers.google.com/search/docs/crawling-indexing/robots/intro
Bing: https://www.bing.com/webmasters/help/robots-txt-validation
Robots.txt spec: https://www.robotstxt.org/

12 KiB Raw Blame History

Robots.txt Reference Guide

Syntax Guide

Basic Structure

Core Directives

User-agent

Disallow

Allow

Sitemap

Crawl-delay

Common Configurations

1. Allow All Bots (Default)

2. Block All Bots

3. Block Specific Directories

4. Block All AI Crawlers

5. Allow Search Engines, Block Everything Else

6. Block URL Parameters

7. Block File Types

8. E-commerce Configuration

9. WordPress Configuration

10. Shopify Configuration

Platform-Specific Templates

Wix

Squarespace

Webflow

Drupal

Testing and Validation

Google Search Console Robots.txt Tester

Manual Testing

Common Testing Scenarios

Common Mistakes and Fixes

Mistake 1: Blocking CSS/JS Files

Mistake 2: Using Relative URLs for Sitemap

Mistake 3: Spaces in Directives

Mistake 4: Forgetting Trailing Slash

Mistake 5: Blocking Entire Site Accidentally

Mistake 6: Not Blocking Development Environments

Mistake 7: Case Sensitivity Errors

Advanced Patterns

Wildcard Examples

Multiple Sitemaps

Bot-Specific Rules

Robots.txt vs Meta Robots vs X-Robots-Tag

When to use each:

Monitoring and Maintenance

Regular Checks

Search Console Monitoring

Robots.txt Checklist

Emergency Fixes

Accidentally Blocked Entire Site

Blocked CSS/JS Files

Staging Site Indexed

Resources and Tools

12 KiB

Raw Blame History