mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-06 22:05:13 +02:00
50 lines
1.2 KiB
Markdown
50 lines
1.2 KiB
Markdown
# HTML to Markdown for RAG
|
|
|
|
Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
|
|
|
|
## CLI
|
|
|
|
```bash
|
|
# Clean markdown with headings, links, and readable structure.
|
|
webclaw https://docs.anthropic.com --format markdown > page.md
|
|
|
|
# Token-optimized output for direct LLM context.
|
|
webclaw https://docs.anthropic.com --format llm > page.txt
|
|
|
|
# Keep the main article content and remove common navigation/footer noise.
|
|
webclaw https://docs.anthropic.com \
|
|
--only-main-content \
|
|
--format markdown \
|
|
> page.md
|
|
```
|
|
|
|
## Batch a URL List
|
|
|
|
Create `urls.txt`:
|
|
|
|
```text
|
|
https://docs.anthropic.com/
|
|
https://docs.anthropic.com/en/docs/claude-code
|
|
https://docs.anthropic.com/en/api/messages
|
|
```
|
|
|
|
Run:
|
|
|
|
```bash
|
|
webclaw --urls-file urls.txt --format llm > corpus.txt
|
|
```
|
|
|
|
## Hosted API
|
|
|
|
```bash
|
|
curl https://api.webclaw.io/v1/scrape \
|
|
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://docs.anthropic.com",
|
|
"formats": ["markdown", "llm"],
|
|
"only_main_content": true
|
|
}'
|
|
```
|
|
|
|
Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.
|