# HTML to Markdown for RAG
Turn web pages into clean markdown or compact LLM text before chunking, embedding, or passing the page to an agent.
## CLI
```bash
# Clean markdown with headings, links, and readable structure.
webclaw https://docs.anthropic.com --format markdown > page.md
# Token-optimized output for direct LLM context.
webclaw https://docs.anthropic.com --format llm > page.txt
# Keep the main article content and remove common navigation/footer noise.
webclaw https://docs.anthropic.com \
--only-main-content \
--format markdown \
> page.md
```
## Batch a URL List
Create `urls.txt`:
```text
https://docs.anthropic.com/
https://docs.anthropic.com/en/docs/claude-code
https://docs.anthropic.com/en/api/messages
```
Run:
```bash
webclaw --urls-file urls.txt --format llm > corpus.txt
```
## Hosted API
```bash
curl https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.anthropic.com",
"formats": ["markdown", "llm"],
"only_main_content": true
}'
```
Use `markdown` when humans may inspect the output. Use `llm` when the next step is chunking, embedding, summarization, or prompt context.