webclaw/docs/config.md
Jacob Magar adf4b6ba55 feat(llm): add Gemini CLI provider as primary; set qwen3.5:9b as default Ollama model
- Add GeminiCliProvider: shells out to `gemini -p` with --output-format json,
  injection-safe prompt passing, MCP server suppression via temp workdir,
  6-slot concurrency semaphore, 60s subprocess deadline
- Add --llm-provider, --llm-model, --llm-base-url CLI flags for per-call overrides
- Provider chain: Gemini CLI → OpenAI → Ollama → Anthropic
- Move LLM timing to dispatch layer (LLM: Xs on stderr)
- Default Ollama model: qwen3:8b → qwen3.5:9b (benchmark shows better schema extraction)
- Add noxa mcp subcommand
- Add docs/reports/llm-benchmark-2026-04-11.md (Gemini vs qwen3.5:4b vs qwen3.5:9b)
- Bump version 0.3.11 → 0.4.0

Co-authored-by: Claude <claude@anthropic.com>
2026-04-12 00:52:53 -04:00

273 lines
8.4 KiB
Markdown

# Config and Environment
This document explains how `noxa` loads configuration, how it merges `config.json` with environment variables and CLI flags, and which settings belong in each place.
## Quick Summary
- `config.json` is for non-secret defaults.
- `.env` is for secrets and URLs.
- CLI flags always win over config and environment variables.
- Unknown keys in `config.json` are ignored.
- `config.json` uses `snake_case` keys.
## Load Order
`noxa` resolves settings in this order:
1. CLI flags
2. `config.json`
3. Environment variables
4. Built-in defaults
That means you can set a default in `config.json`, override it for a single run with a CLI flag, and keep secrets in `.env` without checking them into source control.
## Where `config.json` Comes From
By default, `noxa` loads `./config.json` from the current working directory.
You can override that in two ways:
- `--config <PATH>` on the CLI
- `NOXA_CONFIG=<PATH>` in the environment
If the file does not exist:
- an explicit `--config` path or `NOXA_CONFIG` path is an error
- the default `./config.json` is optional and missing files are ignored
To bypass config entirely for one run:
```bash
NOXA_CONFIG=/dev/null noxa https://example.com
```
## What Belongs Where
### `config.json`
Use `config.json` for stable, non-secret defaults such as:
- output format
- output directory
- browser fingerprint
- timeout
- crawl depth and page limits
- selector filters
- LLM provider and model
### `.env`
Use `.env` for secrets, URLs, and a small number of runtime overrides:
- `NOXA_API_KEY`
- `NOXA_PROXY`
- `NOXA_PROXY_FILE`
- `NOXA_WEBHOOK_URL`
- `NOXA_LLM_BASE_URL`
Those values are intentionally excluded from `config.json`.
If you run `setup.sh` or the Docker Compose stack, the generated `.env` may also include local deployment settings such as `NOXA_PORT`, `NOXA_HOST`, `NOXA_AUTH_KEY`, `NOXA_LOG`, `OLLAMA_HOST`, and `OLLAMA_MODEL`.
### CLI-only
These options stay on the command line and do not belong in `config.json`:
- `--on-change`
- `--raw-html`
`--on-change` is CLI-only because it executes shell commands. `--raw-html` is a per-run mode, not a persistent default.
## Config File Rules
- Keys are `snake_case`.
- All fields are optional.
- Unknown fields are ignored.
- Arrays are used for selector and path lists.
- Boolean flags have one important limitation: if you set them to `true` in `config.json`, you cannot disable them for a single CLI run with a `--no-...` flag because `noxa` does not define one.
The boolean fields with this limitation are:
- `metadata`
- `verbose`
- `only_main_content`
- `use_sitemap`
If you need to turn one of those off temporarily, bypass the config file with `NOXA_CONFIG=/dev/null`.
## Supported `config.json` Keys
### Output
| Key | Type | Default | Notes |
|---|---|---:|---|
| `format` | string | `markdown` | One of `markdown`, `json`, `text`, `llm`, `html` |
| `metadata` | boolean | `false` | Include metadata in output |
| `verbose` | boolean | `false` | Enable verbose logging |
| `output_dir` | string or null | `null` | Write outputs to files in this directory instead of stdout |
When `output_dir` is set, noxa writes results to files instead of printing them for the modes that support file output:
- single URL extraction
- multi-URL batch extraction
- crawl
- LLM extraction and summarization
- sitemap discovery
- diff output
- brand extraction
- research reports
- watch changes
File names are derived from the URL or mode name, and the directory is created on demand.
### Output Directory Layout
For URL-based output, noxa mirrors the URL path under `output_dir`:
| URL | Written file |
|---|---|
| `https://example.com/` | `output_dir/example_com/index.md` |
| `https://example.com/docs/api` | `output_dir/docs/api.md` |
| `https://example.com/docs/api/` | `output_dir/docs/api.md` |
| `https://example.com/blog/post?id=123` | `output_dir/blog/post_id_123.md` |
The extension comes from the selected output format:
| Format | Extension |
|---|---|
| `markdown` | `.md` |
| `llm` | `.md` |
| `json` | `.json` |
| `text` | `.txt` |
| `html` | `.html` |
For `--urls-file`, a CSV entry of `url,filename` uses the custom filename instead of the URL-derived name.
Examples:
```txt
https://example.com/docs/api,api.md
https://example.com/blog/post
```
Becomes:
```txt
output_dir/api.md
output_dir/blog/post.md
```
Mode-specific outputs use fixed filenames in the root of `output_dir`:
| Mode | File |
|---|---|
| `--map` | `sitemap.json` or `sitemap.txt` |
| `--diff-with` | `diff.json` or `diff.txt` |
| `--brand` | `brand.json` |
| `--research` | `research-<slug>.json` |
| `--watch` | `watch-<timestamp>.json` |
The directory tree is created automatically, so nested paths do not need to exist ahead of time.
### Fetch
| Key | Type | Default | Notes |
|---|---|---:|---|
| `browser` | string | `chrome` | One of `chrome`, `firefox`, `random` |
| `timeout` | integer | `30` | Request timeout in seconds |
| `pdf_mode` | string | `auto` | One of `auto`, `fast` |
| `only_main_content` | boolean | `false` | Auto-detect the main content area |
### Content Filtering
| Key | Type | Default | Notes |
|---|---|---:|---|
| `include_selectors` | array of strings | `[]` | CSS selectors to include |
| `exclude_selectors` | array of strings | `[]` | CSS selectors to exclude |
### Crawl
| Key | Type | Default | Notes |
|---|---|---:|---|
| `depth` | integer | `1` | Crawl depth |
| `max_pages` | integer | `20` | Maximum pages to crawl |
| `concurrency` | integer | `5` | Concurrent requests |
| `delay` | integer | `100` | Delay between requests in ms |
| `path_prefix` | string or null | `null` | Only crawl URLs whose path starts with this prefix |
| `include_paths` | array of strings | `[]` | Glob patterns to include |
| `exclude_paths` | array of strings | `[]` | Glob patterns to exclude |
| `use_sitemap` | boolean | `false` | Seed the crawl from sitemap discovery |
### LLM
| Key | Type | Default | Notes |
|---|---|---:|---|
| `llm_provider` | string | unset | Optional provider name: `gemini`, `ollama`, `openai`, `anthropic` |
| `llm_model` | string | unset | Optional model override |
## Environment Variables
| Variable | Purpose | Notes |
|---|---|---|
| `NOXA_API_KEY` | Cloud API key | Used for cloud fallback and cloud-only features |
| `NOXA_PROXY` | Single proxy URL | Takes priority over proxy file when set |
| `NOXA_PROXY_FILE` | Proxy pool file path | One proxy per line |
| `NOXA_WEBHOOK_URL` | Notification webhook | Used by watch/crawl/batch notifications |
| `NOXA_LLM_BASE_URL` | LLM endpoint URL | For Ollama or OpenAI-compatible endpoints |
| `NOXA_LLM_PROVIDER` | Default LLM provider | Environment override for the provider name |
| `NOXA_LLM_MODEL` | Default LLM model | Environment override for the model name |
| `NOXA_CONFIG` | Config file path | Override `./config.json` or bypass with `/dev/null` |
The following variables are not part of the `config.json` contract, but they still matter for LLM provider behavior:
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`
- `OLLAMA_HOST`
- `OLLAMA_MODEL`
- `GEMINI_MODEL`
## Example
`config.example.json` shows the recommended baseline:
```json
{
"$schema": "./config.schema.json",
"_doc": [
"Copy to config.json and remove fields you don't need.",
"Secrets (api_key, proxy, webhook, llm_base_url) go in .env — NOT here."
],
"format": "markdown",
"browser": "chrome",
"timeout": 30,
"pdf_mode": "auto",
"metadata": false,
"verbose": false,
"only_main_content": false,
"include_selectors": [],
"exclude_selectors": ["nav", "footer", ".sidebar", ".cookie-banner"],
"depth": 1,
"max_pages": 20,
"concurrency": 5,
"delay": 100,
"path_prefix": null,
"include_paths": [],
"exclude_paths": ["/changelog/*", "/blog/*", "/releases/*"],
"use_sitemap": false,
"llm_provider": "gemini",
"llm_model": "gemini-2.5-pro"
}
```
## Gotchas
- `config.json` is permissive by design: unknown fields are ignored so newer config files still work on older binaries.
- `llm_provider` is validated by the CLI at runtime; invalid values will fail when the provider is selected.
- `browser`, `timeout`, `depth`, `max_pages`, `concurrency`, and `delay` are ordinary defaults, so CLI flags can override them per run.
- Boolean defaults set to `true` in config are sticky for that run unless you bypass the file.
## Related Files
- [`config.schema.json`](../config.schema.json)
- [`config.example.json`](../config.example.json)
- [`env.example`](../env.example)