mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-05-02 19:42:37 +02:00
- Add GeminiCliProvider: shells out to `gemini -p` with --output-format json, injection-safe prompt passing, MCP server suppression via temp workdir, 6-slot concurrency semaphore, 60s subprocess deadline - Add --llm-provider, --llm-model, --llm-base-url CLI flags for per-call overrides - Provider chain: Gemini CLI → OpenAI → Ollama → Anthropic - Move LLM timing to dispatch layer (LLM: Xs on stderr) - Default Ollama model: qwen3:8b → qwen3.5:9b (benchmark shows better schema extraction) - Add noxa mcp subcommand - Add docs/reports/llm-benchmark-2026-04-11.md (Gemini vs qwen3.5:4b vs qwen3.5:9b) - Bump version 0.3.11 → 0.4.0 Co-authored-by: Claude <claude@anthropic.com>
273 lines
8.4 KiB
Markdown
273 lines
8.4 KiB
Markdown
# Config and Environment
|
|
|
|
This document explains how `noxa` loads configuration, how it merges `config.json` with environment variables and CLI flags, and which settings belong in each place.
|
|
|
|
## Quick Summary
|
|
|
|
- `config.json` is for non-secret defaults.
|
|
- `.env` is for secrets and URLs.
|
|
- CLI flags always win over config and environment variables.
|
|
- Unknown keys in `config.json` are ignored.
|
|
- `config.json` uses `snake_case` keys.
|
|
|
|
## Load Order
|
|
|
|
`noxa` resolves settings in this order:
|
|
|
|
1. CLI flags
|
|
2. `config.json`
|
|
3. Environment variables
|
|
4. Built-in defaults
|
|
|
|
That means you can set a default in `config.json`, override it for a single run with a CLI flag, and keep secrets in `.env` without checking them into source control.
|
|
|
|
## Where `config.json` Comes From
|
|
|
|
By default, `noxa` loads `./config.json` from the current working directory.
|
|
|
|
You can override that in two ways:
|
|
|
|
- `--config <PATH>` on the CLI
|
|
- `NOXA_CONFIG=<PATH>` in the environment
|
|
|
|
If the file does not exist:
|
|
|
|
- an explicit `--config` path or `NOXA_CONFIG` path is an error
|
|
- the default `./config.json` is optional and missing files are ignored
|
|
|
|
To bypass config entirely for one run:
|
|
|
|
```bash
|
|
NOXA_CONFIG=/dev/null noxa https://example.com
|
|
```
|
|
|
|
## What Belongs Where
|
|
|
|
### `config.json`
|
|
|
|
Use `config.json` for stable, non-secret defaults such as:
|
|
|
|
- output format
|
|
- output directory
|
|
- browser fingerprint
|
|
- timeout
|
|
- crawl depth and page limits
|
|
- selector filters
|
|
- LLM provider and model
|
|
|
|
### `.env`
|
|
|
|
Use `.env` for secrets, URLs, and a small number of runtime overrides:
|
|
|
|
- `NOXA_API_KEY`
|
|
- `NOXA_PROXY`
|
|
- `NOXA_PROXY_FILE`
|
|
- `NOXA_WEBHOOK_URL`
|
|
- `NOXA_LLM_BASE_URL`
|
|
|
|
Those values are intentionally excluded from `config.json`.
|
|
|
|
If you run `setup.sh` or the Docker Compose stack, the generated `.env` may also include local deployment settings such as `NOXA_PORT`, `NOXA_HOST`, `NOXA_AUTH_KEY`, `NOXA_LOG`, `OLLAMA_HOST`, and `OLLAMA_MODEL`.
|
|
|
|
### CLI-only
|
|
|
|
These options stay on the command line and do not belong in `config.json`:
|
|
|
|
- `--on-change`
|
|
- `--raw-html`
|
|
|
|
`--on-change` is CLI-only because it executes shell commands. `--raw-html` is a per-run mode, not a persistent default.
|
|
|
|
## Config File Rules
|
|
|
|
- Keys are `snake_case`.
|
|
- All fields are optional.
|
|
- Unknown fields are ignored.
|
|
- Arrays are used for selector and path lists.
|
|
- Boolean flags have one important limitation: if you set them to `true` in `config.json`, you cannot disable them for a single CLI run with a `--no-...` flag because `noxa` does not define one.
|
|
|
|
The boolean fields with this limitation are:
|
|
|
|
- `metadata`
|
|
- `verbose`
|
|
- `only_main_content`
|
|
- `use_sitemap`
|
|
|
|
If you need to turn one of those off temporarily, bypass the config file with `NOXA_CONFIG=/dev/null`.
|
|
|
|
## Supported `config.json` Keys
|
|
|
|
### Output
|
|
|
|
| Key | Type | Default | Notes |
|
|
|---|---|---:|---|
|
|
| `format` | string | `markdown` | One of `markdown`, `json`, `text`, `llm`, `html` |
|
|
| `metadata` | boolean | `false` | Include metadata in output |
|
|
| `verbose` | boolean | `false` | Enable verbose logging |
|
|
| `output_dir` | string or null | `null` | Write outputs to files in this directory instead of stdout |
|
|
|
|
When `output_dir` is set, noxa writes results to files instead of printing them for the modes that support file output:
|
|
|
|
- single URL extraction
|
|
- multi-URL batch extraction
|
|
- crawl
|
|
- LLM extraction and summarization
|
|
- sitemap discovery
|
|
- diff output
|
|
- brand extraction
|
|
- research reports
|
|
- watch changes
|
|
|
|
File names are derived from the URL or mode name, and the directory is created on demand.
|
|
|
|
### Output Directory Layout
|
|
|
|
For URL-based output, noxa mirrors the URL path under `output_dir`:
|
|
|
|
| URL | Written file |
|
|
|---|---|
|
|
| `https://example.com/` | `output_dir/example_com/index.md` |
|
|
| `https://example.com/docs/api` | `output_dir/docs/api.md` |
|
|
| `https://example.com/docs/api/` | `output_dir/docs/api.md` |
|
|
| `https://example.com/blog/post?id=123` | `output_dir/blog/post_id_123.md` |
|
|
|
|
The extension comes from the selected output format:
|
|
|
|
| Format | Extension |
|
|
|---|---|
|
|
| `markdown` | `.md` |
|
|
| `llm` | `.md` |
|
|
| `json` | `.json` |
|
|
| `text` | `.txt` |
|
|
| `html` | `.html` |
|
|
|
|
For `--urls-file`, a CSV entry of `url,filename` uses the custom filename instead of the URL-derived name.
|
|
|
|
Examples:
|
|
|
|
```txt
|
|
https://example.com/docs/api,api.md
|
|
https://example.com/blog/post
|
|
```
|
|
|
|
Becomes:
|
|
|
|
```txt
|
|
output_dir/api.md
|
|
output_dir/blog/post.md
|
|
```
|
|
|
|
Mode-specific outputs use fixed filenames in the root of `output_dir`:
|
|
|
|
| Mode | File |
|
|
|---|---|
|
|
| `--map` | `sitemap.json` or `sitemap.txt` |
|
|
| `--diff-with` | `diff.json` or `diff.txt` |
|
|
| `--brand` | `brand.json` |
|
|
| `--research` | `research-<slug>.json` |
|
|
| `--watch` | `watch-<timestamp>.json` |
|
|
|
|
The directory tree is created automatically, so nested paths do not need to exist ahead of time.
|
|
|
|
### Fetch
|
|
|
|
| Key | Type | Default | Notes |
|
|
|---|---|---:|---|
|
|
| `browser` | string | `chrome` | One of `chrome`, `firefox`, `random` |
|
|
| `timeout` | integer | `30` | Request timeout in seconds |
|
|
| `pdf_mode` | string | `auto` | One of `auto`, `fast` |
|
|
| `only_main_content` | boolean | `false` | Auto-detect the main content area |
|
|
|
|
### Content Filtering
|
|
|
|
| Key | Type | Default | Notes |
|
|
|---|---|---:|---|
|
|
| `include_selectors` | array of strings | `[]` | CSS selectors to include |
|
|
| `exclude_selectors` | array of strings | `[]` | CSS selectors to exclude |
|
|
|
|
### Crawl
|
|
|
|
| Key | Type | Default | Notes |
|
|
|---|---|---:|---|
|
|
| `depth` | integer | `1` | Crawl depth |
|
|
| `max_pages` | integer | `20` | Maximum pages to crawl |
|
|
| `concurrency` | integer | `5` | Concurrent requests |
|
|
| `delay` | integer | `100` | Delay between requests in ms |
|
|
| `path_prefix` | string or null | `null` | Only crawl URLs whose path starts with this prefix |
|
|
| `include_paths` | array of strings | `[]` | Glob patterns to include |
|
|
| `exclude_paths` | array of strings | `[]` | Glob patterns to exclude |
|
|
| `use_sitemap` | boolean | `false` | Seed the crawl from sitemap discovery |
|
|
|
|
### LLM
|
|
|
|
| Key | Type | Default | Notes |
|
|
|---|---|---:|---|
|
|
| `llm_provider` | string | unset | Optional provider name: `gemini`, `ollama`, `openai`, `anthropic` |
|
|
| `llm_model` | string | unset | Optional model override |
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Purpose | Notes |
|
|
|---|---|---|
|
|
| `NOXA_API_KEY` | Cloud API key | Used for cloud fallback and cloud-only features |
|
|
| `NOXA_PROXY` | Single proxy URL | Takes priority over proxy file when set |
|
|
| `NOXA_PROXY_FILE` | Proxy pool file path | One proxy per line |
|
|
| `NOXA_WEBHOOK_URL` | Notification webhook | Used by watch/crawl/batch notifications |
|
|
| `NOXA_LLM_BASE_URL` | LLM endpoint URL | For Ollama or OpenAI-compatible endpoints |
|
|
| `NOXA_LLM_PROVIDER` | Default LLM provider | Environment override for the provider name |
|
|
| `NOXA_LLM_MODEL` | Default LLM model | Environment override for the model name |
|
|
| `NOXA_CONFIG` | Config file path | Override `./config.json` or bypass with `/dev/null` |
|
|
|
|
The following variables are not part of the `config.json` contract, but they still matter for LLM provider behavior:
|
|
|
|
- `OPENAI_API_KEY`
|
|
- `ANTHROPIC_API_KEY`
|
|
- `OLLAMA_HOST`
|
|
- `OLLAMA_MODEL`
|
|
- `GEMINI_MODEL`
|
|
|
|
## Example
|
|
|
|
`config.example.json` shows the recommended baseline:
|
|
|
|
```json
|
|
{
|
|
"$schema": "./config.schema.json",
|
|
"_doc": [
|
|
"Copy to config.json and remove fields you don't need.",
|
|
"Secrets (api_key, proxy, webhook, llm_base_url) go in .env — NOT here."
|
|
],
|
|
"format": "markdown",
|
|
"browser": "chrome",
|
|
"timeout": 30,
|
|
"pdf_mode": "auto",
|
|
"metadata": false,
|
|
"verbose": false,
|
|
"only_main_content": false,
|
|
"include_selectors": [],
|
|
"exclude_selectors": ["nav", "footer", ".sidebar", ".cookie-banner"],
|
|
"depth": 1,
|
|
"max_pages": 20,
|
|
"concurrency": 5,
|
|
"delay": 100,
|
|
"path_prefix": null,
|
|
"include_paths": [],
|
|
"exclude_paths": ["/changelog/*", "/blog/*", "/releases/*"],
|
|
"use_sitemap": false,
|
|
"llm_provider": "gemini",
|
|
"llm_model": "gemini-2.5-pro"
|
|
}
|
|
```
|
|
|
|
## Gotchas
|
|
|
|
- `config.json` is permissive by design: unknown fields are ignored so newer config files still work on older binaries.
|
|
- `llm_provider` is validated by the CLI at runtime; invalid values will fail when the provider is selected.
|
|
- `browser`, `timeout`, `depth`, `max_pages`, `concurrency`, and `delay` are ordinary defaults, so CLI flags can override them per run.
|
|
- Boolean defaults set to `true` in config are sticky for that run unless you bypass the file.
|
|
|
|
## Related Files
|
|
|
|
- [`config.schema.json`](../config.schema.json)
|
|
- [`config.example.json`](../config.example.json)
|
|
- [`env.example`](../env.example)
|