Fetch web pages as clean markdown for AI coding agents.
Single Go binary. One dependency (x/net/html). Fetches HTML, strips noise, outputs markdown. Works as a CLI, MCP server, and Claude Code plugin.
/install-plugin RandomCodeSpace/rawdoc
Adds /rawdoc and /rawdoc-crawl slash commands plus rawdoc_fetch and rawdoc_crawl MCP tools. The setup hook builds the binary automatically — requires Go 1.25+.
go install github.com/RandomCodeSpace/rawdoc@latestrawdoc --serveRuns as a JSON-RPC stdio server implementing the Model Context Protocol. Exposes rawdoc_fetch and rawdoc_crawl tools. See Manual MCP Setup below for configuration.
- Fetches HTML via plain HTTP with browser-like headers
- Strips noise — scripts, styles, navbars, footers, ads, cookie banners, hidden elements
- Extracts main content using site-specific selectors or readability scoring
- Converts to clean markdown (headings, code blocks, tables, lists)
- Crawls linked pages when given a depth > 0
95%+ token reduction vs raw HTML. Works on server-rendered sites. JS-only SPAs are not supported.
# Single page → stdout
rawdoc https://kubernetes.io/docs/concepts/workloads/pods/
# Just the code blocks
rawdoc https://www.baeldung.com/spring-kafka --code-only
# JSON output with metadata
rawdoc https://pkg.go.dev/fmt -f json
# YAML output
rawdoc https://pkg.go.dev/fmt -f yaml
# Save to file
rawdoc https://example.com -o docs.md
# Crawl docs to a directory (depth=2, max 50 pages)
rawdoc https://kubernetes.io/docs/concepts/workloads/ -d 2 -o ~/docs/k8s/
# Verbose — see fetch decisions and token stats
rawdoc https://www.baeldung.com/spring-kafka -v
# MCP server mode (stdio JSON-RPC)
rawdoc --serve[tier1] https://pkg.go.dev/fmt → fetching
[stats] input: 139.2KB (35634 tokens) → output: 43.5KB (11135 tokens) | 69% saved
[output] wrote json to docs.json
All verbose output goes to stderr. stdout stays clean for piping.
| Flag | Default | Description |
|---|---|---|
-o, --output |
stdout | File or directory |
-f, --format |
markdown |
markdown text json yaml |
--code-only |
— | Extract only code blocks |
--no-links |
— | Strip link URLs, keep text only |
| Flag | Default | Description |
|---|---|---|
-d, --depth |
0 |
Crawl depth (0 = single page) |
-c, --concurrency |
5 |
Parallel fetches |
--max-pages |
50 |
Page limit |
--delay |
1s |
Delay between requests |
--include |
— | URL path glob to include |
--exclude |
— | URL path glob to exclude |
--sitemap |
— | Parse sitemap.xml for URL discovery |
| Flag | Default | Description |
|---|---|---|
--timeout |
15s |
Per-request timeout |
--max-time |
10m |
Total runtime ceiling |
--max-retries |
3 |
Per-URL retries with exponential backoff |
--header K=V |
— | Extra header (repeatable) |
| Flag | Default | Description |
|---|---|---|
-v, --verbose |
— | Fetch log and token stats to stderr |
-q, --quiet |
— | Suppress all stderr |
--serve |
— | Run as MCP stdio server |
--version |
— | Print version |
rawdoc https://kubernetes.io/docs/concepts/workloads/ -d 2 --max-pages 50 -o ~/docs/k8s/Writes one .md file per page plus an index.md:
~/docs/k8s/
├── index.md
├── workloads.md
├── workloads-pods.md
├── workloads-controllers-deployment.md
└── ...
Stays on the same domain. Respects --include/--exclude globs and --max-pages limit.
| Format | Description |
|---|---|
markdown |
Headings, code blocks, tables, lists (default) |
text |
Plain text, no markup |
json |
Structured: url, title, content, code_blocks, fetch_tier, token count |
yaml |
Same fields as JSON |
--code-only |
Only fenced code blocks from the page |
Built-in content selectors for: Baeldung, Docusaurus, GitBook, ReadTheDocs, MkDocs, Spring.io, GitHub, MDN, Go pkg.dev, StackOverflow, Medium, Dev.to, Confluence, Notion.
Falls back to readability scoring when no selector matches.
| Component | Name | Description |
|---|---|---|
| Command | /rawdoc <url> |
Fetch a page as markdown |
| Command | /rawdoc-crawl <url> [depth] |
Crawl linked pages |
| MCP Tool | rawdoc_fetch |
Programmatic single-page fetch |
| MCP Tool | rawdoc_crawl |
Programmatic multi-page crawl |
/install-plugin RandomCodeSpace/rawdoc
The setup hook builds the Go binary automatically. Requires Go 1.25+.
rawdoc_fetch — fetch a single page as markdown.
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | yes | URL to fetch |
format |
string | no | markdown (default), text, json, yaml |
code_only |
boolean | no | Extract only code blocks |
rawdoc_crawl — crawl linked pages from a seed URL.
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | yes | Seed URL to crawl |
depth |
integer | no | Crawl depth (default: 1) |
max_pages |
integer | no | Max pages to fetch (default: 20) |
include |
string | no | URL path glob to include |
exclude |
string | no | URL path glob to exclude |
concurrency |
integer | no | Parallel fetches (default: 3) |
If you prefer to configure the MCP server manually instead of using the plugin:
go install github.com/RandomCodeSpace/rawdoc@latestAdd to ~/.claude/settings.json:
{
"mcpServers": {
"rawdoc": {
"command": "rawdoc",
"args": ["--serve"]
}
}
}| Code | Meaning |
|---|---|
0 |
Success |
1 |
Fetch failure |
2 |
Usage error (bad flags, invalid URL) |
git clone https://github.com/RandomCodeSpace/rawdoc.git
cd rawdoc
go build -o rawdoc .Cross-compile:
GOOS=linux GOARCH=amd64 go build -o rawdoc-linux-amd64 .
GOOS=windows GOARCH=amd64 go build -o rawdoc.exe .
GOOS=darwin GOARCH=arm64 go build -o rawdoc-darwin-arm64 .Requires: Go 1.25+ | Dependencies: golang.org/x/net (only)
- JS-rendered pages (React SPAs, Next.js CSR, Angular) return empty content — rawdoc uses plain HTTP, not a browser
- CAPTCHA/login-gated pages — returns whatever the public page shows
- Single IP — not designed for large-scale scraping or proxy rotation