rawdoc

Fetch web pages as clean markdown for AI coding agents.

Single Go binary. One dependency (x/net/html). Fetches HTML, strips noise, outputs markdown. Works as a CLI, MCP server, and Claude Code plugin.

Install

Claude Code Plugin (recommended)

/install-plugin RandomCodeSpace/rawdoc

Adds /rawdoc and /rawdoc-crawl slash commands plus rawdoc_fetch and rawdoc_crawl MCP tools. The setup hook builds the binary automatically — requires Go 1.25+.

CLI

go install github.com/RandomCodeSpace/rawdoc@latest

MCP Server

rawdoc --serve

Runs as a JSON-RPC stdio server implementing the Model Context Protocol. Exposes rawdoc_fetch and rawdoc_crawl tools. See Manual MCP Setup below for configuration.

What It Does

Fetches HTML via plain HTTP with browser-like headers
Strips noise — scripts, styles, navbars, footers, ads, cookie banners, hidden elements
Extracts main content using site-specific selectors or readability scoring
Converts to clean markdown (headings, code blocks, tables, lists)
Crawls linked pages when given a depth > 0

95%+ token reduction vs raw HTML. Works on server-rendered sites. JS-only SPAs are not supported.

Usage

# Single page → stdout
rawdoc https://kubernetes.io/docs/concepts/workloads/pods/

# Just the code blocks
rawdoc https://www.baeldung.com/spring-kafka --code-only

# JSON output with metadata
rawdoc https://pkg.go.dev/fmt -f json

# YAML output
rawdoc https://pkg.go.dev/fmt -f yaml

# Save to file
rawdoc https://example.com -o docs.md

# Crawl docs to a directory (depth=2, max 50 pages)
rawdoc https://kubernetes.io/docs/concepts/workloads/ -d 2 -o ~/docs/k8s/

# Verbose — see fetch decisions and token stats
rawdoc https://www.baeldung.com/spring-kafka -v

# MCP server mode (stdio JSON-RPC)
rawdoc --serve

Verbose Output

[tier1] https://pkg.go.dev/fmt → fetching
[stats] input: 139.2KB (35634 tokens) → output: 43.5KB (11135 tokens) | 69% saved
[output] wrote json to docs.json

All verbose output goes to stderr. stdout stays clean for piping.

Flags

Output

Flag	Default	Description
`-o, --output`	stdout	File or directory
`-f, --format`	`markdown`	`markdown` `text` `json` `yaml`
`--code-only`	—	Extract only code blocks
`--no-links`	—	Strip link URLs, keep text only

Crawling

Flag	Default	Description
`-d, --depth`	`0`	Crawl depth (0 = single page)
`-c, --concurrency`	`5`	Parallel fetches
`--max-pages`	`50`	Page limit
`--delay`	`1s`	Delay between requests
`--include`	—	URL path glob to include
`--exclude`	—	URL path glob to exclude
`--sitemap`	—	Parse sitemap.xml for URL discovery

HTTP

Flag	Default	Description
`--timeout`	`15s`	Per-request timeout
`--max-time`	`10m`	Total runtime ceiling
`--max-retries`	`3`	Per-URL retries with exponential backoff
`--header K=V`	—	Extra header (repeatable)

Info

Flag	Default	Description
`-v, --verbose`	—	Fetch log and token stats to stderr
`-q, --quiet`	—	Suppress all stderr
`--serve`	—	Run as MCP stdio server
`--version`	—	Print version

Crawl Mode

rawdoc https://kubernetes.io/docs/concepts/workloads/ -d 2 --max-pages 50 -o ~/docs/k8s/

Writes one .md file per page plus an index.md:

~/docs/k8s/
├── index.md
├── workloads.md
├── workloads-pods.md
├── workloads-controllers-deployment.md
└── ...

Stays on the same domain. Respects --include/--exclude globs and --max-pages limit.

Output Formats

Format	Description
`markdown`	Headings, code blocks, tables, lists (default)
`text`	Plain text, no markup
`json`	Structured: url, title, content, code_blocks, fetch_tier, token count
`yaml`	Same fields as JSON
`--code-only`	Only fenced code blocks from the page

Site-Specific Selectors

Built-in content selectors for: Baeldung, Docusaurus, GitBook, ReadTheDocs, MkDocs, Spring.io, GitHub, MDN, Go pkg.dev, StackOverflow, Medium, Dev.to, Confluence, Notion.

Falls back to readability scoring when no selector matches.

Claude Code Plugin

What You Get

Component	Name	Description
Command	`/rawdoc <url>`	Fetch a page as markdown
Command	`/rawdoc-crawl <url> [depth]`	Crawl linked pages
MCP Tool	`rawdoc_fetch`	Programmatic single-page fetch
MCP Tool	`rawdoc_crawl`	Programmatic multi-page crawl

Install

/install-plugin RandomCodeSpace/rawdoc

The setup hook builds the Go binary automatically. Requires Go 1.25+.

MCP Tools

rawdoc_fetch — fetch a single page as markdown.

Parameter	Type	Required	Description
`url`	string	yes	URL to fetch
`format`	string	no	`markdown` (default), `text`, `json`, `yaml`
`code_only`	boolean	no	Extract only code blocks

rawdoc_crawl — crawl linked pages from a seed URL.

Parameter	Type	Required	Description
`url`	string	yes	Seed URL to crawl
`depth`	integer	no	Crawl depth (default: 1)
`max_pages`	integer	no	Max pages to fetch (default: 20)
`include`	string	no	URL path glob to include
`exclude`	string	no	URL path glob to exclude
`concurrency`	integer	no	Parallel fetches (default: 3)

Manual MCP Setup

If you prefer to configure the MCP server manually instead of using the plugin:

go install github.com/RandomCodeSpace/rawdoc@latest

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "rawdoc": {
      "command": "rawdoc",
      "args": ["--serve"]
    }
  }
}

Exit Codes

Code	Meaning
`0`	Success
`1`	Fetch failure
`2`	Usage error (bad flags, invalid URL)

Building

git clone https://github.com/RandomCodeSpace/rawdoc.git
cd rawdoc
go build -o rawdoc .

Cross-compile:

GOOS=linux   GOARCH=amd64 go build -o rawdoc-linux-amd64 .
GOOS=windows GOARCH=amd64 go build -o rawdoc.exe .
GOOS=darwin  GOARCH=arm64 go build -o rawdoc-darwin-arm64 .

Requires: Go 1.25+ | Dependencies: golang.org/x/net (only)

Limitations

JS-rendered pages (React SPAs, Next.js CSR, Angular) return empty content — rawdoc uses plain HTTP, not a browser
CAPTCHA/login-gated pages — returns whatever the public page shows
Single IP — not designed for large-scale scraping or proxy rotation

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
commands		commands
hooks		hooks
.gitignore		.gitignore
.mcp.json		.mcp.json
README.md		README.md
bench_test.go		bench_test.go
convert.go		convert.go
convert_test.go		convert_test.go
crawl.go		crawl.go
crawl_test.go		crawl_test.go
extract.go		extract.go
extract_test.go		extract_test.go
fetch.go		fetch.go
fetch_test.go		fetch_test.go
go.mod		go.mod
go.sum		go.sum
journey-into-rawdoc.md		journey-into-rawdoc.md
main.go		main.go
mcp.go		mcp.go
sites.go		sites.go
sites_test.go		sites_test.go
testing.ps1		testing.ps1
testing.sh		testing.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rawdoc

Install

Claude Code Plugin (recommended)

CLI

MCP Server

What It Does

Usage

Verbose Output

Flags

Output

Crawling

HTTP

Info

Crawl Mode

Output Formats

Site-Specific Selectors

Claude Code Plugin

What You Get

Install

MCP Tools

Manual MCP Setup

Exit Codes

Building

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rawdoc

Install

Claude Code Plugin (recommended)

CLI

MCP Server

What It Does

Usage

Verbose Output

Flags

Output

Crawling

HTTP

Info

Crawl Mode

Output Formats

Site-Specific Selectors

Claude Code Plugin

What You Get

Install

MCP Tools

Manual MCP Setup

Exit Codes

Building

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages