CrawlerScope is a public crawler intelligence dataset and static GitHub Pages dashboard for crawler, AI bot, SEO bot, monitoring probe, and scanner infrastructure.
The project aggregates operator-published IP ranges, normalizes them into CIDR prefixes, tracks source provenance, and publishes operational exports for gateways, analytics pipelines, SIEM enrichment, bot management, and infrastructure visibility.
Repository:
https://github.com/ipanalytics/CrawlerScope
Crawler infrastructure is fragmented across vendor JSON feeds, documentation pages, robots specifications, and unofficial community-maintained lists.
CrawlerScope consolidates those sources into a normalized operational dataset with:
- CIDR normalization
- source attribution
- operator metadata
- category classification
- service labeling
- export tooling
The repository is designed for direct machine consumption and lightweight browser-based inspection.
- Googlebot
- Bingbot
- DuckDuckGo
- Applebot
- YandexBot
- Baiduspider
- OpenAI
- Anthropic
- Perplexity
- Meta
- Amazonbot
- Bytespider
- AhrefsBot
- SemrushBot
- Shodan
- Censys
- Datadog Synthetics
- Pingdom
- UptimeRobot
- Better Stack
- StatusCake
- Common Crawl
- Pinterestbot
- LinkedInBot
CrawlerScope separates datasets by source quality and publication model.
| Source Type | Description |
|---|---|
official_json |
Operator-published structured JSON |
official_text |
Operator-published text-based CIDR lists |
documented_user_agent |
Publicly documented crawler identity without authoritative IP feed |
known_static |
Operationally useful static ranges with limited authority guarantees |
This distinction is preserved in exports and dashboard filters.
| Feature | Description |
|---|---|
| Interactive map | Country-level operator distribution |
| Category analytics | Operator/category mix charts |
| Cascading filters | Filter by category, operator, source type, or service |
| Full-text search | Search across operators, tags, URLs, and user-agents |
| Export generation | JSON, CSV, CIDR text, robots.txt, NGINX maps |
| Presets | AI crawlers, monitoring probes, official feeds |
| Service table | Sortable infrastructure inventory |
| Clipboard export | Copy filtered CIDR selections |
Public Sources
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
Vendor JSON Documentation Static Lists
│ │ │
└──────────────┴──────┬───────┘
▼
Normalization Layer
CIDR + metadata merge
▼
Classification Engine
category / tags / source type
▼
Export Pipeline
JSON / CSV / robots / nginx
▼
Static Dashboard
| File | Description |
|---|---|
data/current/crawlers.json |
Full normalized crawler dataset |
data/current/robots-ai.txt |
robots.txt snippets for AI crawlers |
data/current/nginx-ai-map.conf |
NGINX user-agent mapping |
data/history/summary.csv |
Historical build metrics |
data/snapshots/*.json |
Compact snapshot summaries |
curl -fsSLO \
https://raw.githubusercontent.com/ipanalytics/CrawlerScope/main/data/current/crawlers.jsonjq -r '
.records[]
| select(.category=="ai-crawler")
| .prefix
' crawlers.jsoncurl -fsSL \
https://raw.githubusercontent.com/ipanalytics/CrawlerScope/main/data/current/robots-ai.txtinclude /etc/nginx/nginx-ai-map.conf;
if ($is_ai_crawler = 1) {
return 403;
}CrawlerScope/
├── .github/
│ └── workflows/
├── data/
│ ├── current/
│ ├── history/
│ └── snapshots/
├── public/
│ ├── assets/
│ └── index.html
├── scripts/
├── LICENSE
└── README.md
Generated site/ artifacts are intentionally excluded from version control.
python3 scripts/update.pyrm -rf site
cp -R public site
cp -R data site/data
python3 -m http.server 8080 --directory siteOpen:
http://127.0.0.1:8080/
CrawlerScope is deployed through GitHub Actions.
Workflow:
.github/workflows/crawler-scope.yml
Pages configuration:
- Source:
GitHub Actions - Branch deployment is not required
- Generated assets are published from workflow artifacts
Default refresh interval:
schedule:
- cron: "23 */6 * * *"Most upstream crawler sources update daily or less frequently, so sub-hour refresh intervals generally provide limited value.
- IP inventories are only as complete as upstream disclosures
- User-Agent strings are trivially spoofable
- Some operators publish crawler identities without stable IP feeds
- Static/public ranges should be treated as operational hints, not authoritative truth
- Multiple services may legitimately share infrastructure prefixes
| Domain | Example |
|---|---|
| Bot Management | AI crawler detection and filtering |
| SIEM Enrichment | Infrastructure attribution |
| Analytics | Search and crawler traffic classification |
| WAF Pipelines | Allow/block automation logic |
| SEO Monitoring | Search crawler visibility |
| Threat Hunting | Scanner infrastructure correlation |
Planned additions:
- ASN-level crawler attribution
- Historical prefix diffing
- Provider overlap analysis
- Signed dataset releases
- Compressed bulk exports
- Additional crawler verification metadata
Licensed under CC0-1.0.
See LICENSE.
CrawlerScope aggregates publicly available infrastructure information for operational and analytical use. Consumers are responsible for validating suitability within their own environments.
