Documentation Scraper
Overview
The netapi docs command group provides a universal documentation scraper that converts web documentation to AsciiDoc or Markdown using pandoc.
Commands
scrape
Scrape a single documentation page:
netapi docs scrape -u "https://wiki.archlinux.org/title/Pacman" -f asciidoc -o ./output
Options:
| Option | Short | Description |
|---|---|---|
|
|
URL to scrape (required) |
|
|
Output format: asciidoc, markdown (default: asciidoc) |
|
|
Output directory (default: ./docs-output) |
|
Custom output filename (auto-generated if not set) |
|
|
|
Delay between requests in seconds (default: 1.0) |
|
|
Custom CSS selector for content extraction |
scrape-guide
Scrape entire documentation guide (all chapters):
netapi docs scrape-guide -u "https://docs.example.com/guide/" -o ./guide-output
This command:
-
Fetches the index/TOC page
-
Discovers all chapter links
-
Scrapes each chapter sequentially
-
Creates an index file linking all chapters
Options:
| Option | Short | Description |
|---|---|---|
|
|
Guide index page URL (required) |
|
|
Output format (default: asciidoc) |
|
|
Output directory (default: ./docs-output) |
|
|
Delay between requests (default: 1.5) |
ise
Scrape Cisco ISE Administrator Guide:
# ISE 3.2 Admin Guide
netapi docs ise --version 3-2 -o ./ise-docs
# ISE 3.3 in Markdown
netapi docs ise --version 3-3 -f markdown -o ./ise-docs
Scraping ISE 3-2 Admin Guide URL: https://www.cisco.com/c/en/us/td/docs/security/ise/3-2/admin_guide/b_ise_admin_3_2.html [1/47] Introduction to Cisco ISE... [2/47] Getting Started... ... Scraped 47 files to: ./ise-docs
Options:
| Option | Short | Description |
|---|---|---|
|
|
ISE version (e.g., 3-2, 3-3) (default: 3-2) |
|
|
Output format (default: asciidoc) |
|
|
Output directory (default: ./ise-docs) |
|
|
Delay between requests (default: 1.5) |
github
Scrape documentation from a GitHub repository:
# Scrape Flask docs
netapi docs github pallets/flask --path docs/ -o ./flask-docs
# Full URL also works
netapi docs github https://github.com/ansible/ansible --path docs/
# Different branch
netapi docs github owner/repo --branch develop --path docs/
Options:
| Option | Short | Description |
|---|---|---|
|
|
Output format (default: asciidoc) |
|
|
Output directory (default: ./github-docs) |
|
|
Branch name (default: main) |
|
|
Subdirectory path (e.g., docs/) |
arch
Scrape a page from the Arch Linux Wiki:
netapi docs arch Pacman -o ./arch-docs
netapi docs arch "SSH keys" -o ./arch-docs
netapi docs arch Installation_guide -f markdown
Options:
| Option | Short | Description |
|---|---|---|
|
|
Output format (default: asciidoc) |
|
|
Output directory (default: ./arch-docs) |
Supported Sites
The scraper auto-detects optimal CSS selectors for common documentation sites:
| Domain | Selector | Notes |
|---|---|---|
|
|
Wiki articles |
|
|
Man pages |
|
|
GitHub documentation |
|
|
README files |
|
|
DevNet documentation |
|
|
ReadTheDocs projects |
|
|
Flask, Jinja, etc. |
|
|
Python documentation |
|
|
Kernel docs |
|
|
Linux man pages |
|
|
Sheet music wiki |
For unlisted sites, use --selector to specify a custom CSS selector.
Output Formats
| Format | Description |
|---|---|
|
AsciiDoc (default) - Best for Antora integration |
|
GitHub-Flavored Markdown - Compatible with Obsidian, GitHub, etc. |
Examples
Build Local ISE Reference
# Scrape ISE 3.2 admin guide
netapi docs ise --version 3-2 -o ~/docs/ise-admin
# Result: 47 AsciiDoc files with index
ls ~/docs/ise-admin/
# 00-index.adoc
# 01-introduction-to-cisco-ise.adoc
# 02-getting-started.adoc
# ...
Troubleshooting
No content extracted
Try specifying a custom selector:
# Inspect page structure first
curl -s "https://example.com/docs" | grep -E '<article|<main|<div.*content'
# Then use appropriate selector
netapi docs scrape -u "https://example.com/docs" -s "div.main-content"