Documentation Scraper

Overview

The netapi docs command group provides a universal documentation scraper that converts web documentation to AsciiDoc or Markdown using pandoc.

Prerequisites

Requires pandoc for format conversion:

sudo pacman -S pandoc

Commands

scrape

Scrape a single documentation page:

netapi docs scrape -u "https://wiki.archlinux.org/title/Pacman" -f asciidoc -o ./output

Options:

Option Short Description

--url

-u

URL to scrape (required)

--format

-f

Output format: asciidoc, markdown (default: asciidoc)

--output

-o

Output directory (default: ./docs-output)

--filename

Custom output filename (auto-generated if not set)

--delay

-d

Delay between requests in seconds (default: 1.0)

--selector

-s

Custom CSS selector for content extraction

scrape-guide

Scrape entire documentation guide (all chapters):

netapi docs scrape-guide -u "https://docs.example.com/guide/" -o ./guide-output

This command:

  1. Fetches the index/TOC page

  2. Discovers all chapter links

  3. Scrapes each chapter sequentially

  4. Creates an index file linking all chapters

Options:

Option Short Description

--url

-u

Guide index page URL (required)

--format

-f

Output format (default: asciidoc)

--output

-o

Output directory (default: ./docs-output)

--delay

-d

Delay between requests (default: 1.5)

ise

Scrape Cisco ISE Administrator Guide:

# ISE 3.2 Admin Guide
netapi docs ise --version 3-2 -o ./ise-docs

# ISE 3.3 in Markdown
netapi docs ise --version 3-3 -f markdown -o ./ise-docs
Example Output
Scraping ISE 3-2 Admin Guide
URL: https://www.cisco.com/c/en/us/td/docs/security/ise/3-2/admin_guide/b_ise_admin_3_2.html

[1/47] Introduction to Cisco ISE...
[2/47] Getting Started...
...
Scraped 47 files to: ./ise-docs

Options:

Option Short Description

--version

-v

ISE version (e.g., 3-2, 3-3) (default: 3-2)

--format

-f

Output format (default: asciidoc)

--output

-o

Output directory (default: ./ise-docs)

--delay

-d

Delay between requests (default: 1.5)

github

Scrape documentation from a GitHub repository:

# Scrape Flask docs
netapi docs github pallets/flask --path docs/ -o ./flask-docs

# Full URL also works
netapi docs github https://github.com/ansible/ansible --path docs/

# Different branch
netapi docs github owner/repo --branch develop --path docs/

Options:

Option Short Description

--format

-f

Output format (default: asciidoc)

--output

-o

Output directory (default: ./github-docs)

--branch

-b

Branch name (default: main)

--path

-p

Subdirectory path (e.g., docs/)

arch

Scrape a page from the Arch Linux Wiki:

netapi docs arch Pacman -o ./arch-docs
netapi docs arch "SSH keys" -o ./arch-docs
netapi docs arch Installation_guide -f markdown

Options:

Option Short Description

--format

-f

Output format (default: asciidoc)

--output

-o

Output directory (default: ./arch-docs)

Supported Sites

The scraper auto-detects optimal CSS selectors for common documentation sites:

Domain Selector Notes

wiki.archlinux.org

#mw-content-text

Wiki articles

man.archlinux.org

main.container

Man pages

docs.github.com

article.markdown-body

GitHub documentation

github.com

article.markdown-body

README files

developer.cisco.com

article.content

DevNet documentation

readthedocs.io

div.document

ReadTheDocs projects

palletsprojects.com

div.document

Flask, Jinja, etc.

docs.python.org

div.body

Python documentation

kernel.org

div.kerneldoc

Kernel docs

man7.org

pre

Linux man pages

imslp.org

#wiki-body

Sheet music wiki

For unlisted sites, use --selector to specify a custom CSS selector.

Output Formats

Format Description

asciidoc

AsciiDoc (default) - Best for Antora integration

markdown

GitHub-Flavored Markdown - Compatible with Obsidian, GitHub, etc.

Examples

Build Local ISE Reference

# Scrape ISE 3.2 admin guide
netapi docs ise --version 3-2 -o ~/docs/ise-admin

# Result: 47 AsciiDoc files with index
ls ~/docs/ise-admin/
# 00-index.adoc
# 01-introduction-to-cisco-ise.adoc
# 02-getting-started.adoc
# ...

Offline Arch Wiki

# Scrape common pages
for page in Pacman Systemd SSH "Network configuration" Btrfs; do
    netapi docs arch "$page" -o ~/docs/arch-wiki
done

Mirror GitHub Docs

# Scrape Ansible documentation
netapi docs github ansible/ansible --path docs/docsite/rst/ -o ~/docs/ansible

Troubleshooting

No content extracted

Try specifying a custom selector:

# Inspect page structure first
curl -s "https://example.com/docs" | grep -E '<article|<main|<div.*content'

# Then use appropriate selector
netapi docs scrape -u "https://example.com/docs" -s "div.main-content"

Pandoc not found

# Arch Linux
sudo pacman -S pandoc

# Ubuntu/Debian
sudo apt install pandoc

# macOS
brew install pandoc

Rate limiting

Increase delay between requests:

netapi docs scrape-guide -u "https://docs.example.com" --delay 3.0