Phase 1: Document Pipeline

Phase 1: Document Pipeline

Extract structured text from any document format using CLI tools. Build batch processing workflows for bulk document libraries.

Core Workflows

PDF Text Extraction

# Single file — preserve layout
pdftotext -layout document.pdf -

# Batch — all PDFs in directory to .txt
find /path/to/docs -name '*.pdf' -exec sh -c \
  'pdftotext -layout "$1" "${1%.pdf}.txt"' _ {} \;

# Extract specific pages
pdftotext -f 5 -l 10 document.pdf -

PDF Metadata & Image Extraction

# Metadata inspection
pdfinfo document.pdf

# Extract all embedded images
pdfimages -all document.pdf ./extracted/img

# Count pages across library
find . -name '*.pdf' -exec pdfinfo {} \; 2>/dev/null | awk '/^Pages:/{s+=$2} END{print s}'

OCR with Tesseract

# Single image to text
tesseract scan.png output -l eng

# Scanned PDF → searchable PDF (via image extraction + reassembly)
pdfimages -png document.pdf /tmp/pages
for f in /tmp/pages-*.png; do
  tesseract "$f" "${f%.png}" -l eng pdf
done

# Multi-language OCR
tesseract scan.png output -l eng+spa

# HOCR output (preserves coordinates)
tesseract scan.png output -l eng hocr

Format Conversion with Pandoc

# DOCX → plain text
pandoc -t plain document.docx -o output.txt

# HTML → Markdown
pandoc -f html -t markdown page.html -o page.md

# EPUB → text
pandoc -t plain book.epub -o book.txt

# Batch DOCX conversion
find . -name '*.docx' -exec sh -c \
  'pandoc -t plain "$1" -o "${1%.docx}.txt"' _ {} \;

Accuracy Tuning

  • Tesseract PSM modes: --psm 6 (block), --psm 3 (auto), --psm 11 (sparse)

  • Pre-processing scans: convert -density 300 -threshold 50% input.png cleaned.png

  • Custom dictionaries and training data for domain-specific OCR

  • Compare OCR output quality: diff <(tesseract --psm 3 img out) <(tesseract --psm 6 img out)