Phase 1: Document Pipeline
Phase 1: Document Pipeline
Extract structured text from any document format using CLI tools. Build batch processing workflows for bulk document libraries.
Core Workflows
PDF Text Extraction
# Single file — preserve layout
pdftotext -layout document.pdf -
# Batch — all PDFs in directory to .txt
find /path/to/docs -name '*.pdf' -exec sh -c \
'pdftotext -layout "$1" "${1%.pdf}.txt"' _ {} \;
# Extract specific pages
pdftotext -f 5 -l 10 document.pdf -
PDF Metadata & Image Extraction
# Metadata inspection
pdfinfo document.pdf
# Extract all embedded images
pdfimages -all document.pdf ./extracted/img
# Count pages across library
find . -name '*.pdf' -exec pdfinfo {} \; 2>/dev/null | awk '/^Pages:/{s+=$2} END{print s}'
OCR with Tesseract
# Single image to text
tesseract scan.png output -l eng
# Scanned PDF → searchable PDF (via image extraction + reassembly)
pdfimages -png document.pdf /tmp/pages
for f in /tmp/pages-*.png; do
tesseract "$f" "${f%.png}" -l eng pdf
done
# Multi-language OCR
tesseract scan.png output -l eng+spa
# HOCR output (preserves coordinates)
tesseract scan.png output -l eng hocr
Format Conversion with Pandoc
# DOCX → plain text
pandoc -t plain document.docx -o output.txt
# HTML → Markdown
pandoc -f html -t markdown page.html -o page.md
# EPUB → text
pandoc -t plain book.epub -o book.txt
# Batch DOCX conversion
find . -name '*.docx' -exec sh -c \
'pandoc -t plain "$1" -o "${1%.docx}.txt"' _ {} \;
Accuracy Tuning
-
Tesseract PSM modes:
--psm 6(block),--psm 3(auto),--psm 11(sparse) -
Pre-processing scans:
convert -density 300 -threshold 50% input.png cleaned.png -
Custom dictionaries and training data for domain-specific OCR
-
Compare OCR output quality:
diff <(tesseract --psm 3 img out) <(tesseract --psm 6 img out)