Source. Two scanned-style sample PDFs were produced by a public Python script
(
Pillow rasterizing original, hand-authored text onto a parchment background +
mild grain & rotation) and packaged as
image-only PDFs — no third-party copyright.
Each
searchable sibling on the right was produced by running
ocrmypdf -l eng+chi_sim --rotate-pages --deskew --skip-text <in> <out>
locally with the official
ocrmypdf/OCRmyPDF
CLI (MPL-2.0). Tesseract runs on the CPU; no API key, no cloud call. The OCR text
layer is
invisible: the page looks identical to the scan but Cmd-F / Ctrl-F and
copy-paste now work, and full-text indexers can crawl the file.