module documentation

This is an experiment in allowing per-page decisions to extract embedded text or do OCR.

To do so, it

  • builds on wetsuite.extras.pdf to extract embedded text
  • builds on wetsuite.extras.ocr to do OCR when necessary
Class PDFAugmenter Tries to combine embedded-text extraction where present and sensible, and OCR where necessary.
Function easyocr_to_hocr Express the
Function fake_hocr Yes, we are generating hOCR from what is already PDF text. In general this makes no sense - where would we take that data?
Function hocr_plaintext take hOCR document as bytestring, output plain text (just in the order it is stored, no analysis or sorting)
Function tesseract_hocr Render pymupdf Document as images, OCR, give results as HOCR.
Function _page_fakehocr takes a fitz Page, extracts
Function _tesseract_hocr_merge Because terreract is a single call on a single image, hOCR output will be a single-page result.
Function _tesseract_hocr_single Run tesseract OCR on an image, give results as hOCR. (which is just pytesseract.image_to_pdf_or_hocr())
def easyocr_to_hocr(list_of_pageresults, cropboxes=None, dpi=None):

Express the

You will probably want to use width_ths of perhaps 0.1 on the easyocr() call (of which the unit is box height, so adaptive) to avoid merging words into sentences

cropboxes and dpi are only interesting if you want to get this

Note that the units this reports in is pixels.

def fake_hocr(document, dpi=None):

Yes, we are generating hOCR from what is already PDF text. In general this makes no sense - where would we take that data?

It's primarily so that we can have a common (internal-ish) format between OCR and PDF within this project.

Note that

  • PDF's Y origin is bottom, unlike OCR, so these are recalculated with respect to the cropbox.
  • A PDF's units are pt (1pt = 1/72 inch, and 0.35278mm). Since OCR is in pixels and you sometimes might want to try for equivalence from different sources, you can hand in a dpi and it should become the same scale. Chances are it's still offset.
Parameters
documentpymupdf Document
dpiUndocumented
Returns
XML as bytestring
def hocr_plaintext(hocrbytes):

take hOCR document as bytestring, output plain text (just in the order it is stored, no analysis or sorting)

def tesseract_hocr(document, lang='eng', dpi=200):

Render pymupdf Document as images, OCR, give results as HOCR.

def _page_fakehocr(page, add_under_node, page_num, dpimult=1.0):

takes a fitz Page, extracts

Note that in general, OCR and PDF and images can disagree on whether the Y origin is on top or bottom. Images, OCR, and hOCR have x,y origins in the top left. PDF and therefore PDF libraries will have it in the bottom left.

def _tesseract_hocr_merge(hocr_pages_xmlbytes):

Because terreract is a single call on a single image, hOCR output will be a single-page result.

If you have used _tesseract_hocr_single() to analyze single pages, then to create a single hOCR for the document those came from, we puts the pages in sequence, and rewrites the ids to ensure it's a valid document.

Currently hardcoded with assumptions about the tesseract output; watch for changes.

def _tesseract_hocr_single(image, lang='eng'):

Run tesseract OCR on an image, give results as hOCR. (which is just pytesseract.image_to_pdf_or_hocr())