class documentation

Tries to combine embedded-text extraction where present and sensible, and OCR where necessary.

Works per page, producing hOCR output intermeda for each page, so that we can opt to do more layout analysis -- or not.

Method __init__ DPI is both for the image to feed to OCR -- but also for the rescaling of the fake-hOCR-from-PDF
Method process_pages Processes a page at a time.
Instance Variable doc Undocumented
Instance Variable dpi Undocumented
Instance Variable ocr_preference Undocumented
def __init__(self, pdf, ocr_at_dpi=200, ocr_preference=('easyocr', 'tesseract')):

DPI is both for the image to feed to OCR -- but also for the rescaling of the fake-hOCR-from-PDF

def process_pages(self):

Processes a page at a time.

doc =

Undocumented

dpi =

Undocumented

ocr_preference =

Undocumented