wetsuite.extras.pdfhocr.PDFAugmenter

class documentation

class PDFAugmenter:

Tries to combine embedded-text extraction where present and sensible, and OCR where necessary.

Works per page, producing hOCR output intermeda for each page, so that we can opt to do more layout analysis -- or not.

Method	`__init__`	DPI is both for the image to feed to OCR -- but also for the rescaling of the fake-hOCR-from-PDF
Method	`process_pages`	Processes a page at a time.
Instance Variable	`doc`	Undocumented
Instance Variable	`dpi`	Undocumented
Instance Variable	`ocr_preference`	Undocumented

def __init__(self, pdf, ocr_at_dpi=200, ocr_preference=('easyocr', 'tesseract')): ¶

DPI is both for the image to feed to OCR -- but also for the rescaling of the fake-hOCR-from-PDF

def process_pages(self): ¶

Processes a page at a time.

Undocumented

Undocumented

ocr_preference = ¶

Undocumented