class documentation
Tries to combine embedded-text extraction where present and sensible, and OCR where necessary.
Works per page, producing hOCR output intermeda for each page, so that we can opt to do more layout analysis -- or not.
Method | __init__ |
DPI is both for the image to feed to OCR -- but also for the rescaling of the fake-hOCR-from-PDF |
Method | process |
Processes a page at a time. |
Instance Variable | doc |
Undocumented |
Instance Variable | dpi |
Undocumented |
Instance Variable | ocr |
Undocumented |