This is an experiment in allowing per-page decisions to extract embedded text or do OCR.
To do so, it
- builds on wetsuite.extras.pdf to extract embedded text
- builds on wetsuite.extras.ocr to do OCR when necessary
Class |
|
Tries to combine embedded-text extraction where present and sensible, and OCR where necessary. |
Function | easyocr |
Express the |
Function | fake |
Yes, we are generating hOCR from what is already PDF text. In general this makes no sense - where would we take that data? |
Function | hocr |
take hOCR document as bytestring, output plain text (just in the order it is stored, no analysis or sorting) |
Function | tesseract |
Render pymupdf Document as images, OCR, give results as HOCR. |
Function | _page |
takes a fitz Page, extracts |
Function | _tesseract |
Because terreract is a single call on a single image, hOCR output will be a single-page result. |
Function | _tesseract |
Run tesseract OCR on an image, give results as hOCR. (which is just pytesseract.image_to_pdf_or_hocr()) |
Express the
You will probably want to use width_ths of perhaps 0.1 on the easyocr() call (of which the unit is box height, so adaptive) to avoid merging words into sentences
cropboxes and dpi are only interesting if you want to get this
Note that the units this reports in is pixels.
Yes, we are generating hOCR from what is already PDF text. In general this makes no sense - where would we take that data?
It's primarily so that we can have a common (internal-ish) format between OCR and PDF within this project.
Note that
- PDF's Y origin is bottom, unlike OCR, so these are recalculated with respect to the cropbox.
- A PDF's units are pt (1pt = 1/72 inch, and 0.35278mm). Since OCR is in pixels and you sometimes might want to try for equivalence from different sources, you can hand in a dpi and it should become the same scale. Chances are it's still offset.
Parameters | |
document | pymupdf Document |
dpi | Undocumented |
Returns | |
XML as bytestring |
take hOCR document as bytestring, output plain text (just in the order it is stored, no analysis or sorting)
takes a fitz Page, extracts
Note that in general, OCR and PDF and images can disagree on whether the Y origin is on top or bottom. Images, OCR, and hOCR have x,y origins in the top left. PDF and therefore PDF libraries will have it in the bottom left.
Because terreract is a single call on a single image, hOCR output will be a single-page result.
If you have used _tesseract_hocr_single() to analyze single pages, then to create a single hOCR for the document those came from, we puts the pages in sequence, and rewrites the ids to ensure it's a valid document.
Currently hardcoded with assumptions about the tesseract output; watch for changes.