Query PDFs about the text objects that they contain (which is not always clean, structured, correct, or present at all)
If you want clean structured output, then you likely want to put it more work, but for a bag-of-words method this may be enough.
See also ocr.py, and note that we also have "render PDF pages to image" so we can hand that to that OCR module.
TODO: read about natural reading order details at: https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order
Function | closest |
Given a pymupdf Box, tells you the name of the size, and orientation. |
Function | count |
Counts the number of pages that have a reasonable amount of embedded text on them. |
Function | do |
Given a parsed pymupdf document object, tells us whether the pages (specifically their CropBox) vary in size at all. |
Function | document |
Tries to be slightly smarter than page_embedded_text_generator, making it slightly easier to see paragraphs and _maybe_ headings. |
Function | embedded |
For a PDF, walk through its pages |
Function | page |
Takes a single pymupdf Page object, and renders it as a PIL color Image at a specific resolution |
Function | page |
Extracts fragments using PyMuPDF's xhtml-style extraction, which analyzed some basic paragraphs and headers so that we don't have to (but also removed the low-level information it based that off) |
Function | page |
Quick 'get fragments of text from a page', relying on some pymupdf analysis. |
Function | page |
Takes PDF file data, yields a page's worth of its embedded text at a time (is a generator), according to the text objects in the PDF stream. |
Function | pages |
Takes PDF byte document, yields one page at a time as a PIL image object. |
Function | pdf |
Use only OCR to process a PDF (note attempt to use text from PDF objects). |
Function | _open |
helper function that lets varied functions deal with |
Variable | _html |
Undocumented |
Variable | _page |
Undocumented |
Given a pymupdf Box, tells you the name of the size, and orientation.
Parameters | |
box | a pymupdf Box, e.g. a_page.cropbox |
within | the amount of size it may be off. (Default is 36pt, which is ~12mm, 0.5 inch, which is perhaps overly flexible) |
Returns | |
something like ('A4','portrait', 1,0), which is:
|
Counts the number of pages that have a reasonable amount of embedded text on them.
Intended to help detect PDFs that are partly or fully images-of-text instead.
Counts characters per page plus spaces between words, but strips edges; TODO: think about that more.
Parameters | |
either:
| |
char | how long the text on a page should be, in characters, after strip()ping. Defaults to 200, which is maybe 50 words. |
Returns | |
(list_of_number_of_chars_per_page, num_pages_with_text, num_pages) |
Given a parsed pymupdf document object, tells us whether the pages (specifically their CropBox) vary in size at all.
Meant to help detect PDFs composited from multiple sources.
Parameters | |
the document under test (as bytes, already-parsed Document object) | |
allowance | the maximum height or width difference between largest and smallest, in pt (default is 36, which is ~12mm) |
Returns | |
a 3-tuple: ( whether there is more than allowance_pt difference, amount of width difference, amount of height difference ) |
Tries to be slightly smarter than page_embedded_text_generator, making it slightly easier to see paragraphs and _maybe_ headings.
Set up to do some more analysis (than e.g. page_embedded_fragments),
Note that this is the implementation of split.Fragments_PDF_Fallback, so when changing things, consider side effects there.
Parameters | |
Undocumented | |
hint |
|
debug | Undocumented |
Returns | |
|
For a PDF, walk through its pages
- if it reports having text, use that text
- if it does not report having text, render it as an image and run OCR
...and relies on our own wetsuite.extras.ocr to do so.
Relies on wetsuite.extras.ocr
compare with
- pdf_text_ocr(), which applies OCR to all pages
For context:
When given a PDF, you can easily decide to
- get all embedded text (and might OCR some empty pages), in which case page_embedded_text_generator() does most of what you want, which is fast, and as precise as that embedded text is.
- OCR all pages in which case wetsuite.extras.ocr.easyocr and .easyocr_toplaintext() might do what you want.
The limitation to that is that you generally don't know what is in a PDF so
- need to write that fallback (just a few lines)
- if you OCR for thoroughness, you might end up getting a lower-quality OCR for pages that already contained good quality text
- you still can't deal with PDFs that are composited from sources that contain embedded text as well as images of text, which are relatively rare, but definitely happen.
This is mostly a convenience function to make your life simpler: it does that fallback, and it does it per PDF page.
This should be a decent balance of fast and precise when we have embedded text, and best-effort for pages that might contain images of text.
CONSIDER: rewriting this after restructuring the ocr interface.
Parameters | |
Undocumented | |
charint | Undocumented |
dpi:int | Undocumented |
cache | Undocumented |
use | Undocumented |
Returns | |
a list (one item for each page) of 2-tuples (first is 'embedded' or 'ocr', second item is the flattened text) |
Extracts fragments using PyMuPDF's xhtml-style extraction, which analyzed some basic paragraphs and headers so that we don't have to (but also removed the low-level information it based that off)
Parameters | |
page | pymupdf page object |
Returns | |
Quick 'get fragments of text from a page', relying on some pymupdf analysis.
Note: does less processing than document_fragments, and defaults to a simpler output (string or list of strings, not the hint structure that document_fragments gives) CONSIDER: making them work the same
Parameters | |
page | pymupdf page object |
join | If false, we return a string of lists. If True, we return a string. |
Returns | |
a single string (often with newlines), or a list of parts. |
Takes PDF file data, yields a page's worth of its embedded text at a time (is a generator), according to the text objects in the PDF stream.
...which are essentially a "please render this text", but note that this is not 1:1 with the text you see, or as coherent as the way you would naturally read it. So it requests sort the text fragments in reading order, in a way that is usually roughly right, but is far from perfect.
Note that this is comparable with page_embedded_as_xhtml(); under the covers it is almost the same call but asks the library for plain text instead.
Takes PDF byte document, yields one page at a time as a PIL image object.
Parameters | |
PDF file contents as a bytes object, or an already-opened fitz Document | |
dpi | the resolution to render at. Higher is slower, and not necessarily much better; in fact there are cases where higher is worse. 150 to 200 seems a good tradeoff. |
Returns | |
a generator yielding images, one page at a time (because consider what a 300-page PDF would do to RAM use) |
Use only OCR to process a PDF (note attempt to use text from PDF objects).
Mostly a call into wetsuite.extras.ocr, and so relies on it
This is currently
- wetsuite.datacollect.pdf.pages_as_images()
- wetsuite.extras.ocr.easyocr()
and is also:
- slow (might take a minute or two per document) - consider cacheing the result
- not clever in any way
so probably ONLY use this if
- extracting text objects (e.g. page_embedded_text_generator) gave you nothing
- you only care about what words exist, not about document structure
Parameters | |
filedata:bytes | |
use | Undocumented |
Returns | |
all text, as a single string. |