wetsuite.extras.pdf

module documentation

Query PDFs about the text objects that they contain (which is not always clean, structured, correct, or present at all)

If you want clean structured output, then you likely want to put it more work, but for a bag-of-words method this may be enough.

See also ocr.py, and note that we also have "render PDF pages to image" so we can hand that to that OCR module.

TODO: read about natural reading order details at: https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order

Function	`closest_paper_size_name`	Given a pymupdf Box, tells you the name of the size, and orientation.
Function	`count_pages_with_embedded_text`	Counts the number of pages that have a reasonable amount of embedded text on them.
Function	`do_page_sizes_vary`	Given a parsed pymupdf document object, tells us whether the pages (specifically their CropBox) vary in size at all.
Function	`document_fragments`	Tries to be slightly smarter than page_embedded_text_generator, making it slightly easier to see paragraphs and _maybe_ headings.
Function	`embedded_or_ocr_perpage`	For a PDF, walk through its pages
Function	`page_as_image`	Takes a single pymupdf Page object, and renders it as a PIL color Image at a specific resolution
Function	`page_embedded_as_xhtml`	Extracts fragments using PyMuPDF's xhtml-style extraction, which analyzed some basic paragraphs and headers so that we don't have to (but also removed the low-level information it based that off)
Function	`page_embedded_fragments`	Quick 'get fragments of text from a page', relying on some pymupdf analysis.
Function	`page_embedded_text_generator`	Takes PDF file data, yields a page's worth of its embedded text at a time (is a generator), according to the text objects in the PDF stream.
Function	`pages_as_images`	Takes PDF byte document, yields one page at a time as a PIL image object.
Function	`pdf_text_ocr`	Use only OCR to process a PDF (note attempt to use text from PDF objects).
Function	`_open_pdf`	helper function that lets varied functions deal with
Variable	`_html_header_tag_names`	Undocumented
Variable	`_page_sizes`	Undocumented

def closest_paper_size_name(box, within_pt=36.0): ¶

Given a pymupdf Box, tells you the name of the size, and orientation.

Parameters
box	a pymupdf Box, e.g. a_page.cropbox
within_pt	the amount of size it may be off. (Default is 36pt, which is ~12mm, 0.5 inch, which is perhaps overly flexible)
Returns
something like ('A4','portrait', 1,0), which is: the name of the size (currently 'A4', 'Letter', or 'other') whether it's in 'portrait' or 'landscape' the size mismatch to that size, width-wise and height-wise (if name is not 'other', this will be lower than within_pt)

def count_pages_with_embedded_text(pdf, char_threshold=200): ¶

Counts the number of pages that have a reasonable amount of embedded text on them.

Intended to help detect PDFs that are partly or fully images-of-text instead.

Counts characters per page plus spaces between words, but strips edges; TODO: think about that more.

Parameters
pdf	either: PDF file data (as bytes) the output of page_embedded_text_generator()
char_threshold	how long the text on a page should be, in characters, after strip()ping. Defaults to 200, which is maybe 50 words.
Returns
(list_of_number_of_chars_per_page, num_pages_with_text, num_pages)

def do_page_sizes_vary(pdf, allowance_pt=36): ¶

Given a parsed pymupdf document object, tells us whether the pages (specifically their CropBox) vary in size at all.

Meant to help detect PDFs composited from multiple sources.

Parameters
pdf	the document under test (as bytes, already-parsed Document object)
allowance_pt	the maximum height or width difference between largest and smallest, in pt (default is 36, which is ~12mm)
Returns
a 3-tuple: ( whether there is more than allowance_pt difference, amount of width difference, amount of height difference )

def document_fragments(pdf, hint_structure=True, debug=True): ¶

Tries to be slightly smarter than page_embedded_text_generator, making it slightly easier to see paragraphs and _maybe_ headings.

Set up to do some more analysis (than e.g. page_embedded_fragments),

Note that this is the implementation of split.Fragments_PDF_Fallback, so when changing things, consider side effects there.

Parameters
pdf	Undocumented
hint_structure	if True, return the structure internal to this function if False, returns a text string.
debug	Undocumented
Returns
a list of strings, or a list of (hintdict, emptydict, textfragment) (the empty dict is for drop-in use in the split module)

def embedded_or_ocr_perpage(pdf, char_threshold=30, dpi=150, cache_store=None, use_gpu=False): ¶

For a PDF, walk through its pages

if it reports having text, use that text
if it does not report having text, render it as an image and run OCR

...and relies on our own wetsuite.extras.ocr to do so.

Relies on wetsuite.extras.ocr

compare with

pdf_text_ocr(), which applies OCR to all pages

For context:

When given a PDF, you can easily decide to

get all embedded text (and might OCR some empty pages), in which case page_embedded_text_generator() does most of what you want, which is fast, and as precise as that embedded text is.
OCR all pages in which case wetsuite.extras.ocr.easyocr and .easyocr_toplaintext() might do what you want.

The limitation to that is that you generally don't know what is in a PDF so

need to write that fallback (just a few lines)
if you OCR for thoroughness, you might end up getting a lower-quality OCR for pages that already contained good quality text
you still can't deal with PDFs that are composited from sources that contain embedded text as well as images of text, which are relatively rare, but definitely happen.

This is mostly a convenience function to make your life simpler: it does that fallback, and it does it per PDF page.

This should be a decent balance of fast and precise when we have embedded text, and best-effort for pages that might contain images of text.

CONSIDER: rewriting this after restructuring the ocr interface.

Parameters
pdf	Undocumented
char_threshold:`int`	Undocumented
dpi:`int`	Undocumented
cache_store	Undocumented
use_gpu	Undocumented
Returns
a list (one item for each page) of 2-tuples (first is 'embedded' or 'ocr', second item is the flattened text)

def page_as_image(page, dpi=150): ¶

Takes a single pymupdf Page object, and renders it as a PIL color Image at a specific resolution

def page_embedded_as_xhtml(page): ¶

Extracts fragments using PyMuPDF's xhtml-style extraction, which analyzed some basic paragraphs and headers so that we don't have to (but also removed the low-level information it based that off)

Parameters
page	pymupdf page object
Returns

def page_embedded_fragments(page, join=True): ¶

Quick 'get fragments of text from a page', relying on some pymupdf analysis.

Note: does less processing than document_fragments, and defaults to a simpler output (string or list of strings, not the hint structure that document_fragments gives) CONSIDER: making them work the same

Parameters
page	pymupdf page object
join	If false, we return a string of lists. If True, we return a string.
Returns
a single string (often with newlines), or a list of parts.

def page_embedded_text_generator(pdf, option='text'): ¶

Takes PDF file data, yields a page's worth of its embedded text at a time (is a generator), according to the text objects in the PDF stream.

...which are essentially a "please render this text", but note that this is not 1:1 with the text you see, or as coherent as the way you would naturally read it. So it requests sort the text fragments in reading order, in a way that is usually roughly right, but is far from perfect.

Note that this is comparable with page_embedded_as_xhtml(); under the covers it is almost the same call but asks the library for plain text instead.

def pages_as_images(pdf, dpi=150): ¶

Takes PDF byte document, yields one page at a time as a PIL image object.

Parameters
pdf	PDF file contents as a bytes object, or an already-opened fitz Document
dpi	the resolution to render at. Higher is slower, and not necessarily much better; in fact there are cases where higher is worse. 150 to 200 seems a good tradeoff.
Returns
a generator yielding images, one page at a time (because consider what a 300-page PDF would do to RAM use)

def pdf_text_ocr(filedata, use_gpu=True): ¶

Use only OCR to process a PDF (note attempt to use text from PDF objects).

Mostly a call into wetsuite.extras.ocr, and so relies on it

This is currently

wetsuite.datacollect.pdf.pages_as_images()
wetsuite.extras.ocr.easyocr()

and is also:

slow (might take a minute or two per document) - consider cacheing the result
not clever in any way

so probably ONLY use this if

extracting text objects (e.g. page_embedded_text_generator) gave you nothing
you only care about what words exist, not about document structure

Parameters
filedata:`bytes`
use_gpu	Undocumented
Returns
all text, as a single string.

def _open_pdf(pdf): ¶

helper function that lets varied functions deal with

an already-opened pymupdf Document
file as bytes
CONSIDER: or a filename

_html_header_tag_names: tuple[str, ...] = ¶

Undocumented

_page_sizes: tuple[tuple, ...] = ¶

Undocumented