module documentation

Query PDFs about the text objects that they contain (which is not always clean, structured, correct, or present at all)

If you want clean structured output, then you likely want to put it more work, but for a bag-of-words method this may be enough.

See also ocr.py, and note that we also have "render PDF pages to image" so we can hand that to that OCR module.

TODO: read about natural reading order details at: https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order

Function closest_paper_size_name Given a pymupdf Box, tells you the name of the size, and orientation.
Function count_pages_with_embedded_text Counts the number of pages that have a reasonable amount of embedded text on them.
Function do_page_sizes_vary Given a parsed pymupdf document object, tells us whether the pages (specifically their CropBox) vary in size at all.
Function document_fragments Tries to be slightly smarter than page_embedded_text_generator, making it slightly easier to see paragraphs and _maybe_ headings.
Function embedded_or_ocr_perpage For a PDF, walk through its pages
Function page_as_image Takes a single pymupdf Page object, and renders it as a PIL color Image at a specific resolution
Function page_embedded_as_xhtml Extracts fragments using PyMuPDF's xhtml-style extraction, which analyzed some basic paragraphs and headers so that we don't have to (but also removed the low-level information it based that off)
Function page_embedded_fragments Quick 'get fragments of text from a page', relying on some pymupdf analysis.
Function page_embedded_text_generator Takes PDF file data, yields a page's worth of its embedded text at a time (is a generator), according to the text objects in the PDF stream.
Function pages_as_images Takes PDF byte document, yields one page at a time as a PIL image object.
Function pdf_text_ocr Use only OCR to process a PDF (note attempt to use text from PDF objects).
Function _open_pdf helper function that lets varied functions deal with
Variable _html_header_tag_names Undocumented
Variable _page_sizes Undocumented
def closest_paper_size_name(box, within_pt=36.0):

Given a pymupdf Box, tells you the name of the size, and orientation.

Parameters
boxa pymupdf Box, e.g. a_page.cropbox
within_ptthe amount of size it may be off. (Default is 36pt, which is ~12mm, 0.5 inch, which is perhaps overly flexible)
Returns

something like ('A4','portrait', 1,0), which is:

  • the name of the size (currently 'A4', 'Letter', or 'other')
  • whether it's in 'portrait' or 'landscape'
  • the size mismatch to that size, width-wise and height-wise (if name is not 'other', this will be lower than within_pt)
def count_pages_with_embedded_text(pdf, char_threshold=200):

Counts the number of pages that have a reasonable amount of embedded text on them.

Intended to help detect PDFs that are partly or fully images-of-text instead.

Counts characters per page plus spaces between words, but strips edges; TODO: think about that more.

Parameters
pdf

either:

  • PDF file data (as bytes)
  • the output of page_embedded_text_generator()
char_thresholdhow long the text on a page should be, in characters, after strip()ping. Defaults to 200, which is maybe 50 words.
Returns
(list_of_number_of_chars_per_page, num_pages_with_text, num_pages)
def do_page_sizes_vary(pdf, allowance_pt=36):

Given a parsed pymupdf document object, tells us whether the pages (specifically their CropBox) vary in size at all.

Meant to help detect PDFs composited from multiple sources.

Parameters
pdfthe document under test (as bytes, already-parsed Document object)
allowance_ptthe maximum height or width difference between largest and smallest, in pt (default is 36, which is ~12mm)
Returns
a 3-tuple: ( whether there is more than allowance_pt difference, amount of width difference, amount of height difference )
def document_fragments(pdf, hint_structure=True, debug=True):

Tries to be slightly smarter than page_embedded_text_generator, making it slightly easier to see paragraphs and _maybe_ headings.

Set up to do some more analysis (than e.g. page_embedded_fragments),

Note that this is the implementation of split.Fragments_PDF_Fallback, so when changing things, consider side effects there.

Parameters
pdfUndocumented
hint_structure
  • if True, return the structure internal to this function
  • if False, returns a text string.
debugUndocumented
Returns
  • a list of strings, or
  • a list of (hintdict, emptydict, textfragment) (the empty dict is for drop-in use in the split module)
def embedded_or_ocr_perpage(pdf, char_threshold=30, dpi=150, cache_store=None, use_gpu=False):

For a PDF, walk through its pages

  • if it reports having text, use that text
  • if it does not report having text, render it as an image and run OCR

...and relies on our own wetsuite.extras.ocr to do so.

Relies on wetsuite.extras.ocr

compare with

  • pdf_text_ocr(), which applies OCR to all pages

For context:

When given a PDF, you can easily decide to

  • get all embedded text (and might OCR some empty pages), in which case page_embedded_text_generator() does most of what you want, which is fast, and as precise as that embedded text is.
  • OCR all pages in which case wetsuite.extras.ocr.easyocr and .easyocr_toplaintext() might do what you want.

The limitation to that is that you generally don't know what is in a PDF so

  • need to write that fallback (just a few lines)
  • if you OCR for thoroughness, you might end up getting a lower-quality OCR for pages that already contained good quality text
  • you still can't deal with PDFs that are composited from sources that contain embedded text as well as images of text, which are relatively rare, but definitely happen.

This is mostly a convenience function to make your life simpler: it does that fallback, and it does it per PDF page.

This should be a decent balance of fast and precise when we have embedded text, and best-effort for pages that might contain images of text.

CONSIDER: rewriting this after restructuring the ocr interface.

Parameters
pdfUndocumented
char_threshold:intUndocumented
dpi:intUndocumented
cache_storeUndocumented
use_gpuUndocumented
Returns
a list (one item for each page) of 2-tuples (first is 'embedded' or 'ocr', second item is the flattened text)
def page_as_image(page, dpi=150):

Takes a single pymupdf Page object, and renders it as a PIL color Image at a specific resolution

def page_embedded_as_xhtml(page):

Extracts fragments using PyMuPDF's xhtml-style extraction, which analyzed some basic paragraphs and headers so that we don't have to (but also removed the low-level information it based that off)

Parameters
pagepymupdf page object
Returns
def page_embedded_fragments(page, join=True):

Quick 'get fragments of text from a page', relying on some pymupdf analysis.

Note: does less processing than document_fragments, and defaults to a simpler output (string or list of strings, not the hint structure that document_fragments gives) CONSIDER: making them work the same

Parameters
pagepymupdf page object
joinIf false, we return a string of lists. If True, we return a string.
Returns
a single string (often with newlines), or a list of parts.
def page_embedded_text_generator(pdf, option='text'):

Takes PDF file data, yields a page's worth of its embedded text at a time (is a generator), according to the text objects in the PDF stream.

...which are essentially a "please render this text", but note that this is not 1:1 with the text you see, or as coherent as the way you would naturally read it. So it requests sort the text fragments in reading order, in a way that is usually roughly right, but is far from perfect.

Note that this is comparable with page_embedded_as_xhtml(); under the covers it is almost the same call but asks the library for plain text instead.

def pages_as_images(pdf, dpi=150):

Takes PDF byte document, yields one page at a time as a PIL image object.

Parameters
pdfPDF file contents as a bytes object, or an already-opened fitz Document
dpithe resolution to render at. Higher is slower, and not necessarily much better; in fact there are cases where higher is worse. 150 to 200 seems a good tradeoff.
Returns
a generator yielding images, one page at a time (because consider what a 300-page PDF would do to RAM use)
def pdf_text_ocr(filedata, use_gpu=True):

Use only OCR to process a PDF (note attempt to use text from PDF objects).

Mostly a call into wetsuite.extras.ocr, and so relies on it

This is currently

  • wetsuite.datacollect.pdf.pages_as_images()
  • wetsuite.extras.ocr.easyocr()

and is also:

  • slow (might take a minute or two per document) - consider cacheing the result
  • not clever in any way

so probably ONLY use this if

  • extracting text objects (e.g. page_embedded_text_generator) gave you nothing
  • you only care about what words exist, not about document structure
Parameters
filedata:bytes
use_gpuUndocumented
Returns
all text, as a single string.
def _open_pdf(pdf):

helper function that lets varied functions deal with

  • an already-opened pymupdf Document
  • file as bytes
  • CONSIDER: or a filename
_html_header_tag_names: tuple[str, ...] =

Undocumented

_page_sizes: tuple[tuple, ...] =

Undocumented