module documentation

Extract text from images, mainy aimed at PDFs that contain pictures of documents

Largely a wrapper for OCR package, currently just EasyOCR; we really should TODO: add tesseract https://github.com/sirfz/tesserocr

And then, ideally, TODO: add an interface in front of both it an tesseract (and maybe, in terms of 'text fragment placed here', also pymudpdf) so that the helper functions make equal sense

Function bbox_height Calculate a bounding box's height.
Function bbox_max_x maximum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function bbox_max_y maximum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function bbox_min_x minimum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function bbox_min_y minimum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function bbox_width Calcualte a bounding box's width.
Function bbox_xy_extent Calcualte a bounding box's X and Y extents
Function doc_extent Like page_extent(), but considering all pages at once, mostly to ge the overall margins, and e.g. avoid doing weird things on a last half-filled page.
Function easyocr Takes an image, returns structured OCR results as a specific python struct.
Function easyocr_draw_eval Given
Function easyocr_toplaintext Take intermediate results with boxes and, at least for now, smushes the text together as-is, without much care about placement.
Function ocr_pdf_pages This is a convenience function that uses OCR to get text from all of a PDF document, returning it in a per-page, structured way.
Function page_allxy Given a page's worth of OCR results, return list of X, and list of Y coordinates, meant for e.g. statistics use.
Function page_extent Estimates the bounds that contain most of the page contents (uses considers all bbox x and y coordinates)
Function page_fragment_filter Searches for specific text patterns on specific parts of pages.
Variable _easyocr_reader_cpu Undocumented
Variable _easyocr_reader_gpu Undocumented
def bbox_height(bbox):

Calculate a bounding box's height.

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's width
def bbox_max_x(bbox):

maximum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's maximum x coordinate
def bbox_max_y(bbox):

maximum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's maximum y coordinate
def bbox_min_x(bbox):

minimum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's minimum x coordinate
def bbox_min_y(bbox):

minimum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's minimum y coordinate
def bbox_width(bbox):

Calcualte a bounding box's width.

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's width
def bbox_xy_extent(bbox):

Calcualte a bounding box's X and Y extents

Parameters
bboxa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's (min(x), max(x), min(y), max(y))
def doc_extent(list_of_page_ocr_fragments, percentile_x=(1, 99), percentile_y=(1, 99)):

Like page_extent(), but considering all pages at once, mostly to ge the overall margins, and e.g. avoid doing weird things on a last half-filled page.

Note that if many pages have little on them, this is somewhat fragile

Parameters
list_of_page_ocr_fragmentsA list of (bbox, text, cert).
percentile_xUndocumented
percentile_yUndocumented
Returns
(page_min_x, page_max_x, page_min_y, page_max_y) which, note, might not be exactly what you epxected
def easyocr(image, pythontypes=True, use_gpu=True, languages=('nl', 'en'), debug=False):

Takes an image, returns structured OCR results as a specific python struct.

Requires easyocr being installed. Will load easyocr's model on the first call, so try to do many calls from a single process to reduce that overhead to just once.

CONSIDER: pass through kwargs to readtext() CONSIDER: fall back to CPU if GPU init fails

Parameters
imagea single PIL image.
pythontypesif pythontypes==False, easyocr gives you numpy.int64 in bbox and numpy.float64 for confidence, if pythontypes==True (default), we make that python int and float for you before returning
use_gpuwhether to use GPU (True), or CPU (False). Only does anything on the first call, after that relies on that choice. GPU generally is a factor faster than a single CPU core (in quick tests, 3 to 4 times), so you may prefer GPU unless you don't have a GPU, don't want runtime competition with other GPU use.
languageswhat languages to detect. Defaults to 'nl','en'. You might occasionally wish to add 'fr'.
debugUndocumented
Returns
a list of [[topleft, topright, botright, botleft], text, confidence] (which are EasyOCR's results)
def easyocr_draw_eval(image, ocr_results):

Given

  • a PIL image (the image you handed into OCR),
  • the results from ocr()

draws the bounding boxes, with color indicating the confidence.

Made for inspection of how much OCR picks up, and what it might have trouble with.

Parameters
imagethe image that you ran ocr() on
ocr_resultsthe output of ocr()
Returns
a copy of the input image with boxes drawn on it
def easyocr_toplaintext(results):

Take intermediate results with boxes and, at least for now, smushes the text together as-is, without much care about placement.

This is currently NOT enough to be decent processing, and we plan to be smarter than this, given time.

There is some smarter code in kansspelautoriteit fetching notebook.

CONSIDER centralizing that and/or 'natural reading order' code

Parameters
resultsthe output of ocr()
Returns
plain text
def ocr_pdf_pages(pdfbytes, dpi=150, use_gpu=True, page_cache=None, verbose=True):

This is a convenience function that uses OCR to get text from all of a PDF document, returning it in a per-page, structured way.

More precisely, it

  • iterates through a PDF one page at a time,
    • renders that page it to an image,
    • runs OCR on that page image.

This depends on another of our modules (pdf), and pymupdf

Parameters
pdfbytesUndocumented
dpiresolution to render the pages at, before OCRing them. Optimal may be around 200ish? (TODO: test)
use_gpuUndocumented
page_cacheCONSIDER: allowing cacheing the result of the easyocr calls into a store
verboseUndocumented
Returns

a 2-tuple:

  • a list of the results that easyocr_toplaintext() outputs
  • a list of "all text on all pages" strings (specifically, fed through the simple-and-stupid easyocr_toplaintext()). Less structure, and redundant with the first returned, but means less typing for some uses.
def page_allxy(page_ocr_fragments):

Given a page's worth of OCR results, return list of X, and list of Y coordinates, meant for e.g. statistics use.

Parameters
page_ocr_fragmentsa bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
( all x list, all y list )
def page_extent(page_ocr_fragments, percentile_x=(1, 99), percentile_y=(1, 99)):

Estimates the bounds that contain most of the page contents (uses considers all bbox x and y coordinates)

'Most' in that we use the 1st and 99th percentiles (by default) - may need tweaking

Parameters
page_ocr_fragmentsA list of (bbox, text, cert).
percentile_x
percentile_y
Returns
(page_min_x, page_max_x, page_min_y, page_max_y) which, note, might not be exactly what you epxected
def page_fragment_filter(page_ocr_fragments, textre=None, q_min_x=None, q_min_y=None, q_max_x=None, q_max_y=None, extent=None, verbose=False):

Searches for specific text patterns on specific parts of pages.

Takes the fragments from a single page (CONSIDER: making a doc_fragment_filter).

This is sometimes overkill, but for some uses this is easier. ...in particularly the first one it was written for, trying to find the size of the header and footer, to be able to ignore them.

q_{min,max}_{x,y} can be

  • floats (relative to height and width of text ...present within the page, by default ...or the document, if you hand in the document extent via extent (can make more sense to deal with first and last pages being half filled)
  • otherwise assumed to be ints, absolute units (which are likely to be pixels and depend on the DPI),
Parameters
page_ocr_fragments
textreinclude only fragments that match this regular expression
q_min_xhelps restrict where on the page we search (see notes above)
q_min_yhelps restrict where on the page we search (see notes above)
q_max_xhelps restrict where on the page we search (see notes above)
q_max_yhelps restrict where on the page we search (see notes above)
extentdefines the extent (minx, miny, maxx, maxy) of the page which, note, is ONLY used when q_ are floats.
verbosesay what we're including/excluding and why
_easyocr_reader_cpu =

Undocumented

_easyocr_reader_gpu =

Undocumented