wetsuite.extras.ocr

module documentation

Extract text from images, mainly aimed at PDFs that contain _pictures_ of documents, rather than text directly.

Largely a wrapper for an OCR package. Current code is centered specifically around EasyOCR.

We started adding tesseract bit it is more work right now; we should probably make tesseract and equally viable choice.

Function	`bbox_height`	Calculate a bounding box's height.
Function	`bbox_max_x`	maximum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function	`bbox_max_y`	maximum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function	`bbox_min_x`	minimum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function	`bbox_min_y`	minimum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Function	`bbox_width`	Calcualte a bounding box's width.
Function	`bbox_xy_extent`	Calcualte a bounding box's X and Y extents
Function	`doc_extent`	Like page_extent(), but considering all pages at once, mostly to ge the overall margins, and e.g. avoid doing weird things on a last half-filled page.
Function	`easyocr`	Takes an image, returns structured OCR results as a specific python struct.
Function	`easyocr_draw_eval`	Given
Function	`easyocr_toplaintext`	Take intermediate results with boxes and, at least for now, smushes the text together as-is, without much care about placement.
Function	`ocr_pdf_pages`	This is a convenience function that wraps EasyOCR to get text from all of a PDF document, returning it in a per-page, structured way.
Function	`page_allxy`	Given a page's worth of OCR results, return list of X, and list of Y coordinates, meant for e.g. statistics use.
Function	`page_extent`	Estimates the bounds that contain most of the page contents (uses considers all bbox x and y coordinates)
Function	`page_fragment_filter`	Searches for specific text patterns on specific parts of pages.
Function	`tesseract_plain`	Run tesseract OCR an image give results as plain text.
Variable	`_easyocr_reader_cpu`	Undocumented
Variable	`_easyocr_reader_gpu`	Undocumented

def bbox_height(bbox): ¶

Calculate a bounding box's height.

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's width

def bbox_max_x(bbox): ¶

maximum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's maximum x coordinate

def bbox_max_y(bbox): ¶

maximum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's maximum y coordinate

def bbox_min_x(bbox): ¶

minimum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's minimum x coordinate

def bbox_min_y(bbox): ¶

minimum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's minimum y coordinate

def bbox_width(bbox): ¶

Calcualte a bounding box's width.

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's width

def bbox_xy_extent(bbox): ¶

Calcualte a bounding box's X and Y extents

Parameters
bbox	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
the bounding box's (min(x), max(x), min(y), max(y))

def doc_extent(list_of_page_easyocr_fragments, percentile_x=(1, 99), percentile_y=(1, 99)): ¶

Like page_extent(), but considering all pages at once, mostly to ge the overall margins, and e.g. avoid doing weird things on a last half-filled page.

Note that if many pages have little on them, this is somewhat fragile

Parameters
list_of_page_easyocr_fragments	A list of ( A list of (bbox, text, cert) ) for each page
percentile_x	Undocumented
percentile_y	Undocumented
Returns
(page_min_x, page_max_x, page_min_y, page_max_y) which, note, might not be exactly what you epxected

def easyocr(image, pythontypes=True, use_gpu=True, languages=('nl', 'en'), **kwargs): ¶

Takes an image, returns structured OCR results as a specific python struct.

Requires easyocr being installed. Will load easyocr's model on the first call, so try to do many calls from a single process to reduce that overhead to just once.

CONSIDER: pass through kwargs to readtext() CONSIDER: fall back to CPU if GPU init fails

Parameters
image	a single PIL image.
pythontypes	if pythontypes==False, easyocr gives you numpy.int64 in bbox and numpy.float64 for confidence, if pythontypes==True (default), we make that python int and float for you before returning
use_gpu	whether to use GPU (True), or CPU (False). Only does anything on the first call, after that relies on that choice. GPU generally is a factor faster than a single CPU core (in quick tests, 3 to 4 times), so you may prefer GPU unless you don't have a GPU, don't want runtime competition with other GPU use.
languages	what languages to detect. Defaults to 'nl','en'. You might occasionally wish to add 'fr'.
**kwargs	other keyword arguments are passed through to easyocr's `reader.readtext` call.
Returns
a list of `[[topleft, topright, botright, botleft], text, confidence]` (which are EasyOCR's results)

def easyocr_draw_eval(image, ocr_results): ¶

Given

a PIL image (the image you handed into OCR),
the results from ocr()

draws the bounding boxes, with color indicating the confidence.

Made for inspection of how much OCR picks up, and what it might have trouble with.

Parameters
image	the image that you ran ocr() on
ocr_results	the output of ocr()
Returns
a copy of the input image with boxes drawn on it

def easyocr_toplaintext(results): ¶

Take intermediate results with boxes and, at least for now, smushes the text together as-is, without much care about placement.

This is currently NOT enough to be decent processing, and we plan to be smarter than this, given time.

For an idea of how to start doing this more cleverly, look at the kansspelautoriteit data collection notebook.

CONSIDER centralizing that and/or 'natural reading order' code

Parameters
results	the output of ocr()
Returns
plain text

def ocr_pdf_pages(pdfbytes, dpi=150, use_gpu=True, page_cache=None, verbose=True): ¶

This is a convenience function that wraps EasyOCR to get text from all of a PDF document, returning it in a per-page, structured way.

More precisely, it

iterates through a PDF one page at a time,
- renders that page it to an image,
- runs EasyOCR's
- return result data on that page image.

This depends on another of our modules (pdf), and pymupdf

Parameters
pdfbytes	Undocumented
dpi	resolution to render the pages at, before OCRing them. Optimal may be around 200ish? (TODO: test)
use_gpu	Undocumented
page_cache	CONSIDER: allowing cacheing the result of the easyocr calls into a store
verbose	Undocumented
Returns
a 2-tuple: a list of the results that easyocr_toplaintext() outputs a list of "all text on all pages" strings (specifically, fed through the simple-and-stupid easyocr_toplaintext()). Less structure, and redundant with the first returned, but means less typing for some uses.

def page_allxy(page_easyocr_fragments): ¶

Given a page's worth of OCR results, return list of X, and list of Y coordinates, meant for e.g. statistics use.

Parameters
page_easyocr_fragments	a bounding box, as a 4-tuple (tl,tr,br,bl)
Returns
( all x list, all y list )

def page_extent(page_easyocr_fragments, percentile_x=(1, 99), percentile_y=(1, 99)): ¶

Estimates the bounds that contain most of the page contents (uses considers all bbox x and y coordinates)

'Most' in that we use the 1st and 99th percentiles (by default) - may need tweaking

Parameters
page_easyocr_fragments	A list of (bbox, text, cert).
percentile_x
percentile_y
Returns
(page_min_x, page_max_x, page_min_y, page_max_y) which, note, might not be exactly what you epxected

def page_fragment_filter(page_easyocr_fragments, textre=None, q_min_x=None, q_min_y=None, q_max_x=None, q_max_y=None, extent=None, verbose=False): ¶

Searches for specific text patterns on specific parts of pages.

Takes the fragments from a single page (CONSIDER: making a doc_fragment_filter).

This is sometimes overkill, but for some uses this is easier. ...in particularly the first one it was written for, trying to find the size of the header and footer, to be able to ignore them.

q_{min,max}_{x,y} can be

floats (relative to height and width of text ...present within the page, by default ...or the document, if you hand in the document extent via extent (can make more sense to deal with first and last pages being half filled)
otherwise assumed to be ints, absolute units (which are likely to be pixels and depend on the DPI),

Parameters
page_easyocr_fragments
textre	include only fragments that match this regular expression
q_min_x	helps restrict where on the page we search (see notes above)
q_min_y	helps restrict where on the page we search (see notes above)
q_max_x	helps restrict where on the page we search (see notes above)
q_max_y	helps restrict where on the page we search (see notes above)
extent	defines the extent (minx, miny, maxx, maxy) of the page which, note, is ONLY used when q_ are floats.
verbose	say what we're including/excluding and why

def tesseract_plain(image, lang='eng'): ¶

Run tesseract OCR an image give results as plain text.

_easyocr_reader_cpu = ¶

Undocumented

_easyocr_reader_gpu = ¶

Undocumented