Extract text from images, mainy aimed at PDFs that contain pictures of documents
Largely a wrapper for OCR package, currently just EasyOCR; we really should TODO: add tesseract https://github.com/sirfz/tesserocr
And then, ideally, TODO: add an interface in front of both it an tesseract (and maybe, in terms of 'text fragment placed here', also pymudpdf) so that the helper functions make equal sense
Function | bbox |
Calculate a bounding box's height. |
Function | bbox |
maximum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code |
Function | bbox |
maximum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code |
Function | bbox |
minimum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code |
Function | bbox |
minimum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code |
Function | bbox |
Calcualte a bounding box's width. |
Function | bbox |
Calcualte a bounding box's X and Y extents |
Function | doc |
Like page_extent(), but considering all pages at once, mostly to ge the overall margins, and e.g. avoid doing weird things on a last half-filled page. |
Function | easyocr |
Takes an image, returns structured OCR results as a specific python struct. |
Function | easyocr |
Given |
Function | easyocr |
Take intermediate results with boxes and, at least for now, smushes the text together as-is, without much care about placement. |
Function | ocr |
This is a convenience function that uses OCR to get text from all of a PDF document, returning it in a per-page, structured way. |
Function | page |
Given a page's worth of OCR results, return list of X, and list of Y coordinates, meant for e.g. statistics use. |
Function | page |
Estimates the bounds that contain most of the page contents (uses considers all bbox x and y coordinates) |
Function | page |
Searches for specific text patterns on specific parts of pages. |
Variable | _easyocr |
Undocumented |
Variable | _easyocr |
Undocumented |
Calculate a bounding box's height.
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's width |
maximum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's maximum x coordinate |
maximum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's maximum y coordinate |
minimum X coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's minimum x coordinate |
minimum Y coordinate - redundant with bbox_xy_extent, but sometimes more readable in code
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's minimum y coordinate |
Calcualte a bounding box's width.
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's width |
Calcualte a bounding box's X and Y extents
Parameters | |
bbox | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
the bounding box's (min(x), max(x), min(y), max(y)) |
Like page_extent(), but considering all pages at once, mostly to ge the overall margins, and e.g. avoid doing weird things on a last half-filled page.
Note that if many pages have little on them, this is somewhat fragile
Parameters | |
list | A list of (bbox, text, cert). |
percentile | Undocumented |
percentile | Undocumented |
Returns | |
(page_min_x, page_max_x, page_min_y, page_max_y) which, note, might not be exactly what you epxected |
Takes an image, returns structured OCR results as a specific python struct.
Requires easyocr being installed. Will load easyocr's model on the first call, so try to do many calls from a single process to reduce that overhead to just once.
CONSIDER: pass through kwargs to readtext() CONSIDER: fall back to CPU if GPU init fails
Parameters | |
image | a single PIL image. |
pythontypes | if pythontypes==False, easyocr gives you numpy.int64 in bbox and numpy.float64 for confidence, if pythontypes==True (default), we make that python int and float for you before returning |
use | whether to use GPU (True), or CPU (False). Only does anything on the first call, after that relies on that choice. GPU generally is a factor faster than a single CPU core (in quick tests, 3 to 4 times), so you may prefer GPU unless you don't have a GPU, don't want runtime competition with other GPU use. |
languages | what languages to detect. Defaults to 'nl','en'. You might occasionally wish to add 'fr'. |
debug | Undocumented |
Returns | |
a list of [[topleft, topright, botright, botleft], text, confidence] (which are EasyOCR's results) |
Given
- a PIL image (the image you handed into OCR),
- the results from ocr()
draws the bounding boxes, with color indicating the confidence.
Made for inspection of how much OCR picks up, and what it might have trouble with.
Parameters | |
image | the image that you ran ocr() on |
ocr | the output of ocr() |
Returns | |
a copy of the input image with boxes drawn on it |
Take intermediate results with boxes and, at least for now, smushes the text together as-is, without much care about placement.
This is currently NOT enough to be decent processing, and we plan to be smarter than this, given time.
There is some smarter code in kansspelautoriteit fetching notebook.
CONSIDER centralizing that and/or 'natural reading order' code
Parameters | |
results | the output of ocr() |
Returns | |
plain text |
This is a convenience function that uses OCR to get text from all of a PDF document, returning it in a per-page, structured way.
More precisely, it
- iterates through a PDF one page at a time,
- renders that page it to an image,
- runs OCR on that page image.
This depends on another of our modules (pdf
), and pymupdf
Parameters | |
pdfbytes | Undocumented |
dpi | resolution to render the pages at, before OCRing them. Optimal may be around 200ish? (TODO: test) |
use | Undocumented |
page | CONSIDER: allowing cacheing the result of the easyocr calls into a store |
verbose | Undocumented |
Returns | |
a 2-tuple:
|
Given a page's worth of OCR results, return list of X, and list of Y coordinates, meant for e.g. statistics use.
Parameters | |
page | a bounding box, as a 4-tuple (tl,tr,br,bl) |
Returns | |
( all x list, all y list ) |
Estimates the bounds that contain most of the page contents (uses considers all bbox x and y coordinates)
'Most' in that we use the 1st and 99th percentiles (by default) - may need tweaking
Parameters | |
page | A list of (bbox, text, cert). |
percentile | |
percentile | |
Returns | |
(page_min_x, page_max_x, page_min_y, page_max_y) which, note, might not be exactly what you epxected |
Searches for specific text patterns on specific parts of pages.
Takes the fragments from a single page (CONSIDER: making a doc_fragment_filter).
This is sometimes overkill, but for some uses this is easier. ...in particularly the first one it was written for, trying to find the size of the header and footer, to be able to ignore them.
q_{min,max}_{x,y} can be
- floats (relative to height and width of text ...present within the page, by default ...or the document, if you hand in the document extent via extent (can make more sense to deal with first and last pages being half filled)
- otherwise assumed to be ints, absolute units (which are likely to be pixels and depend on the DPI),
Parameters | |
page | |
textre | include only fragments that match this regular expression |
q | helps restrict where on the page we search (see notes above) |
q | helps restrict where on the page we search (see notes above) |
q | helps restrict where on the page we search (see notes above) |
q | helps restrict where on the page we search (see notes above) |
extent | defines the extent (minx, miny, maxx, maxy) of the page which, note, is ONLY used when q_ are floats. |
verbose | say what we're including/excluding and why |