wetsuite.helpers.lazy

module documentation

Various functions that allow you to be (a little too) lazy - less typing and/or less thinking.

This module itself is a little creative with many details, so don't count its details to stay the same, or on reproducability even if it did.

In part is actually calls to other parts of wetsuite.

Function	`etree`	Parse XML in a bytestring to an ET object. Mostly just ET's fromstring() plus namespace stripping (that you can turn off)
Function	`html_text`	Takes a HTML file as a bytestring, returns its body text as a string. This is primarily `wetsuite.helpers.etree.html_text`.
Function	`pdf_embedded_text`	Given PDF (as a bytestring), Returns the plain text t reports to have inside it.
Function	`pdf_text_ocr`	Given PDF as a bytestring, OCRs it and report the text in that. Expect this to not be the cleanest.
Function	`spacy_parse`	Takes text and returns a spacy document for it.
Variable	`_loaded_models`	Undocumented

def etree(xmlbytes: bytes, strip_namespace=True): ¶

Parse XML in a bytestring to an ET object. Mostly just ET's fromstring() plus namespace stripping (that you can turn off)

Parameters
xmlbytes:`bytes`	XML document, as bytes object
strip_namespace	Undocumented
Returns
etree root node

def html_text(htmlbytes: bytes): ¶

Takes a HTML file as a bytestring, returns its body text as a string. This is primarily wetsuite.helpers.etree.html_text.

Will have some use on XML documents, more so on the subset of XMLs used in Dutch government that it is informed of, but if you care for some structure on top of that text, now or later, consider using wetsuite.split.feeling_lucky instead of this.

def pdf_embedded_text(pdfbytes: bytes, page_join='\n\n'): ¶

Given PDF (as a bytestring), Returns the plain text t reports to have inside it.

Expect this to be missing for some PDFs; read our notebooks explaining why, and the use of wetsuite.extras.pdf and wetsuite.extras.ocr to do better.

Parameters
pdfbytes:`bytes`	PDF document, as bytes object
page_join	Undocumented
Returns
all embedded text, as a single string

def pdf_text_ocr(pdfbytes: bytes): ¶

Given PDF as a bytestring, OCRs it and report the text in that. Expect this to not be the cleanest.

Parameters
pdfbytes:`bytes`	PDF document, as bytes object
Returns
one string (pages only introduce a double newline, which you can't really fish out later - if you want more control, you probably wwant to look at the underlying module)

def spacy_parse(string, force_model=None, force_language=None, detection_fallback='nl'): ¶

Takes text and returns a spacy document for it.

By default, it

estimates the language (based on a specific language detection model)
picks an already-installed model of that determined language

In general you might care for the reproducability of explicitly loading a model yourself, but this can be handy in experiments, to parse some fragments of text with less typing.

Note also that this would fail if it detects a language you do not have an installed model for; use force_language if you want to avoid that.

Parameters
string	string to parse
force_model	if None, detect model; if not None, load this one
force_language	if None, detect language; if not None, assume this one
detection_fallback	if language detection fails (e.g. because _its_ model was not installed), fall back to use this language
Returns
a Doc of that text

_loaded_models: dict = ¶

Undocumented