module documentation

Various functions that allow you to be (a little too) lazy - less typing and/or less thinking.

This module itself is a little creative with many details, so don't count its details to stay the same, or on reproducability even if it did.

In part is actually calls to other parts of wetsuite.

Function etree Parse XML in a bytestring to an ET object. Mostly just ET.fromstring() with namespace stripping (that you can turn off)
Function html_text Takes a HTML file as a bytestring, returns its body text as a string.
Function pdf_embedded_text Given PDF (as a bytestring), Returns the plain text t reports to have inside it.
Function pdf_text_ocr Given PDF as a bytestring, OCRs it and report the text in that. Expect this to not be the cleanest.
Function spacy_parse Takes text and returns a spacy document for it.
Variable _loaded_models Undocumented
def etree(xmlbytes, strip_namespace=True):

Parse XML in a bytestring to an ET object. Mostly just ET.fromstring() with namespace stripping (that you can turn off)

Parameters
xmlbytesXML document, as bytes object
strip_namespaceUndocumented
Returns
etree root node
def html_text(htmlbytes):

Takes a HTML file as a bytestring, returns its body text as a string.

(note: this is also roughly the implementation of wetsuite.helpers.split.Fragments_HTML_Fallback)

def pdf_embedded_text(pdfbytes, page_join='\n\n'):

Given PDF (as a bytestring), Returns the plain text t reports to have inside it.

Expect this to be missing for some PDFs; read our notebooks explaining why, and the use of wetsuite.extras.pdf and wetsuite.extras.ocr to do better.

Parameters
pdfbytesPDF document, as bytes object
page_joinUndocumented
Returns
all embedded text, as a single string
def pdf_text_ocr(pdfbytes):

Given PDF as a bytestring, OCRs it and report the text in that. Expect this to not be the cleanest.

Parameters
pdfbytesPDF document, as bytes object
Returns
one string (pages only introduce a double newline, which you can't really fish out later - if you want more control, you probably wwant to look at the underlying module)
def spacy_parse(string, force_model=None, force_language=None, detection_fallback='nl'):

Takes text and returns a spacy document for it.

By default, it

  • estimates the language (based on a specific language detection model)
  • picks an already-installed model of that determined language

In general you might care for the reproducability of explicitly loading a model yourself, but this can be handy in experiments, to parse some fragments of text with less typing.

Note also that this would fail if it detects a language you do not have an installed model for; use force_language if you want to avoid that.

Parameters
stringstring to parse
force_modelif None, detect model; if not None, load this one
force_languageif None, detect language; if not None, assume this one
detection_fallbackif language detection fails (e.g. because _its_ model was not installed), fall back to use this language
Returns
a Doc of that text
_loaded_models: dict =

Undocumented