Various functions that allow you to be (a little too) lazy - less typing and/or less thinking.
This module itself is a little creative with many details, so don't count its details to stay the same, or on reproducability even if it did.
In part is actually calls to other parts of wetsuite.
Function | etree |
Parse XML in a bytestring to an ET object. Mostly just ET.fromstring() with namespace stripping (that you can turn off) |
Function | html |
Takes a HTML file as a bytestring, returns its body text as a string. |
Function | pdf |
Given PDF (as a bytestring), Returns the plain text t reports to have inside it. |
Function | pdf |
Given PDF as a bytestring, OCRs it and report the text in that. Expect this to not be the cleanest. |
Function | spacy |
Takes text and returns a spacy document for it. |
Variable | _loaded |
Undocumented |
Parse XML in a bytestring to an ET object. Mostly just ET.fromstring() with namespace stripping (that you can turn off)
Parameters | |
xmlbytes | XML document, as bytes object |
strip | Undocumented |
Returns | |
etree root node |
Takes a HTML file as a bytestring, returns its body text as a string.
(note: this is also roughly the implementation of wetsuite.helpers.split.Fragments_HTML_Fallback)
Given PDF (as a bytestring), Returns the plain text t reports to have inside it.
Expect this to be missing for some PDFs; read our notebooks explaining why, and the use of wetsuite.extras.pdf and wetsuite.extras.ocr to do better.
Parameters | |
pdfbytes | PDF document, as bytes object |
page | Undocumented |
Returns | |
all embedded text, as a single string |
Given PDF as a bytestring, OCRs it and report the text in that. Expect this to not be the cleanest.
Parameters | |
pdfbytes | PDF document, as bytes object |
Returns | |
one string (pages only introduce a double newline, which you can't really fish out later - if you want more control, you probably wwant to look at the underlying module) |
Takes text and returns a spacy document for it.
By default, it
- estimates the language (based on a specific language detection model)
- picks an already-installed model of that determined language
In general you might care for the reproducability of explicitly loading a model yourself, but this can be handy in experiments, to parse some fragments of text with less typing.
Note also that this would fail if it detects a language you do not have an installed model for; use force_language if you want to avoid that.
Parameters | |
string | string to parse |
force | if None, detect model; if not None, load this one |
force | if None, detect language; if not None, assume this one |
detection | if language detection fails (e.g. because _its_ model was not installed), fall back to use this language |
Returns | |
a Doc of that text |