wetsuite.helpers.etree

module documentation

Helpers to deal with XML data, largely a wrapper around lxml and its ElementTree interface.

TODO: minimize the amount of "will break because we use the lxml flavour of ElementTree", and add more tests for that.

Some general helpers. ...including some helper functions shared by some debug scripts.

CONSIDER:

A "turn tree into nested dicts" function - see e.g. https://lxml.de/FAQ.html#how-can-i-map-an-xml-tree-into-a-dict-of-dicts
have a fromstring() as a thin wrapper but with strip_namespace in there? (saves a lines but might be a confusing API change)

Class	`debug_color`	Takes XML, parses, reindents, strip_namespaces, returns a class that will render it in color in a jupyter notebook (using pygments).
Function	`all_text_fragments`	Returns all fragments of text contained in a subtree, as a list of strings.
Function	`debug_pretty`	Return (piece of) tree as a string, readable for debugging
Function	`html_text`	Take an etree (will also take a bytestring) presumed to contain elements with HTML names, extract the plain text as a single string.
Function	`indent`	Returns a 'reindented' copy of a tree, with text nodes altered to add spaces and newlines, so that if tostring()'d and printed, it would print indented by depth.
Function	`kvelements_to_dict`	Where people use elements for single text values, it's convenient to get them as a dict.
Function	`node_walk`	Walks all elements under the given element, remembering both path string and element reference as we go.
Function	`parse_html`	Parses HTML into an etree. NOTE: This is NOT what you would use for XML - fromstring() is for XML.
Function	`path_between`	Given an ancestor and a descentent element from the same tree (In many applications you want under to be the the root element)
Function	`path_count`	Walk nodes under an etree element, count how often each path happens (counting the complete path string). written to summarize the rough structure of a document.
Function	`strip_namespace`	Returns a copy of a tree that has its namespaces stripped.
Constant	`SOME_NS_PREFIXES`	some readable XML prefixes, for friendlier display. This is ONLY for consistent pretty-printing in debug, and WILL NOT BE CORRECT according to the document definition. (It is not used by code in this module, just one of our CLI utilities).
Function	`_indent_inplace`	Alters the text nodes so that the tostring()ed version will look nice and indented when printed as plain text.
Function	`_strip_namespace_inplace`	Takes a parsed ET structure and does an in-place removal of all namespaces. Returns a list of removed namespaces, which you can usually ignore. Not really meant to be used directly, in part because it assumes lxml etrees.
Variable	`_html_text_knowledge`	The data that html_text works from; we might make this a parameter so you can control that

def all_text_fragments(under_node, strip: str = None, ignore_empty: bool = False, ignore_tags=(), join: str = None, stop_at: list = None): ¶

Returns all fragments of text contained in a subtree, as a list of strings.

For the simplest uses, you may just want to use

Note that for simpler uses, this is itertext() with extra steps. You may not need this.

For example, all_text_fragments( fromstring('<a>foo<b>bar</b></a>') ) == ['foo', 'bar']

Note that:

If your source is XML,
this is a convenience function that lets you be pragmatic with creative HTML-like nesting, and perhaps should not be used for things that are strictly data.

TODO: more tests, I'm moderately sure strip doesn't do quite what it should.

TODO: add add_spaces - an acknowledgment that in non-HTML, as well as equally free-form documents like this project often handles, that element should be considered to split a word (e.g. p in HTML) or that element probably doesn't split a word (e.g. em, sup in HTML) The idea would be that you can specify which elements get spaces inserted and which do not. Probably with defaults for us, which are creative and not necessarily correct, but on average makes fewer weird mistakes (would need to figure that out from the various schemas)

Parameters
under_node	an etree node to work under
strip:`str`	is what to remove at the edges of each .text and .tail ...handed to strip(), and note that the default, None, is to strip whitespace if you want it to strip nothing at all, pass in '' (empty string)
ignore_empty:`bool`	removes strings that are empty when after that stripping
ignore_tags	ignores direct/first .text content of named tags (does not ignore .tail, does not ignore the subtree)
join:`str`	if None, return a list of text fragments; if a string, we return a single tring, joined on that
stop_at:`list`	should be None or a list of tag names. If a tag name is in this sequence, we stop walking the tree entirely. (note that it would still include that tag's tail; CONSIDER: changing that)
Returns
if join==None (the default), a list of text fragments. If join is a string, a single string (joined on that string)

def debug_pretty(tree, reindent=True, strip_namespaces=True, encoding='unicode'): ¶

Return (piece of) tree as a string, readable for debugging

Intended to take an etree object (but if give a bytestring we'll try to parse it as XML)

Because this is purely meant for debugging, it by default

strips namespaces
reindents
returns as unicode (not bytes) so we can print() it

It's also mostly just short for:

       etree.tostring(  etree.indent( etree.strip_namespace( tree ) ), encoding='unicode' )

def html_text(etree, join=True, bodynodename='body'): ¶

Take an etree (will also take a bytestring) presumed to contain elements with HTML names, extract the plain text as a single string.

What this adds over basic text extraction using "".join(elem.itertext()), (or all_text_fragments() in this module) is awareness of which HTML elements should be considered to split words and to split paragraphs.

It will selectively insert spaces and newlines, as to not smash text together in ways unlikely to how a browser would do it. The downside is that this becomes more creative than some might like, so if you want precise control, take the code and refine your own. (Inspiration was taken from the html-text module. While we're being creative anyway, we might _also_ consider taking inspiration from jusText, to remove boilerplate content based on a few heuristics.)

While this will also take most of the more structured XML seen in BWB, CVDR, and OP, it mostly just passes the text through. If you care about structure, now or later, you may prefer wetsuite.helpers.split.

Parameters
etree	Can be one of * etree object (but there is little point as most node names will not be known. * a bytes or str object - will be assumed to be HTML that isn't parsed yet. (bytes suggests properly storing file data, str that you might be more fiddly with encodings) * a bs4 object - this is a stretch, but could save you some time.
join	If True, returns a single string (with a little more polishing, of spaces after newlines) If False, returns the fragments it collected and added. Due to the insertion and handing of whitespace, this bears only limited relation to the parts.
bodynodename	start at the node with this name - defaults to 'body'. Use None to start at the root of what you handed in.

def indent(tree, strip_whitespace: bool = True): ¶

Returns a 'reindented' copy of a tree, with text nodes altered to add spaces and newlines, so that if tostring()'d and printed, it would print indented by depth.

This may change the meaning of the document, so this output should _only_ be used for presentation of the debugging sort.

See also https://lxml.de/lxmlhtml.html

Parameters
htmlbytes:`bytes`	a HTML file as a bytestring
Returns
an etree object

def path_between(under_node, element, excluding: bool = False): ¶

Given an ancestor and a descentent element from the same tree (In many applications you want under to be the the root element)

Returns the xpath-style path to get from (under) to this specific element ...or raises a ValueError mentioning that the element is not in this tree

Keep in mind that if you reformat a tree, the latter is likely.

This function has very little code, and if you do this for much of a document, you may want to steal the code

Parameters
under_node
element
excluding:`bool`	if we have a/b/c and call this between an and c, there are cases for wanting either * complete path report, like `/a/b/c` (excluding=False), e.g. as a 'complete * a relative path like `b/c` (excluding=True), in particular when we know we'll be calling xpath or find on node a
Returns

def path_count(under_node, max_depth=None): ¶

Walk nodes under an etree element, count how often each path happens (counting the complete path string). written to summarize the rough structure of a document.

Path here means 'the name of each element', *not* xpath-style path with indices that resolve to the specific node.

Returns a dict from each path strings to how often it occurs

def strip_namespace(tree, remove_from_attr=True): ¶

Returns a copy of a tree that has its namespaces stripped.

More specifically it removes

namespace from element names
namespaces from attribute names (default, but optional)
default namespaces (TODO: test that properly)

Parameters
tree	The node under which to remove things (you would probably hand in the root)
remove_from_attr	Whether to remove namespaces from attributes as well. For attributes with the same name that are unique only because of a different namespace, this may cause attributes to be overwritten, For example: : <e p:at="bar" at="quu"/> might become: : <e at="bar"/> I've not yet seen any XML where this matters - but it might.
Returns
The URLs for the stripped namespaces. We don't expect you to have a use for this most of the time, but in some debugging you want to know, and report them.

SOME_NS_PREFIXES: dict[str, str] = ¶

some readable XML prefixes, for friendlier display. This is ONLY for consistent pretty-printing in debug, and WILL NOT BE CORRECT according to the document definition. (It is not used by code in this module, just one of our CLI utilities).

Value

{'http://www.w3.org/2000/xmlns/': 'xmlns',
 'http://www.w3.org/2001/XMLSchema': 'xsd',
 'http://www.w3.org/XML/1998/namespace': 'xml',
 'http://www.w3.org/2001/XMLSchema-instance': 'xsi',
 'http://www.w3.org/1999/xhtml': 'xhtml',
 'http://www.w3.org/1999/xlink': 'xlink',
 'http://schema.org/': 'schema',
...

def _indent_inplace(elem, level: int = 0, strip_whitepsace: bool = True): ¶

Alters the text nodes so that the tostring()ed version will look nice and indented when printed as plain text.

def _strip_namespace_inplace(tree, remove_from_attr=True): ¶

Takes a parsed ET structure and does an in-place removal of all namespaces. Returns a list of removed namespaces, which you can usually ignore. Not really meant to be used directly, in part because it assumes lxml etrees.

Parameters
tree	See `strip_namespace`
remove_from_attr	See `strip_namespace`
Returns
See `strip_namespace`

_html_text_knowledge: dict[str, tuple] = ¶

The data that html_text works from; we might make this a parameter so you can control that

Parameters
tree	tree to copy and reindent
strip_whitespace:`bool`	make contents that contain a lot of newlines look cleaner, but changes the stored data even more.

Parameters
under_node	etree node/element to work under (use the children of)
strip_text	whether to use strip() on text values (defaults to True)
ignore_tagnames	sequence of strings, naming tags/variables to not put into the dict
Returns
`dict`	a python dict (see e.g. example above)

Parameters
under_node	If given None, it emits nothing (we assume it's from a find() that hit nothing, and that it's slightly easier to ignore here than in your code)
max_depth
Returns
a generator yielding (path, element), and is mainly a helper used by path_count()