Helpers to deal with XML data, largely a wrapper around lxml and its ElementTree interface.
TODO: minimize the amount of "will break because we use the lxml flavour of ElementTree", and add more tests for that.
Some general helpers. ...including some helper functions shared by some debug scripts.
CONSIDER:
- A "turn tree into nested dicts" function - see e.g. https://lxml.de/FAQ.html#how-can-i-map-an-xml-tree-into-a-dict-of-dicts
- have a fromstring() as a thin wrapper but with strip_namespace in there? (saves a lines but might be a confusing API change)
Class | debug |
Takes XML, parses, reindents, strip_namespaces, returns a class that will render it in color in a jupyter notebook (using pygments). |
Function | all |
Returns all fragments of text contained in a subtree, as a list of strings. |
Function | debug |
Return (piece of) tree as a string, readable for debugging |
Function | html |
Take an etree presumed to contain elements with HTML names, extract the plain text as a single string. |
Function | indent |
Returns a 'reindented' copy of a tree, with text nodes altered to add spaces and newlines, so that if tostring()'d and printed, it would print indented by depth. |
Function | kvelements |
Where people use elements for single text values, it's convenient to get them as a dict. |
Function | node |
Walks all elements under the given element, remembering both path string and element reference as we go. |
Function | parse |
Parses HTML into an etree. NOTE: This is *NOT* what you would use for XML - fromstring() is for XML. |
Function | path |
Given an ancestor and a descentent element from the same tree (In many applications you want under to be the the root element) |
Function | path |
Walk nodes under an etree element, count how often each path happens (counting the complete path string). written to summarize the rough structure of a document. |
Function | strip |
Returns a copy of a tree that has its namespaces stripped. |
Constant | SOME |
some readable XML prefixes, for friendlier display. This is ONLY for consistent pretty-printing in debug, and WILL NOT BE CORRECT according to the document definition. (It is not used by the rest of this code, just one of our CLI utilities). |
Function | _indent |
Alters the text nodes so that the tostring()ed version will look nice and indented when printed as plain text. |
Function | _strip |
Takes a parsed ET structure and does an in-place removal of all namespaces. Returns a list of removed namespaces, which you can usually ignore. Not really meant to be used directly, in part because it assumes lxml etrees. |
Variable | _html |
The data that html_text works from; we might make this a parameter so you can control that |
Returns all fragments of text contained in a subtree, as a list of strings.
For the simplest uses, you may just want to use
Note that for simpler uses, this is itertext() with extra steps. You may not need this.
For example, all_text_fragments( fromstring('<a>foo<b>bar</b></a>') ) == ['foo', 'bar']
Note that:
- If your source is XML,
- this is a convenience function that lets you be pragmatic with creative HTML-like nesting, and perhaps should not be used for things that are strictly data.
TODO: more tests, I'm moderately sure strip doesn't do quite what it should.
TODO: add add_spaces - an acknowledgment that in non-HTML, as well as equally free-form documents like this project often handles, that element should be considered to split a word (e.g. p in HTML) or that element probably doesn't split a word (e.g. em, sup in HTML) The idea would be that you can specify which elements get spaces inserted and which do not. Probably with defaults for us, which are creative and not necessarily correct, but on average makes fewer weird mistakes (would need to figure that out from the various schemas)
Parameters | |
under | an etree node to work under |
strip:str | is what to remove at the edges of each .text and .tail ...handed to strip(), and note that the default, None, is to strip whitespace if you want it to strip nothing at all, pass in '' (empty string) |
ignorebool | removes strings that are empty when after that stripping |
ignore | ignores direct/first .text content of named tags (does not ignore .tail, does not ignore the subtree) |
join:str | if None, return a list of text fragments; if a string, we return a single tring, joined on that |
stoplist | should be None or a list of tag names. If a tag name is in this sequence, we stop walking the tree entirely. (note that it would still include that tag's tail; CONSIDER: changing that) |
Returns | |
if join==None (the default), a list of text fragments. If join is a string, a single string (joined on that string) |
Return (piece of) tree as a string, readable for debugging
Intended to take an etree object (but if give a bytestring we'll try to parse it as XML)
Because this is purely meant for debugging, it by default
- strips namespaces
- reindents
- returns as unicode (not bytes) so we can print() it
It's also mostly just short for:
etree.tostring( etree.indent( etree.strip_namespace( tree ) ), encoding='unicode' )
Take an etree presumed to contain elements with HTML names, extract the plain text as a single string.
Yes, you can get basic text extraction using "".join(elem.itertext()), or with a _little_ more control using all_text_fragments() in this module.
What this function adds is awarenesss of which HTML elements should be considered to split words and to split paragraphs. It will selectively insert spaces and newlines, as to not smash text together in ways unlikely to how a browser would do it.
The downside is that this becomes more creative than some might like, so if you want precise control, take the code and refine your own.
(Inspiration was taken from the html-text module. While we're being creative anyway, we might _also_ consider taking inspiration from jusText, to remove boilerplate content based on a few heuristics.)
Parameters | |
etree | Can be one of * etree object (but there is little point as most node names will not be known. * a bytes or str object - will be assumed to be HTML that isn't parsed yet. (bytes suggests properly storing file data, str that you might be more fiddly with encodings) * a bs4 object - this is a stretch, but could save you some time. |
join | If True, returns a single string (with a little more polishing, of spaces after newlines) If False, returns the fragments it collected and added. Due to the insertion and handing of whitespace, this bears only limited relation to the parts. |
bodynodename | start at the node with this name - defaults to 'body'. Use None to start at the root of what you handed in. |
Returns a 'reindented' copy of a tree, with text nodes altered to add spaces and newlines, so that if tostring()'d and printed, it would print indented by depth.
This may change the meaning of the document, so this output should _only_ be used for presentation of the debugging sort.
See also _indent_inplace
Parameters | |
tree | tree to copy and reindent |
stripbool | make contents that contain a lot of newlines look cleaner, but changes the stored data even more. |
Where people use elements for single text values, it's convenient to get them as a dict.
Given an etree element containing a series of such values, Returns a dict that is mostly just { el.tag:el.text } so ignores .tail Skips keys with empty values.
Would for example turn an etree fragment like :
<foo> <identifier>BWBR0001840</identifier> <title>Grondwet</title> <onderwerp/> </foo>
into python dict: :
{'identifier':'BWBR0001840', 'title':'Grondwet'}
Parameters | |
under | etree node/element to work under (use the children of) |
strip | whether to use strip() on text values (defaults to True) |
ignore | sequence of strings, naming tags/variables to not put into the dict |
Returns | |
dict | a python dict (see e.g. example above) |
Walks all elements under the given element, remembering both path string and element reference as we go.
(note that this is not an xpath style with specific indices, just the names of the elements)
For example: :
TODO
TODO: re-test now that I've added max_depth, because I'm not 100% on the details
Parameters | |
under | If given None, it emits nothing (we assume it's from a find() that hit nothing, and that it's slightly easier to ignore here than in your code) |
max | |
Returns | |
a generator yielding (path, element), and is mainly a helper used by path_count() |
Parses HTML into an etree. NOTE: This is *NOT* what you would use for XML - fromstring() is for XML.
this parse_html() differs from etree.fromstring
- in that we use a parser more aware of HTML and deals with minor incorrectness
- and creates lxml.html-based objects, which have more functions compared to their XML node counterparts
If you are doing this, consider also
- BeautifulSoup, as slightly more HTML-aware parse, and an alternative API you might prefer to etree's (or specifically not; using both can be confusing)
- ElementSoup, to take more broken html into etree via beautifulsoup
See also https://lxml.de/lxmlhtml.html
Parameters | |
htmlbytes:bytes | a HTML file as a bytestring |
Returns | |
an etree object |
Given an ancestor and a descentent element from the same tree (In many applications you want under to be the the root element)
Returns the xpath-style path to get from (under) to this specific element ...or raises a ValueError mentioning that the element is not in this tree
Keep in mind that if you reformat a tree, the latter is likely.
This function has very little code, and if you do this for much of a document, you may want to steal the code
Parameters | |
under | |
element | |
excluding:bool | if we have a/b/c and call this between an and c, there are cases for wanting either * complete path report, like `/a/b/c` (excluding=False), e.g. as a 'complete * a relative path like `b/c` (excluding=True), in particular when we know we'll be calling xpath or find on node a |
Returns | |
Walk nodes under an etree element, count how often each path happens (counting the complete path string). written to summarize the rough structure of a document.
Path here means 'the name of each element', *not* xpath-style path with indices that resolve to the specific node.
Returns a dict from each path strings to how often it occurs
Returns a copy of a tree that has its namespaces stripped.
More specifically it removes
- namespace from element names
- namespaces from attribute names (default, but optional)
- default namespaces (TODO: test that properly)
Parameters | |
tree | The node under which to remove things (you would probably hand in the root) |
remove | Whether to remove namespaces from attributes as well. For attributes with the same name that are unique only because of a different namespace, this may cause attributes to be overwritten, For example: : <e p:at="bar" at="quu"/> might become: : <e at="bar"/> I've not yet seen any XML where this matters - but it might. |
Returns | |
The URLs for the stripped namespaces. We don't expect you to have a use for this most of the time, but in some debugging you want to know, and report them. |
some readable XML prefixes, for friendlier display. This is ONLY for consistent pretty-printing in debug, and WILL NOT BE CORRECT according to the document definition. (It is not used by the rest of this code, just one of our CLI utilities).
Value |
|
Alters the text nodes so that the tostring()ed version will look nice and indented when printed as plain text.
Parameters | |
elem | Undocumented |
level:int | Undocumented |
stripbool | Undocumented |
Takes a parsed ET structure and does an in-place removal of all namespaces. Returns a list of removed namespaces, which you can usually ignore. Not really meant to be used directly, in part because it assumes lxml etrees.
Parameters | |
tree | See strip_namespace |
remove | See strip_namespace |
Returns | |
See strip_namespace |