wetsuite.helpers.koop

module documentation

Data and metadata parsing that is probably specific to KOOP's SRU repositories.

For more general things, see

meta.py
patterns.py

Function	`alineas_with_selective_path`	Given document-style XML data such as that of CVDR XML documents, tries to capture most of the interesting structure in easier-to-digest python data form, and lessen the nested nature without quite throwing it away.
Function	`bwb_manifest_usefuls`	Fetch interesting metadata from manifest WTI XML. TODO: finish
Function	`bwb_merge_usefuls`	Merge the result of the above, into a flatter structure
Function	`bwb_searchresult_meta`	Takes individual SRU result record as an etree subtrees, picks out BWB-specific metadata (merging the separate metadata sections). Returns a dict
Function	`bwb_title_looks_boring`	Give title text, estimate whether the content has much to say
Function	`bwb_toestand_text`	Given the document (as an etree object), this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such.
Function	`bwb_toestand_usefuls`	Fetch the most interesting metadata from parsed toestand XML. TODO: finish
Function	`bwb_wti_usefuls`	Fetch interesting metadata from parsed WTI XML. TODO: finish, and actually do it properly -- e.g. look to the schema as to what may be omitted, may repeat, etc.
Function	`cvdr_meta`	Extracts metadata from a CVDR SRU search result's individual record, or CVDR content xml's root.
Function	`cvdr_normalize_expressionid`	When indexing, you may care to transform varied forms (e.g. "112779_1" and "CVDR112779_1") into just one normalized form.
Function	`cvdr_param_parse`	Picks the parameters from a juriconnect (jci) style identifier string.
Function	`cvdr_parse_identifier`	Given a CVDR style identifier string (sometimes called JCDR), gives a more normalized version, e.g. useful for indexing.
Function	`cvdr_sourcerefs`	Given the CVDR XML content document as an etree object, looks for the <source> tags (meta/owmsmantel/source), which are references to laws and other regulations (VERIFY)
Function	`cvdr_text`	Given the XML content document as etree object, this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such.
Function	`cvdr_versions_for_work`	takes a CVDR id (with or without _version, i.e. expression id or work id), searches KOOP's CVDR repo, Returns: a list of all matching version expression ids
Function	`merge_alinea_data`	Takes the output of alineas_with_selective_path() puts text fragments together when their specified ['parts'] values are the same.
Function	`op_data_xml_seems_empty`	Some of the XML files presented to us as XML data are actually devoid of content text. This determines if it's such a case. We actually parse it, so we can better distinguish between empty and near-empty.
Function	`parse_op_metafile`	Parses two different metadata-only XML styles found in KOOP's Offiele Publicaties repositories
Function	`parse_op_searchmeta`	similar to cvdr_meta; CONSIDER: abstract most of that into one helper function
Function	`parse_repo_url`	TODO: see how well this holds up to all areas
Function	`prefer_types`	Given a bunch of document types such as:
Variable	`_re_repourl_parl`	Undocumented
Variable	`_versions_cache`	Undocumented

def alineas_with_selective_path(tree, start_at_path=None, alinea_elemnames=('al')): ¶

Given document-style XML data such as that of CVDR XML documents, tries to capture most of the interesting structure in easier-to-digest python data form, and lessen the nested nature without quite throwing it away.

Whenever we hit an <al>, we emit a dict that details all the interesting elements between body and this <al>

This is intended as the lower-level half of potential cleanup, grouping/splitting logic, and such. ...though it is specific to a handful of XML schemas, so not universal. For wider applicability you may want to look to helpers.split TODO

Parameters
tree	the etree tree/node to work on
start_at_path	if you gave it the root of an etree, you can do a subset by handing in xpath here (alternatively, you could navigate yourself and hand the interesting section in directly)
alinea_elemnames	will be ('al',) for the KOOP sources. Was made into a parameter only to make this perhaps-applicable elsewhere, you probably don't want to touch this.
Returns
Returns a list of dicts, one for each <al> (or whatever you handed into alinea_elemnames) While on some flat examples, e.g. officiele-publicaties XMLs, each output might not hold much structure, some of the better-structured cases, e.g. BWB XMLs, each such output dict might look something like: : { 'path': '/cvdr/body/regeling/regeling-tekst/hoofdstuk[1]/artikel[1]/al', 'parts': [ {'what': 'hoofdstuk', 'hoofdstuklabel': 'Hoofdstuk', 'hoofdstuknr': '1', 'hoofdstuktitel': 'Algemene bepalingen',}, {'what': 'artikel', 'artikellabel': 'Artikel', 'artikelnr': '1:1', 'artikeltitel': 'Begripsomschrijvingen',} ], 'merged': {'hoofdstuklabel': 'Hoofdstuk', 'hoofdstuknr': '1', 'hoofdstuktitel': 'Algemene bepalingen' 'artikellabel': 'Artikel', 'artikelnr': '1:1', 'artikeltitel': 'Begripsomschrijvingen', }, 'text-flat': 'In deze verordening wordt verstaan dan wel mede verstaan onder:' } Where: 'parts' details each structural element (boek, hoofdstuk, afdeling paragraaf, artikel, lid) that encompasses this fragment the ...label keys are largely entirely redundant, but there are documents that abuse these, which you may want to know. 'merged' is the part dicts, without the 'what' key, update()d into one dict. Intended to be the part you may want to filter on all in one place In simpler documents, this eases things and is correct In complex documents, this may be incorrect -- because when you e.g. have an afdeling nested in an afdeling, values from one overwrite the other. 'path' points at the item we're describing (in xpath terms), in case you want to find this element in the original XML / etree form 'text-flat' is plain text, with any markup elements flattened out this is very much intended to be simplified further soon, likely using merge_alinea_data(). This is separated largely for flexibility's sake, (CONSIDER: also provide a extract-text-with-reasonable-defaults fuction) WARNING: currently NONE of these keys in parts or merged are settled on yet -- things may change.

def bwb_manifest_usefuls(tree): ¶

Fetch interesting metadata from manifest WTI XML. TODO: finish

def bwb_merge_usefuls(toestand_usefuls=None, wti_usefuls=None, manifest_usefuls=None): ¶

Merge the result of the above, into a flatter structure

def bwb_searchresult_meta(record_node): ¶

Takes individual SRU result record as an etree subtrees, picks out BWB-specific metadata (merging the separate metadata sections). Returns a dict

def bwb_title_looks_boring(text): ¶

Give title text, estimate whether the content has much to say

def bwb_toestand_text(tree, debug=False): ¶

Given the document (as an etree object), this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such.

Not structured data, intended to assist generic "how do these setences parse" code

TODO:

review this, it makes various assumptions about document structure
review the handling of certain elements, like lijst, table, definitielijst
see if there are contained elements to ignore, like maybe <redactie type="vervanging"> ?
generalize to have a parameter ignore_tags=['li.nr', 'meta-data', 'kop', 'tussenkop', 'plaatje', 'adres', 'specificatielijst', 'artikel.toelichting', 'citaat', 'wetcitaat']

def bwb_toestand_usefuls(tree): ¶

Fetch the most interesting metadata from parsed toestand XML. TODO: finish

def bwb_wti_usefuls(tree): ¶

Fetch interesting metadata from parsed WTI XML. TODO: finish, and actually do it properly -- e.g. look to the schema as to what may be omitted, may repeat, etc.

def cvdr_meta(tree, flatten=False): ¶

Extracts metadata from a CVDR SRU search result's individual record, or CVDR content xml's root.

Because various elements can repeat - and various things frequently do (e.g. 'source'), each value is a list. using flatten=True MAY CLOBBER DATA and is only recommended for quick and dirty debug prints, not for real use.

Context for flatten:

In a lot of cases we care mainly for tagname and text, and there are no attributes, e.g.

owmskern's <identifier>CVDR641872_2</identifier>
owmskern's <title>Nadere regels jeugdhulp gemeente Pijnacker-Nootdorp 2020</title>
owmskern's <language>nl</language>
owmskern's <modified>2022-02-17</modified>
owmsmantel's <alternative>Nadere regels jeugdhulp gemeente Pijnacker-Nootdorp 2020</alternative>
owmsmantel's <subject>maatschappelijke zorg en welzijn</subject>
owmsmantel's <issued>2022-02-08</issued>
owmsmantel's <rights>De tekst in dit document is vrij van auteursrecht en databankrecht</rights>

In others you may also care about an attribute or two, e.g.:

owmskern's <type scheme="overheid:Informatietype">regeling</type> (except there's no variation in that value anyway)
owmskern's <creator scheme="overheid:Gemeente">Pijnacker-Nootdorp</creator>
owmsmantel's <isRatifiedBy scheme="overheid:BestuursorgaanGemeente">college van burgemeester en wethouders</isRatifiedBy>
owmsmantel's <isFormatOf resourceIdentifier="https://zoek.officielebekendmakingen.nl/gmb-2022-66747">gmb-2022-66747</isFormatOf>
owmsmantel's <source resourceIdentifier="https://lokaleregelgeving.overheid.nl/CVDR641839">Verordening jeugdhulp gemeente Pijnacker-Nootdorp 2020</source>

When those attributes matter, you want flatten=False (the default) and you will get a dict like: :

        { 'creator': [{'attr': {'scheme': 'overheid:Gemeente'}, 'text': 'Zuidplas'}], ... }

Parameters
tree	an etree object that is either a search result's individual record (in which case we're looking for ./recordData/gzd/originalData/meta CVDR content xml's root (in which case it's ./meta) ...because both contain almost the same metadata almost the same way (the difference is enrichedData in the search results).
flatten	For quick and dirty presentation (only) you may wish to ask to creatively smush those into one string by asking for `flatten==True` in which case you get something like: : { 'creator': 'Zuidplas (overheid:Gemeente)', ... } Please avoid this when you care to deal with data in a structured way (even if you can sometimes get away with it due to empty attributs).
Returns
a dict containing owmskern, owmsmantel, and cvdripm's elements merged into a single dict. If it's a search result, it will also mention its enrichedData.

def cvdr_normalize_expressionid(text: str): ¶

When indexing, you may care to transform varied forms (e.g. "112779_1" and "CVDR112779_1") into just one normalized form.

Note that this does not work on workids - we raise an exception for them.

Parameters
text:`str`	the identifier to work on.
Returns
the expression - basically the thing cvdr_parse_identifier() returns, _including_ the 'CVDR' at the start

def cvdr_param_parse(rest: str): ¶

Picks the parameters from a juriconnect (jci) style identifier string.

Mostly a helper function, used e.g. by cvdr_sourcerefs. Duplicates code in meta.py - TODO: centralize that

Would turn "BWB://1.0:c:BWBR0008903&artikel=12&g=2011-11-08" into a dict where 'artikel' maps to ['12'], etc.

Parameters
rest:`str`	the string to parse
Returns
a dict containing each variable to a list of values present in this URL-like string

def cvdr_parse_identifier(text: str, prepend_cvdr: bool = False): ¶

Given a CVDR style identifier string (sometimes called JCDR), gives a more normalized version, e.g. useful for indexing.

For example:

    cvdr_parse_identifier('101404_1')     ==  ('101404', '101404_1')
    cvdr_parse_identifier('CVDR101405_1') ==  ('101405', '101405_1')
    cvdr_parse_identifier('CVDR101406')   ==  ('101406',  None     )
    cvdr_parse_identifier('1.0:101407_1') ==  ('101407', '101407_1')

Parameters
text:`str`	the thing to parse
prepend_cvdr:`bool`	whether to put 'CVDR' in front of the work and the expression. Defaults to false.
Returns
a tuple of strings: (work ID, expression ID), the latter of which will be None if input was a work ID; see the examples above. If it makes _no_ sense as a CVDR number, it may raise a ValueError instead.

def cvdr_sourcerefs(tree, ignore_without_id=True, debug=False): ¶

Given the CVDR XML content document as an etree object, looks for the <source> tags (meta/owmsmantel/source), which are references to laws and other regulations (VERIFY)

This function

extracts (only) the source tags that specify ...and ignores

in part to normalize what is in there a bit. Be aware this is more creative than a helper function probably should be.

Returns a list of :

  (type, origref, specref, parts, source_text)

where

type: currently one of 'BWB' or 'CVDR'
origref: URL-like reference
specref: just the identifier
parts: dict of parts parsed from URL, or None
source_text: text (name and often reference to a part); seems to be more convention-based than standardized

For example (mostly to point out there is _plenty_ of variation in most parts) :

 ('BWB',
  '1.0:c:BWBR0015703&artikel=6&g=2014-11-13',
  'BWBR0015703',
  {'artikel': ['6'], 'g': ['2014-11-13']},
  'Participatiewet, art. 6')

or :

 ('BWB',
  'http://wetten.overheid.nl/BWBR0015703/geldigheidsdatum_15-05-2011#Hoofdstuk1_12',
  'BWBR0015703',
  {},
  'Wet werk en bijstand, art. 8, lid 1')

or :

 ('CVDR',
  'CVDR://103202_1',
  '103202_1',
  None,
  'Inspraakverordening Spijkenisse, art. 2')

or :

 ('CVDR',
  '1.1:CVDR229520-1',
  '229520',
  None,
  'Verordening voorzieningen maatschappelijke participatie 2012-A, artikel 4')

def cvdr_text(tree): ¶

Given the XML content document as etree object, this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such.

Returns a single string. This is currently a best-effort formatting, where you should e.g. find that paragraphs are split with double newlines.

This is currently mostly copy-pasted from the bwb code TODO: unify, after I figure out all the varying structure

TODO: write functions that support "give me flat text for each article separately"

def cvdr_versions_for_work(cvdr_id: str) -> list: ¶

takes a CVDR id (with or without _version, i.e. expression id or work id), searches KOOP's CVDR repo, Returns: a list of all matching version expression ids

Keep in mind that this actively does requests, so preferably don't do this in bulk, and/or cache your results.

Returns
`list`	a list of a list expression ids

def merge_alinea_data(alinea_dicts, if_same={'hoofdstuk': 'hoofdstuknr', 'afdeling': 'afdelingnr', 'paragraaf': 'paragraafnr', 'sub-paragraaf': 'subparagraafnr', 'artikel': 'artikelnr', 'lid': 'lidnr', 'definitie-item': 'term', 'uitspraak': 'id', 'uitspraak.info': None, 'section': 'nr'}): ¶

Takes the output of alineas_with_selective_path() puts text fragments together when their specified ['parts'] values are the same.

In other words, this lets you control just how flat to make the text, e.g.

flatten all text within a lid (e.g. flattening lists),
smush all lid text within an article together, etc.
mostly flatten out the text, but still group it by hoofdstuk if those are present

...etc.

CONSIDER: returning a meta dict for each such grouped text (instead of the raw key)

Returns

def op_data_xml_seems_empty(docbytes, minchars=1): ¶

Some of the XML files presented to us as XML data are actually devoid of content text. This determines if it's such a case. We actually parse it, so we can better distinguish between empty and near-empty.

def parse_op_metafile(input: bytes | wetsuite.helpers.etree.ElementTree, as_dict=False): ¶

Parses two different metadata-only XML styles found in KOOP's Offiele Publicaties repositories

the one that looks like `<metadata_gegevens>` with a set of `<metadata name="DC.title" scheme="" content="...`
the one that (after a namespace strip) looks like `<owms-metadata><owmskern>` with e.g. `<dcterms:identifier>gmb-...`

could also raise, e.g. a XMLSyntaxError

NOT TO BE CONFUSED with parse_op_searchmeta

CONSIDER: TODO: a similar flatten parameter (probably defaulting False)

Tries to return them in the same style, e.g.

taking off the name-based grouping from DC.title
taking off the tag-based grouping (ignoring owmskern tag)

Returns
by default, a list of (key, schema, value) tuples if as_dict=True, a dict like {key: [(schema, value), ...]}

def parse_op_searchmeta(input, flatten=False): ¶

similar to cvdr_meta; CONSIDER: abstract most of that into one helper function

Note that

the 'enriched' and 'manifestations' keys show equivalent information
the 'enriched' and 'manifestations' keys are not affected by flattening.

def parse_repo_url(url): ¶

TODO: see how well this holds up to all areas

Parameters
url	an URL like `https://repository.overheid.nl/frbr/officielepublicaties/ag-ek/1995/ag-ek-1995-02-08/1/xml/ag-ek-1995-02-08.xml`,
Returns
a dict like: {'doctype': 'ag-ek', 'group': '1995', 'doc_id': 'ag-ek-1995-02-08', 'mn': '1', 'exprtype': 'xml', 'bn': 'ag-ek-1995-02-08.xml', 'url': 'https://repository.overheid.nl/frbr/officielepublicaties/ag-ek/1995/ag-ek-1995-02-08/1/xml/ag-ek-1995-02-08.xml'}

def prefer_types(given_strlist, all_of=('metadata', 'metadataowms', 'xml'), first_of=('html', 'html.zip', 'pdf', 'odt'), never=('coordinaten', 'jpg', 'ocr'), require_present=('metadata')): ¶

Given a bunch of document types such as:

    ['metadata', 'metadataowms', 'pdf','odt', 'jpg', 'coordinaten', 'ocr', 'html', 'xml']

return only a subset of types, for defaults:

    ['metadata', 'metadataowms', 'xml', 'html']

...because those defaults prefer smaller, smaller, more data-like/simpler-to-parse variants and avoid content redundancy that we probably won't end up reading (e.g. why prse odt, pdf when you have html, xml).

DECISIONS/ASSUMPTIONS MADE:

things like 'don't care about HTML when we have XML' which might not fit your needs exactly.
Assumes xml is content, and it can be other things. You probably want to be sure about your specific purpose.
Consider that if you e.g. hand in `['metadata', 'metadataowms', 'xml', 'pdf']` it will return _both_ xml and PDF. If you specifically wanted only one content thing, wanted only one, xml, then you should probably say first_of=('xml, 'html', 'html.zip', 'pdf', 'odt')

Parameters
given_strlist	Undocumented
all_of	always add these when they appear. Meant for things like metadata, and more data-like formats.
first_of	add the first in this list that matches and stop (stop regardless of whether it was already added via always) Meant for things like "if always didn't find something, add a single next best thing, probably the one most reasonable to parse"
never	never have these in the returned list (_even_ if they were mentioned in always or first_of)
require_present	Undocumented
Returns
a list of strings (often shorter than what we were given)

_re_repourl_parl = ¶

Undocumented

_versions_cache: dict = ¶

Undocumented

wetsuite.helpers.koop_parse

Context for flatten:

`wetsuite.helpers.koop_parse`