Data and metadata parsing that is probably specific to KOOP's SRU repositories.
For more general things, see
- meta.py
- patterns.py
Function | alineas |
Given document-style XML data such as that of CVDR XML documents, tries to capture most of the interesting structure in easier-to-digest python data form, and lessen the nested nature without quite throwing it away. |
Function | bwb |
Fetch interesting metadata from manifest WTI XML. TODO: finish |
Function | bwb |
Merge the result of the above, into a flatter structure |
Function | bwb |
Takes individual SRU result record as an etree subtrees, picks out BWB-specific metadata (merging the separate metadata sections). Returns a dict |
Function | bwb |
Give title text, estimate whether the content has much to say |
Function | bwb |
Given the document (as an etree object), this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such. |
Function | bwb |
Fetch the most interesting metadata from parsed toestand XML. TODO: finish |
Function | bwb |
Fetch interesting metadata from parsed WTI XML. TODO: finish, and actually do it properly -- e.g. look to the schema as to what may be omitted, may repeat, etc. |
Function | cvdr |
Extracts metadata from a CVDR SRU search result's individual record, or CVDR content xml's root. |
Function | cvdr |
When indexing, you may care to transform varied forms (e.g. "112779_1" and "CVDR112779_1") into just one normalized form. |
Function | cvdr |
Picks the parameters from a juriconnect (jci) style identifier string. |
Function | cvdr |
Given a CVDR style identifier string (sometimes called JCDR), gives a more normalized version, e.g. useful for indexing. |
Function | cvdr |
Given the CVDR XML content document as an etree object, looks for the <source> tags (meta/owmsmantel/source), which are references to laws and other regulations (VERIFY) |
Function | cvdr |
Given the XML content document as etree object, this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such. |
Function | cvdr |
takes a CVDR id (with or without _version, i.e. expression id or work id), searches KOOP's CVDR repo, Returns: a list of all matching version expression ids |
Function | merge |
Takes the output of alineas_with_selective_path() puts text fragments together when their specified ['parts'] values are the same. |
Function | op |
Some of the XML files presented to us as XML data are actually devoid of content text. This determines if it's such a case. We actually parse it, so we can better distinguish between empty and near-empty. |
Function | parse |
Parses two different metadata-only XML styles found in KOOP's Offiele Publicaties repositories |
Function | parse |
similar to cvdr_meta; CONSIDER: abstract most of that into one helper function |
Function | parse |
TODO: see how well this holds up to all areas |
Function | prefer |
Given a bunch of document types such as: |
Variable | _re |
Undocumented |
Variable | _versions |
Undocumented |
Given document-style XML data such as that of CVDR XML documents, tries to capture most of the interesting structure in easier-to-digest python data form, and lessen the nested nature without quite throwing it away.
Whenever we hit an <al>, we emit a dict that details all the interesting elements between body and this <al>
This is intended as the lower-level half of potential cleanup, grouping/splitting logic, and such. ...though it is specific to a handful of XML schemas, so not universal. For wider applicability you may want to look to helpers.split TODO
Parameters | |
tree | the etree tree/node to work on |
start | if you gave it the root of an etree, you can do a subset by handing in xpath here (alternatively, you could navigate yourself and hand the interesting section in directly) |
alinea | will be ('al',) for the KOOP sources. Was made into a parameter only to make this perhaps-applicable elsewhere, you probably don't want to touch this. |
Returns | |
Returns a list of dicts, one for each <al> (or whatever you handed into alinea_elemnames) While on some flat examples, e.g. officiele-publicaties XMLs, each output might not hold much structure, some of the better-structured cases, e.g. BWB XMLs, each such output dict might look something like: : { 'path': '/cvdr/body/regeling/regeling-tekst/hoofdstuk[1]/artikel[1]/al', 'parts': [ {'what': 'hoofdstuk', 'hoofdstuklabel': 'Hoofdstuk', 'hoofdstuknr': '1', 'hoofdstuktitel': 'Algemene bepalingen',}, {'what': 'artikel', 'artikellabel': 'Artikel', 'artikelnr': '1:1', 'artikeltitel': 'Begripsomschrijvingen',} ], 'merged': {'hoofdstuklabel': 'Hoofdstuk', 'hoofdstuknr': '1', 'hoofdstuktitel': 'Algemene bepalingen' 'artikellabel': 'Artikel', 'artikelnr': '1:1', 'artikeltitel': 'Begripsomschrijvingen', }, 'text-flat': 'In deze verordening wordt verstaan dan wel mede verstaan onder:' } Where:
|
Takes individual SRU result record as an etree subtrees, picks out BWB-specific metadata (merging the separate metadata sections). Returns a dict
Given the document (as an etree object), this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such.
Not structured data, intended to assist generic "how do these setences parse" code
TODO:
- review this, it makes various assumptions about document structure
- review the handling of certain elements, like lijst, table, definitielijst
- see if there are contained elements to ignore, like maybe <redactie type="vervanging"> ?
- generalize to have a parameter ignore_tags=['li.nr', 'meta-data', 'kop', 'tussenkop', 'plaatje', 'adres', 'specificatielijst', 'artikel.toelichting', 'citaat', 'wetcitaat']
Fetch interesting metadata from parsed WTI XML. TODO: finish, and actually do it properly -- e.g. look to the schema as to what may be omitted, may repeat, etc.
Extracts metadata from a CVDR SRU search result's individual record, or CVDR content xml's root.
Because various elements can repeat - and various things frequently do (e.g. 'source'), each value is a list. using flatten=True MAY CLOBBER DATA and is only recommended for quick and dirty debug prints, not for real use.
Context for flatten:
In a lot of cases we care mainly for tagname and text, and there are no attributes, e.g.
- owmskern's <identifier>CVDR641872_2</identifier>
- owmskern's <title>Nadere regels jeugdhulp gemeente Pijnacker-Nootdorp 2020</title>
- owmskern's <language>nl</language>
- owmskern's <modified>2022-02-17</modified>
- owmsmantel's <alternative>Nadere regels jeugdhulp gemeente Pijnacker-Nootdorp 2020</alternative>
- owmsmantel's <subject>maatschappelijke zorg en welzijn</subject>
- owmsmantel's <issued>2022-02-08</issued>
- owmsmantel's <rights>De tekst in dit document is vrij van auteursrecht en databankrecht</rights>
In others you may also care about an attribute or two, e.g.:
- owmskern's <type scheme="overheid:Informatietype">regeling</type> (except there's no variation in that value anyway)
- owmskern's <creator scheme="overheid:Gemeente">Pijnacker-Nootdorp</creator>
- owmsmantel's <isRatifiedBy scheme="overheid:BestuursorgaanGemeente">college van burgemeester en wethouders</isRatifiedBy>
- owmsmantel's <isFormatOf resourceIdentifier="https://zoek.officielebekendmakingen.nl/gmb-2022-66747">gmb-2022-66747</isFormatOf>
- owmsmantel's <source resourceIdentifier="https://lokaleregelgeving.overheid.nl/CVDR641839">Verordening jeugdhulp gemeente Pijnacker-Nootdorp 2020</source>
When those attributes matter, you want flatten=False (the default) and you will get a dict like: :
{ 'creator': [{'attr': {'scheme': 'overheid:Gemeente'}, 'text': 'Zuidplas'}], ... }
Parameters | |
tree | an etree object that is either
...because both contain almost the same metadata almost the same way (the difference is enrichedData in the search results). |
flatten | For quick and dirty presentation (only) you may wish to ask to creatively smush those into one string by asking for flatten==True in which case you get something like: : { 'creator': 'Zuidplas (overheid:Gemeente)', ... } Please avoid this when you care to deal with data in a structured way (even if you can sometimes get away with it due to empty attributs). |
Returns | |
a dict containing owmskern, owmsmantel, and cvdripm's elements merged into a single dict. If it's a search result, it will also mention its enrichedData. |
When indexing, you may care to transform varied forms (e.g. "112779_1" and "CVDR112779_1") into just one normalized form.
Note that this does not work on workids - we raise an exception for them.
Parameters | |
text:str | the identifier to work on. |
Returns | |
the expression - basically the thing cvdr_parse_identifier() returns, _including_ the 'CVDR' at the start |
Picks the parameters from a juriconnect (jci) style identifier string.
Mostly a helper function, used e.g. by cvdr_sourcerefs. Duplicates code in meta.py - TODO: centralize that
Would turn "BWB://1.0:c:BWBR0008903&artikel=12&g=2011-11-08" into a dict where 'artikel' maps to ['12'], etc.
Parameters | |
rest:str | the string to parse |
Returns | |
a dict containing each variable to a list of values present in this URL-like string |
Given a CVDR style identifier string (sometimes called JCDR), gives a more normalized version, e.g. useful for indexing.
For example:
cvdr_parse_identifier('101404_1') == ('101404', '101404_1') cvdr_parse_identifier('CVDR101405_1') == ('101405', '101405_1') cvdr_parse_identifier('CVDR101406') == ('101406', None ) cvdr_parse_identifier('1.0:101407_1') == ('101407', '101407_1')
Parameters | |
text:str | the thing to parse |
prependbool | whether to put 'CVDR' in front of the work and the expression. Defaults to false. |
Returns | |
a tuple of strings: (work ID, expression ID), the latter of which will be None if input was a work ID; see the examples above. If it makes _no_ sense as a CVDR number, it may raise a ValueError instead. |
Given the CVDR XML content document as an etree object, looks for the <source> tags (meta/owmsmantel/source), which are references to laws and other regulations (VERIFY)
This function
- extracts (only) the source tags that specify ...and ignores
in part to normalize what is in there a bit. Be aware this is more creative than a helper function probably should be.
Returns a list of :
(type, origref, specref, parts, source_text)
where
- type: currently one of 'BWB' or 'CVDR'
- origref: URL-like reference
- specref: just the identifier
- parts: dict of parts parsed from URL, or None
- source_text: text (name and often reference to a part); seems to be more convention-based than standardized
For example (mostly to point out there is _plenty_ of variation in most parts) :
('BWB', '1.0:c:BWBR0015703&artikel=6&g=2014-11-13', 'BWBR0015703', {'artikel': ['6'], 'g': ['2014-11-13']}, 'Participatiewet, art. 6')
or :
('BWB', 'http://wetten.overheid.nl/BWBR0015703/geldigheidsdatum_15-05-2011#Hoofdstuk1_12', 'BWBR0015703', {}, 'Wet werk en bijstand, art. 8, lid 1')
or :
('CVDR', 'CVDR://103202_1', '103202_1', None, 'Inspraakverordening Spijkenisse, art. 2')
or :
('CVDR', '1.1:CVDR229520-1', '229520', None, 'Verordening voorzieningen maatschappelijke participatie 2012-A, artikel 4')
Given the XML content document as etree object, this is a quick and dirty 'give me mainly the plaintext in it', skipping any introductions and such.
Returns a single string. This is currently a best-effort formatting, where you should e.g. find that paragraphs are split with double newlines.
This is currently mostly copy-pasted from the bwb code TODO: unify, after I figure out all the varying structure
TODO: write functions that support "give me flat text for each article separately"
takes a CVDR id (with or without _version, i.e. expression id or work id), searches KOOP's CVDR repo, Returns: a list of all matching version expression ids
Keep in mind that this actively does requests, so preferably don't do this in bulk, and/or cache your results.
Parameters | |
cvdrstr | Undocumented |
Returns | |
list | a list of a list expression ids |
Takes the output of alineas_with_selective_path() puts text fragments together when their specified ['parts'] values are the same.
In other words, this lets you control just how flat to make the text, e.g.
- flatten all text within a lid (e.g. flattening lists),
- smush all lid text within an article together, etc.
- mostly flatten out the text, but still group it by hoofdstuk if those are present
...etc.
CONSIDER: returning a meta dict for each such grouped text (instead of the raw key)
Returns | |
Some of the XML files presented to us as XML data are actually devoid of content text. This determines if it's such a case. We actually parse it, so we can better distinguish between empty and near-empty.
Parses two different metadata-only XML styles found in KOOP's Offiele Publicaties repositories
- the one that looks like `<metadata_gegevens>` with a set of `<metadata name="DC.title" scheme="" content="...`
- the one that (after a namespace strip) looks like `<owms-metadata><owmskern>` with e.g. `<dcterms:identifier>gmb-...`
could also raise, e.g. a XMLSyntaxError
NOT TO BE CONFUSED with parse_op_searchmeta
CONSIDER: TODO: a similar flatten parameter (probably defaulting False)
Tries to return them in the same style, e.g.
- taking off the name-based grouping from DC.title
- taking off the tag-based grouping (ignoring owmskern tag)
Parameters | |
input:Union[ | Undocumented |
as | Undocumented |
Returns | |
|
similar to cvdr_meta; CONSIDER: abstract most of that into one helper function
Note that
- the 'enriched' and 'manifestations' keys show equivalent information
- the 'enriched' and 'manifestations' keys are not affected by flattening.
TODO: see how well this holds up to all areas
Parameters | |
url | an URL like https://repository.overheid.nl/frbr/officielepublicaties/ag-ek/1995/ag-ek-1995-02-08/1/xml/ag-ek-1995-02-08.xml, |
Returns | |
a dict like: {'doctype': 'ag-ek', 'group': '1995', 'doc_id': 'ag-ek-1995-02-08', 'mn': '1', 'exprtype': 'xml', 'bn': 'ag-ek-1995-02-08.xml', 'url': 'https://repository.overheid.nl/frbr/officielepublicaties/ag-ek/1995/ag-ek-1995-02-08/1/xml/ag-ek-1995-02-08.xml'} |
Given a bunch of document types such as:
['metadata', 'metadataowms', 'pdf','odt', 'jpg', 'coordinaten', 'ocr', 'html', 'xml']
return only a subset of types, for defaults:
['metadata', 'metadataowms', 'xml', 'html']
...because those defaults prefer smaller, smaller, more data-like/simpler-to-parse variants and avoid content redundancy that we probably won't end up reading (e.g. why prse odt, pdf when you have html, xml).
DECISIONS/ASSUMPTIONS MADE:
- things like 'don't care about HTML when we have XML' which might not fit your needs exactly.
- Assumes xml is content, and it can be other things. You probably want to be sure about your specific purpose.
- Consider that if you e.g. hand in `['metadata', 'metadataowms', 'xml', 'pdf']` it will return _both_ xml and PDF. If you specifically wanted only one content thing, wanted only one, xml, then you should probably say first_of=('xml, 'html', 'html.zip', 'pdf', 'odt')
Parameters | |
given | Undocumented |
all | always add these when they appear. Meant for things like metadata, and more data-like formats. |
first | add the first in this list that matches and stop (stop regardless of whether it was already added via always) Meant for things like "if always didn't find something, add a single next best thing, probably the one most reasonable to parse" |
never | never have these in the returned list (_even_ if they were mentioned in always or first_of) |
require | Undocumented |
Returns | |
a list of strings (often shorter than what we were given) |