Things that parse metadata.
Specifically for things not tied to a singular API or data source, or we otherwise expect to see some reuse of.
There are similar function in other places, in particular when they are specific. For example helpers speciic to KOOP's presentations BWB, CVDR, and OP sits in helpers.koop_parse
The function name should give you some indication how what it associates with, and how specific it is.
| Function | findall |
Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not. |
| Function | findall |
Within plain text, this tries to find all occurences of things that look like an ECLI identifier |
| Function | group |
Given URLs from officielepublicaties SRU repostory, parse out useful information and also group by that information. |
| Function | is |
Do two CELEX identifiers refer to the same document? |
| Function | parse |
Parses identifiers like |
| Function | parse |
Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers. |
| Function | parse |
Parses something we know is an ECLI, reports the parts in a dict. |
| Function | parse |
Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. : |
| Function | parse |
Parse kamerstukken identifiers like kst-26643-144-h1 |
| Function | parse |
Parse a officielepublicaties repository URL. |
| Constant | CELEX |
The three-letter codes that CELEX uses to refer to countries |
| Constant | CELEX |
The document types defined within CELEX sectors |
| Constant | CELEX |
The sectors defined within CELEX |
| Function | _celex |
helper to search in CELEX_DOCTYPES. Returns None if nothing matches. |
| Function | _is |
Undocumented |
| Constant | _RE |
Undocumented |
| Constant | _RE |
Undocumented |
| Constant | _RE |
Undocumented |
| Variable | _re |
Undocumented |
Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not.
TODO: give this function a better name, it's not just bekendmakingen.
| Parameters | |
instring:str | the string to look in |
| Returns | |
| a list of strings, like ["stcrt-2009-9231", "ah-tk-20082009-2945"] | |
Within plain text, this tries to find all occurences of things that look like an ECLI identifier
| Parameters | |
string:str | the string to look in |
| rstrip | whether to return the match stripped of any final dot(s). While dots are valid in an ECLI (typically used as a separator), it is more likely that a dot on the end is an ECLI into a sentence than it is to be part of the ECLI. This stripping is enabled by default, but it would be more correct for you to always control this parameter, and for well-controlled metadata fields it may be more correct to use False. |
| Returns | |
| a list of strings. | |
Given URLs from officielepublicaties SRU repostory, parse out useful information and also group by that information.
Consider that metadata and the one-or-more content URLs are stored separately. You would proably want those to be grouped for you, to point out
- which urls belong together
- what (meta)data type they are (e.g. 'pdf' and 'metadata')
- what document identifier they are both part of (e.g. 'ag-10656')
- what area they share (e.g. 'ag')
For example, if you handed in:
- https://repository.overheid.nl/frbr/officielepublicaties/ag/onopgemaakt/ag-10656/1/metadata/metadata.xml
- https://repository.overheid.nl/frbr/officielepublicaties/ag/onopgemaakt/ag-10656/1/pdf/ag-10656.pdf
You would get:
{'ag':
{'ag-10656': {
'metadata': {
'area': 'ag',
'id': 'ag-10656',
'mtype': 'metadata',
'mnum': '1',
'basename': 'metadata.xml',
'url': 'https://repository.overheid.nl/frbr/officielepublicaties/ag/onopgemaakt/ag-10656/1/metadata/metadata.xml'
},
'pdf': {
'area': 'ag',
'id': 'ag-10656',
'mtype': 'pdf',
'mnum': '1',
'basename': 'ag-10656.pdf',
'url': 'https://repository.overheid.nl/frbr/officielepublicaties/ag/onopgemaakt/ag-10656/1/pdf/ag-10656.pdf'}
},
}
| Parameters | |
| ud | list of either URL strings, or dicts (presumed to be such URLs parsed by parse_op_repo_url) |
| Returns | |
| nested dict structure | |
Do two CELEX identifiers refer to the same document?
This is currently somewhat crude, we should read up on the details. Currently:
- ignores sector to be able to ignore sector 0
- tries to ignore transpositions
| Parameters | |
celex1:str | CELEX identifier as string. Will be parsed. |
celex2:str | CELEX identifier as string. Will be parsed. |
Parses identifiers like
kst-26643-144-h1
h-tk-20082009-7140-7144
ah-tk-20082009-2945
stcrt-2009-9231
TODO: give this function a better name, it's not just bekendmakingen. We're not sure it has a real name, though.
Notes:
as of this writing, this function still fails on ~.01% of of keys I've seen, but most of those seem to be invalid (and almost all of those are kst-, since improved in parse_kst_id), or perhaps a rare variant.
if you match on something like ([a-z-]+)[0-9A-Z], you get more than the below - but it depends on the documents you source.
- sometimes you get a bunch of ids that suggest a soft subcategory, e.g. nds-bzk0700034-b1
- sometmies you get a capital you weren't expecting, e.g. Stcrt-2001-130-CAO1965
CONSIDER: also producing citation form(s) of each.
| Parameters | |
| s | the string to parse as a single identifier. |
| Returns | |
dict with basic details, e.g. parse_bekendmaking_id('stb-2023-281') gives:
{'type':'stb', 'jaar':'2023', 'docnum':'281'}
where 'type' and 'docnum' are guaranteed to be there, and 'jaar' is often but not always there. If not a known type, or it is known but seems invalid to that, we raise a ValueError. | |
Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers.
Also produces a somewhat-normalized form (e.g. strips a 'CELEX:' in front)
Returns a dict detailing the parts. NOTE that the details will change when I actually read the specs properly
- norm is what you fed in, uppercased, and with an optional 'CELEX:' stripped but otherwise untouched
- id is recoposed from sector_number, year, document_type, document_number which means it is stripped of additions - it may strip more than you want!
Keep in mind that this will _not_ resolve things like "go to the consolidated version" like the EUR-Lex site will do
TODO: read the spec, because I'm not 100% on
- sector 0
- sector C
- whether additions like (01) in e.g. 32012A0424(01) are part of the identifier or not (...yes. These are unique documents)
- national transposition
- if you have multiple additions like '(01)' and '-20160504' and 'FIN_240353', ...what order they should appear in
| Parameters | |
celex:str | CELEX identifier as string. Will be parsed. |
| Returns | |
a dict like:
{ 'norm': '32016R0679',
'id': '32016R0679', 'document_number': '0679', 'nattrans': '', 'specdate': '',
'sector_number': '3', 'sector_name': 'Legislation',
'year': '2016',
'document_type': 'R', 'document_type_description': 'Regulations',
}
| |
Parses something we know is an ECLI, reports the parts in a dict.
Currently hardcoded to remove any final period.
Returns a dict with keys that contain at least:
'country_code': 'NL',
'court_code': 'HR',
'year': '1977',
'caseid': 'AC1784',
And perhaps (TODO: settle this):
'normalized': 'ECLI:NL:HR:1977:AC1784',
'removed': ').',
'court_details': {'abbrev': 'HR', 'extra': ['hr'], 'name': 'Hoge Raad'}
As an experiment, we try to report more about the court in question, but note the key ('court_details') is not guaranteed to be there.
| Parameters | |
string:str | the string to parse as an ECLI |
Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. :
jci1.31:c:BWBR0012345&g=2005-01-01&artikel=3.1
returns something like :
{'version': '1.31', 'type': 'c', 'bwb': 'BWBR0012345',
'params': {'g': ['2005-01-01'], 'artikel': ['3.1']}}
Notes:
params is actually an an OrderedDict, so you can also fetch them in the order they appeared, for the cases where that matters.
tries to be robust to a few non-standard things we've seen in real use
for type=='c' (single consolidation), expected params include
- g geldigheidsdatum
- z zichtdatum
for type=='v' (collection), expected params include
- s start of geldigheid
- e end of geldigheid
- z zichtdatum
Note that precise interpretation, and generation of these links, is a little more involved, in that versions made small semantic changes to the meanings of some parts.
| Parameters | |
text:str | jci-style identifier as string. Will be parsed. |
Parse kamerstukken identifiers like kst-26643-144-h1
Also a helper for parse_bekendmaking_id to parse this particular subset.
There is more description of the most common variations in one of our notebooks
TODO: review the tests, the separateion of e.g. dossier and vergaderjaar, plus the acceptance of some weirder cases, probably means it probably over-accepts now. (it would be more understandable to just add each pattern, even if there's twenty of them)
| Parameters | |
string:str | kst-style identifier as string, to be parsed. |
debug:bool | whether to point out some debug |
| Returns | |
a dict with keys
Note that due varied types, you might consider variation before assuming presence of other keys | |
Parse a officielepublicaties repository URL.
| Parameters | |
| url | as an example, given an URL such as 'https://repository.overheid.nl/frbr/officielepublicaties/ah-ek/20182019/ah-ek-20182019-2/1/metadata/metadata.xml' |
| Returns | |
...a dict with keys like
| |
The three-letter codes that CELEX uses to refer to countries
| Value |
|
The document types defined within CELEX sectors
| Value |
|
The sectors defined within CELEX
| Value |
|
Undocumented
| Value |
|
Undocumented
| Value |
|
Undocumented
| Value |
|