Things that parse metadata.
Specifically for things not tied to a singular API or data source, or we otherwise expect to see some reuse of.
There are similar function in other places, in particular when they are specific. For example helpers speciic to KOOP's presentations BWB, CVDR, and OP sits in helpers.koop_parse
The function name should give you some indication how what it associates with, and how specific it is.
Function | findall |
Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not. |
Function | findall |
Within plain text, this tries to find all occurences of things that look like an ECLI identifier |
Function | is |
Do two CELEX identifiers refer to the same document? |
Function | parse |
Parses identifiers like |
Function | parse |
Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers. |
Function | parse |
Parses something we know is an ECLI, reports the parts in a dict. |
Function | parse |
Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. : |
Function | parse |
Parse kamerstukken identifiers like kst-26643-144-h1 |
Constant | CELEX |
The three-letter codes that CELEX uses to refer to countries |
Constant | CELEX |
The document types defined within CELEX sectors |
Constant | CELEX |
The sectors defined within CELEX |
Function | _celex |
helper to search in CELEX_DOCTYPES. Returns None if nothing matches. |
Function | _is |
Undocumented |
Constant | _RE |
Undocumented |
Constant | _RE |
Undocumented |
Constant | _RE |
Undocumented |
Variable | _re |
Undocumented |
Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not.
TODO: give this function a better name, it's not just bekendmakingen.
Parameters | |
instring:str | the string to look in |
Returns | |
a list of values, like ["stcrt-2009-9231", "ah-tk-20082009-2945"] |
Within plain text, this tries to find all occurences of things that look like an ECLI identifier
Parameters | |
string:str | the string to look in |
rstrip | whether to return the match stripped of any final dot(s). While dots are valid in an ECLI (typically used as a separator), it is more likely that a dot on the end is an ECLI into a sentence than it is to be part of the ECLI. This stripping is enabled by default, but it would be more correct for you to always control this parameter, and for well-controlled metadata fields it may be more correct to use False. |
Returns | |
a list of strings. |
Do two CELEX identifiers refer to the same document?
Currently:
- ignores sector to be able to ignore sector 0
- tries to ignore
This is currently based on estimation - we should read up on the details.
Parameters | |
celex1:str | CELEX identifier as string. Will be parsed. |
celex2:str | CELEX identifier as string. Will be parsed. |
Parses identifiers like
kst-26643-144-h1
h-tk-20082009-7140-7144
ah-tk-20082009-2945
stcrt-2009-9231
TODO: give this function a better name, it's not just bekendmakingen.
Notes:
as of this writing it still fails on ~ .01% of of keys I've seen, but most of those seem to be invalid (though almost all of those are kst-, so we may just not known an uncommon variant).
if you match on something like ([a-z-]+)[0-9A-Z], you get more than the below - but it depends on the documents you source.
- sometimes you get a bunch of ids that suggest a soft subcategory, e.g. nds-bzk0700034-b1
- sometmies you get a capital you weren't expecting, e.g. Stcrt-2001-130-CAO1965
CONSIDER: also producing citation form(s) of each.
Parameters | |
s | the string to parse as a single identifier. |
Returns | |
dict with basic details, e.g. parse_bekendmaking_id('stb-2023-281') == {'type':'stb', 'jaar':'2023', 'docnum':'281'} where 'type' and 'docnum' are guaranteed to be there, and 'jaar' is often but not always there. If it not a known type of identifier, or it is known but seems invalid, it raises a ValueError. |
Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers.
Also produces a somewhat-normalized form (e.g. strips a 'CELEX:' in front)
Returns a dict detailing the parts. NOTE that the details will change when I actually read the specs properly
- norm is what you fed in, uppercased, and with an optional 'CELEX:' stripped but otherwise untouched
- id is recoposed from sector_number, year, document_type, document_number which means it is stripped of additions - it may strip more than you want!
Keep in mind that this will _not_ resolve things like "go to the consolidated version" like the EUR-Lex site will do
TODO: read the spec, because I'm not 100% on
- sector 0
- sector C
- whether additions like (01) in e.g. 32012A0424(01) are part of the identifier or not (...yes. Theyse are unique documents)
- national transposition
- if you have multiple additions like '(01)' and '-20160504' and 'FIN_240353', ...what order they should appear in
TODO: we might be able to assist common in those cases (e.g. a test for "is this equivalent"). I e.g. do not know whether id_nonattrans is useful or correct
Parameters | |
celex:str | CELEX identifier as string. Will be parsed. |
Parses something we know is an ECLI, reports the parts in a dict.
Currently hardcoded to remove any final period.
Returns a dict with keys that contain at least:
'country_code': 'NL', 'court_code': 'HR', 'year': '1977', 'caseid': 'AC1784',
And perhaps (TODO: settle this):
'normalized': 'ECLI:NL:HR:1977:AC1784', 'removed': ').', 'court_details': {'abbrev': 'HR', 'extra': ['hr'], 'name': 'Hoge Raad'}
As an experiment, we try to report more about the court in question, but note the key ('court_details') is not guaranteed to be there.
Parameters | |
string:str | the string to parse as an ECLI |
Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. :
jci1.31:c:BWBR0012345&g=2005-01-01&artikel=3.1
returns something like :
{'version': '1.31', 'type': 'c', 'bwb': 'BWBR0012345', 'params': {'g': ['2005-01-01'], 'artikel': ['3.1']}}
Notes:
params is actually an an OrderedDict, so you can also fetch them in the order they appeared, for the cases where that matters.
tries to be robust to a few non-standard things we've seen in real use
for type=='c' (single consolidation), expected params include
- g geldigheidsdatum
- z zichtdatum
for type=='v' (collection), expected params include
- s start of geldigheid
- e end of geldigheid
- z zichtdatum
Note that precise interpretation, and generation of these links, is a little more involved, in that versions made small semantic changes to the meanings of some parts.
Parameters | |
text:str | jci-style identifier as string. Will be parsed. |
Parse kamerstukken identifiers like kst-26643-144-h1
Also a helper for parse_bekendmaking_id to parse this particular subset.
There is more description of the variations in one of our notebooks
Parameters | |
string:str | kst-style identifier as string. Will be parsed. |
debug:bool | whether to point out some debug |
Returns | |
a dict with keys
|
The three-letter codes that CELEX uses to refer to countries
Value |
|
The document types defined within CELEX sectors
Value |
|
The sectors defined within CELEX
Value |
|
helper to search in CELEX_DOCTYPES. Returns None if nothing matches.
Parameters | |
sectorstr | Undocumented |
documentstr | Undocumented |
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|