wetsuite.helpers.meta

module documentation

Things that parse metadata.

Specifically for things not tied to a singular API or data source, or we otherwise expect to see some reuse of.

There are similar function in other places, in particular when they are specific. For example helpers speciic to KOOP's presentations BWB, CVDR, and OP sits in helpers.koop_parse

The function name should give you some indication how what it associates with, and how specific it is.

Function	`findall_bekendmaking_ids`	Look for identifiers like `stcrt-2009-9231` and `ah-tk-20082009-2945` Might find a few things that are not.
Function	`findall_ecli`	Within plain text, this tries to find all occurences of things that look like an ECLI identifier
Function	`is_equivalent_celex`	Do two CELEX identifiers refer to the same document?
Function	`parse_bekendmaking_id`	Parses identifiers like
Function	`parse_celex`	Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers.
Function	`parse_ecli`	Parses something we know is an ECLI, reports the parts in a dict.
Function	`parse_jci`	Takes something in the form of `jci{version}:{type}:{BWB-id}{key-value}*`, so e.g. :
Function	`parse_kst_id`	Parse kamerstukken identifiers like `kst-26643-144-h1`
Constant	`CELEX_COUNTRIES`	The three-letter codes that CELEX uses to refer to countries
Constant	`CELEX_DOCTYPES`	The document types defined within CELEX sectors
Constant	`CELEX_SECTORS`	The sectors defined within CELEX
Function	`_celex_doctype`	helper to search in CELEX_DOCTYPES. Returns None if nothing matches.
Function	`_is_all_digits`	Undocumented
Constant	`_RE_CELEX`	Undocumented
Constant	`_RE_ECLIFIND`	Undocumented
Constant	`_RE_JCIFIND`	Undocumented
Variable	`_re_bekendid`	Undocumented

def findall_bekendmaking_ids(instring): ¶

Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not.

TODO: give this function a better name, it's not just bekendmakingen.

Parameters
instring:`str`	the string to look in
Returns
a list of values, like ["stcrt-2009-9231", "ah-tk-20082009-2945"]

def findall_ecli(string, rstrip_dot=True): ¶

Within plain text, this tries to find all occurences of things that look like an ECLI identifier

Parameters
string:`str`	the string to look in
rstrip_dot	whether to return the match stripped of any final dot(s). While dots are valid in an ECLI (typically used as a separator), it is more likely that a dot on the end is an ECLI into a sentence than it is to be part of the ECLI. This stripping is enabled by default, but it would be more correct for you to always control this parameter, and for well-controlled metadata fields it may be more correct to use False.
Returns
a list of strings.

def is_equivalent_celex(celex1, celex2): ¶

Do two CELEX identifiers refer to the same document?

Currently:

ignores sector to be able to ignore sector 0
tries to ignore

This is currently based on estimation - we should read up on the details.

Parameters
celex1:`str`	CELEX identifier as string. Will be parsed.
celex2:`str`	CELEX identifier as string. Will be parsed.

def parse_bekendmaking_id(s): ¶

Parses identifiers like

kst-26643-144-h1
h-tk-20082009-7140-7144
ah-tk-20082009-2945
stcrt-2009-9231

TODO: give this function a better name, it's not just bekendmakingen.

Notes:
as of this writing it still fails on ~ .01% of of keys I've seen, but most of those seem to be invalid (though almost all of those are kst-, so we may just not known an uncommon variant).
if you match on something like ([a-z-]+)[0-9A-Z], you get more than the below - but it depends on the documents you source.
- sometimes you get a bunch of ids that suggest a soft subcategory, e.g. nds-bzk0700034-b1
- sometmies you get a capital you weren't expecting, e.g. Stcrt-2001-130-CAO1965
CONSIDER: also producing citation form(s) of each.

Parameters
s	the string to parse as a single identifier.
Returns
dict with basic details, e.g. parse_bekendmaking_id('stb-2023-281') == {'type':'stb', 'jaar':'2023', 'docnum':'281'} where 'type' and 'docnum' are guaranteed to be there, and 'jaar' is often but not always there. If it not a known type of identifier, or it is known but seems invalid, it raises a ValueError.

def parse_celex(celex): ¶

Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers.

Also produces a somewhat-normalized form (e.g. strips a 'CELEX:' in front)

Returns a dict detailing the parts. NOTE that the details will change when I actually read the specs properly

norm is what you fed in, uppercased, and with an optional 'CELEX:' stripped but otherwise untouched
id is recoposed from sector_number, year, document_type, document_number which means it is stripped of additions - it may strip more than you want!

Keep in mind that this will _not_ resolve things like "go to the consolidated version" like the EUR-Lex site will do

TODO: read the spec, because I'm not 100% on

sector 0
sector C
whether additions like (01) in e.g. 32012A0424(01) are part of the identifier or not (...yes. Theyse are unique documents)
national transposition
if you have multiple additions like '(01)' and '-20160504' and 'FIN_240353', ...what order they should appear in

TODO: we might be able to assist common in those cases (e.g. a test for "is this equivalent"). I e.g. do not know whether id_nonattrans is useful or correct

Parameters
celex:`str`	CELEX identifier as string. Will be parsed.

def parse_ecli(string): ¶

Parses something we know is an ECLI, reports the parts in a dict.

Currently hardcoded to remove any final period.

Returns a dict with keys that contain at least:

    'country_code': 'NL',
    'court_code': 'HR',
    'year': '1977',
    'caseid': 'AC1784',

And perhaps (TODO: settle this):

    'normalized': 'ECLI:NL:HR:1977:AC1784',
    'removed': ').',
    'court_details': {'abbrev': 'HR', 'extra': ['hr'], 'name': 'Hoge Raad'}

As an experiment, we try to report more about the court in question, but note the key ('court_details') is not guaranteed to be there.

Parameters
string:`str`	the string to parse as an ECLI

def parse_jci(text): ¶

Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. :

    jci1.31:c:BWBR0012345&g=2005-01-01&artikel=3.1

returns something like :

    {'version': '1.31', 'type': 'c', 'bwb': 'BWBR0012345',
     'params': {'g': ['2005-01-01'], 'artikel': ['3.1']}}

Notes:

params is actually an an OrderedDict, so you can also fetch them in the order they appeared, for the cases where that matters.
tries to be robust to a few non-standard things we've seen in real use
for type=='c' (single consolidation), expected params include
- g geldigheidsdatum
- z zichtdatum
for type=='v' (collection), expected params include
- s start of geldigheid
- e end of geldigheid
- z zichtdatum
Note that precise interpretation, and generation of these links, is a little more involved, in that versions made small semantic changes to the meanings of some parts.

Parameters
text:`str`	jci-style identifier as string. Will be parsed.

def parse_kst_id(string, debug=False): ¶

Parse kamerstukken identifiers like kst-26643-144-h1

Also a helper for parse_bekendmaking_id to parse this particular subset.

There is more description of the variations in one of our notebooks

Parameters
string:`str`	kst-style identifier as string. Will be parsed.
debug:`bool`	whether to point out some debug
Returns
a dict with keys `dossiernum` - a kamerstukdossier, where it applies `docnum` - a document identifier `_var` to mention an internal variant that our parsing used

CELEX_COUNTRIES: list[str] = ¶

The three-letter codes that CELEX uses to refer to countries

Value

['BEL',
 'DEU',
 'FRA',
 'CZE',
 'ESP',
 'PRT',
 'AUT',
...

CELEX_DOCTYPES: tuple = ¶

The document types defined within CELEX sectors

Value

(('1',
  'K',
  'Treaty establishing the European Coal and Steel Community (ECSC Treaty) 1951'↵
),
 ('1',
  'A',
  'Treaty establishing the European Atomic Energy Community (EAEC Treaty or Eura↵
...

CELEX_SECTORS: dict[str, str] = ¶

The sectors defined within CELEX

Value

{'1': 'Treaties',
 '2': 'External Agreements',
 '3': 'Legislation',
 '4': 'Internal Agreements',
 '5': 'Proposals + other preparatory documents',
 '6': 'Case Law',
 '7': 'National Implementation',
...

def _celex_doctype(sector_number, document_type): ¶

helper to search in CELEX_DOCTYPES. Returns None if nothing matches.

Parameters
sector_number:`str`	Undocumented
document_type:`str`	Undocumented

def _is_all_digits(s): ¶

Undocumented

_RE_CELEX = ¶

Undocumented

Value

re.compile(r'(\b[1234567890CE])([0-9]{4})([A-Z][A-Z]?)([0-9\(\)]{4,})(\b[^\s">&\↵
.]*)?')

_RE_ECLIFIND = ¶

Undocumented

Value

re.compile(r'ECLI:[A-Za-z]{2}:[A-Za-z0-9\.]{1,7}:[0-9]{1,4}:[A-Z-z0-9\.]{1,25}',
           re.M)

_RE_JCIFIND = ¶

Undocumented

Value

re.compile(r'(?:jci)?([0-9\.]+):([a-z]):(BWB[RV][0-9]+)([^\s;"\']*)',
           re.M)

_re_bekendid = ¶

Undocumented