module documentation

Things that parse metadata.

Specifically for things not tied to a singular API or data source, or we otherwise expect to see some reuse of.

There are similar function in other places, in particular when they are specific. For example helpers speciic to KOOP's presentations BWB, CVDR, and OP sits in helpers.koop_parse

The function name should give you some indication how what it associates with, and how specific it is.

Function findall_bekendmaking_ids Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not.
Function findall_ecli Within plain text, this tries to find all occurences of things that look like an ECLI identifier
Function is_equivalent_celex Do two CELEX identifiers refer to the same document?
Function parse_bekendmaking_id Parses identifiers like
Function parse_celex Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers.
Function parse_ecli Parses something we know is an ECLI, reports the parts in a dict.
Function parse_jci Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. :
Function parse_kst_id Parse kamerstukken identifiers like kst-26643-144-h1
Constant CELEX_COUNTRIES The three-letter codes that CELEX uses to refer to countries
Constant CELEX_DOCTYPES The document types defined within CELEX sectors
Constant CELEX_SECTORS The sectors defined within CELEX
Function _celex_doctype helper to search in CELEX_DOCTYPES. Returns None if nothing matches.
Function _is_all_digits Undocumented
Constant _RE_CELEX Undocumented
Constant _RE_ECLIFIND Undocumented
Constant _RE_JCIFIND Undocumented
Variable _re_bekendid Undocumented
def findall_bekendmaking_ids(instring):

Look for identifiers like stcrt-2009-9231 and ah-tk-20082009-2945 Might find a few things that are not.

TODO: give this function a better name, it's not just bekendmakingen.

Parameters
instring:strthe string to look in
Returns
a list of values, like ["stcrt-2009-9231", "ah-tk-20082009-2945"]
def findall_ecli(string, rstrip_dot=True):

Within plain text, this tries to find all occurences of things that look like an ECLI identifier

Parameters
string:strthe string to look in
rstrip_dotwhether to return the match stripped of any final dot(s). While dots are valid in an ECLI (typically used as a separator), it is more likely that a dot on the end is an ECLI into a sentence than it is to be part of the ECLI. This stripping is enabled by default, but it would be more correct for you to always control this parameter, and for well-controlled metadata fields it may be more correct to use False.
Returns
a list of strings.
def is_equivalent_celex(celex1, celex2):

Do two CELEX identifiers refer to the same document?

Currently:

  • ignores sector to be able to ignore sector 0
  • tries to ignore

This is currently based on estimation - we should read up on the details.

Parameters
celex1:strCELEX identifier as string. Will be parsed.
celex2:strCELEX identifier as string. Will be parsed.
def parse_bekendmaking_id(s):

Parses identifiers like

  • kst-26643-144-h1

  • h-tk-20082009-7140-7144

  • ah-tk-20082009-2945

  • stcrt-2009-9231

    TODO: give this function a better name, it's not just bekendmakingen.

    Notes:

  • as of this writing it still fails on ~ .01% of of keys I've seen, but most of those seem to be invalid (though almost all of those are kst-, so we may just not known an uncommon variant).

  • if you match on something like ([a-z-]+)[0-9A-Z], you get more than the below - but it depends on the documents you source.

    • sometimes you get a bunch of ids that suggest a soft subcategory, e.g. nds-bzk0700034-b1
    • sometmies you get a capital you weren't expecting, e.g. Stcrt-2001-130-CAO1965

    CONSIDER: also producing citation form(s) of each.

Parameters
sthe string to parse as a single identifier.
Returns
dict with basic details, e.g. parse_bekendmaking_id('stb-2023-281') == {'type':'stb', 'jaar':'2023', 'docnum':'281'} where 'type' and 'docnum' are guaranteed to be there, and 'jaar' is often but not always there. If it not a known type of identifier, or it is known but seems invalid, it raises a ValueError.
def parse_celex(celex):

Describes CELEX's parts in more readable form, where possible. All values are returned as strings, even where they are (ostensibly) numbers.

Also produces a somewhat-normalized form (e.g. strips a 'CELEX:' in front)

Returns a dict detailing the parts. NOTE that the details will change when I actually read the specs properly

  • norm is what you fed in, uppercased, and with an optional 'CELEX:' stripped but otherwise untouched
  • id is recoposed from sector_number, year, document_type, document_number which means it is stripped of additions - it may strip more than you want!

Keep in mind that this will _not_ resolve things like "go to the consolidated version" like the EUR-Lex site will do

TODO: read the spec, because I'm not 100% on

  • sector 0
  • sector C
  • whether additions like (01) in e.g. 32012A0424(01) are part of the identifier or not (...yes. Theyse are unique documents)
  • national transposition
  • if you have multiple additions like '(01)' and '-20160504' and 'FIN_240353', ...what order they should appear in

TODO: we might be able to assist common in those cases (e.g. a test for "is this equivalent"). I e.g. do not know whether id_nonattrans is useful or correct

Parameters
celex:strCELEX identifier as string. Will be parsed.
def parse_ecli(string):

Parses something we know is an ECLI, reports the parts in a dict.

Currently hardcoded to remove any final period.

Returns a dict with keys that contain at least:

    'country_code': 'NL',
    'court_code': 'HR',
    'year': '1977',
    'caseid': 'AC1784',

And perhaps (TODO: settle this):

    'normalized': 'ECLI:NL:HR:1977:AC1784',
    'removed': ').',
    'court_details': {'abbrev': 'HR', 'extra': ['hr'], 'name': 'Hoge Raad'}

As an experiment, we try to report more about the court in question, but note the key ('court_details') is not guaranteed to be there.

Parameters
string:strthe string to parse as an ECLI
def parse_jci(text):

Takes something in the form of jci{version}:{type}:{BWB-id}{key-value}*, so e.g. :

    jci1.31:c:BWBR0012345&g=2005-01-01&artikel=3.1

returns something like :

    {'version': '1.31', 'type': 'c', 'bwb': 'BWBR0012345',
     'params': {'g': ['2005-01-01'], 'artikel': ['3.1']}}

Notes:

  • params is actually an an OrderedDict, so you can also fetch them in the order they appeared, for the cases where that matters.

  • tries to be robust to a few non-standard things we've seen in real use

  • for type=='c' (single consolidation), expected params include

    • g geldigheidsdatum
    • z zichtdatum
  • for type=='v' (collection), expected params include

    • s start of geldigheid
    • e end of geldigheid
    • z zichtdatum

    Note that precise interpretation, and generation of these links, is a little more involved, in that versions made small semantic changes to the meanings of some parts.

Parameters
text:strjci-style identifier as string. Will be parsed.
def parse_kst_id(string, debug=False):

Parse kamerstukken identifiers like kst-26643-144-h1

Also a helper for parse_bekendmaking_id to parse this particular subset.

There is more description of the variations in one of our notebooks

Parameters
string:strkst-style identifier as string. Will be parsed.
debug:boolwhether to point out some debug
Returns

a dict with keys

  • dossiernum - a kamerstukdossier, where it applies
  • docnum - a document identifier
  • _var to mention an internal variant that our parsing used
CELEX_COUNTRIES: list[str] =

The three-letter codes that CELEX uses to refer to countries

Value
['BEL',
 'DEU',
 'FRA',
 'CZE',
 'ESP',
 'PRT',
 'AUT',
...
CELEX_DOCTYPES: tuple =

The document types defined within CELEX sectors

Value
(('1',
  'K',
  'Treaty establishing the European Coal and Steel Community (ECSC Treaty) 1951'
),
 ('1',
  'A',
  'Treaty establishing the European Atomic Energy Community (EAEC Treaty or Eura
...
CELEX_SECTORS: dict[str, str] =

The sectors defined within CELEX

Value
{'1': 'Treaties',
 '2': 'External Agreements',
 '3': 'Legislation',
 '4': 'Internal Agreements',
 '5': 'Proposals + other preparatory documents',
 '6': 'Case Law',
 '7': 'National Implementation',
...
def _celex_doctype(sector_number, document_type):

helper to search in CELEX_DOCTYPES. Returns None if nothing matches.

Parameters
sector_number:strUndocumented
document_type:strUndocumented
def _is_all_digits(s):

Undocumented

_RE_CELEX =

Undocumented

Value
re.compile(r'(\b[1234567890CE])([0-9]{4})([A-Z][A-Z]?)([0-9\(\)]{4,})(\b[^\s">&\
.]*)?')
_RE_ECLIFIND =

Undocumented

Value
re.compile(r'ECLI:[A-Za-z]{2}:[A-Za-z0-9\.]{1,7}:[0-9]{1,4}:[A-Z-z0-9\.]{1,25}',
           re.M)
_RE_JCIFIND =

Undocumented

Value
re.compile(r'(?:jci)?([0-9\.]+):([a-z]):(BWB[RV][0-9]+)([^\s;"\']*)',
           re.M)
_re_bekendid =

Undocumented