wetsuite.helpers.patterns

module documentation

Extracting specific patterns of text.

Much of the code in here is aimed at identfiers and references - identifiers like BWB-ID, CVDR-ID, ECLI, and CELEX, more textual ones like EU OJ and directive references, vindplaats, kamerstukken, and "artikel 3" style references -- see in particular the find_references function for a little more detail

Currently a best-effort proof of concept of each of those matchers, and contain copious hardcoding and messiness.

We _will_ miss things, as most things like this do. Arguably the only real metric is making a list of _everything_ you want to catch and seeing how well you do.

Right now the implementation is mostly regexes -- which aren't great for some of these. (But so aren't formal grammars, because real-world variation will be missed)

Note that if refined further, this should probably be restructured in a way where each matcher can register itself, so that there _isn't_ one central controller function to entangle everything

Function	`abbrev_count_results`	In case you have a lot of data, you can get cleaner (but reduced!) results by reporting how many distinct documents report the same specific explanation
Function	`abbrev_find`	Finds abbreviations with explanations next to them.
Function	`find_artikel_references`	Attempts to find references like :
Function	`find_references`	Looks for various different kinds of references in the given text, sorts the results.
Function	`mark_references_spacy`	Takes a spacy Doc, and matches from you calling `find_references`, marks up those matches on the Document as entities. Replaces the currently marked entities, to avoid overlap.
Function	`simple_tokenize`	Quick and dirty splitter into words. Mainly used by `abbrev_find`
Function	`_wetnamen`	Returns a dict from law name to law BWB-id, mostly to try to exact-match the names in find_nonidentifier_references.
Variable	`_escaped_names`	Undocumented
Variable	`_mw`	Undocumented
Variable	`_mw_re`	Undocumented

def abbrev_count_results(l, remove_dots: bool = False, case_insensitive_explanations=False): ¶

In case you have a lot of data, you can get cleaner (but reduced!) results by reporting how many distinct documents report the same specific explanation

Parameters
l	A nested structure, where the top level is a list where each item represents a document Each of those is what find_abbrevs() returned, i.e. a list of items like : ('AE', ['Abbreviation', 'Explanation'])
remove_dots:`bool`	whether to normalize the abbreviated form by removing any dots.
case_insensitive_explanations	whether we consider the explanatory words in a case insensitive way while counting. We report whatever the most common capitalisation is.
Returns
something like: : { 'AE' : { ['Abbreviation', 'Explanation']: 3, ['Abbreviation', 'Erroneous']: 1 } } where that number would be how many documents contained this explanation (NOT how often we saw this explanation).

def abbrev_find(string: str): ¶

Finds abbreviations with explanations next to them.

Looks for patterns like

"Word Combination (WC)"
"Wet Oven Overheid (Woo)"
"With Periods (W.P.)"
"(EA) Explained After" (probably rare)
"BT (Bracketed terms)"
"(Bracketed terms) BT" (probably rare)

Will both over- and under-accept, so if you want clean results, consider e.g. reporting only things present in multiple documents. see e.g. merge_results()

TODO:

CONSIDER:

how permissive to be with capitalization. Maybe make that a parameter?
rewrite to deal with cases like
- (allow and ignore words like 'en', 'van', 'voor')
  - Koninklijke Nederlandse Akademie van Wetenschappen
  - Autoriteit Consument en Markt (ACM)
  - Centraal Bureau voor de Statistiek (CBS)
  - Nationale Postcode Loterij N.V. (hierna: NPL)
  - Edelmetaal Waarborg Nederland B.V. (EWN)
  - College voor Toetsen en Examens (CvTE)
- (and maybe:)
  - (allow a number of non-capitalized words inbetween)
  - De Regeling werving, reclame en verslavingspreventie kansspelen (hierna: Rwrvk)
  - Kamer voor de Binnenvisserij (Kabivi)
  - Economie, bedrijven en nationale rekeningen (EBN)
- Allow more fuzzines, e.g. based on repeated appearance, e.g.
  - Airport Coordination Netherlands (ACNL)
  - Pensioen- en Uitkeringsraad (PUR)
  - Nederlandse Loodsencorporatie (NLC)
  - Nederlandse Emissieautoriteit (NEa)
  - Waarderingskamer (WK)
  - (note that we can try doing this fully with statistics)
- (and maybe not:)
  - College van toezicht collectieve beheersorganisaties auteurs- en naburige rechten (College van Toezicht Auteursrechten (CvTA))
  - Keurmerkinstituut jeugdzorg (KMI)
listening to 'hierna: ', e.g.
- "Wet Bevordering Integriteitbeoordelingen door het Openbaar Bestuur (hierna: Wet BIBOB)"
- "Drank- en horecawet (hierna: DHW)"
- "Algemene wet bestuursrecht (hierna: Awb)"
- "het Verdrag betreffende de werking van de Europese Unie (hierna: VWEU)"
- "de Subsidieregeling OPZuid 2021-2027 (hierna: Subsidieregeling OPZuid)"
- "de Wet werk en bijstand (hierna: WWB)"
- "de Wet werk en inkomen naar arbeidsvermogen (hierna: WIA)"
- "de Wet maatschappelijke ondersteuning (hierna: Wmo)"
  
  These seem to be more structured, in particular when you use (de|het) as a delimiter This seems overly specific, but works well to extract a bunch of these

Parameters
string:`str`	python string to look in. CONSIDER: accept spacy objects as well
Returns
a list of ('ww', ['word', 'word']) tuples, pretty much as-is so it (intentionally) contains duplicates

def find_artikel_references(string: str, context_amt: int = 60, debug: bool = False): ¶

Attempts to find references like :

    "artikel 5.1, tweede lid, aanhef en onder i, van de Woo"

and parse and resolve as much as it can.

This is a a separate function because it is more complex than most others, but if you want to look for more than just these, then you probably want to wield it via find_references.

These references are not a formalized format, and while the law ( https://wetten.overheid.nl/BWBR0005730/ ) that suggests the format of these should be succinct, and sometimes it looks like these have near-templates, that is not what real-world use looks like.

Another reasonable approach might be include each real-world variant format explicitly, as it lets you put stronger patterns first and fall back on fuzzier ones, it makes it clear what is being matched, and it's easier to see how common each is. However, that also easily leads to false negatives -- missing real references.

Instead, we

start by finding some strong anchors
keep accepting bits of adjacent string as long as they look like things we know "artikel 5.1," "tweede lid," "aanhef en onder i"
then seeing what text is around it, which should be at least the law name

Neither will deal well with the briefest forms, e.g. "(81 WWB)" which is arguably only reasonable to recognize when you recognize either side (by known law name, which is harder for abbreviations in that it probably leads to false positives) ...and in that example, we might want to

see if character context makes it reasonable - the parthentheses make it more reasonable than if you found the six characters '81 WWB' in any context
check whether the estimated law (Wet werk en bijstand - BWBR0015703) has an article 81
check, in some semantic way, whether Wet werk en bijstand makes any sense in context of the text

TODO: ...also so that we can return some estimation of

how sure we are this is a reference,
how complete a reference is, and/or
how easy to resolve a reference is.

Parameters
string:`str`	the text to look in
context_amt:`int`	how much context to find another piece in (TODO: make this part of internal parameters)
debug:`bool`	Undocumented
Returns
a list of dict matches, as also mentioned on find_references()

def find_references(string: str, bwb: bool = True, cvdr: bool = True, ecli: bool = True, celex: bool = True, ljn: bool = False, bekendmaking_ids: bool = False, vindplaatsen: bool = True, artikel: bool = True, kamerstukken: bool = True, euoj: bool = True, eudir: bool = True, eureg: bool = True, debug: bool = False): ¶

Looks for various different kinds of references in the given text, sorts the results.

Note that there is a gliding scale between 'is this and identifier and will we probably find most of them' and 'is this more textual, more varied, so more easily miss parts' (...and should this perhaps not be implemented with regexes as it currently is)

See also:

Leidraad voor juridische auteurs

Parameters
string:`str`	the string to look in. Note that matches return offsets within this string.
bwb:`bool`	whether to look for BWB identifiers, e.g. BWBR0006501
cvdr:`bool`	whether to look for CVDR work and expression identifiers, e.g. CVDR101405_1 CVDR101405/1 CVDR101405
ecli:`bool`	whether to look for ECLI identifiers, e.g. ECLI:NL:HR:2005:AT4537
celex:`bool`	whether to look for CELEX identifiers, e.g. 32000L0060 and some variations
ljn:`bool`	whether to look for LJN identifiers, e.g. AT4537 (disabled by default because we want you to be explicitly aware of false negatives. Also they aren't used anymore)
bekendmaking_ids:`bool`	whether to look for bekendmaking-ids like kst-26643-144-h1 and h-tk-20082009-7140-7144. Disabled by default because you're not usally seeing these in text.
vindplaatsen:`bool`	whether to look for vindplaatsen for Trb, Stb, Stcrt, e.g. `"Stb. 2011, 35"` are actually quite regular (mostly by merit of being simple)
artikel:`bool`	whether to look for artikel 3, lid 3, aanhef en onder c style references
kamerstukken:`bool`	whether to look for kamerstukken references, the ones that look like: Kamerstukken I 1995/96, 23700, nr. 188b, p. 3. Kamerstukken I 2014/15, 33802, C, p. 3. Kamerstukken II 1999/2000, 2000/2001, 2001/2002, 26 855. Kamerstukken I 2000/2001, 26 855 (250, 250a); 2001/2002, 26 855 (16, 16a, 16b, 16c).
euoj:`bool`	whether to look for EU Official Journal references, the ones that look like: OJ L 69, 13.3.2013, p. 1 OJ L 168, 30.6.2009, p. 41–47
eudir:`bool`	whether to look for EU directive references, the ones that look like: Council Directive 93/42/EEC of 14 June 1993 Directive 93/42/EEC of 14 June 1993
eureg:`bool`	whether to look for EU regulation references, the ones that look like: Council Regulation (EEC) No 2658/87
debug:`bool`	Undocumented
Returns
A list of dicts (sorted by the value of `start`), each with at least the keys `"type"` - type of reference, e.g. "kst", "euoj", "artikel" `"start"` and `"end"` - character offsets of the match `"text"` - all the matched text and probably `"details"`, with contents that are mostly specific to the type of reference

def mark_references_spacy(doc, matches): ¶

Takes a spacy Doc, and matches from you calling find_references, marks up those matches on the Document as entities. *Replaces* the currently marked entities, to avoid overlap.

TODO: allow parameter to match it as spans instead (...also because char_span() with alignment_mode='expand' probably makes this easier.

Bases this on the plain text, and then trying to find all the tokens necessary to cover that (that code needs some double checking).

def simple_tokenize(string: str): ¶

Quick and dirty splitter into words. Mainly used by abbrev_find

Parameters
string:`str`	the string to split up.

def _wetnamen(): ¶

Returns a dict from law name to law BWB-id, mostly to try to exact-match the names in find_nonidentifier_references.

THIS IS VERY MUCH AN EXPERIMENT - either it should be replaced by something like ner or testcat, or the data it uses needs some care

_escaped_names = ¶

Undocumented

_mw = ¶

Undocumented

_mw_re = ¶

Undocumented