Extracting specific patterns of text.
Much of the code in here is aimed at identfiers and references - identifiers like BWB-ID, CVDR-ID, ECLI, and CELEX, more textual ones like EU OJ and directive references, vindplaats, kamerstukken, and "artikel 3" style references -- see in particular the find_references function for a little more detail
Currently a best-effort proof of concept of each of those matchers, and contain copious hardcoding and messiness.
We _will_ miss things, as most things like this do. Arguably the only real metric is making a list of _everything_ you want to catch and seeing how well you do.
Right now the implementation is mostly regexes -- which aren't great for some of these. (But so aren't formal grammars, because real-world variation will be missed)
Note that if refined further, this should probably be restructured in a way where each matcher can register itself, so that there _isn't_ one central controller function to entangle everything
Function | abbrev |
In case you have a lot of data, you can get cleaner (but reduced!) results by reporting how many distinct documents report the same specific explanation |
Function | abbrev |
Finds abbreviations with explanations next to them. |
Function | find |
Attempts to find references like : |
Function | find |
Looks for various different kinds of references in the given text, sorts the results. |
Function | mark |
Takes a spacy Doc, and matches from you calling find_references, marks it as entities. |
Function | simple |
Quick and dirty splitter into words. Mainly used by abbrev_find |
Function | _wetnamen |
a dict from law name to law BWB-id, mostly to try to match the names in find_nonidentifier_references. Used by find_nonidentifier_references |
Variable | _mw |
Undocumented |
Variable | _mw |
Undocumented |
In case you have a lot of data, you can get cleaner (but reduced!) results by reporting how many distinct documents report the same specific explanation
Parameters | |
l | A nested structure, where
|
removebool | whether to normalize the abbreviated form by removing any dots. |
case | whether we consider the explanatory words in a case insensitive way while counting. We report whatever the most common capitalisation is. |
Returns | |
something like: : { 'AE' : { ['Abbreviation', 'Explanation']: 3, ['Abbreviation', 'Erroneous']: 1 } } where that number would be how many documents had this explanation (NOT how often we saw this explanation). |
Finds abbreviations with explanations next to them.
Looks for patterns like
- "Word Combination (WC)"
- "Wet Oven Overheid (Woo)"
- "With Periods (W.P.)"
- "(EA) Explained After" (probably rare)
- "BT (Bracketed terms)"
- "(Bracketed terms) BT" (probably rare)
Will both over- and under-accept, so if you want clean results, consider e.g. reporting only things present in multiple documents. see e.g. merge_results()
CONSIDER:
how permissive to be with capitalization. Maybe make that a parameter?
allow and ignore words like 'of', 'the'
rewrite to deal with cases like
- Autoriteit Consument en Markt (ACM)
- De Regeling werving, reclame en verslavingspreventie kansspelen (hierna: Rwrvk)
- Nationale Postcode Loterij N.V. (hierna: NPL)
- Edelmetaal Waarborg Nederland B.V. (EWN)
- College voor Toetsen en Examens (CvTE)
- (and maybe:)
- Pensioen- en Uitkeringsraad (PUR)
- Nederlandse Loodsencorporatie (NLC)
- Nederlandse Emissieautoriteit (NEa)
- Kamer voor de Binnenvisserij (Kabivi)
- (and maybe not:)
- College van toezicht collectieve beheersorganisaties auteurs- en naburige rechten (College van Toezicht Auteursrechten (CvTA))
- Keurmerkinstituut jeugdzorg (KMI)
listening to 'hierna: ', e.g.
"Wet Bevordering Integriteitbeoordelingen door het Openbaar Bestuur (hierna: Wet BIBOB)"
"Drank- en horecawet (hierna: DHW)"
"Algemene wet bestuursrecht (hierna: Awb)"
"het Verdrag betreffende de werking van de Europese Unie (hierna: VWEU)"
"de Subsidieregeling OPZuid 2021-2027 (hierna: Subsidieregeling OPZuid)"
"de Wet werk en bijstand (hierna: WWB)"
"de Wet werk en inkomen naar arbeidsvermogen (hierna: WIA)"
"de Wet maatschappelijke ondersteuning (hierna: Wmo)"
These seem to be more structured, in particular when you use (de|het) as a delimiter This seems overly specific, but works well to extract a bunch of these
Parameters | |
string:str | python string to look in. CONSIDER: accept spacy objects as well |
Returns | |
a list of ('ww', ['word', 'word']) tuples, pretty much as-is so it (intentionally) contains duplicates |
Attempts to find references like :
"artikel 5.1, tweede lid, aanhef en onder i, van de Woo"
and parse and resolve as much as it can.
This is a a separate function because it is more complex than most others, but if you want to look for more than just these, then you probably want to wield it via find_references.
These references are not a formalized format, and while the law ( https://wetten.overheid.nl/BWBR0005730/ ) that suggests the format of these should be succinct, and sometimes it looks like these have near-templates, that is not what real-world use looks like.
Another reasonable approach might be include each real-world variant format explicitly, as it lets you put stronger patterns first and fall back on fuzzier ones, it makes it clear what is being matched, and it's easier to see how common each is. However, that also easily leads to false negatives -- missing real references.
Instead, we
- start by finding some strong anchors
- keep accepting bits of adjacent string as long as they look like things we know "artikel 5.1," "tweede lid," "aanhef en onder i"
- then seeing what text is around it, which should be at least the law name
Neither will deal well with the briefest forms, e.g. "(81 WWB)" which is arguably only reasonable to recognize when you recognize either side (by known law name, which is harder for abbreviations in that it probably leads to false positives) ...and in that example, we might want to
- see if character context makes it reasonable - the parthentheses make it more reasonable than if you found the six characters '81 WWB' in any context
- check whether the estimated law (Wet werk en bijstand - BWBR0015703) has an article 81
- check, in some semantic way, whether Wet werk en bijstand makes any sense in context of the text
TODO: ...also so that we can return some estimation of
- how sure we are this is a reference,
- how complete a reference is, and/or
- how easy to resolve a reference is.
Parameters | |
string:str | the text to look in |
contextint | how much context to find another piece in (TODO: make this part of internal parameters) |
debug:bool | Undocumented |
Returns | |
a list of dict matches, as also mentioned on find_references() |
Looks for various different kinds of references in the given text, sorts the results.
Note that there is a gliding scale between 'is this and identifier and will we probably find most of them' and 'is this more textual, more varied, so more easily miss parts' (...and should this perhaps not be implemented with regexes as it currently is)
See also:
- Leidraad voor juridische auteurs
Parameters | |
string:str | the string to look in. Note that matches return offsets within this string. |
bwb:bool | whether to look for BWB identifiers, e.g. BWBR0006501 |
cvdr:bool | whether to look for CVDR work and expression identifiers, e.g. CVDR101405_1 CVDR101405/1 CVDR101405 |
ecli:bool | whether to look for ECLI identifiers, e.g. ECLI:NL:HR:2005:AT4537 |
celex:bool | whether to look for CELEX identifiers, e.g. 32000L0060 and some variations |
ljn:bool | whether to look for LJN identifiers, e.g. AT4537 (disabled by default because we want you to be explicitly aware of false negatives. Also they aren't used anymore) |
bekendmakingbool | whether to look for bekendmaking-ids like kst-26643-144-h1 and h-tk-20082009-7140-7144. Disabled by default because you're not usally seeing these in text. |
vindplaatsen:bool | whether to look for vindplaatsen for Trb, Stb, Stcrt, e.g. "Stb. 2011, 35" are actually quite regular (mostly by merit of being simple) |
artikel:bool | whether to look for artikel 3, lid 3, aanhef en onder c style references |
kamerstukken:bool | whether to look for kamerstukken references, the ones that look like: Kamerstukken I 1995/96, 23700, nr. 188b, p. 3. Kamerstukken I 2014/15, 33802, C, p. 3. Kamerstukken II 1999/2000, 2000/2001, 2001/2002, 26 855. Kamerstukken I 2000/2001, 26 855 (250, 250a); 2001/2002, 26 855 (16, 16a, 16b, 16c). |
euoj:bool | whether to look for EU Official Journal references, the ones that look like: OJ L 69, 13.3.2013, p. 1 OJ L 168, 30.6.2009, p. 41–47 |
eudir:bool | whether to look for EU directive references, the ones that look like: Council Directive 93/42/EEC of 14 June 1993 Directive 93/42/EEC of 14 June 1993 |
eureg:bool | whether to look for EU regulation references, the ones that look like: Council Regulation (EEC) No 2658/87 |
debug:bool | Undocumented |
Returns | |
A list of dicts (sorted by the value of `start`), each with at least the keys
and probably
|
Takes a spacy Doc, and matches from you calling find_references, marks it as entities.
*Replaces* the currently marked entities, to avoid overlap. (CONSIDER: marking up in spans instead) (...also because char_span() with alignment_mode='expand' probably makes this easier.
Bases this on the plain text, and then trying to find all the tokens necessary to cover that (that code needs some double checking).
Quick and dirty splitter into words. Mainly used by abbrev_find
Parameters | |
string:str | the string to split up. |