module documentation

mostly-basic string helper functions

Many are simple enough, or specific, that you'ld easily implement them as you need them, so not that much time is saved.

Function canonical_compare return whether two unicode strings are the same after canonical decomposition
Function compatibility_compare return whether two unicode strings are the same after compatibility decomposition
Function contains_all_of Given a string and a list of strings, returns whether the former contains all of the substrings in the latter Note that no attention is paid to word boundaries. e.g.
Function contains_any_of Given a string and a list of strings, returns whether the former contains at least one of the strings in the latter e.g. contains_any_of('microfishes', ['mikrofi','microfi','fiches']) == True
Function count_case_insensitive Calls count_normalized() with normalize_func=lambda s:s.lower() which means it is case insensitive in counting strings, but it reports the most common capitalisation.
Function count_normalized Takes a list of strings, returns a string:count dict, with some extra processing
Function count_unicode_categories Count the unicode categories within the given string - and also simplify that.
Function findall_with_context Matches substrings/regexpe, and for each match also gives some of the text context (on a character-amount basis).
Function has_lowercase_letter Returns whether the string contains at least one lowercase letter (that is, one that would change when calling upper())
Function has_text Does this string contain at least something we can consider text? Based on unicode codepoint categories - see count_unicode_categories
Function interpret_ordinal_nl Given ordinals, gives the integer it represents (for 0..99)
Function is_mainly_numeric Returns whether the amount of characters of the string that are 0123456789, -, or space, make up more than threshold of the entire string lengt.
Function is_numeric Does this string contain _only_ something we can probably consider a number? That is, [0-9.,] and optional whitespace around it
Function ngram_count Takes a string, figures out the n-grams, Returns a dict from n-gram strings to how often they occur in this string.
Function ngram_generate Gives all n-grams of a specific length. Generator function. Quick and dirty version.
Function ngram_matchcount Score by overlapping n-grams (outputs of ngram_count())
Function ngram_sort_by_matches Score each item in string-list option_strings they match to string string, by how many n-gram strings (with n in 1..4), where more matching n-grams means a higher score:
Function ordered_unique Takes a list of strings, returns one without duplicates, keeping the first of each (so unlike a plain set(strlist), it keeps the order of what we keep) (Not the fastest implementation)
Function ordinal_nl Give a number, gives the ordinal word for dutch number (0..99)
Function remove_deheteen remove 'de', 'het', and 'een' as words from the start of a string - meant to help normalize phrases
Function remove_diacritics Unicode decomposes, remove combining characters, unicode compose. Note that not everything next to a letter is considered a diacritic.
Function remove_initial remove strings from the start of a string, based on a list of regexps
Function remove_privateuse Removes unicode characters within private use areas, because they have no semantic meaning (U+E000 through U+F8FF, U+F0000 through U+FFFFD, U+100000 to U+10FFFD).
Function simple_tokenize Split string into words _Very_ basic - splits on and swallows symbols and such.
Function simplify_whitespace Replaces newlines with spaces, squeeze multiple spaces into one, then strip() the whole. May e.g. be useful to push spaces into functions that trip over newlines, series of newlines, or series of spaces.
Variable stopwords_en some English stopwords
Variable stopwords_nl some Dutch stopwords
Function _matches_anyall helper for contains_any_of and contains_all_of. See the docstrings for both.
Variable _ordinal_nl_20 Undocumented
Variable _ordinal_nl_20_rev Undocumented
Variable _re_combining helps remove diacritics - list a number of combining (but not actually combin*ed*) character ranges in unicode, since you often want to remove these (after decomposition)
Variable _re_tig Undocumented
Variable _tigste1 Undocumented
Variable _tigste10 Undocumented
Variable _tigste10_rev Undocumented
Variable _tigste1_rev Undocumented
def canonical_compare(string1, string2):

return whether two unicode strings are the same after canonical decomposition

def compatibility_compare(string1, string2):

return whether two unicode strings are the same after compatibility decomposition

def contains_all_of(haystack, needles, case_sensitive=True, regexp=False, encoding='utf8'):

Given a string and a list of strings, returns whether the former contains all of the substrings in the latter Note that no attention is paid to word boundaries. e.g.

  • contains_all_of('AA (B/CCC)', ('AA', 'BB') ) == False
  • strings.contains_all_of('Wetswijziging', ['wijziging', 'wet'], case_sensitive=False) == True
  • strings.contains_all_of('wijziging wet A', ['wijziging', 'wet'], case_sensitive=False) == True
Parameters
haystack:stris treated like a regular expression (the test is whether re.search for it is not None)
needles:List[str]the things to look for Note that if you use regexp=True and case_sensitive=True, the regexp gets lowercased before compilation, which may not always be correct.
case_sensitiveif False, lowercasing hackstack and needle before testing. Defauts to True.
regexptreat needles as regexps rather than subbstrings. Default is False, i.e. substriungs
encodinglets us deal with bytes, by saying "if you see a bytes haystack or needle, decode using this encoding". Defaults to utf-8
def contains_any_of(haystack, needles, case_sensitive=True, regexp=False, encoding='utf8'):

Given a string and a list of strings, returns whether the former contains at least one of the strings in the latter e.g. contains_any_of('microfishes', ['mikrofi','microfi','fiches']) == True

Parameters
haystack:stris treated like a regular expression (the test is whether re.search for it is not None)
needles:List[str]the things to look for Note that if you use regexp=True and case_sensitive=True, the regexp gets lowercased before compilation, which may not always be correct.
case_sensitiveif False, lowercasing hackstack and needle before testing. Defauts to True.
regexptreat needles as regexps rather than subbstrings. Default is False, i.e. substriungs
encodinglets us deal with bytes, by saying "if you see a bytes haystack or needle, decode using this encoding". Defaults to utf-8
def count_case_insensitive(strings, min_count=1, min_word_length=0, stopwords=(), stopwords_i=(), **kwargs):

Calls count_normalized() with normalize_func=lambda s:s.lower() which means it is case insensitive in counting strings, but it reports the most common capitalisation.

Explicitly writing a function for such singular use is almost pointless, yet this seems like a common case and saves some typing.

Parameters
strings:List[str]
min_count
min_word_length
stopwords
stopwords_iUndocumented
**kwargsUndocumented
Returns
def count_normalized(strings, min_count=1, min_word_length=0, normalize_func=None, stopwords=(), stopwords_i=()):

Takes a list of strings, returns a string:count dict, with some extra processing

Parameters beyond normalize_func are mostly about removing things you would probably call, so you do not have to do that separately.

Note that if you are using spacy or other POS tagging anyway, filtering e.g. just nouns and such before handing it into this is a lot cleaner and easier (if a little slower).

CONSIDER:

  • imitating wordcloud collocations= behaviour
  • imitating wordcloud normalize_plurals=True
  • imitating wordcloud include_numbers=False
  • separating out different parts of these behaviours
Parameters
strings:List[str]a list of strings, the thing we count.
min_count:int
  • if integer, or float >1: we remove if final count is < that count,
  • if float in 0 to 1.0 range: we remove if the final count is < this fraction times the maximum count we see
min_word_length
  • strings shorter than this are removed. This is tested after normalization, so you can remove things in normalization too.
normalize_func

half the point of this function. Should be a str->str function.

  • We group things by what is equal after this function is applied, but we report the most common case before it is. For example, to _count_ blind to case, but report just one (the most common case) :

        count_normalized( "a A A a A A a B b b B b".split(),  normalize_func=lambda s:s.lower() )
    

    would give :

        {"A":7, "b":5}
    
  • Could be used for other things. For example, if you make normalize_func map a word to its lemma, then you unify all inflections, and get reported the most common one.

stopwords
  • defaults to not removing anything
  • handing in True adds some of our own (dutch and english)
  • handing in a list uses yours instead. There is a stopwords_nl and stopwords_en in this module to get you started but you may want to refine your own
stopwords_i
  • defaults to not removing anything
Returns
a { string: count } dict
def count_unicode_categories(string, nfc_first=True):

Count the unicode categories within the given string - and also simplify that.

For reference:

  • Lu - uppercase letter
  • Ll - lowercase letter
  • Lt - titlecase letter
  • Lm - modifier letter
  • Lo - other letter
  • Mn - nonspacing mark
  • Mc - spacing combining mark
  • Me - enclosing mark
  • Nd - number: decimal digit
  • Nl - number: letter
  • No - number: other
  • Pc - punctuation: connector
  • Pd - punctuation: dash
  • Ps - punctuation: open
  • Pe - punctuation: close
  • Pi - punctuation: initial quote (may behave like Ps or Pe depending on usage)
  • Pf - punctuation; final quote (may behave like Ps or Pe depending on usage)
  • Po - punctuation:Other
  • Sm - math symbol
  • Sc - currency symbol
  • Sk - modifier symbol
  • So - other symbol
  • Zs - space separator
  • Zl - line separator
  • Zp - paragraph separator
  • Cc - control character
  • Cf - format character
  • Cs - surrogate codepoint
  • Co - private use character
  • Cn - (character not assigned
Parameters
string:strthe string to look in
nfc_first:boolwhether to do a normalization (that e.g. merges diacritics into the letters they are on)
Returns

two dicts, one counting the unicode categories per character, one simplified creatively. For example:

    count_unicode_categories('Fisher 99 ∢ 쎩 🧀')

would return:

  • {'textish': 7, 'space': 4, 'number': 2, 'symbol': 2},
  • {'Lu': 1, 'Ll': 5, 'Zs': 4, 'Nd': 2, 'Sm': 1, 'Lo': 1, 'So': 1}
def findall_with_context(pattern, s, context_amt):

Matches substrings/regexpe, and for each match also gives some of the text context (on a character-amount basis).

For example:

        list(findall_with_context(" a ", "I am a fork and a spoon", 5))

would return:

        [('I am', ' a ', <re.Match object; span=(4, 7), match=' a '>,   'fork '),
        ('k and', ' a ', <re.Match object; span=(15, 18), match=' a '>, 'spoon')]
Parameters
pattern:strthe regex (/string) to look for
s:strthe string to find things in
context_amt:intamount of context, in number of characters
Returns

a generator that yields 4-tuples:

  • string before
  • matched string
  • match object - may seem redundant, but you often want a distinction between what is matched and captured. Also, the offset can be useful
  • string after
def has_lowercase_letter(s):

Returns whether the string contains at least one lowercase letter (that is, one that would change when calling upper())

def has_text(string, mincount=1):

Does this string contain at least something we can consider text? Based on unicode codepoint categories - see count_unicode_categories

Parameters
string:strthe text to count in
mincount:inthow many text-like characters to demand
Returns
True or False
def interpret_ordinal_nl(string):

Given ordinals, gives the integer it represents (for 0..99)

Parameters
string:strthe string with integer as text
Returns

the integer, e.g.:

    interpret_ordinal_nl('eerste') == 1
def is_mainly_numeric(string, threshold=0.8):

Returns whether the amount of characters of the string that are 0123456789, -, or space, make up more than threshold of the entire string lengt.

Meant to help ignore serial numbers and such.

Parameters
string:strthe text to look in
thresholdif more than this fraction of numbers (or the other mentioned characters), we return True.
Returns
whether it's mostly numbers
def is_numeric(string):

Does this string contain _only_ something we can probably consider a number? That is, [0-9.,] and optional whitespace around it

Parameters
string:strthe string to look in
def ngram_count(string, gramlens=(2, 3, 4), splitfirst=False):

Takes a string, figures out the n-grams, Returns a dict from n-gram strings to how often they occur in this string.

Parameters
string:strthe string to count n-grams from
gramlens:List[int]list of lengths you want (defaults to (2,3,4): 2-grams, 3-grams and 4-grams)
splitfirst:boolis here if you want to apply it to words - that is, do a (dumb) split so that we don't collect n-grams across word boundaries
Returns
a dict with string : occurences
def ngram_generate(string, n):

Gives all n-grams of a specific length. Generator function. Quick and dirty version.

Treats input as sequence, so you can be creative and e.g. give it lists of strings (e.g. already-split words from sentences)

Parameters
string:strthe string to take slices of
n:intthe size, the n in n-gram
Returns
a generator that yields all the n-grams
def ngram_matchcount(count_1, count_2):

Score by overlapping n-grams (outputs of ngram_count())

Parameters
count_1:dictone dict of counts, e.g. from ngram_count
count_2:dictanother dict of counts, e.g. from ngram_count
Returns
a fraction, the amount of matches divided by the total amount of
def ngram_sort_by_matches(string, option_strings, gramlens=(1, 2, 3, 4), with_scores=False):

Score each item in string-list option_strings they match to string string, by how many n-gram strings (with n in 1..4), where more matching n-grams means a higher score:

    ngram_sort_by_matches( 'for', ['spork', 'knife', 'spoon', 'fork']) == ['fork', 'spork', 'knife', 'spoon']

Note that if you pick the first, this is effectively a "which one is the closest string?" function

Parameters
string:strthe string to be most similar to
option_strings:List[str]the string list to sort by similarity
gramlens:List[int]the n-grams to use, defaults to (1,2,3,4), it may be a little faster to do (1,2,3)
with_scores:boolif False, returns list of strings. If True, returns list of (string, score).
Returns
List of strings, or of tuples if with_scores==True
def ordered_unique(strlist, case_sensitive=True, remove_none=True):

Takes a list of strings, returns one without duplicates, keeping the first of each (so unlike a plain set(strlist), it keeps the order of what we keep) (Not the fastest implementation)

Parameters
strlist:List[str]The list of strings to work on
case_sensitive:boolIf False, it then keeps the _first_ casing it saw
remove_none:boolremove list elements that are None instead of a string
Returns
a list of strings
def ordinal_nl(integer):

Give a number, gives the ordinal word for dutch number (0..99)

Parameters
integer:intthe number as an int
Returns

that number as a word in a string, e.g.:

    ordinal_nl(1) == 'eerste'
def remove_deheteen(string, remove=('de\\b', 'het\\b', 'een\\b')):

remove 'de', 'het', and 'een' as words from the start of a string - meant to help normalize phrases

Parameters
string
remove
Returns
def remove_diacritics(string):

Unicode decomposes, remove combining characters, unicode compose. Note that not everything next to a letter is considered a diacritic.

Parameters
string:strthe string to work on
Returns

a string where diacritics on characters have been removed, e.g.:

    remove_diacritics( 'olé' ) == 'ole'
def remove_initial(string, remove_relist, flags=re.I):

remove strings from the start of a string, based on a list of regexps

Parameters
string:str
remove_relist
flags
Returns
def remove_privateuse(string, replace_with=' '):

Removes unicode characters within private use areas, because they have no semantic meaning (U+E000 through U+F8FF, U+F0000 through U+FFFFD, U+100000 to U+10FFFD).

def simple_tokenize(text):

Split string into words _Very_ basic - splits on and swallows symbols and such.

Real NLP tokenizers are often more robust, but for a quick test we can avoid a big depdenency (and sometimes execution slowness)

Parameters
texta single string
Returns
a list of words
def simplify_whitespace(string):

Replaces newlines with spaces, squeeze multiple spaces into one, then strip() the whole. May e.g. be useful to push spaces into functions that trip over newlines, series of newlines, or series of spaces.

WARNING: Don't use this when you waned to preserve empty lines.

Parameters
string:strthe string you want less whitespace in
Returns
that string with less whitespace
stopwords_en: list[str] =

some English stopwords

stopwords_nl: tuple[str, ...] =

some Dutch stopwords

def _matches_anyall(haystack, needles, case_sensitive=True, regexp=False, encoding=None, matchall=False):

helper for contains_any_of and contains_all_of. See the docstrings for both.

Parameters
haystack:strUndocumented
needles:List[str]Undocumented
case_sensitiveUndocumented
regexpUndocumented
encodingUndocumented
matchallUndocumented
_ordinal_nl_20: dict[str, int] =

Undocumented

_ordinal_nl_20_rev: dict =

Undocumented

_re_combining =

helps remove diacritics - list a number of combining (but not actually combin*ed*) character ranges in unicode, since you often want to remove these (after decomposition)

_re_tig =

Undocumented

_tigste1: dict[str, int] =

Undocumented

_tigste10: dict[str, int] =

Undocumented

_tigste10_rev: dict =

Undocumented

_tigste1_rev: dict =

Undocumented