wetsuite.helpers.strings

module documentation

mostly-basic string helper functions

Many are simple enough, or specific, that you'ld easily implement them as you need them, so not that much time is saved.

Function	`canonical_compare`	return whether two unicode strings are the same after canonical decomposition
Function	`compatibility_compare`	return whether two unicode strings are the same after compatibility decomposition
Function	`contains_all_of`	Given a string and a list of strings, returns whether the former contains all of the substrings in the latter Note that no attention is paid to word boundaries. e.g.
Function	`contains_any_of`	Given a string and a list of strings, returns whether the former contains at least one of the strings in the latter e.g. contains_any_of('microfishes', ['mikrofi','microfi','fiches']) == True
Function	`count_case_insensitive`	Calls count_normalized() with normalize_func=lambda s:s.lower() which means it is case insensitive in counting strings, but it reports the most common capitalisation.
Function	`count_normalized`	Takes a list of strings, returns a string:count dict, with some extra processing
Function	`count_unicode_categories`	Count the unicode categories within the given string - and also simplify that.
Function	`findall_with_context`	Matches substrings/regexpe, and for each match also gives some of the text context (on a character-amount basis).
Function	`has_lowercase_letter`	Returns whether the string contains at least one lowercase letter (that is, one that would change when calling upper())
Function	`has_text`	Does this string contain at least something we can consider text? Based on unicode codepoint categories - see `count_unicode_categories`
Function	`interpret_ordinal_nl`	Given ordinals, gives the integer it represents (for 0..99)
Function	`is_mainly_numeric`	Returns whether the amount of characters of the string that are 0123456789, -, or space, make up more than threshold of the entire string lengt.
Function	`is_numeric`	Does this string contain _only_ something we can probably consider a number? That is, [0-9.,] and optional whitespace around it
Function	`ngram_count`	Takes a string, figures out the n-grams, Returns a dict from n-gram strings to how often they occur in this string.
Function	`ngram_generate`	Gives all n-grams of a specific length. Generator function. Quick and dirty version.
Function	`ngram_matchcount`	Score by overlapping n-grams (outputs of ngram_count())
Function	`ngram_sort_by_matches`	Score each item in string-list `option_strings` they match to string `string`, by how many n-gram strings (with n in 1..4), where more matching n-grams means a higher score:
Function	`ordered_unique`	Takes a list of strings, returns one without duplicates, keeping the first of each (so unlike a plain set(strlist), it keeps the order of what we keep) (Not the fastest implementation)
Function	`ordinal_nl`	Give a number, gives the ordinal word for dutch number (0..99)
Function	`remove_deheteen`	remove 'de', 'het', and 'een' as words from the start of a string - meant to help normalize phrases
Function	`remove_diacritics`	Unicode decomposes, remove combining characters, unicode compose. Note that not everything next to a letter is considered a diacritic.
Function	`remove_initial`	remove strings from the start of a string, based on a list of regexps
Function	`remove_privateuse`	Removes unicode characters within private use areas, because they have no semantic meaning (U+E000 through U+F8FF, U+F0000 through U+FFFFD, U+100000 to U+10FFFD).
Function	`simple_tokenize`	Split string into words _Very_ basic - splits on and swallows symbols and such.
Function	`simplify_whitespace`	Replaces newlines with spaces, squeeze multiple spaces into one, then strip() the whole. May e.g. be useful to push spaces into functions that trip over newlines, series of newlines, or series of spaces.
Variable	`stopwords_en`	some English stopwords
Variable	`stopwords_nl`	some Dutch stopwords
Function	`_matches_anyall`	helper for `contains_any_of` and `contains_all_of`. See the docstrings for both.
Variable	`_ordinal_nl_20`	Undocumented
Variable	`_ordinal_nl_20_rev`	Undocumented
Variable	`_re_combining`	helps remove diacritics - list a number of combining (but not actually combined) character ranges in unicode, since you often want to remove these (after decomposition)
Variable	`_re_tig`	Undocumented
Variable	`_tigste1`	Undocumented
Variable	`_tigste10`	Undocumented
Variable	`_tigste10_rev`	Undocumented
Variable	`_tigste1_rev`	Undocumented

def canonical_compare(string1, string2): ¶

return whether two unicode strings are the same after canonical decomposition

def compatibility_compare(string1, string2): ¶

return whether two unicode strings are the same after compatibility decomposition

def contains_all_of(haystack, needles, case_sensitive=True, regexp=False, encoding='utf8'): ¶

Given a string and a list of strings, returns whether the former contains all of the substrings in the latter Note that no attention is paid to word boundaries. e.g.

contains_all_of('AA (B/CCC)', ('AA', 'BB') ) == False
strings.contains_all_of('Wetswijziging', ['wijziging', 'wet'], case_sensitive=False) == True
strings.contains_all_of('wijziging wet A', ['wijziging', 'wet'], case_sensitive=False) == True

Parameters
haystack:`str`	is treated like a regular expression (the test is whether re.search for it is not None)
needles:`List[str]`	the things to look for Note that if you use regexp=True and case_sensitive=True, the regexp gets lowercased before compilation, which may not always be correct.
case_sensitive	if False, lowercasing hackstack and needle before testing. Defauts to True.
regexp	treat needles as regexps rather than subbstrings. Default is False, i.e. substriungs
encoding	lets us deal with bytes, by saying "if you see a bytes haystack or needle, decode using this encoding". Defaults to utf-8

def contains_any_of(haystack, needles, case_sensitive=True, regexp=False, encoding='utf8'): ¶

Given a string and a list of strings, returns whether the former contains at least one of the strings in the latter e.g. contains_any_of('microfishes', ['mikrofi','microfi','fiches']) == True

Parameters
haystack:`str`	is treated like a regular expression (the test is whether re.search for it is not None)
needles:`List[str]`	the things to look for Note that if you use regexp=True and case_sensitive=True, the regexp gets lowercased before compilation, which may not always be correct.
case_sensitive	if False, lowercasing hackstack and needle before testing. Defauts to True.
regexp	treat needles as regexps rather than subbstrings. Default is False, i.e. substriungs
encoding	lets us deal with bytes, by saying "if you see a bytes haystack or needle, decode using this encoding". Defaults to utf-8

def count_case_insensitive(strings, min_count=1, min_word_length=0, stopwords=(), stopwords_i=(), **kwargs): ¶

Calls count_normalized() with normalize_func=lambda s:s.lower() which means it is case insensitive in counting strings, but it reports the most common capitalisation.

Explicitly writing a function for such singular use is almost pointless, yet this seems like a common case and saves some typing.

Parameters
strings:`List[str]`
min_count
min_word_length
stopwords
stopwords_i	Undocumented
**kwargs	Undocumented
Returns

def count_normalized(strings, min_count=1, min_word_length=0, normalize_func=None, stopwords=(), stopwords_i=()): ¶

Takes a list of strings, returns a string:count dict, with some extra processing

Parameters beyond normalize_func are mostly about removing things you would probably call, so you do not have to do that separately.

Note that if you are using spacy or other POS tagging anyway, filtering e.g. just nouns and such before handing it into this is a lot cleaner and easier (if a little slower).

CONSIDER:

imitating wordcloud collocations= behaviour
imitating wordcloud normalize_plurals=True
imitating wordcloud include_numbers=False
separating out different parts of these behaviours

Parameters
strings:`List[str]`	a list of strings, the thing we count.
min_count:`int`	if integer, or float >1: we remove if final count is < that count, if float in 0 to 1.0 range: we remove if the final count is < this fraction times the maximum count we see
min_word_length	strings shorter than this are removed. This is tested after normalization, so you can remove things in normalization too.
normalize_func	half the point of this function. Should be a str->str function. We group things by what is equal after this function is applied, but we report the most common case before it is. For example, to _count_ blind to case, but report just one (the most common case) : count_normalized( "a A A a A A a B b b B b".split(), normalize_func=lambda s:s.lower() ) would give : {"A":7, "b":5} Could be used for other things. For example, if you make normalize_func map a word to its lemma, then you unify all inflections, and get reported the most common one.
stopwords	defaults to not removing anything handing in True adds some of our own (dutch and english) handing in a list uses yours instead. There is a stopwords_nl and stopwords_en in this module to get you started but you may want to refine your own
stopwords_i	defaults to not removing anything
Returns
a { string: count } dict

def count_unicode_categories(string, nfc_first=True): ¶

Count the unicode categories within the given string - and also simplify that.

For reference:

Lu - uppercase letter
Ll - lowercase letter
Lt - titlecase letter
Lm - modifier letter
Lo - other letter
Mn - nonspacing mark
Mc - spacing combining mark
Me - enclosing mark
Nd - number: decimal digit
Nl - number: letter
No - number: other
Pc - punctuation: connector
Pd - punctuation: dash
Ps - punctuation: open
Pe - punctuation: close
Pi - punctuation: initial quote (may behave like Ps or Pe depending on usage)
Pf - punctuation; final quote (may behave like Ps or Pe depending on usage)
Po - punctuation:Other
Sm - math symbol
Sc - currency symbol
Sk - modifier symbol
So - other symbol
Zs - space separator
Zl - line separator
Zp - paragraph separator
Cc - control character
Cf - format character
Cs - surrogate codepoint
Co - private use character
Cn - (character not assigned

Parameters
string:`str`	the string to look in
nfc_first:`bool`	whether to do a normalization (that e.g. merges diacritics into the letters they are on)
Returns
two dicts, one counting the unicode categories per character, one simplified creatively. For example: count_unicode_categories('Fisher 99 ∢ 쎩 🧀') would return: {'textish': 7, 'space': 4, 'number': 2, 'symbol': 2}, {'Lu': 1, 'Ll': 5, 'Zs': 4, 'Nd': 2, 'Sm': 1, 'Lo': 1, 'So': 1}

def findall_with_context(pattern, s, context_amt): ¶

Matches substrings/regexpe, and for each match also gives some of the text context (on a character-amount basis).

For example:

        list(findall_with_context(" a ", "I am a fork and a spoon", 5))

would return:

        [('I am', ' a ', <re.Match object; span=(4, 7), match=' a '>,   'fork '),
        ('k and', ' a ', <re.Match object; span=(15, 18), match=' a '>, 'spoon')]

Parameters
pattern:`str`	the regex (/string) to look for
s:`str`	the string to find things in
context_amt:`int`	amount of context, in number of characters
Returns
a generator that yields 4-tuples: string before matched string match object - may seem redundant, but you often want a distinction between what is matched and captured. Also, the offset can be useful string after

def has_lowercase_letter(s): ¶

Returns whether the string contains at least one lowercase letter (that is, one that would change when calling upper())

def has_text(string, mincount=1): ¶

Does this string contain at least something we can consider text? Based on unicode codepoint categories - see count_unicode_categories

Parameters
string:`str`	the text to count in
mincount:`int`	how many text-like characters to demand
Returns
True or False

def interpret_ordinal_nl(string): ¶

Given ordinals, gives the integer it represents (for 0..99)

Parameters
string:`str`	the string with integer as text
Returns
the integer, e.g.: interpret_ordinal_nl('eerste') == 1

def is_mainly_numeric(string, threshold=0.8): ¶

Returns whether the amount of characters of the string that are 0123456789, -, or space, make up more than threshold of the entire string lengt.

Meant to help ignore serial numbers and such.

Parameters
string:`str`	the text to look in
threshold	if more than this fraction of numbers (or the other mentioned characters), we return True.
Returns
whether it's mostly numbers

def is_numeric(string): ¶

Does this string contain _only_ something we can probably consider a number? That is, [0-9.,] and optional whitespace around it

Parameters
string:`str`	the string to look in

def ngram_count(string, gramlens=(2, 3, 4), splitfirst=False): ¶

Takes a string, figures out the n-grams, Returns a dict from n-gram strings to how often they occur in this string.

Parameters
string:`str`	the string to count n-grams from
gramlens:`List[int]`	list of lengths you want (defaults to (2,3,4): 2-grams, 3-grams and 4-grams)
splitfirst:`bool`	is here if you want to apply it to words - that is, do a (dumb) split so that we don't collect n-grams across word boundaries
Returns
a dict with string : occurences

def ngram_generate(string, n): ¶

Gives all n-grams of a specific length. Generator function. Quick and dirty version.

Treats input as sequence, so you can be creative and e.g. give it lists of strings (e.g. already-split words from sentences)

Parameters
string:`str`	the string to take slices of
n:`int`	the size, the n in n-gram
Returns
a generator that yields all the n-grams

def ngram_matchcount(count_1, count_2): ¶

Score by overlapping n-grams (outputs of ngram_count())

Parameters
count_1:`dict`	one dict of counts, e.g. from `ngram_count`
count_2:`dict`	another dict of counts, e.g. from `ngram_count`
Returns
a fraction, the amount of matches divided by the total amount of

def ngram_sort_by_matches(string, option_strings, gramlens=(1, 2, 3, 4), with_scores=False): ¶

Score each item in string-list option_strings they match to string string, by how many n-gram strings (with n in 1..4), where more matching n-grams means a higher score:

    ngram_sort_by_matches( 'for', ['spork', 'knife', 'spoon', 'fork']) == ['fork', 'spork', 'knife', 'spoon']

Note that if you pick the first, this is effectively a "which one is the closest string?" function

Parameters
string:`str`	the string to be most similar to
option_strings:`List[str]`	the string list to sort by similarity
gramlens:`List[int]`	the n-grams to use, defaults to (1,2,3,4), it may be a little faster to do (1,2,3)
with_scores:`bool`	if False, returns list of strings. If True, returns list of (string, score).
Returns
List of strings, or of tuples if with_scores==True

def ordered_unique(strlist, case_sensitive=True, remove_none=True): ¶

Takes a list of strings, returns one without duplicates, keeping the first of each (so unlike a plain set(strlist), it keeps the order of what we keep) (Not the fastest implementation)

Parameters
strlist:`List[str]`	The list of strings to work on
case_sensitive:`bool`	If False, it then keeps the _first_ casing it saw
remove_none:`bool`	remove list elements that are None instead of a string
Returns
a list of strings

def ordinal_nl(integer): ¶

Give a number, gives the ordinal word for dutch number (0..99)

Parameters
integer:`int`	the number as an int
Returns
that number as a word in a string, e.g.: ordinal_nl(1) == 'eerste'

def remove_deheteen(string, remove=('de\\b', 'het\\b', 'een\\b')): ¶

remove 'de', 'het', and 'een' as words from the start of a string - meant to help normalize phrases

Parameters
string
remove
Returns

def remove_diacritics(string): ¶

Unicode decomposes, remove combining characters, unicode compose. Note that not everything next to a letter is considered a diacritic.

Parameters
string:`str`	the string to work on
Returns
a string where diacritics on characters have been removed, e.g.: remove_diacritics( 'olé' ) == 'ole'

def remove_initial(string, remove_relist, flags=re.I): ¶

remove strings from the start of a string, based on a list of regexps

Parameters
string:`str`
remove_relist
flags
Returns

def remove_privateuse(string, replace_with=' '): ¶

Removes unicode characters within private use areas, because they have no semantic meaning (U+E000 through U+F8FF, U+F0000 through U+FFFFD, U+100000 to U+10FFFD).

def simple_tokenize(text): ¶

Split string into words _Very_ basic - splits on and swallows symbols and such.

Real NLP tokenizers are often more robust, but for a quick test we can avoid a big depdenency (and sometimes execution slowness)

Parameters
text	a single string
Returns
a list of words

def simplify_whitespace(string): ¶

Replaces newlines with spaces, squeeze multiple spaces into one, then strip() the whole. May e.g. be useful to push spaces into functions that trip over newlines, series of newlines, or series of spaces.

WARNING: Don't use this when you waned to preserve empty lines.

Parameters
string:`str`	the string you want less whitespace in
Returns
that string with less whitespace

stopwords_en: list[str] = ¶

some English stopwords

stopwords_nl: tuple[str, ...] = ¶

some Dutch stopwords

def _matches_anyall(haystack, needles, case_sensitive=True, regexp=False, encoding=None, matchall=False): ¶

helper for contains_any_of and contains_all_of. See the docstrings for both.

Parameters
haystack:`str`	Undocumented
needles:`List[str]`	Undocumented
case_sensitive	Undocumented
regexp	Undocumented
encoding	Undocumented
matchall	Undocumented

_ordinal_nl_20: dict[str, int] = ¶

Undocumented

_ordinal_nl_20_rev: dict = ¶

Undocumented

_re_combining = ¶

helps remove diacritics - list a number of combining (but not actually combin*ed*) character ranges in unicode, since you often want to remove these (after decomposition)

_re_tig = ¶

Undocumented

_tigste1: dict[str, int] = ¶

Undocumented

_tigste10: dict[str, int] = ¶

Undocumented

_tigste10_rev: dict = ¶

Undocumented

_tigste1_rev: dict = ¶

Undocumented