mostly-basic string helper functions
Many are simple enough, or specific, that you'ld easily implement them as you need them, so not that much time is saved.
Function | canonical |
return whether two unicode strings are the same after canonical decomposition |
Function | compatibility |
return whether two unicode strings are the same after compatibility decomposition |
Function | contains |
Given a string and a list of strings, returns whether the former contains all of the substrings in the latter Note that no attention is paid to word boundaries. e.g. |
Function | contains |
Given a string and a list of strings, returns whether the former contains at least one of the strings in the latter e.g. contains_any_of('microfishes', ['mikrofi','microfi','fiches']) == True |
Function | count |
Calls count_normalized() with normalize_func=lambda s:s.lower() which means it is case insensitive in counting strings, but it reports the most common capitalisation. |
Function | count |
Takes a list of strings, returns a string:count dict, with some extra processing |
Function | count |
Count the unicode categories within the given string - and also simplify that. |
Function | findall |
Matches substrings/regexpe, and for each match also gives some of the text context (on a character-amount basis). |
Function | has |
Returns whether the string contains at least one lowercase letter (that is, one that would change when calling upper()) |
Function | has |
Does this string contain at least something we can consider text? Based on unicode codepoint categories - see count_unicode_categories |
Function | interpret |
Given ordinals, gives the integer it represents (for 0..99) |
Function | is |
Returns whether the amount of characters of the string that are 0123456789, -, or space, make up more than threshold of the entire string lengt. |
Function | is |
Does this string contain _only_ something we can probably consider a number? That is, [0-9.,] and optional whitespace around it |
Function | ngram |
Takes a string, figures out the n-grams, Returns a dict from n-gram strings to how often they occur in this string. |
Function | ngram |
Gives all n-grams of a specific length. Generator function. Quick and dirty version. |
Function | ngram |
Score by overlapping n-grams (outputs of ngram_count()) |
Function | ngram |
Score each item in string-list option_strings they match to string string, by how many n-gram strings (with n in 1..4), where more matching n-grams means a higher score: |
Function | ordered |
Takes a list of strings, returns one without duplicates, keeping the first of each (so unlike a plain set(strlist), it keeps the order of what we keep) (Not the fastest implementation) |
Function | ordinal |
Give a number, gives the ordinal word for dutch number (0..99) |
Function | remove |
remove 'de', 'het', and 'een' as words from the start of a string - meant to help normalize phrases |
Function | remove |
Unicode decomposes, remove combining characters, unicode compose. Note that not everything next to a letter is considered a diacritic. |
Function | remove |
remove strings from the start of a string, based on a list of regexps |
Function | remove |
Removes unicode characters within private use areas, because they have no semantic meaning (U+E000 through U+F8FF, U+F0000 through U+FFFFD, U+100000 to U+10FFFD). |
Function | simple |
Split string into words _Very_ basic - splits on and swallows symbols and such. |
Function | simplify |
Replaces newlines with spaces, squeeze multiple spaces into one, then strip() the whole. May e.g. be useful to push spaces into functions that trip over newlines, series of newlines, or series of spaces. |
Variable | stopwords |
some English stopwords |
Variable | stopwords |
some Dutch stopwords |
Function | _matches |
helper for contains_any_of and contains_all_of . See the docstrings for both. |
Variable | _ordinal |
Undocumented |
Variable | _ordinal |
Undocumented |
Variable | _re |
helps remove diacritics - list a number of combining (but not actually combin*ed*) character ranges in unicode, since you often want to remove these (after decomposition) |
Variable | _re |
Undocumented |
Variable | _tigste1 |
Undocumented |
Variable | _tigste10 |
Undocumented |
Variable | _tigste10 |
Undocumented |
Variable | _tigste1 |
Undocumented |
Given a string and a list of strings, returns whether the former contains all of the substrings in the latter Note that no attention is paid to word boundaries. e.g.
- contains_all_of('AA (B/CCC)', ('AA', 'BB') ) == False
- strings.contains_all_of('Wetswijziging', ['wijziging', 'wet'], case_sensitive=False) == True
- strings.contains_all_of('wijziging wet A', ['wijziging', 'wet'], case_sensitive=False) == True
Parameters | |
haystack:str | is treated like a regular expression (the test is whether re.search for it is not None) |
needles:List[ | the things to look for Note that if you use regexp=True and case_sensitive=True, the regexp gets lowercased before compilation, which may not always be correct. |
case | if False, lowercasing hackstack and needle before testing. Defauts to True. |
regexp | treat needles as regexps rather than subbstrings. Default is False, i.e. substriungs |
encoding | lets us deal with bytes, by saying "if you see a bytes haystack or needle, decode using this encoding". Defaults to utf-8 |
Given a string and a list of strings, returns whether the former contains at least one of the strings in the latter e.g. contains_any_of('microfishes', ['mikrofi','microfi','fiches']) == True
Parameters | |
haystack:str | is treated like a regular expression (the test is whether re.search for it is not None) |
needles:List[ | the things to look for Note that if you use regexp=True and case_sensitive=True, the regexp gets lowercased before compilation, which may not always be correct. |
case | if False, lowercasing hackstack and needle before testing. Defauts to True. |
regexp | treat needles as regexps rather than subbstrings. Default is False, i.e. substriungs |
encoding | lets us deal with bytes, by saying "if you see a bytes haystack or needle, decode using this encoding". Defaults to utf-8 |
Calls count_normalized() with normalize_func=lambda s:s.lower() which means it is case insensitive in counting strings, but it reports the most common capitalisation.
Explicitly writing a function for such singular use is almost pointless, yet this seems like a common case and saves some typing.
Parameters | |
strings:List[ | |
min | |
min | |
stopwords | |
stopwords | Undocumented |
**kwargs | Undocumented |
Returns | |
Takes a list of strings, returns a string:count dict, with some extra processing
Parameters beyond normalize_func are mostly about removing things you would probably call, so you do not have to do that separately.
Note that if you are using spacy or other POS tagging anyway, filtering e.g. just nouns and such before handing it into this is a lot cleaner and easier (if a little slower).
CONSIDER:
- imitating wordcloud collocations= behaviour
- imitating wordcloud normalize_plurals=True
- imitating wordcloud include_numbers=False
- separating out different parts of these behaviours
Parameters | |
strings:List[ | a list of strings, the thing we count. |
minint |
|
min |
|
normalize | half the point of this function. Should be a str->str function.
|
stopwords |
|
stopwords |
|
Returns | |
a { string: count } dict |
Count the unicode categories within the given string - and also simplify that.
For reference:
- Lu - uppercase letter
- Ll - lowercase letter
- Lt - titlecase letter
- Lm - modifier letter
- Lo - other letter
- Mn - nonspacing mark
- Mc - spacing combining mark
- Me - enclosing mark
- Nd - number: decimal digit
- Nl - number: letter
- No - number: other
- Pc - punctuation: connector
- Pd - punctuation: dash
- Ps - punctuation: open
- Pe - punctuation: close
- Pi - punctuation: initial quote (may behave like Ps or Pe depending on usage)
- Pf - punctuation; final quote (may behave like Ps or Pe depending on usage)
- Po - punctuation:Other
- Sm - math symbol
- Sc - currency symbol
- Sk - modifier symbol
- So - other symbol
- Zs - space separator
- Zl - line separator
- Zp - paragraph separator
- Cc - control character
- Cf - format character
- Cs - surrogate codepoint
- Co - private use character
- Cn - (character not assigned
Parameters | |
string:str | the string to look in |
nfcbool | whether to do a normalization (that e.g. merges diacritics into the letters they are on) |
Returns | |
two dicts, one counting the unicode categories per character, one simplified creatively. For example: count_unicode_categories('Fisher 99 ∢ 쎩 🧀') would return:
|
Matches substrings/regexpe, and for each match also gives some of the text context (on a character-amount basis).
For example:
list(findall_with_context(" a ", "I am a fork and a spoon", 5))
would return:
[('I am', ' a ', <re.Match object; span=(4, 7), match=' a '>, 'fork '), ('k and', ' a ', <re.Match object; span=(15, 18), match=' a '>, 'spoon')]
Parameters | |
pattern:str | the regex (/string) to look for |
s:str | the string to find things in |
contextint | amount of context, in number of characters |
Returns | |
a generator that yields 4-tuples:
|
Returns whether the string contains at least one lowercase letter (that is, one that would change when calling upper())
Does this string contain at least something we can consider text? Based on unicode codepoint categories - see count_unicode_categories
Parameters | |
string:str | the text to count in |
mincount:int | how many text-like characters to demand |
Returns | |
True or False |
Given ordinals, gives the integer it represents (for 0..99)
Parameters | |
string:str | the string with integer as text |
Returns | |
the integer, e.g.: interpret_ordinal_nl('eerste') == 1 |
Returns whether the amount of characters of the string that are 0123456789, -, or space, make up more than threshold of the entire string lengt.
Meant to help ignore serial numbers and such.
Parameters | |
string:str | the text to look in |
threshold | if more than this fraction of numbers (or the other mentioned characters), we return True. |
Returns | |
whether it's mostly numbers |
Does this string contain _only_ something we can probably consider a number? That is, [0-9.,] and optional whitespace around it
Parameters | |
string:str | the string to look in |
Takes a string, figures out the n-grams, Returns a dict from n-gram strings to how often they occur in this string.
Parameters | |
string:str | the string to count n-grams from |
gramlens:List[ | list of lengths you want (defaults to (2,3,4): 2-grams, 3-grams and 4-grams) |
splitfirst:bool | is here if you want to apply it to words - that is, do a (dumb) split so that we don't collect n-grams across word boundaries |
Returns | |
a dict with string : occurences |
Gives all n-grams of a specific length. Generator function. Quick and dirty version.
Treats input as sequence, so you can be creative and e.g. give it lists of strings (e.g. already-split words from sentences)
Parameters | |
string:str | the string to take slices of |
n:int | the size, the n in n-gram |
Returns | |
a generator that yields all the n-grams |
Score by overlapping n-grams (outputs of ngram_count())
Parameters | |
countdict | one dict of counts, e.g. from ngram_count |
countdict | another dict of counts, e.g. from ngram_count |
Returns | |
a fraction, the amount of matches divided by the total amount of |
Score each item in string-list option_strings they match to string string, by how many n-gram strings (with n in 1..4), where more matching n-grams means a higher score:
ngram_sort_by_matches( 'for', ['spork', 'knife', 'spoon', 'fork']) == ['fork', 'spork', 'knife', 'spoon']
Note that if you pick the first, this is effectively a "which one is the closest string?" function
Parameters | |
string:str | the string to be most similar to |
optionList[ | the string list to sort by similarity |
gramlens:List[ | the n-grams to use, defaults to (1,2,3,4), it may be a little faster to do (1,2,3) |
withbool | if False, returns list of strings. If True, returns list of (string, score). |
Returns | |
List of strings, or of tuples if with_scores==True |
Takes a list of strings, returns one without duplicates, keeping the first of each (so unlike a plain set(strlist), it keeps the order of what we keep) (Not the fastest implementation)
Parameters | |
strlist:List[ | The list of strings to work on |
casebool | If False, it then keeps the _first_ casing it saw |
removebool | remove list elements that are None instead of a string |
Returns | |
a list of strings |
Give a number, gives the ordinal word for dutch number (0..99)
Parameters | |
integer:int | the number as an int |
Returns | |
that number as a word in a string, e.g.: ordinal_nl(1) == 'eerste' |
remove 'de', 'het', and 'een' as words from the start of a string - meant to help normalize phrases
Parameters | |
string | |
remove | |
Returns | |
Unicode decomposes, remove combining characters, unicode compose. Note that not everything next to a letter is considered a diacritic.
Parameters | |
string:str | the string to work on |
Returns | |
a string where diacritics on characters have been removed, e.g.: remove_diacritics( 'olé' ) == 'ole' |
remove strings from the start of a string, based on a list of regexps
Parameters | |
string:str | |
remove | |
flags | |
Returns | |
Removes unicode characters within private use areas, because they have no semantic meaning (U+E000 through U+F8FF, U+F0000 through U+FFFFD, U+100000 to U+10FFFD).
Split string into words _Very_ basic - splits on and swallows symbols and such.
Real NLP tokenizers are often more robust, but for a quick test we can avoid a big depdenency (and sometimes execution slowness)
Parameters | |
text | a single string |
Returns | |
a list of words |
Replaces newlines with spaces, squeeze multiple spaces into one, then strip() the whole. May e.g. be useful to push spaces into functions that trip over newlines, series of newlines, or series of spaces.
WARNING: Don't use this when you waned to preserve empty lines.
Parameters | |
string:str | the string you want less whitespace in |
Returns | |
that string with less whitespace |
helper for contains_any_of
and contains_all_of
. See the docstrings for both.
Parameters | |
haystack:str | Undocumented |
needles:List[ | Undocumented |
case | Undocumented |
regexp | Undocumented |
encoding | Undocumented |
matchall | Undocumented |
helps remove diacritics - list a number of combining (but not actually combin*ed*) character ranges in unicode, since you often want to remove these (after decomposition)