module documentation

A module to create wordcloud images.

This is _mostly_ a thin wrapper module around an existing wordcloud module that does the actual job of creating the image.

That image will look a bunch cleaner when you can clean up the string:count. The module we use does some of that -- though does it at the same time, in one big class.

This wrapper module exists largely to separate out some of those parts which introduce more flexibility in how we count terms, and also means that more flexible counting is more reusable for things _not_ wordclouds.

Note that much of that code actually live in our wetsuite.helpers.strings, so this module is still mostly glue and some less-typing convenience functions with our _own_ defaults instead.

Function count_from_string Take string, tokenize, count, return word:count dict.
Function count_from_stringlist Takes a list of strings (e.g. a document that you have already tokenized into words; if you want us to do that for you, look at count_from_string).
Function merge_counts Take a sequence of string-to-count dicts, add counts together into one dict
Function wordcloud_from_freqs Takes a {string: count} dict, returns a PIL image object with a wordcloud.
Function wordcloud_from_string Convenience function - work from single-string input, that it assumes it has to first tokenize, then count those tokens of.
Function wordcloud_from_stringlist Convenience function - work from a list of strings, that it assumes you have already made tokens, and only has to count.
def count_from_string(s: str, tokenizer=wetsuite.helpers.strings.simple_tokenize, stopwords=(), stopwords_i=()):

Take string, tokenize, count, return word:count dict.

Parameters
s:strthe string to work on
tokenizerfunction to tokenize with.
stopwordssequence of strings to remove (case sensitive)
stopwords_isequence of strings to remove (case insensitive)
Returns
a dict of string to count
def count_from_stringlist(string_list: list[str], stopwords=(), stopwords_i=()):

Takes a list of strings (e.g. a document that you have already tokenized into words; if you want us to do that for you, look at count_from_string).

Fixed to count in a way that is count insensitive, and then uses the most common capitalisation it saw. If you want control over the counting, do it yourself and look at wordcloud_from_freqs.

Parameters
string_list:list[str]list of strings, the input to work on
stopwordssequence of strings to remove (case sensitive)
stopwords_isequence of strings to remove (case insensitive)
Returns
a dict of string to count
def merge_counts(count_dicts: list[dict]):

Take a sequence of string-to-count dicts, add counts together into one dict

def wordcloud_from_freqs(freqs: dict, width: int = 1200, height: int = 300, background_color='white', min_font_size=10, **kwargs):

Takes a {string: count} dict, returns a PIL image object with a wordcloud.

Parameters
freqs:dicta dict of word:count
width:intimage width, in pixels
height:intimage height, in pixels
background_colorthe color to use for the background; defaults to white.
min_font_sizeno words smaller than this; also relates to when to stop.
**kwargsany other keyword arguments are passed through to wordcloud.WordCloud
Returns
a PIL image (you can e.g. display() or .save() this)
def wordcloud_from_string(s: str, tokenizer=wetsuite.helpers.strings.simple_tokenize, counter=wetsuite.helpers.strings.count_case_insensitive, **kwargs):

Convenience function - work from single-string input, that it assumes it has to first tokenize, then count those tokens of.

Defaults to simple tokenizing, and case-insensitive counting.

Parameters
s:strstring to work on
tokenizerfunction to tokenize with.
counterfunction to count with
**kwargsany other keyword arguments are passed through (probably to Wordcloud)
Returns
a PIL Image
def wordcloud_from_stringlist(string_list: list[str], counter=wetsuite.helpers.strings.count_case_insensitive, **kwargs):

Convenience function - work from a list of strings, that it assumes you have already made tokens, and only has to count.

Defaults to case-insensitive counting.

Parameters
string_list:list[str]string list to work on
counterfunction to count with
**kwargsany other keyword arguments are passed through (probably to Wordcloud)
Returns
a PIL Image