wetsuite.extras.word

module documentation

A module to create wordcloud images.

This is _mostly_ a thin wrapper module around an existing wordcloud module that does the actual job of creating the image.

That image will look a bunch cleaner when you can clean up the string:count. The module we use does some of that -- though does it at the same time, in one big class.

This wrapper module exists largely to separate out some of those parts which introduce more flexibility in how we count terms, and also means that more flexible counting is more reusable for things _not_ wordclouds.

Note that much of that code actually live in our wetsuite.helpers.strings, so this module is still mostly glue and some less-typing convenience functions with our _own_ defaults instead.

Function	`count_from_string`	Take string, tokenize, count, return word:count dict.
Function	`count_from_stringlist`	Takes a list of strings (e.g. a document that you have already tokenized into words; if you want us to do that for you, look at count_from_string).
Function	`merge_counts`	Take a sequence of string-to-count dicts, add counts together into one dict
Function	`wordcloud_from_freqs`	Takes a {string: count} dict, returns a PIL image object with a wordcloud.
Function	`wordcloud_from_string`	Convenience function - work from single-string input, that it assumes it has to first tokenize, then count those tokens of.
Function	`wordcloud_from_stringlist`	Convenience function - work from a list of strings, that it assumes you have already made tokens, and only has to count.

def count_from_string(s: str, tokenizer=wetsuite.helpers.strings.simple_tokenize, stopwords=(), stopwords_i=()): ¶

Take string, tokenize, count, return word:count dict.

Parameters
s:`str`	the string to work on
tokenizer	function to tokenize with.
stopwords	sequence of strings to remove (case sensitive)
stopwords_i	sequence of strings to remove (case insensitive)
Returns
a dict of string to count

def count_from_stringlist(string_list: list[str], stopwords=(), stopwords_i=()): ¶

Takes a list of strings (e.g. a document that you have already tokenized into words; if you want us to do that for you, look at count_from_string).

Fixed to count in a way that is count insensitive, and then uses the most common capitalisation it saw. If you want control over the counting, do it yourself and look at wordcloud_from_freqs.

Parameters
string_list:`list[str]`	list of strings, the input to work on
stopwords	sequence of strings to remove (case sensitive)
stopwords_i	sequence of strings to remove (case insensitive)
Returns
a dict of string to count

def merge_counts(count_dicts: list[dict]): ¶

Take a sequence of string-to-count dicts, add counts together into one dict

def wordcloud_from_freqs(freqs: dict, width: int = 1200, height: int = 300, background_color='white', min_font_size=10, **kwargs): ¶

Takes a {string: count} dict, returns a PIL image object with a wordcloud.

Parameters
freqs:`dict`	a dict of word:count
width:`int`	image width, in pixels
height:`int`	image height, in pixels
background_color	the color to use for the background; defaults to white.
min_font_size	no words smaller than this; also relates to when to stop.
**kwargs	any other keyword arguments are passed through to wordcloud.WordCloud
Returns
a PIL image (you can e.g. display() or .save() this)

def wordcloud_from_string(s: str, tokenizer=wetsuite.helpers.strings.simple_tokenize, counter=wetsuite.helpers.strings.count_case_insensitive, **kwargs): ¶

Convenience function - work from single-string input, that it assumes it has to first tokenize, then count those tokens of.

Defaults to simple tokenizing, and case-insensitive counting.

Parameters
s:`str`	string to work on
tokenizer	function to tokenize with.
counter	function to count with
**kwargs	any other keyword arguments are passed through (probably to Wordcloud)
Returns
a PIL Image

def wordcloud_from_stringlist(string_list: list[str], counter=wetsuite.helpers.strings.count_case_insensitive, **kwargs): ¶

Convenience function - work from a list of strings, that it assumes you have already made tokens, and only has to count.

Defaults to case-insensitive counting.

Parameters
string_list:`list[str]`	string list to work on
counter	function to count with
**kwargs	any other keyword arguments are passed through (probably to Wordcloud)
Returns
a PIL Image

wetsuite.extras.word_cloud

`wetsuite.extras.word_cloud`