A module to create wordcloud images.
This is _mostly_ a thin wrapper module around an existing wordcloud module that does the actual job of creating the image.
That image will look a bunch cleaner when you can clean up the string:count. The module we use does some of that -- though does it at the same time, in one big class.
This wrapper module exists largely to separate out some of those parts which introduce more flexibility in how we count terms, and also means that more flexible counting is more reusable for things _not_ wordclouds.
Note that much of that code actually live in our wetsuite.helpers.strings, so this module is still mostly glue and some less-typing convenience functions with our _own_ defaults instead.
Function | count |
Take string, tokenize, count, return word:count dict. |
Function | count |
Takes a list of strings (e.g. a document that you have already tokenized into words; if you want us to do that for you, look at count_from_string). |
Function | merge |
Take a sequence of string-to-count dicts, add counts together into one dict |
Function | wordcloud |
Takes a {string: count} dict, returns a PIL image object with a wordcloud. |
Function | wordcloud |
Convenience function - work from single-string input, that it assumes it has to first tokenize, then count those tokens of. |
Function | wordcloud |
Convenience function - work from a list of strings, that it assumes you have already made tokens, and only has to count. |
str
, tokenizer=wetsuite.helpers.strings.simple_tokenize, stopwords=(), stopwords_i=()):
¶
Take string, tokenize, count, return word:count dict.
Parameters | |
s:str | the string to work on |
tokenizer | function to tokenize with. |
stopwords | sequence of strings to remove (case sensitive) |
stopwords | sequence of strings to remove (case insensitive) |
Returns | |
a dict of string to count |
Takes a list of strings (e.g. a document that you have already tokenized into words; if you want us to do that for you, look at count_from_string).
Fixed to count in a way that is count insensitive, and then uses the most common capitalisation it saw. If you want control over the counting, do it yourself and look at wordcloud_from_freqs.
Parameters | |
stringlist[ | list of strings, the input to work on |
stopwords | sequence of strings to remove (case sensitive) |
stopwords | sequence of strings to remove (case insensitive) |
Returns | |
a dict of string to count |
dict
, width: int
= 1200, height: int
= 300, background_color='white', min_font_size=10, **kwargs):
¶
Takes a {string: count} dict, returns a PIL image object with a wordcloud.
Parameters | |
freqs:dict | a dict of word:count |
width:int | image width, in pixels |
height:int | image height, in pixels |
background | the color to use for the background; defaults to white. |
min | no words smaller than this; also relates to when to stop. |
**kwargs | any other keyword arguments are passed through to wordcloud.WordCloud |
Returns | |
a PIL image (you can e.g. display() or .save() this) |
str
, tokenizer=wetsuite.helpers.strings.simple_tokenize, counter=wetsuite.helpers.strings.count_case_insensitive, **kwargs):
¶
Convenience function - work from single-string input, that it assumes it has to first tokenize, then count those tokens of.
Defaults to simple tokenizing, and case-insensitive counting.
Parameters | |
s:str | string to work on |
tokenizer | function to tokenize with. |
counter | function to count with |
**kwargs | any other keyword arguments are passed through (probably to Wordcloud) |
Returns | |
a PIL Image |
list[ str]
, counter=wetsuite.helpers.strings.count_case_insensitive, **kwargs):
¶
Convenience function - work from a list of strings, that it assumes you have already made tokens, and only has to count.
Defaults to case-insensitive counting.
Parameters | |
stringlist[ | string list to work on |
counter | function to count with |
**kwargs | any other keyword arguments are passed through (probably to Wordcloud) |
Returns | |
a PIL Image |