wetsuite.helpers.collocation.Collocation

class documentation

class Collocation:

A basic collocation calculator class.

Method	`__init__`	connectors takes a list of words that, are removed when they appear at the _edge_ of an n-gram (for n > 1), but are left if they are inside (so for n >= 3)
Method	`add_gram`	Used by consume_tokens, you typically should not need this
Method	`add_uni`	Used by consume_tokens, you typically should not need this
Method	`cleanup_ngrams`	CONSIDER: allow different threshold for each length, e.g. via a list for mincount
Method	`cleanup_ngrams_func`	Remove unigrams for which the given function returns true
Method	`cleanup_unigrams`	Remove unigrams that are rare - by default: that appear just once. You may wish to increase this. ideally we remove all n-grams using them too, but it's faster to waste the memory and leave them there.
Method	`cleanup_unigrams_func`	Remove unigrams for which the given function returns true
Method	`consume_tokens`	Takes a list of string tokens. Counts unigram and n-gram from it, for given values of n.
Method	`counts`	returns counts of tokens, unigrams, and n>2-grams
Method	`score_ngrams`	Takes the counts we already did, returns a list of items like:
Instance Variable	`connectors`	Undocumented
Instance Variable	`grams`	Undocumented
Instance Variable	`saw_tokens`	Undocumented
Instance Variable	`uni`	Undocumented

def __init__(self, connectors=()): ¶

connectors takes a list of words that, are removed when they appear at the _edge_ of an n-gram (for n > 1), but are left if they are inside (so for n >= 3)

def add_gram(self, strtup, cnt=1): ¶

Used by consume_tokens, you typically should not need this

def add_uni(self, s, cnt=1): ¶

Used by consume_tokens, you typically should not need this

def cleanup_ngrams(self, mincount=2): ¶

CONSIDER: allow different threshold for each length, e.g. via a list for mincount

def cleanup_ngrams_func(self, bad_func): ¶

Remove unigrams for which the given function returns true

def cleanup_unigrams(self, mincount=2): ¶

Remove unigrams that are rare - by default: that appear just once. You may wish to increase this. ideally we remove all n-grams using them too, but it's faster to waste the memory and leave them there.

def cleanup_unigrams_func(self, bad_func): ¶

Remove unigrams for which the given function returns true

def consume_tokens(self, token_list, gramlens=(2, 3, 4)): ¶

Takes a list of string tokens. Counts unigram and n-gram from it, for given values of n.

def counts(self): ¶

returns counts of tokens, unigrams, and n>2-grams

def score_ngrams(self, method='mik2', sort=True): ¶

Takes the counts we already did, returns a list of items like:

    (string_tuple,              score,   count_combo,  [count, part, ...])

e.g.:

    (('aangetekende', 'brief'), 1085.12, 16, [17, 17])

The scoring logic is currently somewhat arbitrary, and needs work before it is meaningful in a _remotely_ linear way.

connectors = ¶

Undocumented

grams = ¶

Undocumented

saw_tokens: int = ¶

Undocumented

uni = ¶

Undocumented