class documentation

class Collocation:

View In Hierarchy

A basic collocation calculator class.

Method __init__ connectors takes a list of words that, are removed when they appear at the _edge_ of an n-gram (for n > 1), but are left if they are inside (so for n >= 3)
Method add_gram Used by consume_tokens, you typically should not need this
Method add_uni Used by consume_tokens, you typically should not need this
Method cleanup_ngrams CONSIDER: allow different threshold for each length, e.g. via a list for mincount
Method cleanup_ngrams_func Remove unigrams for which the given function returns true
Method cleanup_unigrams Remove unigrams that are rare - by default: that appear just once. You may wish to increase this. ideally we remove all n-grams using them too, but it's faster to waste the memory and leave them there.
Method cleanup_unigrams_func Remove unigrams for which the given function returns true
Method consume_tokens Takes a list of string tokens. Counts unigram and n-gram from it, for given values of n.
Method counts returns counts of tokens, unigrams, and n>2-grams
Method score_ngrams Takes the counts we already did, returns a list of items like:
Instance Variable connectors Undocumented
Instance Variable grams Undocumented
Instance Variable saw_tokens Undocumented
Instance Variable uni Undocumented
def __init__(self, connectors=()):

connectors takes a list of words that, are removed when they appear at the _edge_ of an n-gram (for n > 1), but are left if they are inside (so for n >= 3)

def add_gram(self, strtup, cnt=1):

Used by consume_tokens, you typically should not need this

def add_uni(self, s, cnt=1):

Used by consume_tokens, you typically should not need this

def cleanup_ngrams(self, mincount=2):

CONSIDER: allow different threshold for each length, e.g. via a list for mincount

def cleanup_ngrams_func(self, bad_func):

Remove unigrams for which the given function returns true

def cleanup_unigrams(self, mincount=2):

Remove unigrams that are rare - by default: that appear just once. You may wish to increase this. ideally we remove all n-grams using them too, but it's faster to waste the memory and leave them there.

def cleanup_unigrams_func(self, bad_func):

Remove unigrams for which the given function returns true

def consume_tokens(self, token_list, gramlens=(2, 3, 4)):

Takes a list of string tokens. Counts unigram and n-gram from it, for given values of n.

def counts(self):

returns counts of tokens, unigrams, and n>2-grams

def score_ngrams(self, method='mik2', sort=True):

Takes the counts we already did, returns a list of items like:

    (string_tuple,              score,   count_combo,  [count, part, ...])

e.g.:

    (('aangetekende', 'brief'), 1085.12, 16, [17, 17])

The scoring logic is currently somewhat arbitrary, and needs work before it is meaningful in a _remotely_ linear way.

connectors =

Undocumented

grams =

Undocumented

saw_tokens: int =

Undocumented

uni =

Undocumented