class documentation

class Collocation:

Constructor: Collocation(connectors)

View In Hierarchy

A basic collocation calculator class.

Method __init__ connectors takes a list of words that, are removed when they appear at the _edge_ of an n-gram (for n > 1), but are left if they are inside (so for n >= 3)
Method add_gram Used by consume_tokens, you typically should not need this
Method add_uni Used by consume_tokens, you typically should not need this
Method cleanup_ngrams CONSIDER: allow different threshold for each length, e.g. via a list for mincount remove n-grams if either
Method cleanup_unigrams Remove unigrams that are rare - by default: that appear just once. You may wish to increase this. ideally we remove all n-grams using them too, but it's faster to waste the memory and leave them there.
Method consume_tokens Takes a list of string tokens. Counts unigram and n-gram from it, for given values of n.
Method counts returns counts of tokens, unigrams, and n>2-grams
Method score_ngrams Takes the counts we already did, returns a list of items like:
Instance Variable connectors Undocumented
Instance Variable grams Undocumented
Instance Variable saw_tokens Undocumented
Instance Variable uni Undocumented
def __init__(self, connectors=()):

connectors takes a list of words that, are removed when they appear at the _edge_ of an n-gram (for n > 1), but are left if they are inside (so for n >= 3)

def add_gram(self, strtup, cnt=1):

Used by consume_tokens, you typically should not need this

def add_uni(self, s, cnt=1):

Used by consume_tokens, you typically should not need this

def cleanup_ngrams(self, mincount=2, disqualify_func=None):

CONSIDER: allow different threshold for each length, e.g. via a list for mincount remove n-grams if either

  • they occur less than mincount
  • func returns true for them (the function itself gets (the n-gram string tuple, the count) as a parameter)

Both can be None (though mincount==1 is functionally the same as None)

def cleanup_unigrams(self, mincount=2, disqualify_func=None):

Remove unigrams that are rare - by default: that appear just once. You may wish to increase this. ideally we remove all n-grams using them too, but it's faster to waste the memory and leave them there.

def consume_tokens(self, token_list, gramlens=(2, 3, 4)):

Takes a list of string tokens. Counts unigram and n-gram from it, for given values of n.

def counts(self):

returns counts of tokens, unigrams, and n>2-grams

def score_ngrams(self, method='mik2', sort=True):

Takes the counts we already did, returns a list of items like:

    (string_tuple,              score,   count_combo,  [count, part, ...])

e.g.:

    (('aangetekende', 'brief'), 1085.12, 16, [17, 17])

The scoring logic is currently somewhat arbitrary, and needs work before it is meaningful in a _remotely_ linear way.

connectors =

Undocumented

grams =

Undocumented

saw_tokens: int =

Undocumented

uni =

Undocumented