module documentation

helper functions related to spacy natural language parsing.

TODO: decide whether we want to import spacy globally - if a user hadn't yet it would be a heavy import and there may be things you want to control before that, like tensorflow warning suppression. ...though chances are you'll import this helper after that, so it might be fine.

TODO: cleanup

Class notebook_content_visualisation Python notebook visualisation to give some visual idea of contents: marks out-of-vocabulary tokens red, and highlight the more interesting words (by POS).
Function detect_language Detects language Note that this depends on the spacy_fastlang library, which depends on the fasttext library.
Function en_noun_chunks Quick and dirty way to get some noun chunks out of english text
Function installed_model_for_language Picks an installed model for the given language (where language is the initial string in the model name, e.g. 'en' or 'nl') You can crudely give some preference as to which among multiple model names to prefer.
Function interesting_words Takes an already-parsed spacy span (or something else that iterates as tokens), uses the pos_ attribute to return only the more interesting tokens, ignoring stopwords, function words, and such.
Function list_installed_models List loadable spacy model names. Spacy models are regular python packages, so this is somewhat tricky to do directly, but was implemented in spacy.util in version 3.something.
Function nl_noun_chunks Meant as a quick and dirty way to pre-process text for when experimenting with models, as a particularly to remove function words
Function reload quick and dirty way to save some time reloading during development
Function sentence_complexity_spacy Takes an already-parsed spacy sentence
Function sentence_split A language-agnostic sentence splitter based on the `xx_sent_ud_sm` model.
Function span_as_doc unused? Also, what was its purpose again?
Function subjects_in_doc Given a parsed documment, returns the nominal/clausal subjects for each sentence individually, as a list of lists (of Tokens), e.g.
Function subjects_in_span For a given span, returns a list of subjects (there can zero, one, or more)
Variable _dutch Undocumented
Variable _english Undocumented
Variable _langdet_model Undocumented
Variable _xx_sent_model Undocumented
def detect_language(string):

Detects language Note that this depends on the spacy_fastlang library, which depends on the fasttext library.

Returns (lang, score)

  • lang string as used by spacy (xx if don't know)
  • score is an approximated certainty

Depends on spacy_fastlang and loads it on first call of this function. Which will fail if not installed.

CONSIDER: truncate the text to something reasonable to not use too much memory. On parameter?

Parameters
string:strthe text to determine the language of
def en_noun_chunks(text, load_model_name='en_core_web_trf'):

Quick and dirty way to get some noun chunks out of english text

Parameters
text:strUndocumented
load_model_name:strUndocumented
Returns
listUndocumented
def installed_model_for_language(lang, prefer=('_lg$', '_md$', '_sm$')):

Picks an installed model for the given language (where language is the initial string in the model name, e.g. 'en' or 'nl') You can crudely give some preference as to which among multiple model names to prefer.

Parameters
langa language string, like 'nl' or 'en'
prefera list treated as regexes to be matched against each model name, where matches earlier in that list are preferred
Returns
the model name that seems to match best. Raises a ValueError if there are no models for the given language.
def interesting_words(span, ignore_stop=True, ignore_pos_=('PUNCT', 'SPACE', 'X', 'AUX', 'DET', 'CCONJ'), as_text=False):

Takes an already-parsed spacy span (or something else that iterates as tokens), uses the pos_ attribute to return only the more interesting tokens, ignoring stopwords, function words, and such.

Currently tries to include only tokens where the part of speech (`pos_`) is one of "NOUN", "PROPN", "NUM", "ADJ", "VERB", "ADP", "ADV"

Parameters
spanthe doc, sentence, or other span to iterate for Tokens
ignore_stopwhether to ignore what spacy considers is_stop
ignore_pos_what list of pos_ to ignore (meant to avoid the things that it would normally include)
as_textreturn a list of strings, rather than a list of spans
Returns
list of either tokens, or strings (according to as_text)
def list_installed_models():

List loadable spacy model names. Spacy models are regular python packages, so this is somewhat tricky to do directly, but was implemented in spacy.util in version 3.something.

Returns
model names, as a list of strings
def nl_noun_chunks(text, load_model_name='nl_core_news_lg'):

Meant as a quick and dirty way to pre-process text for when experimenting with models, as a particularly to remove function words

To be more than that we might use something like spacy's pattern matching

# CONSIDER: taking a model name, and/or nlp object.

Parameters
text:strUndocumented
load_model_name:strUndocumented
Returns
listUndocumented
def reload():

quick and dirty way to save some time reloading during development

def sentence_complexity_spacy(span):

Takes an already-parsed spacy sentence

Mainly uses the distance of the dependencies involved, ...which is fairly decent for how simple it is.

Consider e.g.

  • long sentences aren't necessarily complex at all (they can just be separate things joined by a comma), they mainly become harder to parse if they introduce long-distance references.
  • parenthetical sentences will lengthen references across them
  • lists and flat compounds will drag the complexity down

Also, this doesn't really need normalization

Downsides include that spacy seems to assign some dependencies just because it needs to, not necessarily sensibly. Also, we should probably count most named entities as a single thing, not the amount of tokens in them

def sentence_split(string, as_plain_sents=False):

A language-agnostic sentence splitter based on the `xx_sent_ud_sm` model.

Parameters
string:strthe text to split into sentences
as_plain_sentsUndocumented
Returns
  • if as_plain_sents==False: a Doc so you can pick out the .sents attribute
  • if as_plain_sents==False: a sequence of strings (from each sentence Span)
def span_as_doc(span):

unused? Also, what was its purpose again?

def subjects_in_doc(doc):

Given a parsed documment, returns the nominal/clausal subjects for each sentence individually, as a list of lists (of Tokens), e.g.

  • I am a fish. You are a moose -> [ [I ], [You] ]

If no sentences are annotated, it will return None

Parameters
docspacy Document
Returns
list of list of tokens
def subjects_in_span(span):

For a given span, returns a list of subjects (there can zero, one, or more)

If given a Doc that means all sentences's. Sometimes that's what you want, yet if you wanted them per sentence, see subjects_in_doc.

Returns a mapping from each subject to related information, e.g.

  • Token(she): { verb:Token(went) }
  • Token(Taking): { verb:Token(relax), object:Token(nap), clause:[Token(Taking), Token(a), Token(nap)] }

You may only be interested in its keys. What's in the values is undecided and may change.

Relevant here are

  • nsubj - nominal subject, a non-clausal constituent in the subject position of an active verb. A nonclausal consituent with the SBJ function tag is considered a nsubj.

TODO: actually implement

_dutch =

Undocumented

_english =

Undocumented

_langdet_model =

Undocumented

_xx_sent_model =

Undocumented