wetsuite.helpers.spacy

module documentation

helper functions related to spacy natural language parsing.

TODO: decide whether we want to import spacy globally - if a user hadn't yet it would be a heavy import and there may be things you want to control before that, like tensorflow warning suppression. ...though chances are you'll import this helper after that, so it might be fine.

TODO: cleanup

Class	`notebook_content_visualisation`	Python notebook visualisation to give some visual idea of contents: marks out-of-vocabulary tokens red, and highlight the more interesting words (by POS).
Function	`detect_language`	Detects language Note that this depends on the spacy_fastlang library, which depends on the fasttext library.
Function	`en_noun_chunks`	Quick and dirty way to get some noun chunks out of english text
Function	`installed_model_for_language`	Picks an installed model for the given language (where language is the initial string in the model name, e.g. 'en' or 'nl') You can crudely give some preference as to which among multiple model names to prefer.
Function	`interesting_words`	Takes an already-parsed spacy span (or something else that iterates as tokens), uses the pos_ attribute to return only the more interesting tokens, ignoring stopwords, function words, and such.
Function	`list_installed_models`	List loadable spacy model names. Spacy models are regular python packages, so this is somewhat tricky to do directly, but was implemented in spacy.util in version 3.something.
Function	`nl_noun_chunks`	Meant as a quick and dirty way to pre-process text for when experimenting with models, as a particularly to remove function words
Function	`reload`	quick and dirty way to save some time reloading during development
Function	`sentence_complexity_spacy`	Takes an already-parsed spacy sentence
Function	`sentence_split`	A language-agnostic sentence splitter based on the `xx_sent_ud_sm` model.
Function	`span_as_doc`	unused? Also, what was its purpose again?
Function	`subjects_in_doc`	Given a parsed documment, returns the nominal/clausal subjects for each sentence individually, as a list of lists (of Tokens), e.g.
Function	`subjects_in_span`	For a given span, returns a list of subjects (there can zero, one, or more)
Variable	`_dutch`	Undocumented
Variable	`_english`	Undocumented
Variable	`_langdet_model`	Undocumented
Variable	`_xx_sent_model`	Undocumented

def detect_language(string): ¶

Detects language Note that this depends on the spacy_fastlang library, which depends on the fasttext library.

Returns (lang, score)

lang string as used by spacy (xx if don't know)
score is an approximated certainty

Depends on spacy_fastlang and loads it on first call of this function. Which will fail if not installed.

CONSIDER: truncate the text to something reasonable to not use too much memory. On parameter?

Parameters
string:`str`	the text to determine the language of

def en_noun_chunks(text, load_model_name='en_core_web_trf'): ¶

Quick and dirty way to get some noun chunks out of english text

Parameters
text:`str`	Undocumented
load_model_name:`str`	Undocumented
Returns
`list`	Undocumented

def installed_model_for_language(lang, prefer=('_lg$', '_md$', '_sm$')): ¶

Picks an installed model for the given language (where language is the initial string in the model name, e.g. 'en' or 'nl') You can crudely give some preference as to which among multiple model names to prefer.

Parameters
lang	a language string, like 'nl' or 'en'
prefer	a list treated as regexes to be matched against each model name, where matches earlier in that list are preferred
Returns
the model name that seems to match best. Raises a ValueError if there are no models for the given language.

def interesting_words(span, ignore_stop=True, ignore_pos_=('PUNCT', 'SPACE', 'X', 'AUX', 'DET', 'CCONJ'), as_text=False): ¶

Takes an already-parsed spacy span (or something else that iterates as tokens), uses the pos_ attribute to return only the more interesting tokens, ignoring stopwords, function words, and such.

Currently tries to include only tokens where the part of speech (`pos_`) is one of "NOUN", "PROPN", "NUM", "ADJ", "VERB", "ADP", "ADV"

Parameters
span	the doc, sentence, or other span to iterate for Tokens
ignore_stop	whether to ignore what spacy considers is_stop
ignore_pos_	what list of pos_ to ignore (meant to avoid the things that it would normally include)
as_text	return a list of strings, rather than a list of spans
Returns
list of either tokens, or strings (according to as_text)

def list_installed_models(): ¶

List loadable spacy model names. Spacy models are regular python packages, so this is somewhat tricky to do directly, but was implemented in spacy.util in version 3.something.

Returns
model names, as a list of strings

def nl_noun_chunks(text, load_model_name='nl_core_news_lg'): ¶

Meant as a quick and dirty way to pre-process text for when experimenting with models, as a particularly to remove function words

To be more than that we might use something like spacy's pattern matching

# CONSIDER: taking a model name, and/or nlp object.

Parameters
text:`str`	Undocumented
load_model_name:`str`	Undocumented
Returns
`list`	Undocumented

def reload(): ¶

quick and dirty way to save some time reloading during development

def sentence_complexity_spacy(span): ¶

Takes an already-parsed spacy sentence

Mainly uses the distance of the dependencies involved, ...which is fairly decent for how simple it is.

Consider e.g.

long sentences aren't necessarily complex at all (they can just be separate things joined by a comma), they mainly become harder to parse if they introduce long-distance references.
parenthetical sentences will lengthen references across them
lists and flat compounds will drag the complexity down

Also, this doesn't really need normalization

Downsides include that spacy seems to assign some dependencies just because it needs to, not necessarily sensibly. Also, we should probably count most named entities as a single thing, not the amount of tokens in them

def sentence_split(string, as_plain_sents=False): ¶

A language-agnostic sentence splitter based on the `xx_sent_ud_sm` model.

Parameters
string:`str`	the text to split into sentences
as_plain_sents	Undocumented
Returns
if as_plain_sents==False: a Doc so you can pick out the .sents attribute if as_plain_sents==False: a sequence of strings (from each sentence Span)

def span_as_doc(span): ¶

unused? Also, what was its purpose again?

def subjects_in_doc(doc): ¶

Given a parsed documment, returns the nominal/clausal subjects for each sentence individually, as a list of lists (of Tokens), e.g.

I am a fish. You are a moose -> [ [I ], [You] ]

If no sentences are annotated, it will return None

Parameters
doc	spacy Document
Returns
list of list of tokens

def subjects_in_span(span): ¶

For a given span, returns a list of subjects (there can zero, one, or more)

If given a Doc that means all sentences's. Sometimes that's what you want, yet if you wanted them per sentence, see subjects_in_doc.

Returns a mapping from each subject to related information, e.g.

Token(she): { verb:Token(went) }
Token(Taking): { verb:Token(relax), object:Token(nap), clause:[Token(Taking), Token(a), Token(nap)] }

You may only be interested in its keys. What's in the values is undecided and may change.

Relevant here are

nsubj - nominal subject, a non-clausal constituent in the subject position of an active verb. A nonclausal consituent with the SBJ function tag is considered a nsubj.

TODO: actually implement

_dutch = ¶

Undocumented

_english = ¶

Undocumented

_langdet_model = ¶

Undocumented

_xx_sent_model = ¶

Undocumented