helper functions related to spacy natural language parsing.
TODO: decide whether we want to import spacy globally - if a user hadn't yet it would be a heavy import and there may be things you want to control before that, like tensorflow warning suppression. ...though chances are you'll import this helper after that, so it might be fine.
TODO: cleanup
Class | notebook |
Python notebook visualisation to give some visual idea of contents: marks out-of-vocabulary tokens red, and highlight the more interesting words (by POS). |
Function | detect |
Detects language Note that this depends on the spacy_fastlang library, which depends on the fasttext library. |
Function | en |
Quick and dirty way to get some noun chunks out of english text |
Function | installed |
Picks an installed model for the given language (where language is the initial string in the model name, e.g. 'en' or 'nl') You can crudely give some preference as to which among multiple model names to prefer. |
Function | interesting |
Takes an already-parsed spacy span (or something else that iterates as tokens), uses the pos_ attribute to return only the more interesting tokens, ignoring stopwords, function words, and such. |
Function | list |
List loadable spacy model names. Spacy models are regular python packages, so this is somewhat tricky to do directly, but was implemented in spacy.util in version 3.something. |
Function | nl |
Meant as a quick and dirty way to pre-process text for when experimenting with models, as a particularly to remove function words |
Function | reload |
quick and dirty way to save some time reloading during development |
Function | sentence |
Takes an already-parsed spacy sentence |
Function | sentence |
A language-agnostic sentence splitter based on the `xx_sent_ud_sm` model. |
Function | span |
unused? Also, what was its purpose again? |
Function | subjects |
Given a parsed documment, returns the nominal/clausal subjects for each sentence individually, as a list of lists (of Tokens), e.g. |
Function | subjects |
For a given span, returns a list of subjects (there can zero, one, or more) |
Variable | _dutch |
Undocumented |
Variable | _english |
Undocumented |
Variable | _langdet |
Undocumented |
Variable | _xx |
Undocumented |
Detects language Note that this depends on the spacy_fastlang library, which depends on the fasttext library.
Returns (lang, score)
- lang string as used by spacy (xx if don't know)
- score is an approximated certainty
Depends on spacy_fastlang and loads it on first call of this function. Which will fail if not installed.
CONSIDER: truncate the text to something reasonable to not use too much memory. On parameter?
Parameters | |
string:str | the text to determine the language of |
Quick and dirty way to get some noun chunks out of english text
Parameters | |
text:str | Undocumented |
loadstr | Undocumented |
Returns | |
list | Undocumented |
Picks an installed model for the given language (where language is the initial string in the model name, e.g. 'en' or 'nl') You can crudely give some preference as to which among multiple model names to prefer.
Parameters | |
lang | a language string, like 'nl' or 'en' |
prefer | a list treated as regexes to be matched against each model name, where matches earlier in that list are preferred |
Returns | |
the model name that seems to match best. Raises a ValueError if there are no models for the given language. |
Takes an already-parsed spacy span (or something else that iterates as tokens), uses the pos_ attribute to return only the more interesting tokens, ignoring stopwords, function words, and such.
Currently tries to include only tokens where the part of speech (`pos_`) is one of "NOUN", "PROPN", "NUM", "ADJ", "VERB", "ADP", "ADV"
Parameters | |
span | the doc, sentence, or other span to iterate for Tokens |
ignore | whether to ignore what spacy considers is_stop |
ignore | what list of pos_ to ignore (meant to avoid the things that it would normally include) |
as | return a list of strings, rather than a list of spans |
Returns | |
list of either tokens, or strings (according to as_text) |
List loadable spacy model names. Spacy models are regular python packages, so this is somewhat tricky to do directly, but was implemented in spacy.util in version 3.something.
Returns | |
model names, as a list of strings |
Meant as a quick and dirty way to pre-process text for when experimenting with models, as a particularly to remove function words
To be more than that we might use something like spacy's pattern matching
# CONSIDER: taking a model name, and/or nlp object.
Parameters | |
text:str | Undocumented |
loadstr | Undocumented |
Returns | |
list | Undocumented |
Takes an already-parsed spacy sentence
Mainly uses the distance of the dependencies involved, ...which is fairly decent for how simple it is.
Consider e.g.
- long sentences aren't necessarily complex at all (they can just be separate things joined by a comma), they mainly become harder to parse if they introduce long-distance references.
- parenthetical sentences will lengthen references across them
- lists and flat compounds will drag the complexity down
Also, this doesn't really need normalization
Downsides include that spacy seems to assign some dependencies just because it needs to, not necessarily sensibly. Also, we should probably count most named entities as a single thing, not the amount of tokens in them
A language-agnostic sentence splitter based on the `xx_sent_ud_sm` model.
Parameters | |
string:str | the text to split into sentences |
as | Undocumented |
Returns | |
|
Given a parsed documment, returns the nominal/clausal subjects for each sentence individually, as a list of lists (of Tokens), e.g.
- I am a fish. You are a moose -> [ [I ], [You] ]
If no sentences are annotated, it will return None
Parameters | |
doc | spacy Document |
Returns | |
list of list of tokens |
For a given span, returns a list of subjects (there can zero, one, or more)
If given a Doc that means all sentences's. Sometimes that's what you want, yet if you wanted them per sentence, see subjects_in_doc.
Returns a mapping from each subject to related information, e.g.
- Token(she): { verb:Token(went) }
- Token(Taking): { verb:Token(relax), object:Token(nap), clause:[Token(Taking), Token(a), Token(nap)] }
You may only be interested in its keys. What's in the values is undecided and may change.
Relevant here are
- nsubj - nominal subject, a non-clausal constituent in the subject position of an active verb. A nonclausal consituent with the SBJ function tag is considered a nsubj.
TODO: actually implement