class documentation

Very minimal SRU implementation - just enough to access the KOOP repositories.

Method __init__ No summary
Method explain Does an explain operation, Returns the XML
Method explain_parsed Does an explain operation, fishes out some interesting details, returns that as a dict.
Method num_records Returns the number of records listed in the last search's results.
Method search_retrieve Fetches a small range of result records for the given query.
Method search_retrieve_many Fetch _many_ results results, in chunks, by calling search_retrieve() repeatedly, in part because there is typically a server-side limit on how many you can fetch at once.
Instance Variable base_url The base URL that other things add to; added from instantiation.
Instance Variable extra_query extra piece of query to add to the quiery you do late. This lets us representing subsets of larger repositories.
Instance Variable number_of_records the number of results reported in the last query we did. None before you do a query. CONSIDER: changing that.
Instance Variable sru_version hardcoded to "1.2"
Instance Variable verbose whether to print out things while we do them.
Instance Variable x_connection The x_connection attribute that some of these need; added from instantiation.
Method _url Combines the basic URL parts given to the constructor, and ensures there's a ? (so you know you can add &k=v) This can probably go into the constructor, when I know how much is constant across SRU URLs...
def __init__(self, base_url: str, x_connection: str = None, extra_query: str = None, verbose=False):
Parameters
base_url:strThe base URL that other things add to. Basically everything up to the '?'
x_connection:stran attribute that some of these need in the URL. Seems to be non-standard and required for these repos.
extra_query:stris used to let us AND something into the query, and is intended to restrict to a subset of documents. This lets us representing subsets of larger repositories (somewhat related to x_connection).
verbosewhether to print out things while we do them.
def explain(self, readable=True, strip_namespaces=True, timeout=10):

Does an explain operation, Returns the XML

  • if readable==False, it returns it as-is
  • if readable==True (default), it will ease human readability:
    • strips namespaces
    • reindent

The XML is a unicode string (for consistency with other parts of this codebase)

def explain_parsed(self, timeout=10):

Does an explain operation, fishes out some interesting details, returns that as a dict.

TODO: actually read the standard instead of assuming things.

def num_records(self):

Returns the number of records listed in the last search's results.

If you call it before doing a search_retrieve it will raise an error.

This function may change.

def search_retrieve(self, query: str, start_record=None, maximum_records=None, callback=None, verbose=False):

Fetches a small range of result records for the given query.

Exactly what each record contains will vary per repository, sometimes even per presumably-sensible-subset of records.

Returns a list of result record (each a ElementTree object -- because in some cases, the search records contain metadata not as easily fetched from the result documents themselves, and you may wish to decide how to fish them out).

Notes:

  • search_retrieve will update the amount of matching records, which backs num_record()
  • strips namespaces from the results - this makes writing code more convenient

CONSIDER:

  • option to returning URL instead of searching
Parameters
query:strthe query string, in CQL form (see the Library of Congress spec) the list of indices you can search in (e.g. e.g. 'dcterms.modified>=2000-01-01') varies with each repo take a look at explain_parsed() (a parsed summary) or explain() (the actual explain XML)
start_recordwhat record offset to start fetching at. Note: one-based counting
maximum_recordshow many records to fetch (from start_offset). Note that repositories may not like high values here. ...so if you care about _all_ results of a possible-large set, then you probably want to use search_retrieve_many() instead.
callbackif not None, this function calls it for each such record node. You can instead wait for the entire range of fetches to conclude and hand you the complete list of result records.
verbosewhether to be even more verbose during this query
def search_retrieve_many(self, query: str, at_a_time: int = 10, start_record: int = 1, up_to: int = 250, callback=None, wait_between_sec: float = 0.5, verbose: bool = False):

Fetch _many_ results results, in chunks, by calling search_retrieve() repeatedly, in part because there is typically a server-side limit on how many you can fetch at once.

You will often rely on either

  • it returning a list of all records (as elementTree objects) ...which can be more convenient if you an to handle the results as a whole - but only happens at the very end
  • it calling a callback function on each individual record _during_ the fetching process. ...which can be more convenient way of dealing with many results while they come in, especially when dealing with very large fetches
Parameters
query:strlike in search_retrieve()
at_a_time:inthow many records to fetch in a single request
start_record:intlike in search_retrieve()
up_to:intis the last record to fetch - as an absolute offset, so e.g. start_offset=200,up_to=250 gives you records 200..250, not 200..450.
callbacklike in search_retrieve()
wait_between_sec:floata backoff sleep between each search request, to avoid hammering a server too much. you can lower this where you know this is overly cautious note that we skip this sleep if one fetch was enough
verbose:bool

whether to be even more verbose during this query

since we fetch in chunks, we may overshoot in the last fetch, by up to at_a_time amount of entries. The code should avoid returning those.

CONSIDER:

  • maybe yield something including numberOfRecords before yielding results?
base_url =

The base URL that other things add to; added from instantiation.

extra_query =

extra piece of query to add to the quiery you do late. This lets us representing subsets of larger repositories.

number_of_records =

the number of results reported in the last query we did. None before you do a query. CONSIDER: changing that.

sru_version: str =

hardcoded to "1.2"

verbose =

whether to print out things while we do them.

x_connection =

The x_connection attribute that some of these need; added from instantiation.

def _url(self):

Combines the basic URL parts given to the constructor, and ensures there's a ? (so you know you can add &k=v) This can probably go into the constructor, when I know how much is constant across SRU URLs