wetsuite.datacollect.sru.SRUBase

class documentation

class SRUBase:

View In Hierarchy

Very minimal SRU implementation - just enough to access the KOOP repositories.

Method	`__init__`	No summary
Method	`explain`	Does an explain operation, Returns the XML
Method	`explain_parsed`	Does an explain operation, Returns a dict with some of the more interesting details.
Method	`num_records`	After you do a search_retrieve, this should be set to a number.
Method	`search_retrieve`	Fetches a range of results for a particular query. Returns each result record as a separate ElementTree object.
Method	`search_retrieve_many`	This function builds on search_retrieve() to "fetch _many_ results results in chunks", by calling search_retrieve() repeatedly.
Instance Variable	`base_url`	The base URL that other things add to; added from instantiation.
Instance Variable	`extra_query`	extra piece of query to add to the quiery you do late. This lets us representing subsets of larger repositories.
Instance Variable	`number_of_records`	the number of results reported in the last query we did. None before you do a query. CONSIDER: changing that.
Instance Variable	`sru_version`	hardcoded to "1.2"
Instance Variable	`verbose`	whether to print out things while we do them.
Instance Variable	`x_connection`	The x_connection attribute that some of these need; added from instantiation.
Method	`_url`	Combines the basic URL parts given to the constructor, and ensures there's a ? (so you know you can add &k=v) This can probably go into the constructor, when I know how much is constant across SRU URLs...

def __init__(self, base_url, x_connection=None, extra_query=None, verbose=False): ¶

overridden in wetsuite.datacollect.koop_sru.BWB, wetsuite.datacollect.koop_sru.CVDR, wetsuite.datacollect.koop_sru.EuropeseRichtlijnen, wetsuite.datacollect.koop_sru.LokaleBekendmakingen, wetsuite.datacollect.koop_sru.OfficielePublicaties, wetsuite.datacollect.koop_sru.PLOOI, wetsuite.datacollect.koop_sru.PUCOpenData, wetsuite.datacollect.koop_sru.SamenwerkendeCatalogi, wetsuite.datacollect.koop_sru.StatenGeneraalDigitaal, wetsuite.datacollect.koop_sru.TuchtRecht, wetsuite.datacollect.koop_sru.WetgevingsKalender

Parameters
base_url:`str`	The base URL that other things add to. Basically everything up to the '?'
x_connection:`str`	an attribute that some of these need in the URL. Seems to be non-standard and required for these repos.
extra_query:`str`	is used to let us AND something into the query, and is intended to restrict to a subset of documents. This lets us representing subsets of larger repositories (somewhat related to x_connection).
verbose	whether to print out things while we do them.

def explain(self, readable=True, strip_namespaces=True, timeout=10): ¶

Does an explain operation, Returns the XML

if readable==False, it returns it as-is
if readable==True (default), it will ease human readability:
- strips namespaces
- reindent

The XML is a unicode string (for consistency with other parts of this codebase)

def explain_parsed(self, timeout=10): ¶

Does an explain operation, Returns a dict with some of the more interesting details.

TODO: actually read the standard instead of assuming things.

def num_records(self): ¶

After you do a search_retrieve, this should be set to a number.

This function may change.

def search_retrieve(self, query, start_record=None, maximum_records=None, callback=None, verbose=False): ¶

Fetches a range of results for a particular query. Returns each result record as a separate ElementTree object.

Exactly what each record contains will vary per repository, sometimes even per presumably-sensible-subset of records, but you may well _want_ access to this detail in raw form because in some cases, it can contain metadata not as easily fetched from the result documents themselves.

You mat want to fish out the number of results (TODO: make that easier)

Notes:

strips namespaces from the results - makes writing code more convenient

CONSIDER:

option to returning URL instead of searching

Parameters
query:`str`	the query string, in CQL form (see the Library of Congress spec) the list of indices you can search in (e.g. e.g. 'dcterms.modified>=2000-01-01') varies with each repo take a look at explain_parsed() (a parsed summary) or explain() (the actual explain XML)
start_record	what record offset to start fetching at. Note: one-based counting
maximum_records	how many records to fetch (from start_offset). Note that repositories may not like high values here. ...so if you care about _all_ results of a possible-large set, then you probably want to use search_retrieve_many() instead.
callback	if not None, this function calls it for each such record node. You can instead wait for the entire range of fetches to conclude and hand you the complete list of result records.
verbose	whether to be even more verbose during this query

def search_retrieve_many(self, query, at_a_time=10, start_record=1, up_to=250, callback=None, wait_between_sec=0.5, verbose=False): ¶

This function builds on search_retrieve() to "fetch _many_ results results in chunks", by calling search_retrieve() repeatedly.

(search_retrieve() will have a limit on how many to search at once, though is still useful to see e.g. if there are results at all)

Like search_retrieve, it (eventually) returns each result record as an elementTree objects, (this can be more convenient if you an to handle the results as a whole)

...and if callback is not None, this will be called on each result _during_ the fetching process. (this can be more convenient way of dealing with many results while they come in)

Parameters
query:`str`	like in search_retrieve()
at_a_time:`int`	how many records to fetch in a single request
start_record:`int`	like in search_retrieve()
up_to:`int`	is the last record to fetch - as an absolute offset, so e.g. start_offset=200,up_to=250 gives you records 200..250, not 200..450.
callback	like in search_retrieve()
wait_between_sec:`float`	a backoff sleep between each search request, to avoid hammering a server too much. you can lower this where you know this is overly cautious note that we skip this sleep if one fetch was enough
verbose:`bool`	whether to be even more verbose during this query since we fetch in chunks, we may overshoot in the last fetch, by up to at_a_time amount of entries. The code should avoid returning those. CONSIDER: maybe yield something including numberOfRecords before yielding results?

base_url = ¶

The base URL that other things add to; added from instantiation.

extra_query = ¶

extra piece of query to add to the quiery you do late. This lets us representing subsets of larger repositories.

number_of_records = ¶

the number of results reported in the last query we did. None before you do a query. CONSIDER: changing that.

sru_version: str = ¶

hardcoded to "1.2"

verbose = ¶

whether to print out things while we do them.

x_connection = ¶

The x_connection attribute that some of these need; added from instantiation.

def _url(self): ¶

Combines the basic URL parts given to the constructor, and ensures there's a ? (so you know you can add &k=v) This can probably go into the constructor, when I know how much is constant across SRU URLs