class SRUBase:
Known subclasses: wetsuite.datacollect.koop_sru.BWB, wetsuite.datacollect.koop_sru.CVDR, wetsuite.datacollect.koop_sru.EuropeseRichtlijnen, wetsuite.datacollect.koop_sru.LokaleBekendmakingen, wetsuite.datacollect.koop_sru.OfficielePublicaties, wetsuite.datacollect.koop_sru.PLOOI, wetsuite.datacollect.koop_sru.PUCOpenData, wetsuite.datacollect.koop_sru.SamenwerkendeCatalogi, wetsuite.datacollect.koop_sru.StatenGeneraalDigitaal, wetsuite.datacollect.koop_sru.TuchtRecht, wetsuite.datacollect.koop_sru.WetgevingsKalender
Constructor: SRUBase(base_url, x_connection, extra_query, verbose)
Very minimal SRU implementation - just enough to access the KOOP repositories.
| Method | __init__ |
No summary |
| Method | explain |
Does an explain operation, Returns the XML |
| Method | explain |
Does an explain operation, fishes out some interesting details, returns that as a dict. |
| Method | num |
Returns the number of records listed in the last search's results. |
| Method | search |
Fetches a small range of result records for the given query. |
| Method | search |
Fetch _many_ results results, in chunks, by calling search_retrieve() repeatedly, in part because there is typically a server-side limit on how many you can fetch at once. |
| Instance Variable | base |
The base URL that other things add to; added from instantiation. |
| Instance Variable | extra |
extra piece of query to add to the quiery you do late. This lets us representing subsets of larger repositories. |
| Instance Variable | number |
the number of results reported in the last query we did. None before you do a query. CONSIDER: changing that. |
| Instance Variable | sru |
hardcoded to "1.2" |
| Instance Variable | verbose |
whether to print out things while we do them. |
| Instance Variable | x |
The x_connection attribute that some of these need; added from instantiation. |
| Method | _url |
Combines the basic URL parts given to the constructor, and ensures there's a ? (so you know you can add &k=v) This can probably go into the constructor, when I know how much is constant across SRU URLs... |
str, x_connection: str = None, extra_query: str = None, verbose=False):
¶
wetsuite.datacollect.koop_sru.BWB, wetsuite.datacollect.koop_sru.CVDR, wetsuite.datacollect.koop_sru.EuropeseRichtlijnen, wetsuite.datacollect.koop_sru.LokaleBekendmakingen, wetsuite.datacollect.koop_sru.OfficielePublicaties, wetsuite.datacollect.koop_sru.PLOOI, wetsuite.datacollect.koop_sru.PUCOpenData, wetsuite.datacollect.koop_sru.SamenwerkendeCatalogi, wetsuite.datacollect.koop_sru.StatenGeneraalDigitaal, wetsuite.datacollect.koop_sru.TuchtRecht, wetsuite.datacollect.koop_sru.WetgevingsKalender| Parameters | |
basestr | The base URL that other things add to. Basically everything up to the '?' |
xstr | an attribute that some of these need in the URL. Seems to be non-standard and required for these repos. |
extrastr | is used to let us AND something into the query, and is intended to restrict to a subset of documents. This lets us representing subsets of larger repositories (somewhat related to x_connection). |
| verbose | whether to print out things while we do them. |
Does an explain operation, Returns the XML
- if readable==False, it returns it as-is
- if readable==True (default), it will ease human readability:
- strips namespaces
- reindent
The XML is a unicode string (for consistency with other parts of this codebase)
Does an explain operation, fishes out some interesting details, returns that as a dict.
TODO: actually read the standard instead of assuming things.
Returns the number of records listed in the last search's results.
If you call it before doing a search_retrieve it will raise an error.
This function may change.
str, start_record=None, maximum_records=None, callback=None, verbose=False):
¶
Fetches a small range of result records for the given query.
Exactly what each record contains will vary per repository, sometimes even per presumably-sensible-subset of records.
Returns a list of result record (each a ElementTree object -- because in some cases, the search records contain metadata not as easily fetched from the result documents themselves, and you may wish to decide how to fish them out).
Notes:
- search_retrieve will update the amount of matching records, which backs num_record()
- strips namespaces from the results - this makes writing code more convenient
CONSIDER:
- option to returning URL instead of searching
| Parameters | |
query:str | the query string, in CQL form (see the Library of Congress spec) the list of indices you can search in (e.g. e.g. 'dcterms.modified>=2000-01-01') varies with each repo take a look at explain_parsed() (a parsed summary) or explain() (the actual explain XML) |
| start | what record offset to start fetching at. Note: one-based counting |
| maximum | how many records to fetch (from start_offset). Note that repositories may not like high values here. ...so if you care about _all_ results of a possible-large set, then you probably want to use search_retrieve_many() instead. |
| callback | if not None, this function calls it for each such record node. You can instead wait for the entire range of fetches to conclude and hand you the complete list of result records. |
| verbose | whether to be even more verbose during this query |
str, at_a_time: int = 10, start_record: int = 1, up_to: int = 250, callback=None, wait_between_sec: float = 0.5, verbose: bool = False):
¶
Fetch _many_ results results, in chunks, by calling search_retrieve() repeatedly, in part because there is typically a server-side limit on how many you can fetch at once.
You will often rely on either
- it returning a list of all records (as elementTree objects) ...which can be more convenient if you an to handle the results as a whole - but only happens at the very end
- it calling a callback function on each individual record _during_ the fetching process. ...which can be more convenient way of dealing with many results while they come in, especially when dealing with very large fetches
| Parameters | |
query:str | like in search_retrieve() |
atint | how many records to fetch in a single request |
startint | like in search_retrieve() |
upint | is the last record to fetch - as an absolute offset, so e.g. start_offset=200,up_to=250 gives you records 200..250, not 200..450. |
| callback | like in search_retrieve() |
waitfloat | a backoff sleep between each search request, to avoid hammering a server too much. you can lower this where you know this is overly cautious note that we skip this sleep if one fetch was enough |
verbose:bool | whether to be even more verbose during this query since we fetch in chunks, we may overshoot in the last fetch, by up to at_a_time amount of entries. The code should avoid returning those. CONSIDER:
|
extra piece of query to add to the quiery you do late. This lets us representing subsets of larger repositories.
the number of results reported in the last query we did. None before you do a query. CONSIDER: changing that.