class FRBRFetcher:
Helper class to fetch data from an area of https://repository.overheid.nl/frbr/ See the constructor's docstring for more. In theory we could use the bulk service to do functionally the same, which is more efficient for both sides, yet almost no SSH tool seems to be able to negotiate with the way they configured it (SFTP imitating anonymous FTP, which is a grea idea in theory).
Method | __init__ |
Hand in two LocalKV-style stores: |
Method | add |
add an URL to an internal "folders to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url() |
Method | add |
add an URL to an internal "pages to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url() |
Method | cached |
cache-backed fetch (from the first one you handed into the constructor) |
Method | handle |
handle a URL that should be what we consider either a page or folder |
Method | uncached |
Unconditional fetch from an URL |
Method | work |
This is a generator so that it can yield fairly frequently in its task, mainly so that you can do something like a progress bar. The simplest use is probably something like: |
Instance Variable | cache |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | count |
Undocumented |
Instance Variable | fetch |
Undocumented |
Instance Variable | fetched |
Undocumented |
Instance Variable | to |
Undocumented |
Instance Variable | to |
Undocumented |
Instance Variable | verbose |
Undocumented |
Instance Variable | waittime |
Undocumented |
Hand in two LocalKV-style stores:
one that the documents will get fetched into (almost all useful content),
one that the intermediate folders get fetched into (mostly pointless outside of this fetcher)
After this you will want to
hand in a starting point like:
fetcher.add_page( 'https://repository.overheid.nl/frbr/cga?start=1' )
use fetcher.work() to start it fetching.
- work() is a generator function that tries to frequently return, so that you can read out some "what have we done" ( see .count_* )
- if you just want it to do things until it's done, you can do `list( fetcher.work() )`
Will only go deeper from the starting page you give it.
Parameters | |
fetch | |
cache | |
verbose | |
waittime | How long to sleep after every actual network fetch, to be nicer to the servers. |
add an URL to an internal "folders to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url()
add an URL to an internal "pages to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url()
This is a generator so that it can yield fairly frequently in its task, mainly so that you can do something like a progress bar. The simplest use is probably something like:
for _ in fetcher.work(): pass # (you could access and print counters)
The thing it yields is not functional to anything, and may be a dummy string, though parts try to make it something you might want to display maybe.
Note that there actually is no real distinction between what this class calls folders and pages, but the way the repositories are structured, pretending this is true helps us recurse into what we added as folders, before what we added as pages, so that we can act depth-first-like and will e.g. start fetching documents before we've gone through all pages. (which seems like a good idea when some things have 100K pages, and there are reasons our fetching gets interrupted) On a similar note, we cache things we consider folders, not things we consider pages (because we assume many things at a level means it's the level (or at least a level) at which things get added over time. As such, you can make the folder store persistent and it saves _some_ time updating a local copy.