class documentation

class FRBRFetcher:

View In Hierarchy

Helper class to fetch data from an area of https://repository.overheid.nl/frbr/ See the constructor's docstring for more. In theory we could use the bulk service to do functionally the same, which is more efficient for both sides, yet almost no SSH tool seems to be able to negotiate with the way they configured it (SFTP imitating anonymous FTP, which is a grea idea in theory).

Method __init__ Hand in two LocalKV-style stores:
Method add_folder add an URL to an internal "folders to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url()
Method add_page add an URL to an internal "pages to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url()
Method cached_folder_fetch cache-backed fetch (from the first one you handed into the constructor)
Method handle_url handle a URL that should be what we consider either a page or folder
Method uncached_fetch Unconditional fetch from an URL
Method work This is a generator so that it can yield fairly frequently in its task, mainly so that you can do something like a progress bar. The simplest use is probably something like:
Instance Variable cache_store Undocumented
Instance Variable count_cacheds Undocumented
Instance Variable count_dupadd Undocumented
Instance Variable count_errors Undocumented
Instance Variable count_fetches Undocumented
Instance Variable count_folders Undocumented
Instance Variable count_items Undocumented
Instance Variable count_pages Undocumented
Instance Variable count_skipped Undocumented
Instance Variable fetch_store Undocumented
Instance Variable fetched Undocumented
Instance Variable to_fetch_folders Undocumented
Instance Variable to_fetch_pages Undocumented
Instance Variable verbose Undocumented
Instance Variable waittime_sec Undocumented
def __init__(self, fetch_store, cache_store, verbose=True, waittime_sec=1.0):

Hand in two LocalKV-style stores:

  • one that the documents will get fetched into (almost all useful content),

  • one that the intermediate folders get fetched into (mostly pointless outside of this fetcher)

    After this you will want to

    • hand in a starting point like:

          fetcher.add_page( 'https://repository.overheid.nl/frbr/cga?start=1' )
      
    • use fetcher.work() to start it fetching.

      • work() is a generator function that tries to frequently return, so that you can read out some "what have we done" ( see .count_* )
      • if you just want it to do things until it's done, you can do `list( fetcher.work() )`

    Will only go deeper from the starting page you give it.

Parameters
fetch_store
cache_store
verbose
waittime_secHow long to sleep after every actual network fetch, to be nicer to the servers.
def add_folder(self, folder_url):

add an URL to an internal "folders to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url()

def add_page(self, page_url):

add an URL to an internal "pages to still look at" set (unless it was previously added / fetched) Mostly intended to be used by handle_url()

def cached_folder_fetch(self, url, retries=3):

cache-backed fetch (from the first one you handed into the constructor)

def handle_url(self, h_url, is_folder=False):

handle a URL that should be what we consider either a page or folder

def uncached_fetch(self, url, retries=3):

Unconditional fetch from an URL

def work(self):

This is a generator so that it can yield fairly frequently in its task, mainly so that you can do something like a progress bar. The simplest use is probably something like:

    for _ in fetcher.work():
        pass   # (you could access and print counters)

The thing it yields is not functional to anything, and may be a dummy string, though parts try to make it something you might want to display maybe.

Note that there actually is no real distinction between what this class calls folders and pages, but the way the repositories are structured, pretending this is true helps us recurse into what we added as folders, before what we added as pages, so that we can act depth-first-like and will e.g. start fetching documents before we've gone through all pages. (which seems like a good idea when some things have 100K pages, and there are reasons our fetching gets interrupted) On a similar note, we cache things we consider folders, not things we consider pages (because we assume many things at a level means it's the level (or at least a level) at which things get added over time. As such, you can make the folder store persistent and it saves _some_ time updating a local copy.

cache_store =

Undocumented

count_cacheds: int =

Undocumented

count_dupadd: int =

Undocumented

count_errors: int =

Undocumented

count_fetches: int =

Undocumented

count_folders: int =

Undocumented

count_items: int =

Undocumented

count_pages: int =

Undocumented

count_skipped: int =

Undocumented

fetch_store =

Undocumented

fetched: dict =

Undocumented

to_fetch_folders =

Undocumented

to_fetch_pages =

Undocumented

verbose =

Undocumented

waittime_sec =

Undocumented