module documentation

Code to help fetch things from https://www.rijksoverheid.nl/documenten

Function scrape_pagination Go through the pagination for a specific document type, calls a callback for each item's detail page URL.
Variable doctypes Undocumented
Variable ministeries Undocumented
def scrape_pagination(doctype, detail_page_callback, from_date=None, to_date=None, debug=False):

Go through the pagination for a specific document type, calls a callback for each item's detail page URL.

What to do with the result is still up to you: you implement a detail_page_callback that gets the URL. There is a notebook with some examples.

As of this writing, we work around a flaw that has probably been corrected since; TODO: describe, check, remove?

This should take _order of magnitude_ of dozens of minutes per thousand items ...mostly because of the backoff to be nice to the server.

This function hardcodes some delays, to not be rude to the server. We could make that async.

Parameters
doctypeUndocumented
detail_page_callback

this is called for each item. It should accept two arguments

  • soup fragment for it on the pagination page (you can often ignore this)
  • a detail page URL
from_dateStart of date range to fetch When from_date and to_date are not given, it defaults to the last four weeks, from call time. If given, both should be a date or datetime. This in part because if you want to fetch _everything_ from the servers, we make you be explicit about it.
to_dateEnd of date range to fetch (if from_date is also given)
debugUndocumented
doctypes: list =

Undocumented

ministeries: list =

Undocumented