wetsuite.datacollect.rijksoverheid_nl

module documentation

Code to help fetch things from https://www.rijksoverheid.nl/documenten

Function	`scrape_pagination`	Go through the pagination for a specific document type, calls a callback for each item's detail page URL.
Variable	`doctypes`	Undocumented
Variable	`ministeries`	Undocumented

def scrape_pagination(doctype, detail_page_callback, from_date=None, to_date=None, debug=False): ¶

Go through the pagination for a specific document type, calls a callback for each item's detail page URL.

What to do with the result is still up to you: you implement a detail_page_callback that gets the URL. There is a notebook with some examples.

As of this writing, we work around a flaw that has probably been corrected since; TODO: describe, check, remove?

This should take _order of magnitude_ of dozens of minutes per thousand items ...mostly because of the backoff to be nice to the server.

This function hardcodes some delays, to not be rude to the server. We could make that async.

Parameters
doctype	Undocumented
detail_page_callback	this is called for each item. It should accept two arguments soup fragment for it on the pagination page (you can often ignore this) a detail page URL
from_date	Start of date range to fetch When from_date and to_date are not given, it defaults to the last four weeks, from call time. If given, both should be a date or datetime. This in part because if you want to fetch _everything_ from the servers, we make you be explicit about it.
to_date	End of date range to fetch (if from_date is also given)
debug	Undocumented

doctypes: list = ¶

Undocumented

ministeries: list = ¶

Undocumented

wetsuite.datacollect.rijksoverheid_nl_documenten