package documentation

Fetch and load already-created datasets that we provide.

You may also be interested in the example notebooks that instroducing most datasets.

As this is often structured data, each dataset may work a little differently, so there is an describe() to get you started - each dataset should fill that out.

Note that these datasets are separate from the code, so availability may change.

From __init__.py:

Class Dataset If you're looking for details about the specific dataset, look at the .description
Function description Fetch the description field for a dataset name, for a specifically named dataset. Simple, but less typing than picking it out yourself.
Function fetch_index Index is expected to be a list of dicts, each with keys including
Function generated_today_text Used when generating datasets
Function list_datasets Fetch index, report (only) dataset names.
Function load Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory
Function print_dataset_summary Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console...
Function _data_from_path Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.
Function _load_bare Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory
Constant _INDEX_URL Undocumented
Variable _index_data Undocumented
Variable _index_fetch_no_more_often_than_sec Undocumented
Variable _index_fetch_time Undocumented
def description(dataset_name: str):

Fetch the description field for a dataset name, for a specifically named dataset. Simple, but less typing than picking it out yourself.

def fetch_index():

Index is expected to be a list of dicts, each with keys including

  • url
  • version (should probably become semver)
  • description_short one-line summary of what this is
  • description longer description, perhaps with some example data
  • download_size how much transfer you'll need
  • real_size Disk storage we expect to need once decompressed
  • download_size_human, real_size_human: more readable version, e.g. where areal size might be the integer 397740, the human size would be 388KiB
  • type content type of dataset
def generated_today_text():

Used when generating datasets

Returns
a string like 'This dataset was generated on 2024-02-02'
def list_datasets():

Fetch index, report (only) dataset names.

If you care about the details in data form, use fetch_index.

If you care about the details in a console or notebook, see orprint_dataset_summary.

Returns
a list of strings, e.g. ['bwb-mostrecent-xml','woo_besluiten_docs_text']
def load(dataset_name: str, verbose=None, force_refetch=False, check_free_space=True):

Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory

Wraps _load_bare, which does most of the heavy lifting.

This primarily adds what is necessary to load that downloaded thing and give it to you as a usable Dataset object

Parameters
dataset_name:strUndocumented
verbosetells you more about the download (on stderr) Can be given True or False. By default (None), we try to detect whether we are in an interactive context, and print only if we are.
force_refetchwhether to remove the current contents before fetching dataset naming should prevent the need for this (except if you're the wetsuite programmer)
check_free_spaceUndocumented
Returns

a Dataset object - which is a container object with little more than

  • a .description (a string)
  • a .data member, some kind of iterable of items. The .description should mention what .data will contain and should give an example of how to use it.
def print_dataset_summary():

Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console

def _data_from_path(data_path):

Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.

def _load_bare(dataset_name: str, verbose=None, force_refetch=False, check_free_space=True):

Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory

If compressed, will uncompress. Does not think about the type of data

Note: You normally would use load(), which takes the same name but gives you a usable object, instead of just a filename.

Returns
the filename we fetched to
_INDEX_URL: str =

Undocumented

Value
'https://wetsuite.knobs-dials.com/datasets/index.json'
_index_data =

Undocumented

_index_fetch_no_more_often_than_sec: int =

Undocumented

_index_fetch_time: int =

Undocumented