package documentation

Fetch and load already-created datasets that we provide. (to see a list of actual datasets, look for the wetsuite_datasets.ipynb notebook)

As this is often structured data, each dataset may work a little differently, so there is an describe() to get you started, that each dataset should fill out.

TODO:

  • If we want updateable datasets (right now there is no plan for that), think more about the robustness around re-fetching indices. Decide it's cheap enough to fetch each time? (but fall back onto stored?)

From __init__.py:

Class Dataset If you're looking for details about the specific dataset, look at the .description
Function description Fetch the description field, for a specifically named dataset. Simple, but less typing than picking it out yourself.
Function fetch_index Index is expected to be a list of dicts, each with keys including
Function generated_today_text Used when generating datasets
Function list_datasets Fetch index, report dataset names _only_. If you also want the details, see fetch_index if you care about the data form, or print_dataset_summary if want it printed on the console/notebook output.
Function load Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory
Function print_dataset_summary Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console...
Function _data_from_path Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.
Function _load_bare Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory
Constant _INDEX_URL Undocumented
Variable _index_data Undocumented
Variable _index_fetch_no_more_often_than_sec Undocumented
Variable _index_fetch_time Undocumented
_INDEX_URL: str =

Undocumented

Value
'https://wetsuite.knobs-dials.com/datasets/index.json'
_index_data =

Undocumented

_index_fetch_time: int =

Undocumented

_index_fetch_no_more_often_than_sec: int =

Undocumented

def fetch_index():

Index is expected to be a list of dicts, each with keys including

  • url
  • version (should probably become semver)
  • description_short one-line summary of what this is
  • description longer description, perhaps with some example data
  • download_size how much transfer you'll need
  • real_size Disk storage we expect to need once decompressed
  • download_size_human, real_size_human: more readable version, e.g. where real_size might be the integer 397740, real_size_human would be 388KiB
  • type content type of dataset

TODO: an example

CONSIDER: keep hosting generic (HTTP fetch?) so that any hoster will do.

def list_datasets():

Fetch index, report dataset names _only_. If you also want the details, see fetch_index if you care about the data form, or print_dataset_summary if want it printed on the console/notebook output.

Returns
a list of strings, e.g. ['bwb-mostrecent-xml','woo_besluiten_docs_text']
def print_dataset_summary():

Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console

def description(dataset_name):

Fetch the description field, for a specifically named dataset. Simple, but less typing than picking it out yourself.

Parameters
dataset_name:strUndocumented
def _data_from_path(data_path):

Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.

def _load_bare(dataset_name, verbose=None, force_refetch=False, check_free_space=True):

Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory

If compressed, will uncompress. Does not think about the type of data

Note: You normally would use load(), which takes the same name but gives you a usable object, instead of just a filename.

Parameters
dataset_name:strUndocumented
verboseUndocumented
force_refetchUndocumented
check_free_spaceUndocumented
Returns
the filename we fetched to
def load(dataset_name, verbose=None, force_refetch=False, check_free_space=True):

Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory

Wraps _load_bare, which does most of the heavy lifting.

This primarily adds what is necessary to load that downloaded thing and give it to you as a usable Dataset object

Parameters
dataset_name:strUndocumented
verbosetells you more about the download (on stderr) Can be given True or False. By default (None), we try to detect whether we are in an interactive context, and print only if we are.
force_refetchwhether to remove the current contents before fetching dataset naming should prevent the need for this (except if you're the wetsuite programmer)
check_free_spaceUndocumented
Returns

a Dataset object - which is a container object with little more than

  • a .description (a string)
  • a .data member, some kind of iterable of items. The .description should mention what .data will contain and should give an example of how to use it.
def generated_today_text():

Used when generating datasets

Returns
a string like 'This dataset was generated on 2024-02-02'