wetsuite.datasets

package documentation

Fetch and load already-created datasets that we provide.

You may also be interested in the example notebooks that instroducing most datasets.

As this is often structured data, each dataset may work a little differently, so there is an describe() to get you started - each dataset should fill that out.

Note that these datasets are separate from the code, so availability may change.

From __init__.py:

Class	`Dataset`	If you're looking for details about the specific dataset, look at the .description
Function	`description`	Fetch the description field for a dataset name, for a specifically named dataset. Simple, but less typing than picking it out yourself.
Function	`fetch_index`	Index is expected to be a list of dicts, each with keys including
Function	`generated_today_text`	Used when generating datasets
Function	`list_datasets`	Fetch index, report (only) dataset names.
Function	`load`	Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory
Function	`print_dataset_summary`	Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console...
Function	`_data_from_path`	Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.
Function	`_load_bare`	Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory
Constant	`_INDEX_URL`	Undocumented
Variable	`_index_data`	Undocumented
Variable	`_index_fetch_no_more_often_than_sec`	Undocumented
Variable	`_index_fetch_time`	Undocumented

def description(dataset_name: str): ¶

Fetch the description field for a dataset name, for a specifically named dataset. Simple, but less typing than picking it out yourself.

def fetch_index(): ¶

Index is expected to be a list of dicts, each with keys including

url
version (should probably become semver)
description_short one-line summary of what this is
description longer description, perhaps with some example data
download_size how much transfer you'll need
real_size Disk storage we expect to need once decompressed
download_size_human, real_size_human: more readable version, e.g. where areal size might be the integer 397740, the human size would be 388KiB
type content type of dataset

def generated_today_text(): ¶

Used when generating datasets

Returns
a string like 'This dataset was generated on 2024-02-02'

def list_datasets(): ¶

Fetch index, report (only) dataset names.

If you care about the details in data form, use fetch_index.

If you care about the details in a console or notebook, see orprint_dataset_summary.

Returns
a list of strings, e.g. ['bwb-mostrecent-xml','woo_besluiten_docs_text']

def load(dataset_name: str, verbose=None, force_refetch=False, check_free_space=True): ¶

Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory

Wraps _load_bare, which does most of the heavy lifting.

This primarily adds what is necessary to load that downloaded thing and give it to you as a usable Dataset object

Parameters
dataset_name:`str`	Undocumented
verbose	tells you more about the download (on stderr) Can be given True or False. By default (None), we try to detect whether we are in an interactive context, and print only if we are.
force_refetch	whether to remove the current contents before fetching dataset naming should prevent the need for this (except if you're the wetsuite programmer)
check_free_space	Undocumented
Returns
a Dataset object - which is a container object with little more than a `.description` (a string) a `.data` member, some kind of iterable of items. The .description should mention what .data will contain and should give an example of how to use it.

def print_dataset_summary(): ¶

Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console

def _data_from_path(data_path): ¶

Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.

def _load_bare(dataset_name: str, verbose=None, force_refetch=False, check_free_space=True): ¶

Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory

If compressed, will uncompress. Does not think about the type of data

Note: You normally would use load(), which takes the same name but gives you a usable object, instead of just a filename.

Returns
the filename we fetched to

_INDEX_URL: str = ¶

Undocumented

Value

'https://wetsuite.knobs-dials.com/datasets/index.json'

_index_data = ¶

Undocumented

_index_fetch_no_more_often_than_sec: int = ¶

Undocumented

_index_fetch_time: int = ¶

Undocumented