Fetch and load already-created datasets that we provide. (to see a list of actual datasets, look for the wetsuite_datasets.ipynb notebook)
As this is often structured data, each dataset may work a little differently, so there is an describe() to get you started, that each dataset should fill out.
TODO:
- If we want updateable datasets (right now there is no plan for that), think more about the robustness around re-fetching indices. Decide it's cheap enough to fetch each time? (but fall back onto stored?)
From __init__.py
:
Class |
|
If you're looking for details about the specific dataset, look at the .description |
Function | description |
Fetch the description field, for a specifically named dataset. Simple, but less typing than picking it out yourself. |
Function | fetch |
Index is expected to be a list of dicts, each with keys including |
Function | generated |
Used when generating datasets |
Function | list |
Fetch index, report dataset names _only_. If you also want the details, see fetch_index if you care about the data form, or print_dataset_summary if want it printed on the console/notebook output. |
Function | load |
Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory |
Function | print |
Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console... |
Function | _data |
Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase. |
Function | _load |
Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory |
Constant | _INDEX |
Undocumented |
Variable | _index |
Undocumented |
Variable | _index |
Undocumented |
Variable | _index |
Undocumented |
Index is expected to be a list of dicts, each with keys including
- url
- version (should probably become semver)
- description_short one-line summary of what this is
- description longer description, perhaps with some example data
- download_size how much transfer you'll need
- real_size Disk storage we expect to need once decompressed
- download_size_human, real_size_human: more readable version, e.g. where real_size might be the integer 397740, real_size_human would be 388KiB
- type content type of dataset
TODO: an example
CONSIDER: keep hosting generic (HTTP fetch?) so that any hoster will do.
Fetch index, report dataset names _only_. If you also want the details, see fetch_index if you care about the data form, or print_dataset_summary if want it printed on the console/notebook output.
Returns | |
a list of strings, e.g. ['bwb-mostrecent-xml','woo_besluiten_docs_text'] |
Print short summary per dataset, on stdout. A little more to go on than just the names from list_datasets(), a little less work than shifting through the dicts for each yourself, but only useful in notebooks or from the console
Fetch the description field, for a specifically named dataset. Simple, but less typing than picking it out yourself.
Parameters | |
datasetstr | Undocumented |
Given a path to a data file, return the data in python-object form -- and and description (based on contents). This wraps opening and dealing with file type, and separates that from the download phase.
Takes a dataset name (that you learned of from the index), Downloads it if necessary - after the first time it's cached in your home directory
If compressed, will uncompress. Does not think about the type of data
Note: You normally would use load(), which takes the same name but gives you a usable object, instead of just a filename.
Parameters | |
datasetstr | Undocumented |
verbose | Undocumented |
force | Undocumented |
check | Undocumented |
Returns | |
the filename we fetched to |
Takes a dataset name (that you learned of from the index), downloads it if necessary - after the first time it's cached in your home directory
Wraps _load_bare, which does most of the heavy lifting.
This primarily adds what is necessary to load that downloaded thing and give it to you as a usable Dataset object
Parameters | |
datasetstr | Undocumented |
verbose | tells you more about the download (on stderr) Can be given True or False. By default (None), we try to detect whether we are in an interactive context, and print only if we are. |
force | whether to remove the current contents before fetching dataset naming should prevent the need for this (except if you're the wetsuite programmer) |
check | Undocumented |
Returns | |
a Dataset object - which is a container object with little more than
|