module documentation

This is intended to store store collections of data on disk, relatively unobtrusive to use (better than e.g. lots of files), and with quick random access (better than e.g. JSONL).

It currently centers on a key-value store. See the docstring on the LocalKV class for more details.

This is used e.g. by various data collection, and by distributed datasets.

There are a lot of general notes in LocalKV's docstring (and a lot of it also applies to MsgpackKV)

Class LocalKV A key-value store backed by a local filesystem - it's a wrapper around sqlite3.
Class MsgpackKV Like localKV but the value can be a nested python type (serialized via msgpack)
Function cached_fetch Helper to fetch URL contents into str-to-bytes (url-to-content) LocalKV store:
Function is_file_a_store Checks that the path seems to point to one of our stores. More specifailly: whether it is an sqlite(3) database, and has a table called 'kv'
Function list_stores Checks a directory for files that seem to be our stores, also lists some basic details about it.
Function resolve_path Note: the KV classes call this internally. This is here less for you to use directly, more to explain how it works and why.
def cached_fetch(store, url, force_refetch=False, sleep_sec=None, timeout=20, maxsize_bytes=(500*1024)*1024, commit=True):

Helper to fetch URL contents into str-to-bytes (url-to-content) LocalKV store:

  • if URL is a key in the given store, fetch from the store and return its value
  • if URL is not a key in the store, do wetsuite.helpers.net.download(url), store in store, and return its value.
    • note that it will do a commit, unless you tell it not to.

Arguably belongs in a mixin or such, but for now its usefulness puts it here.

Parameters
store:LocalKVa store to get/put data from
url:stran URL string to fetch
force_refetch:boolfetch even if we had it already
sleep_sec:floatsleep this long whenever we did an actual fetch (and not when we return data from cache), so that when you use this in scraping, we can easily be nicer to a server.
timeout:floattimeout of te fetch
maxsize_bytesdon't try to store something larger than this (because SQLite may trip over it anyway), defaults to 500MiB
commit:boolwhether to put() with an immediate commit (False can help some faster bulk updates)
Returns
Tuple[bytes, bool]

(data:bytes, whether_it_came_from_cache:bool)

May raise

  • whatever requests.get may raise (e.g. "timeout waiting for store" type things)
  • ValueError when networking says (not response.ok), or if the HTTP code is >=400 (which is behaviour from wetsuite.helpers.net.download()) ...to force us to deal with issues and not store error pages.
def is_file_a_store(path, skip_table_check=False):

Checks that the path seems to point to one of our stores. More specifailly: whether it is an sqlite(3) database, and has a table called 'kv'

You can skip the latter test. It avoids opening the file, so avoids a possible timeout on someone else having the store open for writing.

Parameters
paththe filesystem path to test
skip_table_check:booldon't check for the right table name, e.g. to make it faster or avoid opening a store
Returns
Whether it seems like a store we could open
def list_stores(skip_table_check=True, get_num_items=False, look_under=None):

Checks a directory for files that seem to be our stores, also lists some basic details about it.

Does filesystem access and IO reading to do so, and with some parameter combinations will fail to open currently write-locked databases.

By default look in the directory that (everything that uses) resolve_path() puts things in, you can give it another directory to look in.

Will only look at direct contents of that directory.

Parameters
skip_table_check:boolif true, only tests whether it's a sqlite file, not whether it contains the table we expect. because when it's in the stores directory, chances are we put it there, and we can avoid IO and locking.
get_num_items:booldoes not by default get the number of items, because that can need a bunch of IO, and locking.
look_undera dict with details for each store
Returns

a dict with details for each store, like:

    {
        'path': '/home/example/.wetsuite/stores/thing.db',
        'basename': 'thing.db',
        'size_bytes': 40980480,
        'size_readable': '41M',
        'description': None
    },
def resolve_path(name):

Note: the KV classes call this internally. This is here less for you to use directly, more to explain how it works and why.

For context, handing a pathless base name to underlying sqlite would just put it in the current directory which would often not be where you think, so is likely to sprinkle databases all over the place. This is common point of confusion/mistake around sqlite (and files in general), so we make it harder to do accidentally.

Using this function makes it a little more controlled where things go:

  • Given a **bare name**, e.g. 'extracted.db', this returns an absolute path within a "this is where wetsuite keeps its stores directory" within your user profile, e.g. /home/myuser/.wetsuite/stores/extracted.db or C:\Users\myuser\AppData\Roaming\.wetsuite\stores\extracted.db Most of the point is that handing in the same name will lead to opening the same store, regardless of details.
  • hand in **`:memory:`** if you wanted a memory-only store, not backed by disk
  • given an absolute path, it will use that as-is so if you actually _wanted_ it in the current directory, instead of this function consider something like `os.path.abspath('mystore.db')`
  • given a relative path, it will pass that through -- which will open it relative to the current directory

Notes:

  • should be idempotent, so shouldn't hurt to call this more than once on the same path (in that it _should_ always produce a path with os.sep (...or :memory: ...), which it would pass through the second time)
  • When you rely on the 'base name means it goes to a wetsuite directory', it is suggested that you use descriptive names (e.g. 'rechtspraaknl_extracted.db', not 'extracted.db') so that you don't open existing stores without meaning to.
  • us figuring out a place in your use profile for you This _is_ double-edged, though, in that we will get fair questions like
    • "I can't tell why my user profile is full" and
    • "I can't find my files" (sorry, they're not so useful to access directly)

CONSIDER:

  • listening to a WETSUITE_BASE_DIR to override our "put in user's homedir" behaviour, this might make more sense e.g. to point it at distributed storage without e.g. you having to do symlink juggling
Parameters
name:strthe name or path to inrepret
Returns
a more resolved path, as described above