wetsuite.helpers.localdata.LocalKV

class documentation

class LocalKV:

Known subclasses: wetsuite.helpers.localdata.MsgpackKV

Constructor: LocalKV(path, key_type, value_type, read_only)

A key-value store backed by a local filesystem - it's a wrapper around sqlite3.

Given: :

    db = LocalKV('path/to/dbfile')

Basic use is: :

    db.put('foo', 'bar')
    db.get('foo')

Notes:

on the path/name argument:
- just a name ( without os.sep, that is, / or \ ) will be resolved to a path where wetsuite keeps various stores
- an absolute path will be passed through, used as-is ...but this is NOT very portable until you do things like `os.path.join( myproject_data_dir, 'docstore.db')`
- a relative path with os.sep will be passed through ...which is only as portable as the cwd is predictable)
- ':memory:' is in-memory only
- See also resolve_path for more details
by default, each write is committed individually (because SQlite3's driver defaults to autocommit). If you want more performant bulk writes, use put() with commit=False, and do an explicit commit() afterwards ...BUT if a script borks in the middle of something uncommited, you will need to do manual cleanup.
On typing:
- SQLite will just store what it gets, which makes it easy to store mixed types. To allow programmers to enforce some runtime checking, you can specify key_type and value_type.
- This class won't do conversions for you, it only enforces the values that go in are of the type you said you would put in.
- This should make things more consistent, but is not a strict guaranteem, and you can subvert this easily.
- Some uses may wish for a specific key and value type. You could change both key and value types, e.g. the cached_fetch function expects a str:bytes store
- It is a good idea to open the store with the same typing every time, or you will still confuse yourself. CONSIDER: storing typing in the file in the meta table so we can warn you.
making you do CRUD via functions is a little more typing,
- yet is arguably clearer than 'this particular dict-like happens to get stored magically'
- and it lets us exposes some sqlite things (using transactions, vacuum) for when you know how to use them.
On concurrency: As per basic sqlite behaviour,
- multiple processes can read the same database,
- but only one can write,
- writing is exclusive with reading,
- and there are timeouts on opening and operations.
- So...
  - when you leave a writer with uncommited data for nontrivial amounts of time, readers are likely to time out.
    - If you leave it on autocommit this should be a little rarer
  - and a very slow read through the database might time out a write.
It wouldn't be hard to also make it act largely like a dict, implementing __getitem__, __setitem__, and __delitem__ but this muddies the waters as to its semantics, in particular around when things you set are actually saved - if ever.

So we try to avoid a leaky abstraction, by making you write out all the altering operations, and actually all of them, e.g. get(), put(), keys(), values(), and items(), because those can at least have docstrings to warn you, rather than breaking your reasonable expectations.

...exceptions are
- __len__ for amount of items (CONSIDER making that len())
- __contains__ backing 'is this key in the store') (CONSIDER making that has_key())
and also:
- __iter__ which is actually iterkeys() CONSIDER: removing it
- __getitem__ supports the views-with-a-len
The last were tentative until keys(), values(), and items() started giving views.

TODO: make a final decision where to sit between clean abstractions and convenience.
yes, you _could_ access these SQLite databses yourself, particularly when just reading. Our code is mainly there for convenience and checks. Consider: `sqlite3 store.db 'select key,value from kv limit 10 ' | less` It only starts getting special once you using MsgpackKV, or the extra parsing and wrapping that wetsuite.datasets adds.

Method	`__contains__`	will return whether the store contains a key
Method	`__enter__`	supports use as a context manager
Method	`__exit__`	supports use as a context manager - close()s on exit
Method	`__getitem__`	(only meant to support ValuesView and Itemsview)
Method	`__init__`	Specify the path to the database file to open.
Method	`__iter__`	Using this object as an iterator yields its keys (equivalent to .iterkeys())
Method	`__len__`	Return the amount of entries in this store
Method	`__repr__`	show useful representation
Method	`bytesize`	Returns the approximate amount of the contained data, in bytes (may be a few dozen kilobytes off, or more, because it counts in pages)
Method	`close`	Closes file if still open. Note that if there was a transaction still open, it will be rolled back, not committed.
Method	`commit`	commit changes - for when you use put() or delete() with commit=False to do things in a larger transaction
Method	`delete`	delete item by key.
Method	`estimate_waste`	Estimate how many bytes might be cleaned by a .vacuum()
Method	`get`	Gets value for key. The key type is checked against how you constructed this localKV class (doesn't guarantee it matches what's in the database) If not present, this will raise KeyError (by default) or return None (if you set missing_as_None=True) (this is unlike a dict...
Method	`items`	Returns an iteralble of all items. (a view with a len, rather than just a generator)
Method	`iteritems`	Returns a generator that yields all items
Method	`iterkeys`	Returns a generator that yields all keus If you wanted a list with all keys, use list( store.keys() )
Method	`itervalues`	Returns a generator that yields all values. If you wanted a list with all the values, use list( store.values )
Method	`keys`	Returns an iterable of all keys. (a view with a len, rather than just a generator)
Method	`put`	Sets/updates value for a key.
Method	`random_choice`	Returns a single (key, value) item from the store, selected randomly.
Method	`random_keys`	Returns a amount of keys in a list, selected randomly. Can be faster/cheaper to do than random_sample When the values are large
Method	`random_sample`	Returns an amount of [(key, value), ...] list from the store, selected randomly.
Method	`random_sample_generator`	A generator that yields one (key,value) tuple at a time, intended to avoid materializing all values before we return.
Method	`random_values`	Returns a amount of values in a list, selected randomly.
Method	`random_values_generator`	A generator that yields one value at a time, intended to avoid materializing all values before we return.
Method	`rollback`	roll back changes
Method	`summary`	Gives the byte size, and optionally the number of items and average size
Method	`truncate`	remove all kv entries. If we were still in a transaction, we roll that back first
Method	`vacuum`	After a lot of deletes you could compact the store with vacuum(). WARNING: rewrites the entire file, so the more data you store, the longer this takes. And it may make no difference - you probably want to check estimate_waste() first...
Method	`values`	Returns an iterable of all values. (a view with a len, rather than just a generator)
Instance Variable	`conn`	Undocumented
Instance Variable	`key_type`	Undocumented
Instance Variable	`path`	Undocumented
Instance Variable	`read_only`	Undocumented
Instance Variable	`value_type`	Undocumented
Method	`_checktype_key`	checks a value according to the key_type you handed into the constructor
Method	`_checktype_value`	checks a value according to the value_type you handed into the constructor
Method	`_delete_meta`	For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()
Method	`_get_meta`	For internal use, preferably don't use.
Method	`_open`	Open the path previously set by init. This function could probably be merged into init, it was separated mostly with the idea that we could keep it closed when not using it.
Method	`_put_meta`	For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()
Instance Variable	`_in_transaction`	Undocumented

def __contains__(self, key): ¶

will return whether the store contains a key

def __enter__(self): ¶

supports use as a context manager

def __exit__(self, exc_type, exc_value, exc_traceback): ¶

supports use as a context manager - close()s on exit

def __getitem__(self, key): ¶

(only meant to support ValuesView and Itemsview)

def __init__(self, path, key_type, value_type, read_only=False): ¶

overridden in wetsuite.helpers.localdata.MsgpackKV

Specify the path to the database file to open.

key_type and value_type do not have defaults, so that you think about how you are using these, but we often use str,str and str,bytes

Parameters
path	database name/pat. File will be created if it does not yet exist, so you proably want think to think about repeating the same path in absolute sense. See also the module docstring, and in particular resolve_path()'s docstring
key_type	the key type you have set
value_type	the value type you have set
read_only	whether we have told ourselves to treat this as read-only (our wrapper enforces this instead of sqlite, though mostly to give slightly more useful errors).

def __iter__(self): ¶

Using this object as an iterator yields its keys (equivalent to .iterkeys())

def __len__(self): ¶

Return the amount of entries in this store

def __repr__(self): ¶

show useful representation

def bytesize(self) -> int: ¶

Returns the approximate amount of the contained data, in bytes (may be a few dozen kilobytes off, or more, because it counts in pages)

def close(self): ¶

Closes file if still open. Note that if there was a transaction still open, it will be rolled back, not committed.

def commit(self): ¶

commit changes - for when you use put() or delete() with commit=False to do things in a larger transaction

def delete(self, key, commit: bool = True): ¶

delete item by key.

Note that you should not expect the file to shrink until you do a vacuum() (which will need to rewrite the file).

def estimate_waste(self): ¶

Estimate how many bytes might be cleaned by a .vacuum()

def get(self, key, missing_as_none: bool = False): ¶

overridden in wetsuite.helpers.localdata.MsgpackKV

Gets value for key. The key type is checked against how you constructed this localKV class (doesn't guarantee it matches what's in the database) If not present, this will raise KeyError (by default) or return None (if you set missing_as_None=True) (this is unlike a dict.get, which has a default=None)

def items(self): ¶

Returns an iteralble of all items. (a view with a len, rather than just a generator)

def iteritems(self): ¶

overridden in wetsuite.helpers.localdata.MsgpackKV

Returns a generator that yields all items

def iterkeys(self): ¶

Returns a generator that yields all keus If you wanted a list with all keys, use list( store.keys() )

def itervalues(self): ¶

overridden in wetsuite.helpers.localdata.MsgpackKV

Returns a generator that yields all values. If you wanted a list with all the values, use list( store.values )

def keys(self): ¶

Returns an iterable of all keys. (a view with a len, rather than just a generator)

def put(self, key, value, commit: bool = True): ¶

overridden in wetsuite.helpers.localdata.MsgpackKV

Sets/updates value for a key.

Types will be checked according to what you inited this class with.

commit=False lets us do bulk commits, mostly when you want to a load of (small) changes without becoming IOPS bound, at the risk of locking/blocking other access. If you care less about speed, and/or more about parallel access, ignore this.

CONSIDER: making commit take an integer as well, meaning 'commit every X operations'

def random_choice(self): ¶

Returns a single (key, value) item from the store, selected randomly.

A convenience function, because doing this properly yourself takes two or three lines (you can't random.choice/random.sample a view, so to do it properly you basically have to materialize all keys - and not accidentally all values) BUT assume this is slower than working on the keys yourself.

def random_keys(self, n=10): ¶

Returns a amount of keys in a list, selected randomly. Can be faster/cheaper to do than random_sample When the values are large

On very large stores (tens of millions of items and/or hundred of gbytes) this still ends up being dozens of seconds, because we still skip through a bunch of that data.

def random_sample(self, n): ¶

Returns an amount of [(key, value), ...] list from the store, selected randomly.

WARNING: This materializes all keys and the chosen values in RAM, so can use considerable RAM if values are large. To avoid that RAM use, use random_keys() and get() one key at a time, or use random_sample_generator().

Note that when you ask for a larger sample than the entire population, you get the entire population (and unlike random.sample, we don't raise a ValueError to point out this is no longer a subselection)

def random_sample_generator(self, n=10): ¶

A generator that yields one (key,value) tuple at a time, intended to avoid materializing all values before we return.

Still materializes all the keys before starting to yield, but that should only start to add up troublesome on many-gigabyte stores, and it might avoid some locking issues.

def random_values(self, n=10): ¶

Returns a amount of values in a list, selected randomly.

WARNING: this materializes the values, so this can be very large in RAM. Consider using random_values_generator, or using random_keys and get() one key at a time.

def random_values_generator(self, n=10): ¶

A generator that yields one value at a time, intended to avoid materializing all values before we return.

Still materializes all the keys before starting to yield, but that should only start to add up troublesome on many-gigabyte stores, and it might avoid some locking issues.

def rollback(self): ¶

roll back changes

def summary(self, get_num_items: bool = False): ¶

Gives the byte size, and optionally the number of items and average size

Note that the byte size includes waste, so this will over-estimate if you have altered/removed without doing a vacuum().

Parameters
get_num_items:`bool`	Also find the amount of items, and calculate average size. Is slower than not doing this (proportionally slower with underlying size), adds entries like: : 'num_items': 856716, 'avgsize_bytes': 63585, 'avgsize_readable': '62K',
Returns
a dictionary with at least: {'size_bytes': 54474244096, 'size_readable': '54G'}

def truncate(self, vacuum=True): ¶

remove all kv entries. If we were still in a transaction, we roll that back first

def vacuum(self): ¶

After a lot of deletes you could compact the store with vacuum(). WARNING: rewrites the entire file, so the more data you store, the longer this takes. And it may make no difference - you probably want to check estimate_waste() first. NOTE: if we were left in a transaction (due to commit=False), ths is commit()ed first.

def values(self): ¶

Returns an iterable of all values. (a view with a len, rather than just a generator)

conn = ¶

Undocumented

key_type = ¶

Undocumented

path = ¶

Undocumented

read_only = ¶

Undocumented

value_type = ¶

Undocumented

def _checktype_key(self, val): ¶

checks a value according to the key_type you handed into the constructor

def _checktype_value(self, val): ¶

checks a value according to the value_type you handed into the constructor

def _delete_meta(self, key: str): ¶

For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()

def _get_meta(self, key: str, missing_as_none=False): ¶

For internal use, preferably don't use.

This is an extra str:str table in there that is intended to be separate, with some keys special to these classes. ...you could abuse this for your own needs if you wish, but try not to.

If the key is not present, raises an exception - unless missing_as_none is set, in which case in returns None.

def _open(self, timeout=3.0): ¶

Open the path previously set by init. This function could probably be merged into init, it was separated mostly with the idea that we could keep it closed when not using it.

timeout: how long wait on opening. Lowered from the default just to avoid a lot of waiting half a minute when it was usually just accidentally left locked. (note that this is different from busy_timeout)

def _put_meta(self, key: str, value: str): ¶

For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()

_in_transaction: bool = ¶

Undocumented