class documentation

A key-value store backed by a local filesystem - it's a wrapper around sqlite3.

Given: :

    db = LocalKV('path/to/dbfile')

Basic use is: :

    db.put('foo', 'bar')
    db.get('foo')

Notes:

  • on the path/name argument:

    • just a name ( without os.sep, that is, / or \ ) will be resolved to a path where wetsuite keeps various stores
    • an absolute path will be passed through, used as-is ...but this is NOT very portable until you do things like `os.path.join( myproject_data_dir, 'docstore.db')`
    • a relative path with os.sep will be passed through ...which is only as portable as the cwd is predictable)
    • ':memory:' is in-memory only
    • See also resolve_path for more details
  • by default, each write is committed individually (because SQlite3's driver defaults to autocommit). If you want more performant bulk writes, use put() with commit=False, and do an explicit commit() afterwards ...BUT if a script borks in the middle of something uncommited, you will need to do manual cleanup.

  • On typing:

    • SQLite will just store what it gets, which makes it easy to store mixed types. To allow programmers to enforce some runtime checking, you can specify key_type and value_type.
    • This class won't do conversions for you, it only enforces the values that go in are of the type you said you would put in.
    • This should make things more consistent, but is not a strict guaranteem, and you can subvert this easily.
    • Some uses may wish for a specific key and value type. You could change both key and value types, e.g. the cached_fetch function expects a str:bytes store
    • It is a good idea to open the store with the same typing every time, or you will still confuse yourself. CONSIDER: storing typing in the file in the meta table so we can warn you.
  • making you do CRUD via functions is a little more typing,

    • yet is arguably clearer than 'this particular dict-like happens to get stored magically'
    • and it lets us exposes some sqlite things (using transactions, vacuum) for when you know how to use them.
  • On concurrency: As per basic sqlite behaviour, multiple processes can read the same database, but only one can write, and writing is exclusive with reading. So

    • when you leave a writer with uncommited data for nontrivial amounts of time, readers are likely to time out.
      • If you leave it on autocommit this should be a little rarer
    • and a very slow read through the database might time out a write.
  • It wouldn't be hard to also make it act largely like a dict, implementing __getitem__, __setitem__, and __delitem__ but this muddies the waters as to its semantics, in particular when things you set are actually saved.

    So we try to avoid a leaky abstraction, by making you write out all the altering operations, and actually all of them, e.g. get(), put(), keys(), values(), and items(), because those can at least have docstrings to warn you, rather than breaking your reasonable expectations.

    ...exceptions are

    • __len__ for amount of items (CONSIDER making that len())
    • __contains__ backing 'is this key in the store') (CONSIDER making that has_key())

    and also:

    • __iter__ which is actually iterkeys() CONSIDER: removing it
    • __getitem__ supports the views-with-a-len

    The last were tentative until keys(), values(), and items() started giving views.

    TODO: make a final decision where to sit between clean abstractions and convenience.

  • yes, you _could_ access these SQLite databses yourself, particularly when just reading. Our code is mainly there for convenience and checks. Consider: `sqlite3 store.db 'select key,value from kv limit 10 ' | less` It only starts getting special once you using MsgpackKV, or the extra parsing and wrapping that wetsuite.datasets adds.

Method __contains__ will return whether the store contains a key
Method __enter__ supports use as a context manager
Method __exit__ supports use as a context manager - close()s on exit
Method __getitem__ (only meant to support ValuesView and Itemsview)
Method __init__ Specify the path to the database file to open.
Method __iter__ Using this object as an iterator yields its keys (equivalent to .iterkeys())
Method __len__ Return the amount of entries in this store
Method __repr__ show useful representation
Method bytesize Returns the approximate amount of the contained data, in bytes (may be a few dozen kilobytes off, or more, because it counts in pages)
Method close Closes file if still open. Note that if there was a transaction still open, it will be rolled back, not committed.
Method commit commit changes - for when you use put() or delete() with commit=False to do things in a larger transaction
Method delete delete item by key.
Method estimate_waste Estimate how many bytes might be cleaned by a .vacuum()
Method get Gets value for key. The key type is checked against how you constructed this localKV class (doesn't guarantee it matches what's in the database) If not present, this will raise KeyError (by default) or return None (if you set missing_as_None=True) (this is unlike a dict...
Method items Returns an iteralble of all items. (a view with a len, rather than just a generator)
Method iteritems Returns a generator that yields all items
Method iterkeys Returns a generator that yields all keus If you wanted a list with all keys, use list( store.keys() )
Method itervalues Returns a generator that yields all values. If you wanted a list with all the values, use list( store.values )
Method keys Returns an iterable of all keys. (a view with a len, rather than just a generator)
Method put Sets/updates value for a key.
Method random_choice Returns a single (key, value) item from the store, selected randomly.
Method random_keys Returns a amount of keys in a list, selected randomly. Can be faster/cheaper to do than random_sample When the values are large
Method random_sample Returns an amount of [(key, value), ...] list from the store, selected randomly.
Method random_sample_generator A generator that yields one (key,value) tuple at a time, intended to avoid materializing all values before we return.
Method random_values Returns a amount of values in a list, selected randomly.
Method random_values_generator A generator that yields one value at a time, intended to avoid materializing all values before we return.
Method rollback roll back changes
Method summary Gives the byte size, and optionally the number of items and average size
Method truncate remove all kv entries. If we were still in a transaction, we roll that back first
Method vacuum After a lot of deletes you could compact the store with vacuum(). WARNING: rewrites the entire file, so the more data you store, the longer this takes. And it may make no difference - you probably want to check estimate_waste() first...
Method values Returns an iterable of all values. (a view with a len, rather than just a generator)
Instance Variable conn connection to the sqlite database that we set up
Instance Variable key_type the key type you have set
Instance Variable path the path we opened (after resolving)
Instance Variable read_only whether we have told ourselves to treat this as read-only. That _should_ also make it hard for _us_ to be the cause of leaving the database in a locked state.
Instance Variable value_type the value type you have set
Method _checktype_key checks a value according to the key_type you handed into the constructor
Method _checktype_value checks a value according to the value_type you handed into the constructor
Method _delete_meta For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()
Method _get_meta For internal use, preferably don't use.
Method _open Open the path previously set by init. This function could probably be merged into init, it was separated mostly with the idea that we could keep it closed when not using it.
Method _put_meta For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()
Instance Variable _in_transaction Undocumented
def __contains__(self, key):

will return whether the store contains a key

def __enter__(self):

supports use as a context manager

def __exit__(self, exc_type, exc_value, exc_traceback):

supports use as a context manager - close()s on exit

def __getitem__(self, key):

(only meant to support ValuesView and Itemsview)

def __init__(self, path, key_type, value_type, read_only=False):

Specify the path to the database file to open.

key_type and value_type do not have defaults, so that you think about how you are using these, but we often use str,str and str,bytes

Parameters
pathdatabase name/pat. File will be created if it does not yet exist, so you proably want think to think about repeating the same path in absolute sense. See also the module docstring, and in particular resolve_path()'s docstring
key_type
value_type
read_onlyis only enforced in this wrapper to give slightly more useful errors. (we also give SQLite a PRAGMA)
def __iter__(self):

Using this object as an iterator yields its keys (equivalent to .iterkeys())

def __len__(self):

Return the amount of entries in this store

def __repr__(self):

show useful representation

def bytesize(self):

Returns the approximate amount of the contained data, in bytes (may be a few dozen kilobytes off, or more, because it counts in pages)

Returns
intUndocumented
def close(self):

Closes file if still open. Note that if there was a transaction still open, it will be rolled back, not committed.

def commit(self):

commit changes - for when you use put() or delete() with commit=False to do things in a larger transaction

def delete(self, key, commit=True):

delete item by key.

Note that you should not expect the file to shrink until you do a vacuum() (which will need to rewrite the file).

Parameters
keyUndocumented
commit:boolUndocumented
def estimate_waste(self):

Estimate how many bytes might be cleaned by a .vacuum()

def get(self, key, missing_as_none=False):

Gets value for key. The key type is checked against how you constructed this localKV class (doesn't guarantee it matches what's in the database) If not present, this will raise KeyError (by default) or return None (if you set missing_as_None=True) (this is unlike a dict.get, which has a default=None)

Parameters
keyUndocumented
missing_as_none:boolUndocumented
def items(self):

Returns an iteralble of all items. (a view with a len, rather than just a generator)

def iteritems(self):

Returns a generator that yields all items

def iterkeys(self):

Returns a generator that yields all keus If you wanted a list with all keys, use list( store.keys() )

def itervalues(self):

Returns a generator that yields all values. If you wanted a list with all the values, use list( store.values )

def keys(self):

Returns an iterable of all keys. (a view with a len, rather than just a generator)

def put(self, key, value, commit=True):

Sets/updates value for a key.

Types will be checked according to what you inited this class with.

commit=False lets us do bulk commits, mostly when you want to a load of (small) changes without becoming IOPS bound, at the risk of locking/blocking other access. If you care less about speed, and/or more about parallel access, ignore this.

CONSIDER: making commit take an integer as well, meaning 'commit every X operations'

Parameters
keyUndocumented
valueUndocumented
commit:boolUndocumented
def random_choice(self):

Returns a single (key, value) item from the store, selected randomly.

A convenience function, because doing this properly yourself takes two or three lines (you can't random.choice/random.sample a view, so to do it properly you basically have to materialize all keys - and not accidentally all values) BUT assume this is slower than working on the keys yourself.

def random_keys(self, n=10):

Returns a amount of keys in a list, selected randomly. Can be faster/cheaper to do than random_sample When the values are large

On very large stores (tens of millions of items and/or hundred of gbytes) this still ends up being dozens of seconds, because we still skip through a bunch of that data.

def random_sample(self, n):

Returns an amount of [(key, value), ...] list from the store, selected randomly.

WARNING: This materializes all keys and the chosen values in RAM, so can use considerable RAM if values are large. To avoid that RAM use, use random_keys() and get() one key at a time, or use random_sample_generator().

Note that when you ask for a larger sample than the entire population, you get the entire population (and unlike random.sample, we don't raise a ValueError to point out this is no longer a subselection)

def random_sample_generator(self, n=10):

A generator that yields one (key,value) tuple at a time, intended to avoid materializing all values before we return.

Still materializes all the keys before starting to yield, but that should only start to add up troublesome on many-gigabyte stores, and it might avoid some locking issues.

def random_values(self, n=10):

Returns a amount of values in a list, selected randomly.

WARNING: this materializes the values, so this can be very large in RAM. Consider using random_values_generator, or using random_keys and get() one key at a time.

def random_values_generator(self, n=10):

A generator that yields one value at a time, intended to avoid materializing all values before we return.

Still materializes all the keys before starting to yield, but that should only start to add up troublesome on many-gigabyte stores, and it might avoid some locking issues.

def rollback(self):

roll back changes

def summary(self, get_num_items=False):

Gives the byte size, and optionally the number of items and average size

Note that the byte size includes waste, so this will over-estimate if you have altered/removed without doing a vacuum().

Parameters
get_num_items:bool

Also find the amount of items, and calculate average size. Is slower than not doing this (proportionally slower with underlying size), adds entries like: :

    'num_items':     856716,
    'avgsize_bytes': 63585,
    'avgsize_readable': '62K',
Returns

a dictionary with at least:

    {'size_bytes':     54474244096,
     'size_readable': '54G'}
def truncate(self, vacuum=True):

remove all kv entries. If we were still in a transaction, we roll that back first

def vacuum(self):

After a lot of deletes you could compact the store with vacuum(). WARNING: rewrites the entire file, so the more data you store, the longer this takes. And it may make no difference - you probably want to check estimate_waste() first. NOTE: if we were left in a transaction (due to commit=False), ths is commit()ed first.

def values(self):

Returns an iterable of all values. (a view with a len, rather than just a generator)

conn =

connection to the sqlite database that we set up

key_type =

the key type you have set

path =

the path we opened (after resolving)

read_only =

whether we have told ourselves to treat this as read-only. That _should_ also make it hard for _us_ to be the cause of leaving the database in a locked state.

value_type =

the value type you have set

def _checktype_key(self, val):

checks a value according to the key_type you handed into the constructor

def _checktype_value(self, val):

checks a value according to the value_type you handed into the constructor

def _delete_meta(self, key):

For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()

Parameters
key:strUndocumented
def _get_meta(self, key, missing_as_none=False):

For internal use, preferably don't use.

This is an extra str:str table in there that is intended to be separate, with some keys special to these classes. ...you could abuse this for your own needs if you wish, but try not to.

If the key is not present, raises an exception - unless missing_as_none is set, in which case in returns None.

Parameters
key:strUndocumented
missing_as_noneUndocumented
def _open(self, timeout=3.0):

Open the path previously set by init. This function could probably be merged into init, it was separated mostly with the idea that we could keep it closed when not using it.

timeout: how long wait on opening. Lowered from the default just to avoid a lot of waiting half a minute when it was usually just accidentally left locked. (note that this is different from busy_timeout)

def _put_meta(self, key, value):

For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()

Parameters
key:strUndocumented
value:strUndocumented
_in_transaction: bool =

Undocumented