A key-value store backed by a local filesystem - it's a wrapper around sqlite3.
Given: :
db = LocalKV('path/to/dbfile')
Basic use is: :
db.put('foo', 'bar') db.get('foo')
Notes:
on the path/name argument:
- just a name ( without os.sep, that is, / or \ ) will be resolved to a path where wetsuite keeps various stores
- an absolute path will be passed through, used as-is ...but this is NOT very portable until you do things like `os.path.join( myproject_data_dir, 'docstore.db')`
- a relative path with os.sep will be passed through ...which is only as portable as the cwd is predictable)
- ':memory:' is in-memory only
- See also resolve_path for more details
by default, each write is committed individually (because SQlite3's driver defaults to autocommit). If you want more performant bulk writes, use put() with commit=False, and do an explicit commit() afterwards ...BUT if a script borks in the middle of something uncommited, you will need to do manual cleanup.
On typing:
- SQLite will just store what it gets, which makes it easy to store mixed types. To allow programmers to enforce some runtime checking, you can specify key_type and value_type.
- This class won't do conversions for you, it only enforces the values that go in are of the type you said you would put in.
- This should make things more consistent, but is not a strict guaranteem, and you can subvert this easily.
- Some uses may wish for a specific key and value type. You could change both key and value types, e.g. the cached_fetch function expects a str:bytes store
- It is a good idea to open the store with the same typing every time, or you will still confuse yourself. CONSIDER: storing typing in the file in the meta table so we can warn you.
making you do CRUD via functions is a little more typing,
- yet is arguably clearer than 'this particular dict-like happens to get stored magically'
- and it lets us exposes some sqlite things (using transactions, vacuum) for when you know how to use them.
On concurrency: As per basic sqlite behaviour, multiple processes can read the same database, but only one can write, and writing is exclusive with reading. So
- when you leave a writer with uncommited data for nontrivial amounts of time, readers are likely to time out.
- If you leave it on autocommit this should be a little rarer
- and a very slow read through the database might time out a write.
- when you leave a writer with uncommited data for nontrivial amounts of time, readers are likely to time out.
It wouldn't be hard to also make it act largely like a dict, implementing __getitem__, __setitem__, and __delitem__ but this muddies the waters as to its semantics, in particular when things you set are actually saved.
So we try to avoid a leaky abstraction, by making you write out all the altering operations, and actually all of them, e.g. get(), put(), keys(), values(), and items(), because those can at least have docstrings to warn you, rather than breaking your reasonable expectations.
...exceptions are
- __len__ for amount of items (CONSIDER making that len())
- __contains__ backing 'is this key in the store') (CONSIDER making that has_key())
and also:
- __iter__ which is actually iterkeys() CONSIDER: removing it
- __getitem__ supports the views-with-a-len
The last were tentative until keys(), values(), and items() started giving views.
TODO: make a final decision where to sit between clean abstractions and convenience.
yes, you _could_ access these SQLite databses yourself, particularly when just reading. Our code is mainly there for convenience and checks. Consider: `sqlite3 store.db 'select key,value from kv limit 10 ' | less` It only starts getting special once you using MsgpackKV, or the extra parsing and wrapping that wetsuite.datasets adds.
Method | __contains__ |
will return whether the store contains a key |
Method | __enter__ |
supports use as a context manager |
Method | __exit__ |
supports use as a context manager - close()s on exit |
Method | __getitem__ |
(only meant to support ValuesView and Itemsview) |
Method | __init__ |
Specify the path to the database file to open. |
Method | __iter__ |
Using this object as an iterator yields its keys (equivalent to .iterkeys()) |
Method | __len__ |
Return the amount of entries in this store |
Method | __repr__ |
show useful representation |
Method | bytesize |
Returns the approximate amount of the contained data, in bytes (may be a few dozen kilobytes off, or more, because it counts in pages) |
Method | close |
Closes file if still open. Note that if there was a transaction still open, it will be rolled back, not committed. |
Method | commit |
commit changes - for when you use put() or delete() with commit=False to do things in a larger transaction |
Method | delete |
delete item by key. |
Method | estimate |
Estimate how many bytes might be cleaned by a .vacuum() |
Method | get |
Gets value for key. The key type is checked against how you constructed this localKV class (doesn't guarantee it matches what's in the database) If not present, this will raise KeyError (by default) or return None (if you set missing_as_None=True) (this is unlike a dict... |
Method | items |
Returns an iteralble of all items. (a view with a len, rather than just a generator) |
Method | iteritems |
Returns a generator that yields all items |
Method | iterkeys |
Returns a generator that yields all keus If you wanted a list with all keys, use list( store.keys() ) |
Method | itervalues |
Returns a generator that yields all values. If you wanted a list with all the values, use list( store.values ) |
Method | keys |
Returns an iterable of all keys. (a view with a len, rather than just a generator) |
Method | put |
Sets/updates value for a key. |
Method | random |
Returns a single (key, value) item from the store, selected randomly. |
Method | random |
Returns a amount of keys in a list, selected randomly. Can be faster/cheaper to do than random_sample When the values are large |
Method | random |
Returns an amount of [(key, value), ...] list from the store, selected randomly. |
Method | random |
A generator that yields one (key,value) tuple at a time, intended to avoid materializing all values before we return. |
Method | random |
Returns a amount of values in a list, selected randomly. |
Method | random |
A generator that yields one value at a time, intended to avoid materializing all values before we return. |
Method | rollback |
roll back changes |
Method | summary |
Gives the byte size, and optionally the number of items and average size |
Method | truncate |
remove all kv entries. If we were still in a transaction, we roll that back first |
Method | vacuum |
After a lot of deletes you could compact the store with vacuum(). WARNING: rewrites the entire file, so the more data you store, the longer this takes. And it may make no difference - you probably want to check estimate_waste() first... |
Method | values |
Returns an iterable of all values. (a view with a len, rather than just a generator) |
Instance Variable | conn |
connection to the sqlite database that we set up |
Instance Variable | key |
the key type you have set |
Instance Variable | path |
the path we opened (after resolving) |
Instance Variable | read |
whether we have told ourselves to treat this as read-only. That _should_ also make it hard for _us_ to be the cause of leaving the database in a locked state. |
Instance Variable | value |
the value type you have set |
Method | _checktype |
checks a value according to the key_type you handed into the constructor |
Method | _checktype |
checks a value according to the value_type you handed into the constructor |
Method | _delete |
For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit() |
Method | _get |
For internal use, preferably don't use. |
Method | _open |
Open the path previously set by init. This function could probably be merged into init, it was separated mostly with the idea that we could keep it closed when not using it. |
Method | _put |
For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit() |
Instance Variable | _in |
Undocumented |
wetsuite.helpers.localdata.MsgpackKV
Specify the path to the database file to open.
key_type and value_type do not have defaults, so that you think about how you are using these, but we often use str,str and str,bytes
Parameters | |
path | database name/pat. File will be created if it does not yet exist, so you proably want think to think about repeating the same path in absolute sense. See also the module docstring, and in particular resolve_path()'s docstring |
key | |
value | |
read | is only enforced in this wrapper to give slightly more useful errors. (we also give SQLite a PRAGMA) |
Returns the approximate amount of the contained data, in bytes (may be a few dozen kilobytes off, or more, because it counts in pages)
Returns | |
int | Undocumented |
Closes file if still open. Note that if there was a transaction still open, it will be rolled back, not committed.
commit changes - for when you use put() or delete() with commit=False to do things in a larger transaction
delete item by key.
Note that you should not expect the file to shrink until you do a vacuum() (which will need to rewrite the file).
Parameters | |
key | Undocumented |
commit:bool | Undocumented |
wetsuite.helpers.localdata.MsgpackKV
Gets value for key. The key type is checked against how you constructed this localKV class (doesn't guarantee it matches what's in the database) If not present, this will raise KeyError (by default) or return None (if you set missing_as_None=True) (this is unlike a dict.get, which has a default=None)
Parameters | |
key | Undocumented |
missingbool | Undocumented |
Returns a generator that yields all keus If you wanted a list with all keys, use list( store.keys() )
wetsuite.helpers.localdata.MsgpackKV
Returns a generator that yields all values. If you wanted a list with all the values, use list( store.values )
wetsuite.helpers.localdata.MsgpackKV
Sets/updates value for a key.
Types will be checked according to what you inited this class with.
commit=False lets us do bulk commits, mostly when you want to a load of (small) changes without becoming IOPS bound, at the risk of locking/blocking other access. If you care less about speed, and/or more about parallel access, ignore this.
CONSIDER: making commit take an integer as well, meaning 'commit every X operations'
Parameters | |
key | Undocumented |
value | Undocumented |
commit:bool | Undocumented |
Returns a single (key, value) item from the store, selected randomly.
A convenience function, because doing this properly yourself takes two or three lines (you can't random.choice/random.sample a view, so to do it properly you basically have to materialize all keys - and not accidentally all values) BUT assume this is slower than working on the keys yourself.
Returns a amount of keys in a list, selected randomly. Can be faster/cheaper to do than random_sample When the values are large
On very large stores (tens of millions of items and/or hundred of gbytes) this still ends up being dozens of seconds, because we still skip through a bunch of that data.
Returns an amount of [(key, value), ...] list from the store, selected randomly.
WARNING: This materializes all keys and the chosen values in RAM, so can use considerable RAM if values are large. To avoid that RAM use, use random_keys() and get() one key at a time, or use random_sample_generator().
Note that when you ask for a larger sample than the entire population, you get the entire population (and unlike random.sample, we don't raise a ValueError to point out this is no longer a subselection)
A generator that yields one (key,value) tuple at a time, intended to avoid materializing all values before we return.
Still materializes all the keys before starting to yield, but that should only start to add up troublesome on many-gigabyte stores, and it might avoid some locking issues.
Returns a amount of values in a list, selected randomly.
WARNING: this materializes the values, so this can be very large in RAM. Consider using random_values_generator, or using random_keys and get() one key at a time.
A generator that yields one value at a time, intended to avoid materializing all values before we return.
Still materializes all the keys before starting to yield, but that should only start to add up troublesome on many-gigabyte stores, and it might avoid some locking issues.
Gives the byte size, and optionally the number of items and average size
Note that the byte size includes waste, so this will over-estimate if you have altered/removed without doing a vacuum().
Parameters | |
getbool | Also find the amount of items, and calculate average size. Is slower than not doing this (proportionally slower with underlying size), adds entries like: : 'num_items': 856716, 'avgsize_bytes': 63585, 'avgsize_readable': '62K', |
Returns | |
a dictionary with at least: {'size_bytes': 54474244096, 'size_readable': '54G'} |
After a lot of deletes you could compact the store with vacuum(). WARNING: rewrites the entire file, so the more data you store, the longer this takes. And it may make no difference - you probably want to check estimate_waste() first. NOTE: if we were left in a transaction (due to commit=False), ths is commit()ed first.
whether we have told ourselves to treat this as read-only. That _should_ also make it hard for _us_ to be the cause of leaving the database in a locked state.
For internal use, preferably don't use. See also _get_meta(), _delete_meta(). Note this does an implicit commit()
Parameters | |
key:str | Undocumented |
For internal use, preferably don't use.
This is an extra str:str table in there that is intended to be separate, with some keys special to these classes. ...you could abuse this for your own needs if you wish, but try not to.
If the key is not present, raises an exception - unless missing_as_none is set, in which case in returns None.
Parameters | |
key:str | Undocumented |
missing | Undocumented |
Open the path previously set by init. This function could probably be merged into init, it was separated mostly with the idea that we could keep it closed when not using it.
timeout: how long wait on opening. Lowered from the default just to avoid a lot of waiting half a minute when it was usually just accidentally left locked. (note that this is different from busy_timeout)