wetsuite.helpers.util

module documentation

General utility functions, like "give me a path to where wetsuite can store data" and debug tools to the end of inspecting data.

Function	`free_space`	Says how many bytes are free on the filesystem that stores that mentioned path.
Function	`get_ziphtml`	Made for the .html.zip files that KOOP puts e.g. in its BUS.
Function	`has_xml_header`	Mostly meant as a "reject as HTML?" optimization (it is not actually an is_xml test because the declaration is optional in 1.0 (required in 1.1) and this won't match e.g. UTF-16 XML, but it still catches a _lot_ of real-world XML)...
Function	`hash_color`	Give a CSS color for a string - consistently the same each time based on a hash.
Function	`hash_hex`	Given some byte data, calculate SHA1 hash. Returns that hash as a hex string, unless you specify as_bytes=True
Function	`is_doc`	Does this seem like some kind of office document type? This is currently quick and dirty based on some observations that may not even be correct. TODO: improve
Function	`is_empty_zip`	Does this bytestring look like an empty ZIP file?
Function	`is_html`	Do these bytes look loke a HTML document? (no specific distinction to XHTML)
Function	`is_htmlzip`	Made for the .html.zip files that KOOP puts e.g. in its BUS.
Function	`is_pdf`	Does this bytestring look like a PDF document?
Function	`is_xml`	Does this look and work like an XML file?
Function	`is_zip`	Does this bytestring look like a ZIP file?
Function	`unified_diff`	Returns an unified-diff-like difference between two strings Not meant for actual patching, just for quick debug-printing of changes.
Function	`wetsuite_dir`	Figure out where we can store data.
Function	`_filetype`	describe which of these basic types a bytes object contains

def free_space(path=None): ¶

Says how many bytes are free on the filesystem that stores that mentioned path.

Parameters
path	path to check for (shutil will figure out what filesystem that is on), Defaults to the directory we would store datasets into.

def get_ziphtml(bytesdata): ¶

Made for the .html.zip files that KOOP puts e.g. in its BUS.

Gets the contents of the first file from the zip with a name ending in .html Assuming you tested with is_htmlzip() this should be the main file (there might also e.g. be images in there)

Returns a bytestring, or raises and exception

Parameters
bytesdata:`bytes`	the bytestring to treat as a ZIP file.
Returns
the HTML file as a bytes object

def has_xml_header(bytesdata): ¶

Mostly meant as a "reject as HTML?" optimization (it is not actually an is_xml test because the declaration is optional in 1.0 (required in 1.1) and this won't match e.g. UTF-16 XML, but it still catches a _lot_ of real-world XML)

Parameters
bytesdata:`bytes`	Undocumented

def hash_color(string, on=None): ¶

Give a CSS color for a string - consistently the same each time based on a hash.

Takes a string, and returns (css_str, r,g,b), where r,g,b are 255-scale r,g,b values for the same color.

Usable e.g. to make tables with categorical values more skimmable.

Parameters
string:`str`	the string to hash
on	if 'dark', we try for a bright color, if 'light', we try to give a dark color, otherwise not restricted

def hash_hex(data, as_bytes=False): ¶

Given some byte data, calculate SHA1 hash. Returns that hash as a hex string, unless you specify as_bytes=True

If instead given a (unicode) string, it accepts it and deals with unicode by UTF8-encoding it first. Warning: This is not always what you want.

Parameters
data:`bytes`	the bytes to hash
as_bytes:`bool`	whether to return the hash dugest as a bytes object. Defaults to False, meaning a hex string (like 'a49d')

def is_doc(bytesdata): ¶

Does this seem like some kind of office document type? This is currently quick and dirty based on some observations that may not even be correct. TODO: improve

Parameters
bytesdata:`bytes`	Undocumented
Returns
`bool`	Undocumented

def is_empty_zip(bytesdata): ¶

Does this bytestring look like an empty ZIP file?

Parameters
bytesdata:`bytes`	the bytestring to check contains a ZIP file that stores nothing.
Returns
`bool`	whether it is an empty ZIP

def is_html(bytesdata): ¶

Do these bytes look loke a HTML document? (no specific distinction to XHTML)

Parameters
bytesdata:`bytes`	the bytestring to check is a HTML file.
Returns
`bool`	whether it is HTML

def is_htmlzip(bytesdata): ¶

Made for the .html.zip files that KOOP puts e.g. in its BUS.

Is this a ZIP file with one entry for which the name ends with .html? (we could test its content with is_html it but given the context we can assume it)

Parameters
bytesdata:`bytes`	the bytestring to check/treat as a ZIP file.
Returns
`bool`	whether it is a ZIP containing HTML

def is_pdf(bytesdata): ¶

Does this bytestring look like a PDF document?

Parameters
bytesdata:`bytes`	the bytestring to check looks like the start of a PDF
Returns
`bool`	whether it is PDF

def is_xml(bytesdata_or_filename, return_for_html=False, accept_after_n_nodes=25): ¶

Does this look and work like an XML file?

Yes, we could answer "does it look vaguely like the start of an XML" for a lot cheaper than parsing it.

Yet that doesn't have a lot of value in practice, because you would probably only use this function right before handing it to an XML parser to do a full parse - or deciding not to.

A full parse is the best test, but that would just be double work. So in practice, a function that answers 'would a real XML parser probably accept this?' more cheapy than a full parse is probably more useful.

It's probably lighter to see if the first bunch of nodes don't break XML semantics (we use lxml's iterparse and stop after a number of nodes went fine).

(TODO: the input-type code can be made lighter)

Note on semantics: We consider HTML but also XHMTL to warrant a False, mostly due to the way we use this function ourselves.

Parameters
bytesdata_or_filename	If given bytes, it considers it file contents. If given a str object, it considers that a _filesystem filename_ to read from
return_for_html	Undocumented
accept_after_n_nodes:`int`	After how many nodes (that parsed fine) do we accept this and stop parsing?
Returns
`bool`	whether it is XML - and probably not XHTML or HTML

def is_zip(bytesdata): ¶

Does this bytestring look like a ZIP file?

Parameters
bytesdata:`bytes`	the bytestring to check contains a ZIP file.
Returns
`bool`	whether it is a ZIP

def unified_diff(before, after, strip_header=True, context_n=999): ¶

Returns an unified-diff-like difference between two strings Not meant for actual patching, just for quick debug-printing of changes.

Parameters
before:`str`	a string to treat as the original
after:`str`	a string to treat as the new version
strip_header	whether to strip the first lines
context_n	how much context to include. Defaults to something high so that it omits little to nothing.
Returns
`str`	a string that contains plain text unified-diff-like output (with initial header cut off)

def wetsuite_dir(): ¶

Figure out where we can store data.

Returns a dict with keys mentioning directories:

wetsuite_dir: a directory in the user profile we can store things
datasets_dir: a directory inside wetsuite_dir first that datasets.load() will put dataset files in
stores_dir: a directory inside wetsuite_dir that localdata will put sqlite files in

Keep in mind:

When windows users have their user profile on the network, we try to pick a directory more likely to be shared by your other logins
...BUT keep in mind network mounts tend not to implement proper locking, so around certain things (e.g. our localdata.LocalKV) you invite corruption when multiple workstations write at the same time, ...so don't do that. If you need distributed work, use an actually networked store, or be okay with read-only access.

def _filetype(docbytes): ¶

describe which of these basic types a bytes object contains

Parameters
docbytes:`bytes`	Undocumented