module documentation

General utility functions, like "give me a path to where wetsuite can store data" and debug tools to the end of inspecting data.

Function free_space Says how many bytes are free on the filesystem that stores that mentioned path.
Function get_ziphtml Made for the .html.zip files that KOOP puts e.g. in its BUS.
Function has_xml_header Mostly meant as a "reject as HTML?" optimization (it is not actually an is_xml test because the declaration is optional in 1.0 (required in 1.1) and this won't match e.g. UTF-16 XML, but it still catches a _lot_ of real-world XML)...
Function hash_color Give a CSS color for a string - consistently the same each time based on a hash.
Function hash_hex Given some byte data, calculate SHA1 hash. Returns that hash as a hex string, unless you specify as_bytes=True
Function is_doc Does this seem like some kind of office document type? This is currently quick and dirty based on some observations that may not even be correct. TODO: improve
Function is_empty_zip Does this bytestring look like an empty ZIP file?
Function is_html Do these bytes look loke a HTML document? (no specific distinction to XHTML)
Function is_htmlzip Made for the .html.zip files that KOOP puts e.g. in its BUS.
Function is_pdf Does this bytestring look like a PDF document?
Function is_xml Does this look and work like an XML file?
Function is_zip Does this bytestring look like a ZIP file?
Function unified_diff Returns an unified-diff-like difference between two strings Not meant for actual patching, just for quick debug-printing of changes.
Function wetsuite_dir Figure out where we can store data.
Function _filetype describe which of these basic types a bytes object contains
def free_space(path=None):

Says how many bytes are free on the filesystem that stores that mentioned path.

Parameters
pathpath to check for (shutil will figure out what filesystem that is on), Defaults to the directory we would store datasets into.
def get_ziphtml(bytesdata):

Made for the .html.zip files that KOOP puts e.g. in its BUS.

Gets the contents of the first file from the zip with a name ending in .html Assuming you tested with is_htmlzip() this should be the main file (there might also e.g. be images in there)

Returns a bytestring, or raises and exception

Parameters
bytesdata:bytesthe bytestring to treat as a ZIP file.
Returns
the HTML file as a bytes object
def has_xml_header(bytesdata):

Mostly meant as a "reject as HTML?" optimization (it is not actually an is_xml test because the declaration is optional in 1.0 (required in 1.1) and this won't match e.g. UTF-16 XML, but it still catches a _lot_ of real-world XML)

Parameters
bytesdata:bytesUndocumented
def hash_color(string, on=None):

Give a CSS color for a string - consistently the same each time based on a hash.

Takes a string, and returns (css_str, r,g,b), where r,g,b are 255-scale r,g,b values for the same color.

Usable e.g. to make tables with categorical values more skimmable.

Parameters
string:strthe string to hash
onif 'dark', we try for a bright color, if 'light', we try to give a dark color, otherwise not restricted
def hash_hex(data, as_bytes=False):

Given some byte data, calculate SHA1 hash. Returns that hash as a hex string, unless you specify as_bytes=True

If instead given a (unicode) string, it accepts it and deals with unicode by UTF8-encoding it first. Warning: This is not always what you want.

Parameters
data:bytesthe bytes to hash
as_bytes:boolwhether to return the hash dugest as a bytes object. Defaults to False, meaning a hex string (like 'a49d')
def is_doc(bytesdata):

Does this seem like some kind of office document type? This is currently quick and dirty based on some observations that may not even be correct. TODO: improve

Parameters
bytesdata:bytesUndocumented
Returns
boolUndocumented
def is_empty_zip(bytesdata):

Does this bytestring look like an empty ZIP file?

Parameters
bytesdata:bytesthe bytestring to check contains a ZIP file that stores nothing.
Returns
boolwhether it is an empty ZIP
def is_html(bytesdata):

Do these bytes look loke a HTML document? (no specific distinction to XHTML)

Parameters
bytesdata:bytesthe bytestring to check is a HTML file.
Returns
boolwhether it is HTML
def is_htmlzip(bytesdata):

Made for the .html.zip files that KOOP puts e.g. in its BUS.

Is this a ZIP file with one entry for which the name ends with .html? (we could test its content with is_html it but given the context we can assume it)

Parameters
bytesdata:bytesthe bytestring to check/treat as a ZIP file.
Returns
boolwhether it is a ZIP containing HTML
def is_pdf(bytesdata):

Does this bytestring look like a PDF document?

Parameters
bytesdata:bytesthe bytestring to check looks like the start of a PDF
Returns
boolwhether it is PDF
def is_xml(bytesdata_or_filename, return_for_html=False, accept_after_n_nodes=25):

Does this look and work like an XML file?

Yes, we could answer "does it look vaguely like the start of an XML" for a lot cheaper than parsing it.

Yet that doesn't have a lot of value in practice, because you would probably only use this function right before handing it to an XML parser to do a full parse - or deciding not to.

A full parse is the best test, but that would just be double work. So in practice, a function that answers 'would a real XML parser probably accept this?' more cheapy than a full parse is probably more useful.

It's probably lighter to see if the first bunch of nodes don't break XML semantics (we use lxml's iterparse and stop after a number of nodes went fine).

(TODO: the input-type code can be made lighter)

Note on semantics: We consider HTML but also XHMTL to warrant a False, mostly due to the way we use this function ourselves.

Parameters
bytesdata_or_filenameIf given bytes, it considers it file contents. If given a str object, it considers that a _filesystem filename_ to read from
return_for_htmlUndocumented
accept_after_n_nodes:intAfter how many nodes (that parsed fine) do we accept this and stop parsing?
Returns
boolwhether it is XML - and probably not XHTML or HTML
def is_zip(bytesdata):

Does this bytestring look like a ZIP file?

Parameters
bytesdata:bytesthe bytestring to check contains a ZIP file.
Returns
boolwhether it is a ZIP
def unified_diff(before, after, strip_header=True, context_n=999):

Returns an unified-diff-like difference between two strings Not meant for actual patching, just for quick debug-printing of changes.

Parameters
before:stra string to treat as the original
after:stra string to treat as the new version
strip_headerwhether to strip the first lines
context_nhow much context to include. Defaults to something high so that it omits little to nothing.
Returns
stra string that contains plain text unified-diff-like output (with initial header cut off)
def wetsuite_dir():

Figure out where we can store data.

Returns a dict with keys mentioning directories:

  • wetsuite_dir: a directory in the user profile we can store things
  • datasets_dir: a directory inside wetsuite_dir first that datasets.load() will put dataset files in
  • stores_dir: a directory inside wetsuite_dir that localdata will put sqlite files in

Keep in mind:

  • When windows users have their user profile on the network, we try to pick a directory more likely to be shared by your other logins
  • ...BUT keep in mind network mounts tend not to implement proper locking, so around certain things (e.g. our localdata.LocalKV) you invite corruption when multiple workstations write at the same time, ...so don't do that. If you need distributed work, use an actually networked store, or be okay with read-only access.
def _filetype(docbytes):

describe which of these basic types a bytes object contains

Parameters
docbytes:bytesUndocumented