General utility functions, like "give me a path to where wetsuite can store data" and debug tools to the end of inspecting data.
Function | free |
Says how many bytes are free on the filesystem that stores that mentioned path. |
Function | get |
Made for the .html.zip files that KOOP puts e.g. in its BUS. |
Function | has |
Mostly meant as a "reject as HTML?" optimization (it is not actually an is_xml test because the declaration is optional in 1.0 (required in 1.1) and this won't match e.g. UTF-16 XML, but it still catches a _lot_ of real-world XML)... |
Function | hash |
Give a CSS color for a string - consistently the same each time based on a hash. |
Function | hash |
Given some byte data, calculate SHA1 hash. Returns that hash as a hex string, unless you specify as_bytes=True |
Function | is |
Does this seem like some kind of office document type? This is currently quick and dirty based on some observations that may not even be correct. TODO: improve |
Function | is |
Does this bytestring look like an empty ZIP file? |
Function | is |
Do these bytes look loke a HTML document? (no specific distinction to XHTML) |
Function | is |
Made for the .html.zip files that KOOP puts e.g. in its BUS. |
Function | is |
Does this bytestring look like a PDF document? |
Function | is |
Does this look and work like an XML file? |
Function | is |
Does this bytestring look like a ZIP file? |
Function | unified |
Returns an unified-diff-like difference between two strings Not meant for actual patching, just for quick debug-printing of changes. |
Function | wetsuite |
Figure out where we can store data. |
Function | _filetype |
describe which of these basic types a bytes object contains |
Says how many bytes are free on the filesystem that stores that mentioned path.
Parameters | |
path | path to check for (shutil will figure out what filesystem that is on), Defaults to the directory we would store datasets into. |
Made for the .html.zip files that KOOP puts e.g. in its BUS.
Gets the contents of the first file from the zip with a name ending in .html Assuming you tested with is_htmlzip() this should be the main file (there might also e.g. be images in there)
Returns a bytestring, or raises and exception
Parameters | |
bytesdata:bytes | the bytestring to treat as a ZIP file. |
Returns | |
the HTML file as a bytes object |
Mostly meant as a "reject as HTML?" optimization (it is not actually an is_xml test because the declaration is optional in 1.0 (required in 1.1) and this won't match e.g. UTF-16 XML, but it still catches a _lot_ of real-world XML)
Parameters | |
bytesdata:bytes | Undocumented |
Give a CSS color for a string - consistently the same each time based on a hash.
Takes a string, and returns (css_str, r,g,b), where r,g,b are 255-scale r,g,b values for the same color.
Usable e.g. to make tables with categorical values more skimmable.
Parameters | |
string:str | the string to hash |
on | if 'dark', we try for a bright color, if 'light', we try to give a dark color, otherwise not restricted |
Given some byte data, calculate SHA1 hash. Returns that hash as a hex string, unless you specify as_bytes=True
If instead given a (unicode) string, it accepts it and deals with unicode by UTF8-encoding it first. Warning: This is not always what you want.
Parameters | |
data:bytes | the bytes to hash |
asbool | whether to return the hash dugest as a bytes object. Defaults to False, meaning a hex string (like 'a49d') |
Does this seem like some kind of office document type? This is currently quick and dirty based on some observations that may not even be correct. TODO: improve
Parameters | |
bytesdata:bytes | Undocumented |
Returns | |
bool | Undocumented |
Does this bytestring look like an empty ZIP file?
Parameters | |
bytesdata:bytes | the bytestring to check contains a ZIP file that stores nothing. |
Returns | |
bool | whether it is an empty ZIP |
Do these bytes look loke a HTML document? (no specific distinction to XHTML)
Parameters | |
bytesdata:bytes | the bytestring to check is a HTML file. |
Returns | |
bool | whether it is HTML |
Made for the .html.zip files that KOOP puts e.g. in its BUS.
Is this a ZIP file with one entry for which the name ends with .html? (we could test its content with is_html it but given the context we can assume it)
Parameters | |
bytesdata:bytes | the bytestring to check/treat as a ZIP file. |
Returns | |
bool | whether it is a ZIP containing HTML |
Does this bytestring look like a PDF document?
Parameters | |
bytesdata:bytes | the bytestring to check looks like the start of a PDF |
Returns | |
bool | whether it is PDF |
Does this look and work like an XML file?
Yes, we could answer "does it look vaguely like the start of an XML" for a lot cheaper than parsing it.
Yet that doesn't have a lot of value in practice, because you would probably only use this function right before handing it to an XML parser to do a full parse - or deciding not to.
A full parse is the best test, but that would just be double work. So in practice, a function that answers 'would a real XML parser probably accept this?' more cheapy than a full parse is probably more useful.
It's probably lighter to see if the first bunch of nodes don't break XML semantics (we use lxml's iterparse and stop after a number of nodes went fine).
(TODO: the input-type code can be made lighter)
Note on semantics: We consider HTML but also XHMTL to warrant a False, mostly due to the way we use this function ourselves.
Parameters | |
bytesdata | If given bytes, it considers it file contents. If given a str object, it considers that a _filesystem filename_ to read from |
return | Undocumented |
acceptint | After how many nodes (that parsed fine) do we accept this and stop parsing? |
Returns | |
bool | whether it is XML - and probably not XHTML or HTML |
Does this bytestring look like a ZIP file?
Parameters | |
bytesdata:bytes | the bytestring to check contains a ZIP file. |
Returns | |
bool | whether it is a ZIP |
Returns an unified-diff-like difference between two strings Not meant for actual patching, just for quick debug-printing of changes.
Parameters | |
before:str | a string to treat as the original |
after:str | a string to treat as the new version |
strip | whether to strip the first lines |
context | how much context to include. Defaults to something high so that it omits little to nothing. |
Returns | |
str | a string that contains plain text unified-diff-like output (with initial header cut off) |
Figure out where we can store data.
Returns a dict with keys mentioning directories:
- wetsuite_dir: a directory in the user profile we can store things
- datasets_dir: a directory inside wetsuite_dir first that datasets.load() will put dataset files in
- stores_dir: a directory inside wetsuite_dir that localdata will put sqlite files in
Keep in mind:
- When windows users have their user profile on the network, we try to pick a directory more likely to be shared by your other logins
- ...BUT keep in mind network mounts tend not to implement proper locking, so around certain things (e.g. our localdata.LocalKV) you invite corruption when multiple workstations write at the same time, ...so don't do that. If you need distributed work, use an actually networked store, or be okay with read-only access.