This module tries to wrangle distinct types of documents for you, from HTML to PDF, from varied specific sources, into plain text, so that you can consume it more easily.
It tries to give those into smallish chunks, ideally informed by the document structure.
Ideally, this module gets ongoing attention to cover all the documents that this project cares about, and to that end is a somwhat modular design.
You won't care about that until you want to add your own, but it does have some implications to how it should be used:
- decide(docbytes) will give you (score, splitter_object) tuples
- you can avoid looser-structured splitters by testing for bad scores, _or_ by using decide's thresh to the same effect.
- each splitter object, when called, gives (metadata, intermediate, flat_text) tuples
- flat_text is a string. Sometimes you care about only this
- the other two are more source-specific
- intermeiate tends to be the structure that that flat_text is from, in case that
- metadata i
So if you want more control, dealing with the structure hints it also spits out, your code might look somehing like:
for score, splitter in wetsuite.helpers.split.decide( docbytes ): for metadata, intermediate, text in splitter.fragments(): print( text ) print( '--------------------' )
**If you just want text**, with little control over the details:
string_list = feeling_lucky( docbytes )
TODO:
- parameters, like "OCR that PDF if you have to"
CONSIDER:
think about separating the "read document at lower level" code into functions you can use without having to rip it out these classes here. Right now, even if you know the class you want, you need to do:
f = Fragments_XML_BWB() f.accepts() f.suitableness() for frag in f.fragments(): print( frag )
in particular the HTML code can probably be made rather faster
- using lxml.html (or specifically iterparse?) instad of bs4, because bs4 is currently the slowest part of this ...bit of a rewrite, though.
- having one class contain all of these - mostly to _share_ state like 'we tried to parse with bs4'
always have a reasonable fallback for each document type (TODO: XML?)
settling the intermediate format more?
- Maybe we shouldn't make too many decisions for you, and merely yield suggestions you can easily ignore, e.g.
- 'seems like same section, <TEXT>'
- 'seems like new section, <TEXT>'
- 'seems like page header/footer, <TEXT>'
- 'seems like page footnotes, <TEXT>'
- ...so that you, or a helper function, can group into different-sized chunks in the way you need it.
- Maybe we shouldn't make too many decisions for you, and merely yield suggestions you can easily ignore, e.g.
maybe rename feeling_lucky, maybe move it to lazy?
we were trying to be inspired by LaTeX hyphenation, which has a simple-but-pretty-great "this is the cost of breaking off here" the analogue of which wo7uld make "Hey can you break this to different degrees (sections, paragraphs)" possible
There was an earlier idea to specify pattern matchers entirely as data, like:
presets = { 'beleidsregels': { 'body': { 'name': 'div', 'id': 'PaginaContainer' }, 'keep_headers_html': True, 'keep_tables_html': True, }, } - with a 'this is how to select for the document' and 'this is how to get the body from it', etc. - which is worth re-visiting, because - it may be more easily altered in the long run, - it would probably be more portable than this code currently is. - a limitation might be expressing more complex selection
Class |
|
Abstractish base class explaining the purpose of implementing this |
Class |
|
Turn kamer-related HTMLs (from KOOP's BUS) into fragments |
Class |
|
Turn CVDR in HTML form into fragments |
Class |
|
Extract text from HTML from non-specific source into fragments |
Class |
|
Turn HTML pages from degeschillencommissie.nl into fragments |
Class |
|
Turn blad gemeenschappelijke regeling in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn gemeenteblad in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn provincieblad in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn staatsblad in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn staatscourat in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn tractatenblad in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn waterschapsblad in HTML form (from KOOP's BUS) into fragments |
Class |
|
Turn HTML pages from into fragments |
Class |
|
Extract text from PDF from non-specific source into fragments |
Class |
|
Turn other kamer XMLs (from KOOP's BUS) into fragments (TODO: re-check which these are) |
Class |
|
Turn BWB in XML form into fragments |
Class |
|
Turn CVDR in XML form into fragments |
Class |
|
Extract text from XML from non-specific source into fragments |
Class |
|
Turn blad gemeenschappelijke regeling in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn gemeenteblad in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn handelingen in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn provincieblad in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn sstaatsblad in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn staatscourant in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn tractatenblad in XML form (from KOOP's BUS) into fragments |
Class |
|
Turn waterschapsblad in XML form (from KOOP's BUS) into fragments |
Class |
|
turn rechtspraak.nl's open-rechtspraak XML form into fragments |
Class |
|
A notebook-style formatter that does little more than take a list of tuple of three things (meant for the output of fragments()), and print them in a table. |
Function | decide |
Ask all processors to say how well they would do, pick any that seem applicable enough (by our threshold). |
Function | feeling |
If you are sure this code understands a particular document format, you can hand it in here, and it will returns a list of strings for a document. No control, just text from whatever said it applied best. |
Function | fix |
There are a bunch of XMLs that are invalid _only_ because they contain UTF8 but say they are US-ASCII. This seems constrained to some parliamentary XMLs. |
Variable | header |
Undocumented |
Function | _split |
Code shared between a lot of the officiele-publicaties HTML extraction |
Function | _split |
Code shared between a lot of the officiele-publicaties XML extraction |
Variable | _content |
Undocumented |
Variable | _inhoud |
Undocumented |
Variable | _op |
Undocumented |
Variable | _p |
Undocumented |
Variable | _registered |
Undocumented |
Variable | _stuk |
Undocumented |
Ask all processors to say how well they would do, pick any that seem applicable enough (by our threshold).
Returns a list of (score, processing_object)
Note that that procobject has had accepts() and suitableness() called, so you can now call fragments() to get the fragments.
If you are sure this code understands a particular document format, you can hand it in here, and it will returns a list of strings for a document. No control, just text from whatever said it applied best.
This needs to be renamed. Maybe this needs to go to wetsuite.helpers.lazy instead.
Returns | |
a list of strings |
There are a bunch of XMLs that are invalid _only_ because they contain UTF8 but say they are US-ASCII. This seems constrained to some parliamentary XMLs.
This is a crude patch-up for someone else's mistake, so arguably doesn't really belong in this module, but hey.
Parameters | |
bytesdata:bytes | Undocumented |