wetsuite.helpers.split

module documentation

This module tries to wrangle distinct types of documents for you, from HTML to PDF, from varied specific sources, into plain text, so that you can consume it more easily.

It tries to give those into smallish chunks, ideally informed by the document structure.

Ideally, this module gets ongoing attention to cover all the documents that this project cares about, and to that end is a somwhat modular design.

You won't care about that until you want to add your own, but it does have some implications to how it should be used:

decide(docbytes) will give you (score, splitter_object) tuples
- you can avoid looser-structured splitters by testing for bad scores, _or_ by using decide's thresh to the same effect.
each splitter object, when called, gives (metadata, intermediate, flat_text) tuples
- flat_text is a string. Sometimes you care about only this
- the other two are more source-specific
  - intermeiate tends to be the structure that that flat_text is from, in case that
  - metadata i

So if you want more control, dealing with the structure hints it also spits out, your code might look somehing like:

    for score, splitter in wetsuite.helpers.split.decide( docbytes ):
        for metadata, intermediate, text in splitter.fragments():
            print( text )
            print( '--------------------' )

**If you just want text**, with little control over the details:

    string_list = feeling_lucky( docbytes )

TODO:

parameters, like "OCR that PDF if you have to"

CONSIDER:

think about separating the "read document at lower level" code into functions you can use without having to rip it out these classes here. Right now, even if you know the class you want, you need to do:
```
    f = Fragments_XML_BWB()
    f.accepts()
    f.suitableness()
    for frag in f.fragments():
        print( frag )
```
in particular the HTML code can probably be made rather faster
- using lxml.html (or specifically iterparse?) instad of bs4, because bs4 is currently the slowest part of this ...bit of a rewrite, though.
- having one class contain all of these - mostly to _share_ state like 'we tried to parse with bs4'
always have a reasonable fallback for each document type (TODO: XML?)
settling the intermediate format more?
- Maybe we shouldn't make too many decisions for you, and merely yield suggestions you can easily ignore, e.g.
  - 'seems like same section, <TEXT>'
  - 'seems like new section, <TEXT>'
  - 'seems like page header/footer, <TEXT>'
  - 'seems like page footnotes, <TEXT>'
- ...so that you, or a helper function, can group into different-sized chunks in the way you need it.
maybe rename feeling_lucky, maybe move it to lazy?
we were trying to be inspired by LaTeX hyphenation, which has a simple-but-pretty-great "this is the cost of breaking off here" the analogue of which wo7uld make "Hey can you break this to different degrees (sections, paragraphs)" possible

There was an earlier idea to specify pattern matchers entirely as data, like:

      presets = {
          'beleidsregels': {
              'body': { 'name': 'div', 'id': 'PaginaContainer' },
              'keep_headers_html': True,
              'keep_tables_html':  True,
          },
      }
  - with a 'this is how to select for the document' and 'this is how to get the body from it', etc.
  - which is worth re-visiting, because 
    - it may be more easily altered in the long run, 
    - it would probably be more portable than this code currently is.
  - a limitation might be expressing more complex selection

Class	`Fragments`	Abstractish base class explaining the purpose of implementing this
Class	`Fragments_HTML_BUS_kamer`	Turn kamer-related HTMLs (from KOOP's BUS) into fragments
Class	`Fragments_HTML_CVDR`	Turn CVDR in HTML form into fragments
Class	`Fragments_HTML_Fallback`	Extract text from HTML from non-specific source into fragments
Class	`Fragments_HTML_Geschillencommissie`	Turn HTML pages from degeschillencommissie.nl into fragments
Class	`Fragments_HTML_OP_Bgr`	Turn blad gemeenschappelijke regeling in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_OP_Gmb`	Turn gemeenteblad in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_OP_Prb`	Turn provincieblad in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_OP_Stb`	Turn staatsblad in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_OP_Stcrt`	Turn staatscourat in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_OP_Trb`	Turn tractatenblad in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_OP_Wsb`	Turn waterschapsblad in HTML form (from KOOP's BUS) into fragments
Class	`Fragments_HTML_Tuchtrecht`	Turn HTML pages from into fragments
Class	`Fragments_PDF_Fallback`	Extract text from PDF from non-specific source into fragments
Class	`Fragments_XML_BUS_Kamer`	Turn other kamer XMLs (from KOOP's BUS) into fragments (TODO: re-check which these are)
Class	`Fragments_XML_BWB`	Turn BWB in XML form into fragments
Class	`Fragments_XML_CVDR`	Turn CVDR in XML form into fragments
Class	`Fragments_XML_Fallback`	Extract text from XML from non-specific source into fragments
Class	`Fragments_XML_OP_Bgr`	Turn blad gemeenschappelijke regeling in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Gmb`	Turn gemeenteblad in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Handelingen`	Turn handelingen in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Prb`	Turn provincieblad in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Stb`	Turn sstaatsblad in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Stcrt`	Turn staatscourant in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Trb`	Turn tractatenblad in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_OP_Wsb`	Turn waterschapsblad in XML form (from KOOP's BUS) into fragments
Class	`Fragments_XML_Rechtspraak`	turn rechtspraak.nl's open-rechtspraak XML form into fragments
Class	`SplitDebug`	A notebook-style formatter that does little more than take a list of tuple of three things (meant for the output of fragments()), and print them in a table.
Function	`decide`	Ask all processors to say how well they would do, pick any that seem applicable enough (by our threshold).
Function	`feeling_lucky`	If you are sure this code understands a particular document format, you can hand it in here, and it will returns a list of strings for a document. No control, just text from whatever said it applied best.
Function	`fix_ascii_blah`	There are a bunch of XMLs that are invalid _only_ because they contain UTF8 but say they are US-ASCII. This seems constrained to some parliamentary XMLs.
Variable	`header_tag_names`	Undocumented
Function	`_split_officielepublicaties_html`	Code shared between a lot of the officiele-publicaties HTML extraction
Function	`_split_officielepublicaties_xml`	Code shared between a lot of the officiele-publicaties XML extraction
Variable	`_content_re`	Undocumented
Variable	`_inhoud_re`	Undocumented
Variable	`_op_re`	Undocumented
Variable	`_p_re`	Undocumented
Variable	`_registered_fragment_parsers`	Undocumented
Variable	`_stuk_re`	Undocumented

def decide(docbytes, thresh=1000, first_only=False, debug=False): ¶

Ask all processors to say how well they would do, pick any that seem applicable enough (by our threshold).

Returns a list of (score, processing_object)

Note that that procobject has had accepts() and suitableness() called, so you can now call fragments() to get the fragments.

def feeling_lucky(docbytes): ¶

If you are sure this code understands a particular document format, you can hand it in here, and it will returns a list of strings for a document. No control, just text from whatever said it applied best.

This needs to be renamed. Maybe this needs to go to wetsuite.helpers.lazy instead.

Returns
a list of strings

def fix_ascii_blah(bytesdata): ¶

There are a bunch of XMLs that are invalid _only_ because they contain UTF8 but say they are US-ASCII. This seems constrained to some parliamentary XMLs.

This is a crude patch-up for someone else's mistake, so arguably doesn't really belong in this module, but hey.

Parameters
bytesdata:`bytes`	Undocumented

header_tag_names: tuple[str, ...] = ¶

Undocumented

def _split_officielepublicaties_html(soup): ¶

Code shared between a lot of the officiele-publicaties HTML extraction

def _split_officielepublicaties_xml(tree, start_at): ¶

Code shared between a lot of the officiele-publicaties XML extraction

_content_re = ¶

Undocumented

_inhoud_re = ¶

Undocumented

_op_re = ¶

Undocumented

_p_re = ¶

Undocumented

_registered_fragment_parsers = ¶

Undocumented

_stuk_re = ¶

Undocumented