module documentation

This module tries to wrangle distinct types of documents for you, from HTML to PDF, from varied specific sources, into plain text, so that you can consume it more easily.

It tries to give those into smallish chunks, ideally informed by the document structure.

Ideally, this module gets ongoing attention to cover all the documents that this project cares about, and to that end is a somwhat modular design.

You won't care about that until you want to add your own, but it does have some implications to how it should be used:

  • decide(docbytes) will give you (score, splitter_object) tuples
    • you can avoid looser-structured splitters by testing for bad scores, _or_ by using decide's thresh to the same effect.
  • each splitter object, when called, gives (metadata, intermediate, flat_text) tuples
    • flat_text is a string. Sometimes you care about only this
    • the other two are more source-specific
      • intermeiate tends to be the structure that that flat_text is from, in case that
      • metadata i

So if you want more control, dealing with the structure hints it also spits out, your code might look somehing like:

    for score, splitter in wetsuite.helpers.split.decide( docbytes ):
        for metadata, intermediate, text in splitter.fragments():
            print( text )
            print( '--------------------' )

**If you just want text**, with little control over the details:

    string_list = feeling_lucky( docbytes )

TODO:

  • parameters, like "OCR that PDF if you have to"

CONSIDER:

  • think about separating the "read document at lower level" code into functions you can use without having to rip it out these classes here. Right now, even if you know the class you want, you need to do:

        f = Fragments_XML_BWB()
        f.accepts()
        f.suitableness()
        for frag in f.fragments():
            print( frag )
    
  • in particular the HTML code can probably be made rather faster

    • using lxml.html (or specifically iterparse?) instad of bs4, because bs4 is currently the slowest part of this ...bit of a rewrite, though.
    • having one class contain all of these - mostly to _share_ state like 'we tried to parse with bs4'
  • always have a reasonable fallback for each document type (TODO: XML?)

  • settling the intermediate format more?

    • Maybe we shouldn't make too many decisions for you, and merely yield suggestions you can easily ignore, e.g.
      • 'seems like same section, <TEXT>'
      • 'seems like new section, <TEXT>'
      • 'seems like page header/footer, <TEXT>'
      • 'seems like page footnotes, <TEXT>'
    • ...so that you, or a helper function, can group into different-sized chunks in the way you need it.
  • maybe rename feeling_lucky, maybe move it to lazy?

  • we were trying to be inspired by LaTeX hyphenation, which has a simple-but-pretty-great "this is the cost of breaking off here" the analogue of which wo7uld make "Hey can you break this to different degrees (sections, paragraphs)" possible

  • There was an earlier idea to specify pattern matchers entirely as data, like:

          presets = {
              'beleidsregels': {
                  'body': { 'name': 'div', 'id': 'PaginaContainer' },
                  'keep_headers_html': True,
                  'keep_tables_html':  True,
              },
          }
      - with a 'this is how to select for the document' and 'this is how to get the body from it', etc.
      - which is worth re-visiting, because 
        - it may be more easily altered in the long run, 
        - it would probably be more portable than this code currently is.
      - a limitation might be expressing more complex selection
    
Class Fragments Abstractish base class explaining the purpose of implementing this
Class Fragments_HTML_BUS_kamer Turn kamer-related HTMLs (from KOOP's BUS) into fragments
Class Fragments_HTML_CVDR Turn CVDR in HTML form into fragments
Class Fragments_HTML_Fallback Extract text from HTML from non-specific source into fragments
Class Fragments_HTML_Geschillencommissie Turn HTML pages from degeschillencommissie.nl into fragments
Class Fragments_HTML_OP_Bgr Turn blad gemeenschappelijke regeling in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_OP_Gmb Turn gemeenteblad in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_OP_Prb Turn provincieblad in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_OP_Stb Turn staatsblad in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_OP_Stcrt Turn staatscourat in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_OP_Trb Turn tractatenblad in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_OP_Wsb Turn waterschapsblad in HTML form (from KOOP's BUS) into fragments
Class Fragments_HTML_Tuchtrecht Turn HTML pages from into fragments
Class Fragments_PDF_Fallback Extract text from PDF from non-specific source into fragments
Class Fragments_XML_BUS_Kamer Turn other kamer XMLs (from KOOP's BUS) into fragments (TODO: re-check which these are)
Class Fragments_XML_BWB Turn BWB in XML form into fragments
Class Fragments_XML_CVDR Turn CVDR in XML form into fragments
Class Fragments_XML_Fallback Extract text from XML from non-specific source into fragments
Class Fragments_XML_OP_Bgr Turn blad gemeenschappelijke regeling in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Gmb Turn gemeenteblad in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Handelingen Turn handelingen in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Prb Turn provincieblad in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Stb Turn sstaatsblad in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Stcrt Turn staatscourant in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Trb Turn tractatenblad in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_OP_Wsb Turn waterschapsblad in XML form (from KOOP's BUS) into fragments
Class Fragments_XML_Rechtspraak turn rechtspraak.nl's open-rechtspraak XML form into fragments
Class SplitDebug A notebook-style formatter that does little more than take a list of tuple of three things (meant for the output of fragments()), and print them in a table.
Function decide Ask all processors to say how well they would do, pick any that seem applicable enough (by our threshold).
Function feeling_lucky If you are sure this code understands a particular document format, you can hand it in here, and it will returns a list of strings for a document. No control, just text from whatever said it applied best.
Function fix_ascii_blah There are a bunch of XMLs that are invalid _only_ because they contain UTF8 but say they are US-ASCII. This seems constrained to some parliamentary XMLs.
Variable header_tag_names Undocumented
Function _split_officielepublicaties_html Code shared between a lot of the officiele-publicaties HTML extraction
Function _split_officielepublicaties_xml Code shared between a lot of the officiele-publicaties XML extraction
Variable _content_re Undocumented
Variable _inhoud_re Undocumented
Variable _op_re Undocumented
Variable _p_re Undocumented
Variable _registered_fragment_parsers Undocumented
Variable _stuk_re Undocumented
def decide(docbytes, thresh=1000, first_only=False, debug=False):

Ask all processors to say how well they would do, pick any that seem applicable enough (by our threshold).

Returns a list of (score, processing_object)

Note that that procobject has had accepts() and suitableness() called, so you can now call fragments() to get the fragments.

def feeling_lucky(docbytes):

If you are sure this code understands a particular document format, you can hand it in here, and it will returns a list of strings for a document. No control, just text from whatever said it applied best.

This needs to be renamed. Maybe this needs to go to wetsuite.helpers.lazy instead.

Returns
a list of strings
def fix_ascii_blah(bytesdata):

There are a bunch of XMLs that are invalid _only_ because they contain UTF8 but say they are US-ASCII. This seems constrained to some parliamentary XMLs.

This is a crude patch-up for someone else's mistake, so arguably doesn't really belong in this module, but hey.

Parameters
bytesdata:bytesUndocumented
header_tag_names: tuple[str, ...] =

Undocumented

def _split_officielepublicaties_html(soup):

Code shared between a lot of the officiele-publicaties HTML extraction

def _split_officielepublicaties_xml(tree, start_at):

Code shared between a lot of the officiele-publicaties XML extraction

_content_re =

Undocumented

_inhoud_re =

Undocumented

_op_re =

Undocumented

_p_re =

Undocumented

_registered_fragment_parsers =

Undocumented

_stuk_re =

Undocumented