class documentation

class Fragments_HTML_Geschillencommissie(Fragments):

View In Hierarchy

Turn HTML pages from degeschillencommissie.nl into fragments

Method __init__ Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().
Method accepts whether we would consider parsing that at all. Often, "is this the right file type".
Method suitableness e.g.
Instance Variable soup Undocumented

Inherited from Fragments:

Method fragments yields a tuple for each fragment
Instance Variable debug Undocumented
Instance Variable docbytes Undocumented
def __init__(self, docbytes, debug=False):

Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().

def accepts(self):

whether we would consider parsing that at all. Often, "is this the right file type".

def suitableness(self):

e.g.

  • 5: I recognize that's PDF, from OP, and specifically Stcrt so I probably know how to fetch out the text fairly well
  • 50: I recognize that's PDF, from OP, so I may do better than entirely generic
  • 500: I recognize that's PDF, I will do something generic (because I am a fallback for PDFs)
  • 5000: I recognize that's PDF, but I'm specific and it's probably a bad idea if I do something generic The idea is that with multiple of these, we can find the thing that (says) is most specific to this document.
soup =

Undocumented