class documentation
class Fragments_HTML_OP_Gmb(Fragments):
Turn gemeenteblad in HTML form (from KOOP's BUS) into fragments
Method | __init__ |
Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide(). |
Method | accepts |
whether we would consider parsing that at all. Often, "is this the right file type". |
Method | fragments |
yields a tuple for each fragment |
Method | suitableness |
e.g. |
Instance Variable | docbytes |
Undocumented |
Instance Variable | soup |
Undocumented |
Inherited from Fragments
:
Instance Variable | debug |
Undocumented |
Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().
overrides
wetsuite.helpers.split.Fragments.accepts
whether we would consider parsing that at all. Often, "is this the right file type".
e.g.
- 5: I recognize that's PDF, from OP, and specifically Stcrt so I probably know how to fetch out the text fairly well
- 50: I recognize that's PDF, from OP, so I may do better than entirely generic
- 500: I recognize that's PDF, I will do something generic (because I am a fallback for PDFs)
- 5000: I recognize that's PDF, but I'm specific and it's probably a bad idea if I do something generic The idea is that with multiple of these, we can find the thing that (says) is most specific to this document.