wetsuite.helpers.split.Fragments_PDF

class documentation

class Fragments_PDF_Fallback(Fragments):

Constructor: Fragments_PDF_Fallback(docbytes, debug)

Extract text from PDF from non-specific source into fragments

Tries to look at section titles, but is currently too crude to deal with page headers, footers.

Method	`__init__`	Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().
Method	`accepts`	whether we would consider parsing that at all. Often, "is this the right file type".
Method	`fragments`	yields a tuple for each fragment
Method	`suitableness`	e.g.
Instance Variable	`part_ary`	Undocumented
Instance Variable	`part_name`	Undocumented

Inherited from Fragments:

Instance Variable	`debug`	Undocumented
Instance Variable	`docbytes`	Undocumented

def __init__(self, docbytes, debug=False): ¶

overrides wetsuite.helpers.split.Fragments.__init__

Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().

def accepts(self): ¶

overrides wetsuite.helpers.split.Fragments.accepts

whether we would consider parsing that at all. Often, "is this the right file type".

def fragments(self): ¶

overrides wetsuite.helpers.split.Fragments.fragments

yields a tuple for each fragment

def suitableness(self): ¶

overrides wetsuite.helpers.split.Fragments.suitableness

e.g.

5: I recognize that's PDF, from OP, and specifically Stcrt so I probably know how to fetch out the text fairly well
50: I recognize that's PDF, from OP, so I may do better than entirely generic
500: I recognize that's PDF, I will do something generic (because I am a fallback for PDFs)
5000: I recognize that's PDF, but I'm specific and it's probably a bad idea if I do something generic The idea is that with multiple of these, we can find the thing that (says) is most specific to this document.

part_ary = ¶

Undocumented

part_name = ¶

Undocumented

wetsuite.helpers.split.Fragments_PDF_Fallback

`wetsuite.helpers.split.Fragments_PDF_Fallback`