class documentation

class Fragments_PDF_Fallback(Fragments):

View In Hierarchy

Extract text from PDF from non-specific source into fragments

Tries to look at section titles, but is currently too crude to deal with page headers, footers.

Method __init__ Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().
Method accepts whether we would consider parsing that at all. Often, "is this the right file type".
Method fragments yields a tuple for each fragment
Method suitableness e.g.
Instance Variable part_ary Undocumented
Instance Variable part_name Undocumented

Inherited from Fragments:

Instance Variable debug Undocumented
Instance Variable docbytes Undocumented
def __init__(self, docbytes, debug=False):

Hand the document bytestring into this. Nothing happens yet; you call accepts(), then suitableness(), then possibly fragments() -- see example use in decide().

def accepts(self):

whether we would consider parsing that at all. Often, "is this the right file type".

def fragments(self):

yields a tuple for each fragment

def suitableness(self):

e.g.

  • 5: I recognize that's PDF, from OP, and specifically Stcrt so I probably know how to fetch out the text fairly well
  • 50: I recognize that's PDF, from OP, so I may do better than entirely generic
  • 500: I recognize that's PDF, I will do something generic (because I am a fallback for PDFs)
  • 5000: I recognize that's PDF, but I'm specific and it's probably a bad idea if I do something generic The idea is that with multiple of these, we can find the thing that (says) is most specific to this document.
part_ary =

Undocumented

part_name =

Undocumented