module documentation

Helps interact with the EUR-Lex website and APIs.

Function extract_html Extract data from formatted HTML from the website itself.
Function fetch_by_resource_type Intends to query the SPARQL endpoint to ask for most CELEXes of a specific type, (defaulting to court judgments for no particular reason)
def extract_html(htmlbytes):

Extract data from formatted HTML from the website itself.

Written for JUDG pages, probably needs work for others.

Also, there are plenty of assumptions in this code that probably won't hold over time, so for serious projects you should probably use a data API instead.

TODO: see how language-sensitive this is. CONSIDER: extract more link hrefs (would probably need to hand in page url to)

Parameters
htmlbytesthe page, as a bytes object
Returns
a nested structure
def fetch_by_resource_type(typ='JUDG'):

Intends to query the SPARQL endpoint to ask for most CELEXes of a specific type, (defaulting to court judgments for no particular reason)

TODO: fetch values e.g. at https://github.com/SEMICeu/Excel-to-CPSVAP-RDF-transformation/blob/master/page-objects/utils/CPSVtemplateWithCodelists.json in handier form

Asks to give its semantic results as JSON data, which we parse and return as a python structure.

Parameters
typ

the type to fetch, e.g.

  • 'JUDG' for court judgments
  • 'REG' for regulations (but there are a handful of related things)
Returns

a (possibly-many-item'd) nested structure (python structure, loaded from JSON)

The structure you get back looks like: ( see also https://www.w3.org/TR/2013/REC-sparql11-results-json-20130321/ ) :

    {
        'head': {
            'link': [], 'vars': ['work', 'type', 'celex', 'date', 'force']
        },
        'results': {
            'distinct': False,
            'ordered': True,
            'bindings': [
                {
                    'work':{
                        'type':'uri',
                        'value':'http://publications.europa.eu/resource/cellar/1e3100ce-8a71-433a-8135-15f5cc0e927c'
                    },
                    'type':{
                        'type':'uri',
                        'value':'http://publications.europa.eu/resource/authority/resource-type/JUDG'
                    },
                    'celex':{
                        'type':'typed-literal',
                        'value':'61996CJ0080',
                        'datatype': 'http://www.w3.org/2001/XMLSchema#string'
                    },
                    'date': {
                        'type': 'typed-literal',
                        'value':'1998-01-15',
                        'datatype': 'http://www.w3.org/2001/XMLSchema#date'
                    }
                },
                # ...one of these for each result
            ]
        }
    }