The ckanext-harvest extension covers only the case of CKAN catalogues, since it relies on the CKAN API being provided by the target catalogue to be harvested. Hence, different solutions are needed to cover those catalogues that are not deployed on CKAN (or, in some cases, catalogues that are deployed on CKAN but do not allow access to the API). One possible approach is to implement custom harvesters to address specific cases. This can be done by creating extensions similar to ckanext-harvest that implement the harvesting interface specified by CKAN. Indeed, a few such custom harvesters already exist that target specific catalogues, such as daten.berlin.de, data.london.gov.uk, opendata.paris.fr, etc.

However, such an approach is not easily scalable when the aim is to monitor a large number of different catalogues. Instead, we have implemented a rather generic harvester that is based on scraping the contents of HTML pages, and hence does not rely on specific APIs or other knowledge of the underlying platform on which the targeted catalogue is deployed. Instead, the idea is to navigate to the pages of the catalogue presenting the metadata of each dataset, to parse and analyse the HTML tree structure of the page, and then extract from it the elements of interest, as described above.

The first part, i.e. parsing the HTML code of the page and creating the corresponding tree representation, we use the Python package Beautiful Soup. The second part, i.e. locating the appropriate elements that contain the information of interest, is driven by the information provided during the registration of the catalogue. Recall that when a catalogue is registered which is to be harvested using the HTML Harvester, a sample HTML page displaying a randomly selected dataset is given, together with an indication of the elements that are to be extracted (either their labels, if available, or their values). Based on this example, the relevant paths in the HTML tree, for each attribute to be collected, are identified and stored in the job configuration. The HTML Harvester leverages this information when processing each HTML page of the harvested catalogue to locate the relevant elements in the HTML tree. Notice that this step is rather involved, since one needs to allow for a certain level of tolerance when searching in the HTML code of a page, e.g. allowing labels to match approximately if no exact match is found, skipping certain HTML tags (e.g., line breaks) in certain cases, etc.

The HTML Harvester has been developed as a CKAN extension. Hence, it also implements the three steps gather, fetch and import. However, in this case, retrieving the dataset identifiers is not done separately from collecting the metadata of each dataset (as opposed to ckanext-harvest in which these two steps are naturally distinguished due to the two different functions supported by the CKAN API); instead, the whole metadata extraction process actually takes place during the gather stage, and then they are imported in the Raw Metadata Repository in the import stage.

Throughout the course of the project, as more catalogues were being registered and added for harvesting, some issues came up requiring improvements and enhancements on the HTML harvester. These improvements enabled us to increase not only the number of new catalogues (for more visit the online platform) covered but also the accuracy of the harvesting process. In what follows, we present in more detail the changes and improvements made in the way that the HTML harvester operates.

Ability to process RDF content

RDF is a general purpose data model to describe information on the Web. Although many catalogues do not provide an API compatible to what was already supported by the ODM platform, they do publish metadata about their datasets in RDF format. Thus, we decided to enhance the HTML harvester’s functionality to increase its accuracy and completeness, by enabling it to collect and process metadata in RDF format, when available, instead of relying on HTML scraping.

In this case, the overall metadata collection and processing still follows the general steps described in D3.3; however, instead of defining rules for each of the attributes to harvest, we use the RDF description. The figure below shows one such example from the Loire-Atlantique open data catalogue .

Link to RDF description of a dataset’s metadata

Link to RDF description of a dataset’s metadata

We can see that all information contained in the HTML page is structurally presented in the RDF link. Afterwards, we parse the content with the xmltodict Python library, which manipulates the XML content as JSON. Finally, we need to map every extracted attribute and its value to our internal schema. Default mappings have been included for the case that the RDF description of the metadata follows the DCAT vocabulary. If that is true, the mappings are applied automatically and the process of harvesting is successfully finalized. Otherwise, there is a need to include custom mapping rules to the catalogue’s schema. In that case, the code that needs to be changed is in the file RdfToJson.py, which is available in the project’s Github repository.

The rules defined during the registration process are used in the fetch stage of the harvesting. We select one of the pre-defined methods for collecting the meta-attributes by checking whether or not the ‘RDF Path’ label has a valid value. The check is performed with the if ‘rdf’ in rules.keys() code, where ‘rdf’ is the internal name that is used in the code to describe the specific field in the form.

Ability to process JavaScript code

The gather stage of the HTML harvester is used to collect all available URLs of the datasets. In this stage, we need to go through all the different pages in which these URLs are listed. In the case that a catalogue uses a navigation system based on JavaScript snippets in order to reach and retrieve each hosted dataset, we use a different mechanism. During catalogue registration, we label the catalogue as such a case and provide inputs in specific fields, which are used to handle this type of situations.

To address this issue, we used the Selenium Python library, which allows to execute code enclosed in a JavaScript snippet. In our case, this made it possible to perform automatic paging in the catalogue. However, specific technical issues had to be overcome in order to support this functionality. In particular, this involved the selection of the browser to use that would actually execute the snippet, while being able to run on a server where no GUI environment exists. Our first try was with the PhantomJS. This browser is capable of performing all typical tasks related to a browser, including the execution of JavaScipt code, without the need to load a GUI. However, after few experiments, we concluded that this software was not mature enough to cover our needs and to perform JavaScript execution successfully. Subsequently, we resorted to the use of the Mozilla’s Firefox web browser, and specifically the Python library pyvirtualdisplay, which made it possible to run headless Python Selenium/WebDriver tests in our ODM server. Now that we set up and configured the tools to handle issues related to our problem, we can continue with describing the process performed by the harvester.

First, the harvester takes the value from the ‘btn_identifier’ field provided in the registration form. This value, in combination with the value from Action Type field, is used to search in the HTML code and retrieve the JavaScript code. Then, it is provided as input to the Selenium library and thus executing the code to access the pages with the dataset URL links. This way, the harvester goes through all the pages of the catalogues until there everything is collected. Two types of values can be filled in ‘btn_identifier’ that change slightly the process:

  • number fields, i.e. 1,2,3…, that navigates directly to certain page;
  • text button, (in the above example, ‘Siguiente’), that navigates to the next page that follows the current one.

In the first case, we need to modify the JavaScript code to access all the pages. For instance, in http://opendata.cloudbcn.cat/MULTI/es/catalog/ catalogue, the code for the first page with datasets is the following:

javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$DataPager1$ctl01$ctl00′,”). To access next pages, we need to increment it by one, i.e.javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$DataPager1$ctl01$ctl01′,”), in order to go to the second page. We do this each time we want to navigate to another page with URLs of datasets. On the other case, things are simpler. Since that button by default navigates to the next page, we only need to retrieve the embedded JavaScript code. Thus, each time we want to navigate to the next page, we only need to execute again and again the same code with the Selenium library. In our example, the reported code from the text button field is: javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$DataPager1$ctl02$ctl00′,”), which is parsed as it is every time from the Selenium. Finally, the harvester returns when all metadata collected. This happens when the executed code returns an empty page, first case, or the identifier button is missing, in the second case.

 

More information on this topci can be found in Deliverable D3.6.