The ckanext-harvest extension covers only the case of CKAN catalogues, since it relies on the CKAN API being provided by the target catalogue to be harvested. Hence, different solutions are needed to cover those catalogues that are not deployed on CKAN (or, in some cases, catalogues that are deployed on CKAN but do not allow access to the API). One possible approach is to implement custom harvesters to address specific cases. This can be done by creating extensions similar to ckanext-harvest that implement the harvesting interface specified by CKAN. Indeed, a few such custom harvesters already exist that target specific catalogues, such as daten.berlin.de, data.london.gov.uk, opendata.paris.fr, etc.
However, such an approach is not easily scalable when the aim is to monitor a large number of different catalogues. Instead, we have implemented a rather generic harvester that is based on scraping the contents of HTML pages, and hence does not rely on specific APIs or other knowledge of the underlying platform on which the targeted catalogue is deployed. Instead, the idea is to navigate to the pages of the catalogue presenting the metadata of each dataset, to parse and analyse the HTML tree structure of the page, and then extract from it the elements of interest, as described above.
The first part, i.e. parsing the HTML code of the page and creating the corresponding tree representation, we use the Python package Beautiful Soup. The second part, i.e. locating the appropriate elements that contain the information of interest, is driven by the information provided during the registration of the catalogue. Recall that when a catalogue is registered which is to be harvested using the HTML Harvester, a sample HTML page displaying a randomly selected dataset is given, together with an indication of the elements that are to be extracted (either their labels, if available, or their values). Based on this example, the relevant paths in the HTML tree, for each attribute to be collected, are identified and stored in the job configuration. The HTML Harvester leverages this information when processing each HTML page of the harvested catalogue to locate the relevant elements in the HTML tree. Notice that this step is rather involved, since one needs to allow for a certain level of tolerance when searching in the HTML code of a page, e.g. allowing labels to match approximately if no exact match is found, skipping certain HTML tags (e.g., line breaks) in certain cases, etc.
The HTML Harvester has been developed as a CKAN extension. Hence, it also implements the three steps gather, fetch and import. However, in this case, retrieving the dataset identifiers is not done separately from collecting the metadata of each dataset (as opposed to ckanext-harvest in which these two steps are naturally distinguished due to the two different functions supported by the CKAN API); instead, the whole metadata extraction process actually takes place during the gather stage, and then they are imported in the Raw Metadata Repository in the import stage.
Throughout the course of the project, as more catalogues were being registered and added for harvesting, some issues came up requiring improvements and enhancements on the HTML harvester. These improvements enabled us to increase not only the number of new catalogues (for more visit the online platform) covered but also the accuracy of the harvesting process. In what follows, we present in more detail the changes and improvements made in the way that the HTML harvester operates.
RDF is a general purpose data model to describe information on the Web. Although many catalogues do not provide an API compatible to what was already supported by the ODM platform, they do publish metadata about their datasets in RDF format. Thus, we decided to enhance the HTML harvester’s functionality to increase its accuracy and completeness, by enabling it to collect and process metadata in RDF format, when available, instead of relying on HTML scraping.
In this case, the overall metadata collection and processing still follows the general steps described in D3.3; however, instead of defining rules for each of the attributes to harvest, we use the RDF description. The figure below shows one such example from the Loire-Atlantique open data catalogue .
We can see that all information contained in the HTML page is structurally presented in the RDF link. Afterwards, we parse the content with the xmltodict Python library, which manipulates the XML content as JSON. Finally, we need to map every extracted attribute and its value to our internal schema. Default mappings have been included for the case that the RDF description of the metadata follows the DCAT vocabulary. If that is true, the mappings are applied automatically and the process of harvesting is successfully finalized. Otherwise, there is a need to include custom mapping rules to the catalogue’s schema. In that case, the code that needs to be changed is in the file RdfToJson.py, which is available in the project’s Github repository.
The rules defined during the registration process are used in the fetch stage of the harvesting. We select one of the pre-defined methods for collecting the meta-attributes by checking whether or not the ‘RDF Path’ label has a valid value. The check is performed with the if ‘rdf’ in rules.keys() code, where ‘rdf’ is the internal name that is used in the code to describe the specific field in the form.
More information on this topci can be found in Deliverable D3.6.