This harvester essentially relies on the ckanext-harvest extension, which allows a host CKAN instance to collect and import datasets and/or metadata from other CKAN catalogues. This is useful, for example, for a national catalogue that also aggregates datasets from other, local catalogues. In turn, this extension relies on the CKAN API, which allows a client to access ‐and potentially modify‐ the contents of a CKAN catalogue. Among others, the API includes a function that returns a list containing the identifiers of all the datasets contained in the catalogue, as well as a function for retrieving the complete set of metadata for a given dataset. In fact, the API itself has a rich set of features, allowing performance of more complex searches or even to modify information ‐assuming that the client has appropriate authorisation‐ but the two aforementioned functions are the ones used by the ckanext-harvest extension. Moreover, we use this extension to collect only the metadata of the datasets and not the datasets themselves.
The harvesting process comprises three steps:
In our case, we have modified the import step of the process, since we are interested in storing the collected metadata not in the database used by CKAN but in our own database, the Raw Metadata Repository, which is essentially a collection of JSON documents stored in a MongoDB database. This involves also some content manipulation to replace certain special characters or keywords that are not allowed when importing the data in the MongoDB database. Another change concerns the fact that the ckanext-harvest extension is intended for retrieving both the metadata information of the datasets as well as the datasets themselves. In our case, we only store the metadata records, while for the datasets we keep only the URL and some derived information (file format, file size, MD5 checksum) that is needed to compute some of the metrics.