This  harvester  essentially  relies  on  the   ckanext-harvest extension, which allows a host CKAN instance to collect and import datasets and/or metadata from other CKAN catalogues. This is useful, for example, for a national catalogue that also aggregates datasets from other, local catalogues. In turn, this extension relies on the CKAN API, which allows a client to access ‐and potentially modify‐ the contents of a CKAN catalogue. Among others, the API includes a function that returns a list containing the identifiers of all the datasets contained in the catalogue, as well as a function for retrieving the complete set of metadata for a given dataset. In fact, the API itself has a rich set of features, allowing performance of more complex searches or even to modify information ‐assuming that the client has appropriate authorisation‐ but the two aforementioned functions are the ones used by the ckanext-harvest extension. Moreover, we use this extension to collect only the metadata of the datasets and not the datasets themselves.

The harvesting process comprises three steps:

  • gather: in this step, the API is called to retrieve the IDs of all the datasets available in the target catalogue;
  • fetch: in this step, for each dataset ID in the list, a call is issued to retrieve its metadata;
  • import: this step stores all the retrieved content in the internal database of the host CKAN instance.

In our case, we have modified the import step of the process, since we are interested in storing the collected metadata not in the database used by CKAN but in our own database, the Raw Metadata Repository, which is essentially a collection of JSON documents stored in a MongoDB database. This involves also some content manipulation to  replace certain  special  characters  or  keywords  that  are  not  allowed  when  importing  the  data in the MongoDB database. Another change concerns the fact that the ckanext-harvest extension is intended for retrieving both the metadata information of the datasets as well as the datasets themselves. In our case, we only store the metadata records, while for the datasets we keep only the URL and some derived information (file format, file size, MD5 checksum) that is needed to compute some of the metrics.

More information can be found in Deliverable D3.3 and D3.6.