The harmonization process is a Python service responsible for checking created harmonization jobs that need to be executed. These jobs are created automatically when the harvesting process for a catalogue finishes. The collection harmonise_jobs in the database contains all created jobs. Every such created job stores information that is required to run correctly the process. The fields assigned to every job are grouped in two categories. In the first category, we have fields that describe which attributes of the collected metadata need to be harmonized.
The following metadata attributes, whenever they exist, are harmonized: dates, formats, mime-types, licenses, categories, languages and countries. On some of these, we harmonize both labels and values. That is, there are cases where the name used to describe the above attributes does not comply with our internal schema. For instance, the attribute date_released could be encountered as publish-date, deposit_date, etc. These fields are the date, categories, languages and countries. A dictionary of mappings is used to perform the above transformations.
The second one contains the fields that provide information on which metadata to apply the harmonization rules and being able to get an overview of the current status of the process and its execution. Especially, the id field is used to collect statistics related to the process of harmonization from another collection, jobs, like when last process is executed, the fields that where successfully harmonized etc. The cat_url, references the catalogue whose metadata are about to be harmonized. And finally, the harmonised and status fields are used by the platform to be aware of the progress of the harmonization process, i.e. whether it has ever been executed for the specific catalogue, whether it is currently running, pending etc. The possible values for the harmonised field are:
Similarly, the possible values for the status field are:
In the figure below, we see an example of such a harmonization job for the Polish open data catalogue (http://pl.ckan.net). We can identify that it has already been harmonised once in the past (harmonised:’finished’), and that there is a waiting job to be executed (status:’pending’), first in the queue when the service will be released from its work. The rest of them are referring to the attributes which are actually going to be harmonised, e.g. dates, categories, licenses etc.
During the harmonisation phase, the newly collected metadata are first transferred into an intermediate database (’odm_harmonised_temp’), in order to initiate the harmonization process. Every metadata object in the odm collection could be in one of three states: new, copied or updated. We transfer all metadata that are new or updated. Before we start to process the temp collection, we add a flag copied:true and delete the updated flag for all metadata transferred in the temp collection. Then, we start to apply the harmonization rules to the fields defined in the harmonization job. Each one that is executed is copied to the odm_harmonised collection. However, before storing it, we check if the metadata object actually exists from a previously applied harmonization. This is crucial because if it is updated, certain created fields need to be maintained, e.g. duplication flags. After that, we can safely delete the temp collection.
More information can be found in Deliverable D3.6.