The various catalogues that have been included in the system for harvesting and monitoring come from different geographical levels around Europe. Thus, the list includes catalogues ranging from regional level to national and pan-European. Although this improves coverage, a problem that arises is that duplicate datasets often exist among the monitored catalogues, since a catalogue at a higher regional level (e.g., national) may often aggregate datasets from lower levels (e.g. city-level). As a result, there is a need to identify and exclude duplicates when computing the various metrics in order to avoid bias in the results.
Before describing how we handled the problem, we note that we consider that datasets hosted within the same catalogue to be duplicate free. There were two reasons for making this assumption: (a) an investigation on a sample of the collected metadata did not provide evidence for the opposite, and (b) limiting the duplicate detection task to only consider different catalogues reduces the time and resources required. Nevertheless, it is straightforward to modify this process to also check for duplicates within the same catalogue.
Before we proceed, we need to define the terms duplicates and candidates. As duplicate we describe a pair of metadata stored in the database and refer to the same dataset which exists and harvested from two different catalogues. Respectively, a candidate is a pair of metadata that is highly probable to describe the same dataset which again is hosted in different catalogues. Having clarified these terms, we continue on presenting how the module operates.
The de-duplication module consists of two distinct phases:
- indexing: this step indexes all initial metadata, as described in more detail later, in order to reduce the number of comparisons that are needed to identify duplicates
- searching: this step performs the actual comparisons to identify (potential) duplicates
During the indexing phase, we perform operations that subsequently make it faster and more accurate to identify potential duplicates in the database. Specifically, this includes the following steps, performed for all metadata records:
- prepare the content for each metadata record that will be used as indicator for identifying duplicates; specifically, we use for this purpose the concatenation of the fields <title> and <notes>;
- tokenize the content string and remove punctuations, stop words etc. The Whoosh API used for this;
- calculate the md5sum on the content string on every metadata; this attribute will be used for exact matching (see below);
- create 4-gram Shingles from content string and calculate minhash; this attribute will be used for approximation matching (see below);
- store the above calculated attributes (this is done in the MongoDB database, under the collection dedups), and create an indexer on the minhash
The process is illustrated in the diagram below.
Flow chart for the indexing phase of the de-duplication process
The result of the above process is a fully indexed set of metadata records in our repository that are about to be used by the searching phase.
During the searching phase, we identify and log candidates of metadata. The procedure comprises the following steps:
- select a newly harvested metadata record in the repository;
- find records with identical minhash;
- apply similarity criteria (see below)
- candidates produced for verification
- repeat steps 1-4 for every metadata record.
Criteria for similarity
In order to identify candidates among metadata, we check a number of attributes to comply with specific rules. The number and the type of fields that were chosen should fulfil a few requirements. Firstly, the number should be as small as possible to minimize the processing time but also sufficient to produce reliable results. Also they should exist in most of the metadata, ideally in every metadata, and have valid values in order to be useful. We ended up with the following attributes: title, notes, resources and date_updated. Except for the date_updated, the rest are in accordance to our requirements. However, we choose also the date because it was necessary, whenever it existed, in automating steps of our process, which will be clear below. For each of the aforementioned attributes, we apply the following checks:
- Content (<title>+<notes>): given the contents of a pair of metadata, C and C’, we calculate the edit distance (Levenshtein distance), dist(C,C’). The requirement is that the edit distance must be lower or equal to a threshold. An optimization is applied for the special case of dist(C,C’)=0. We use the md5sum, already calculated in the indexing phase, to find equal string. This saves time in string similarity calculations performed with the edit distance algorithm.
- Resource: given a pair of metadata, they have a set of resources attached to them, R and R’. A resource R1 ∈ R is equal to R2 ∈ R’ if and only if they have the same URL and size. The following cases are valid:
- R ∩ R’ ≠ ∅: the two sets have equal resources and the type of relations between them could be: a) R = R’, when the two sets are equal, b) (R ⊂ R ⋁ ), when R is strict subset to R’ or R is strict superset to R’ respectively and c) (R ⊄ R’ ⋁ R ⊅ R’), when none of them is superset to the other;
- R ∩ R’ = ∅ ∧ (R ≠ ∅ ⋁ R’ ≠ ∅): the two sets have no common resources and at least one of them is not empty;
- R = ∅ ∧ R’ = ∅: both sets are empty.
- Update date: given the date_updated for a pair of metadata, DU and DU’, we compare these values. One requirement for the comparison is that both DRs field values must exist (∃DU ∧ ∃DU’). In this case, we can have the following:
- DU = DU’: date_updated values are equal;
- DU ≠ DU’: one of the two dates is newer than the other;
To formulate requirements and cover all alternative conditions for the above rules, we use the decision table presented in the table below. Based on this, the process results to label each of the metadata records stored in the database as:
- Unique: the metadata object is successfully identified as unique (i.e. no candidate duplicates were found) and no further processing is needed;
- Candidate: a pair of metadata is marked as candidate and waits for verification.
Decision table to label metadata as Candidate or Unique
Finally, having the list of the candidate pairs, we need to figure out the pairs which are indeed duplicates and reject those that are not. Moreover, for each of the identified pairs as duplicates, we must label one of the members of the pair as original. This flag is used when we need to query the database and take into account only one of the metadata pairs that were found and labelled as duplicates. This distinction is made with one of the following ways:
- Automatically: this is the case where the applied rules results in classifying a candidate pair as duplicates and assigning to one the them the original If we look at the decision table, we find that this is the outcome of applying rules R3 and R6;
- Semi-automatically: in this case, the duplicate pairs are produced automatically. However, this is not also the case for the original The rules R2, R5 and R9 result in this occasion. Thus, we need to apply another step that is called partial ordering. We construct a hierarchy of importance between catalogues. This means that metadata belonging in a catalogue of greater importance can be considered as original to those that come from catalogues lower in the hierarchy. Many things can result in characterizing a catalogue as more important to another one. For instance, it could be how trustworthy or up to date is, if it is considered as official for a country or even the geographical area that is covered from the hosted datasets;
- Manually: this is the trivial case, where each of the candidate pairs needs to be manually verified as duplicate. This is the result when the rest of the rules in the table above are applied.
The whole process is illustrated in the diagram below.
Flow chart for the searching phase of the de-duplication process
Finally, we modify each metadata object in our harmonised instance of the collected metadata to mark our findings, as in Figure 17. The following meta-attribute fields are used for this:
- is_duplicate: this flag is set to true when the metadata is labels as duplicate. Otherwise it is set to false.
- duplicates: it is an array of ids of the metadata that are found to be duplicates to current one;
- is_original: this is used as a flag whether this metadata is used as original or not. It takes the values true or
Example of duplicate metadata objects
More information can be found in Deliverable D3.6.