When  publishing  open  data  sets,  the  uploaders  or  maintainers  usually  provide  complementary metadata. The metadata describes important information about the data set, such as title, license, publication body, update frequency, etc., and the goal of the ODM project is to collect and analyse such valuable and insightful metadata sources.

This section will go through the major open data metadata standards published and used by different agents and briefly analyse their relationships and mappings between each other.

DCAT

Data  Catalog  Vocabulary  (DCAT),  a  W3C  recommendation  established  on  16  January  2014,  is designed to “facilitate interoperability between data catalogues published on Web” (W3C). The main goal of DCAT is to improve the data catalogues’ interoperability and make applications easily consume metadata from multiple catalogues.

According  to  DCAT  specification,  the  main  concepts  defined  are  dcat:Catalo, dcat:Dataset  and dcat:Distribution, which represents “an accessible form of a data set as for example a downloadable file, an RSS feed or a web service that provides the data” (W3C).

Asset Description Metadata Schema (ADMS)

ADMS is a metadata schema created by the EU’s Interoperability Solutions for European Public Administrations (ISA) Programme. The goal of ADMS is to help publishers of standards to document the metadata of the standards, such as name, status, theme, version, etc.

ADMS is closely related to DCAT, but the difference in user expectation is the core that distinguishes ADMS from DCAT. ADMS is a profile of DCAT for describing so-called Semantic Assets. DCAT is designed to facilitate interoperability between data catalogs, while ADMS is focused on the assets within a catalog. The core concepts in the vocabulary include: title, alternative title, description, keyword, identifier, document, document/type, document/url, etc.

DCAT-AP

The DCAT Application Profile (DCAT-AP) for data portals in Europe is a specification that re-uses terms from DCAT, ADMS, etc., and adds more specificity by identifying mandatory, recommended and optional elements to be used for a particular open data catalogue. Studies conducted by EU commission  (Vickery,  2011)  have  shown  that  businesses  and  citizens  are  facing  difficulties  in searching and reusing data sets from public sector. Therefore, the availability of a unified method to describe data sets in a machine-readable format with a small number of commonly agreed metadata could  largely  improve  the  co-referencing  and  interoperability  among  different  data  catalogues. DCAT-AP is developed under this context and is expected to be applied across Open Data portals in EU countries.

CKAN Attributes

CKAN is the most widely used open data portal software to date, and as such its respective metadata schema is highly relevant to the ODM project. Unlike other W3C standards mentioned above, the CKAN metadata is exposed via RESTful API and data uploaders will need to fill in the metadata with the API request.

CKAN defines three top-level metadata concepts to describe a given data set:

1. package: title, notes, tags, revision_timestap, owner_org, maintainer, maintainer_email, e

  1. 2. resource: description format, resource_type, webstore_url, size, etc), group (name, title, type, state, e
  2. 3. organisation: name, id, title ,description, state, e

The package, resource and group can be roughly mapped to DCAT as dcat:Dataset, dcat:Distribution, dcat:Catalog and foaf:Agent.

INSPIRE Metadata Schema

INSPIRE is a Directive of the European Parliament and of the Council aiming to establish a “EU-wide spatial  data  infrastructure  to  give  access  to  information  that  can  be  used  to  support  EU environmental policies across different countries and public sectors”(INSPIRE). The actual scope of this information corresponds to 34 environmental themes, covering areas having cross-sector relevance, e.g. addresses, buildings, population distribution and demography.

To maximise the interoperability of data infrastructures operated by EU members, INSPIRE proposes a framework using common specifications for metadata, data monitoring, sharing and reporting. INSPIRE consists of a set of implementing rules along with a listing of corresponding technical guidelines. For metadata schema, the INSPIRE Implementing rules include rules for the description of data sets, which could be adopted by open data publishers.

Common Core Metadata Schema (CCMS) in Project Open Data

The  Common  Core Metadata  Schema is  based on DCAT and  provides mutual  vocabulary that different open data metadata schema can map to. The standard consists of a number of schemas (hierarchical vocabulary terms) that represent things that are most often looked for on the web. CCMS also provide the mappings to their equivalents in other standards.

The schema is implemented in JSON and CSV format. Similar to DCAT and CKAN, CCMS also defines top-level concepts such as:

1. dataset: title, description, keyword, modified, publisher, contactPoint, mbox, identifier, accessLevel, bureauCode, programCode, distribution, etc

2. data catalog: id, title, description, type, items

CCMS  provides  mappings  to  other  major  metadata  vocabularies,  such  as  DCAT,  CKAN  and Schema.org. CCMS also develops a Catalog Generator  to help users publish metadata in CCMS format.

Data Catalog Interoperability Protocol (DCIP)

DCAT is the most recent metadata standard that enables the sharing of metadata across different data catalogs. However, the actual implementation of DCAT is still needed to access the metadata and serialize it into different formats. In this context, the DCIP is a specification designed to “facilitate interoperability between data catalogs published on the Web”(spec.datacatalogs.org) and is complementary to DCAT. It provides an “agreed” protocol (REST API) to access the data defined in DCAT.  One of DCIP’s main targets is to develop a CKAN plugin to expose CKAN metadata as DCAT, but this work is still in progress.

Vocabulary of Interlinked Datasets (VoID)

VoID is an “RDF Schema vocabulary for describing metadata about RDF data sets”(VOID). Its primary purpose is to bridge the gap between data publishers and data consumers using an exclusive vocabulary to describe different data set attributes. The core concepts related to open data sets are: void:Dataset, void:Linkset, void:subset.

Schema.org

Schema.org is a collection of schemas (in RDF/Microdata format) that webmasters can use to markup HTML pages in ways recognised by major search engines. Schema.org covers many domains and there are classes and properties defined as DataCatalog and Dataset. The metadata harvester withinthe  ODM project  can make use of schema.org vocabulary to discover the data sets and data catalogs hosted in a certain website.

Google Dataset Publishing Language

Google Dataset Publishing Language  is a “representation language for the data and metadata of data sets”. Data sets described using this format can be visualised directly from Google Public Data Explorer.

 

References can be found here: OpenDataMonitor Project – Shared References