When publishing open data sets, the uploaders or maintainers usually provide complementary metadata. The metadata describes important information about the data set, such as title, license, publication body, update frequency, etc., and the goal of the ODM project is to collect and analyse such valuable and insightful metadata sources.
This section will go through the major open data metadata standards published and used by different agents and briefly analyse their relationships and mappings between each other.
Data Catalog Vocabulary (DCAT), a W3C recommendation established on 16 January 2014, is designed to “facilitate interoperability between data catalogues published on Web” (W3C). The main goal of DCAT is to improve the data catalogues’ interoperability and make applications easily consume metadata from multiple catalogues.
According to DCAT specification, the main concepts defined are dcat:Catalo, dcat:Dataset and dcat:Distribution, which represents “an accessible form of a data set as for example a downloadable file, an RSS feed or a web service that provides the data” (W3C).
Asset Description Metadata Schema (ADMS)
ADMS is a metadata schema created by the EU’s Interoperability Solutions for European Public Administrations (ISA) Programme. The goal of ADMS is to help publishers of standards to document the metadata of the standards, such as name, status, theme, version, etc.
ADMS is closely related to DCAT, but the difference in user expectation is the core that distinguishes ADMS from DCAT. ADMS is a profile of DCAT for describing so-called Semantic Assets. DCAT is designed to facilitate interoperability between data catalogs, while ADMS is focused on the assets within a catalog. The core concepts in the vocabulary include: title, alternative title, description, keyword, identifier, document, document/type, document/url, etc.
The DCAT Application Profile (DCAT-AP) for data portals in Europe is a specification that re-uses terms from DCAT, ADMS, etc., and adds more specificity by identifying mandatory, recommended and optional elements to be used for a particular open data catalogue. Studies conducted by EU commission (Vickery, 2011) have shown that businesses and citizens are facing difficulties in searching and reusing data sets from public sector. Therefore, the availability of a unified method to describe data sets in a machine-readable format with a small number of commonly agreed metadata could largely improve the co-referencing and interoperability among different data catalogues. DCAT-AP is developed under this context and is expected to be applied across Open Data portals in EU countries.
CKAN is the most widely used open data portal software to date, and as such its respective metadata schema is highly relevant to the ODM project. Unlike other W3C standards mentioned above, the CKAN metadata is exposed via RESTful API and data uploaders will need to fill in the metadata with the API request.
CKAN defines three top-level metadata concepts to describe a given data set:
1. package: title, notes, tags, revision_timestap, owner_org, maintainer, maintainer_email, e
The package, resource and group can be roughly mapped to DCAT as dcat:Dataset, dcat:Distribution, dcat:Catalog and foaf:Agent.
INSPIRE Metadata Schema
INSPIRE is a Directive of the European Parliament and of the Council aiming to establish a “EU-wide spatial data infrastructure to give access to information that can be used to support EU environmental policies across different countries and public sectors”(INSPIRE). The actual scope of this information corresponds to 34 environmental themes, covering areas having cross-sector relevance, e.g. addresses, buildings, population distribution and demography.
To maximise the interoperability of data infrastructures operated by EU members, INSPIRE proposes a framework using common specifications for metadata, data monitoring, sharing and reporting. INSPIRE consists of a set of implementing rules along with a listing of corresponding technical guidelines. For metadata schema, the INSPIRE Implementing rules include rules for the description of data sets, which could be adopted by open data publishers.
Common Core Metadata Schema (CCMS) in Project Open Data
The Common Core Metadata Schema is based on DCAT and provides mutual vocabulary that different open data metadata schema can map to. The standard consists of a number of schemas (hierarchical vocabulary terms) that represent things that are most often looked for on the web. CCMS also provide the mappings to their equivalents in other standards.
The schema is implemented in JSON and CSV format. Similar to DCAT and CKAN, CCMS also defines top-level concepts such as:
1. dataset: title, description, keyword, modified, publisher, contactPoint, mbox, identifier, accessLevel, bureauCode, programCode, distribution, etc
2. data catalog: id, title, description, type, items
CCMS provides mappings to other major metadata vocabularies, such as DCAT, CKAN and Schema.org. CCMS also develops a Catalog Generator to help users publish metadata in CCMS format.
Data Catalog Interoperability Protocol (DCIP)
DCAT is the most recent metadata standard that enables the sharing of metadata across different data catalogs. However, the actual implementation of DCAT is still needed to access the metadata and serialize it into different formats. In this context, the DCIP is a specification designed to “facilitate interoperability between data catalogs published on the Web”(spec.datacatalogs.org) and is complementary to DCAT. It provides an “agreed” protocol (REST API) to access the data defined in DCAT. One of DCIP’s main targets is to develop a CKAN plugin to expose CKAN metadata as DCAT, but this work is still in progress.
Vocabulary of Interlinked Datasets (VoID)
VoID is an “RDF Schema vocabulary for describing metadata about RDF data sets”(VOID). Its primary purpose is to bridge the gap between data publishers and data consumers using an exclusive vocabulary to describe different data set attributes. The core concepts related to open data sets are: void:Dataset, void:Linkset, void:subset.
Schema.org is a collection of schemas (in RDF/Microdata format) that webmasters can use to markup HTML pages in ways recognised by major search engines. Schema.org covers many domains and there are classes and properties defined as DataCatalog and Dataset. The metadata harvester withinthe ODM project can make use of schema.org vocabulary to discover the data sets and data catalogs hosted in a certain website.
Google Dataset Publishing Language
Google Dataset Publishing Language is a “representation language for the data and metadata of data sets”. Data sets described using this format can be visualised directly from Google Public Data Explorer.
References can be found here: OpenDataMonitor Project – Shared References