Category Archives: Blog

DCAT-mapping of CKAN “extra” keys

The CKAN software allows portal providers to include additional metadata fields in the metadata schema. When retrieving the metadata description of a dataset via the API, these keys are included in the resulting JSON under the key “extras”. However, it is not guaranteed that the DCAT conversion of the CKAN metadata contains these extra keys. Depending on the version and configuration of the DCAT export-extension there are three different cases:

1) Portal-specific mapping

The portal provider defines a mapping for certain CKAN fields to a specific RDF property.

For instance, data.gov.ie maps certain metadata keys to DCAT properties: the CKAN metadata description [json] for the dataset Modes of Travel in Dublin Region maps the keys theme-primary to the DCAT property dcat:theme (see the exported DCAT metadata [rdf]).

2) No mapping

Looking at the same dataset, we can see that there are other “extra” keys where no mapping to an RDF property exists, e.g., for the key collection-name. The metadata information will get lost if we only consider the exported DCAT.

3) Generic mapping by extension

Certain CKAN data portals map all available extra metadata keys by using the dct:relation (Dublin Core vocabulary) property. The key gets mapped to the rdfs:label property and the value to the rdf:value property, e.g., for the contact-email metadata key:

<http://example.com/example-dataset>
   dct:relation [
      rdfs:label "contact-email" ;
      rdf:value "example@email.com"
   ] ;

This generic mapping can be found on various data portals, for instance at data.gov.uk: e.g., the DCAT export [rdf] for the Road Safety Data contains all extra keys by using the dct:relation property.

A more detailed report can be found here.

Prototype implementation of ISWC 2016 paper is online

The code of the prototype implementation of our paper at ISWC 2016 (“Multi-level semantic labelling of numerical values“) is online available:

https://github.com/sebneu/number_labelling

 

Please be aware that this implementation is only for demonstration purposes. The underlying background knowledge graph is based on 50 DBpedia properties, in detail described in the paper. This is a research project and we try to fix bugs and plan to extend the knowledge graph to other data sources.

CKAN instances on GitHub

Here I started to collect URLs and APIs of existing CKAN instances. I came across the dataportals.org portal which provides a comprehensive list of Open Data sites and portals. However, there are just about 50 CKAN portals in this collection where some of them are down and a lot of APIs are missing.

Using a short script I harvested the dataportals.org CKAN portals and merge them with my list. Before adding new portals to the list I check if the URL (and the API) is accessible (performing an HTTP GET request).

The resulting portals are published on GitHub as CSV or as JSON data:
instances.csv
instances.json

CKAN Revision Feed

CKAN provides an Atom feed to get updates on edited and created datasets and resources. By default, the feed can be found at {portal-url}/revision/list?format=atom.

In the Python code example I accessed the feed using the feedparser package. The titles of the feed entries are of the following form:

r863bfa65-bfed-40b1-b6ee-e78275be3283 [ra2014:updated:resources]: REST API: Update Objekt ra2014

I used the information within the square brackets: ra2014 is the internal dataset name, updated tells you if the dataset is created or updated and the third value, resources, appears only if a resource of the dataset has been edited.

CKAN Open Data portals

The CKAN software is the de facto standard for Open Data portals. So I started to manually collect a list of portal URLs together with their API.

At the moment there is a focus on portals in Europe and especially on portals in Austria in the list. If you have any portals not listed (no matter which country) please contact me or leave them in the comments.

(Here is the link to the document on Google Docs)

I want to start a second list containing Socrata portals but I’m currently missing URLs. So I would be also interested in any Socrata portals!

UPDATE: I started to harvest dataportals.org and to publish an up-to-date list to GitHub. Here you can find more information.