A Proposal for Governmental Data URIs

From Data-gov Wiki

Jump to: navigation, search
Infobox (Data-gov Insight) edit with form
  • name: A Proposal for Governmental Data URIs


This page discusses the design of URIs used in converting governmental datasets to RDF. We briefly review the current practice of converting datasets using the old style of hash-based URIs, and then detail the new style of slash-based URIs and the various types of data that use them in the process of converting and enhancing governmental datasets.

Contents

Old Style

The original TWC data-gov project used a conversion tool that was developed very quickly, with the intention of rapidly converting the data.gov datasets into RDF. The URIs created by this tool weren't terribly friendly for a number of reasons. Here's an example:

http://www.data.gov/semantic/data/alpha/353/dataset-353.rdf#entry1

This URI represents the 1st row of the original CSV file for data.gov dataset 353. Each instance has one property for each column in the original CSV file, as well as an rdf:type of

http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry

The property URIs are based on whatever data is in the header field of the CSV file. Here's an example property:

http://www.data.gov/semantic/data/alpha/353/dataset-353.rdf#lcswdbl

While this approach has helped us produce a great number of triples, there are several issues that make it less than ideal going forward. By grouping all data into a single RDF file (which would contain all the triples for dataset 353), it becomes difficult to interact with information relating to just one instance without downloading the entire dataset. Also, the hash-based URIs make serving alternative representations of the data (e.g. HTML or different RDF serialization formats) all but impossible.


New Style

We've been developing a new system for converting data.gov datasets that takes to heart some of the lessons learned in our early conversions. Many of the decisions of what URIs in our new system look like are based on the idea of an initial "raw" conversion of a dataset, and an iterative process of enhancing the data. The focus of the raw conversion is to be just as easy as our previous system, requiring little to no specific knowledge of a particular dataset's domain or modeling. By designing a system that allows enhancing a raw dataset, however, we end up making better decisions about what the output of a raw conversion should look like.

We'll start out by describing a dataset. Each dataset has an identifier (e.g. '1530') and a version (this can be any identifying string, but here we use a date such as '2009-10-08'). The version of a dataset is used to indicate whenever the underlying data for a dataset has changed. With this information, we now can construct a URI for the dataset:

http://www.data.gov/semantic/dataset/1530

And a URI for a specific version of a dataset:

http://www.data.gov/semantic/dataset/1530/version/2009-10-08

Again, we will be converting each row of a dataset CSV file to an instance, and assigning property values to the instances for each column. If we don't know anything about the data in the dataset, we will create a URI for a row based on its row number. Therefore, row 1 of dataset 1530 would be assigned the URI:

http://www.data.gov/semantic/dataset/1530/version/2009-10-08/thing_1

However, if we know enough about the dataset to know that each row represents a FOIA request, and that the "Request ID" column contains a unique value for each row (a primary key), row 1 would instead be assigned the URI:

http://www.data.gov/semantic/dataset/1530/version/2009-10-08/request/07-F-0001

As with the old-style URIs, each CSV column is represented by a property URI based on the column's name. For example, the "Received Date" column is assigned the property URI:

http://www.data.gov/semantic/dataset/1530/vocab/raw/received_date

Note that property URIs exist "within" the dataset scope (/dataset/1530/), but outside any particular dataset version (2009-10-08). Upon a first, "raw" conversion, all properties also have "/raw/" in their URI. All "raw" property values are plain literals, taken directly from the underlying CSV file. As the dataset is enhanced, the property values may change to better represent the underlying data. For example, if dataset 1530 is enhanced from its "raw" form to the first enhancement by transforming the "Received Date" values from plain literals to xsd:date datatyped literals, the new property URI becomes

http://www.data.gov/semantic/dataset/1530/vocab/enhancement/1/received_date

Every time the dataset is enhanced, all the properties in the dataset are moved into a new enhancement namespace. While this results in more triples being created every time a dataset is enhanced, it allows existing applications and queries over the published data to continue to work as expected because any particular property URI will always be used with the same value types (plain literals, datatyped literals, URIs, etc.). We expect that datasets will only go through a small number of enhancements before they stabalize as useful, semantically-enhanced datasets.

Property values (cells in the original CSV) can be promoted from string literals to resources. When this happens, the promoted values can be given an optional rdf:type using a newly created (dataset-local) rdfs:Class (this class can be mapped to an existing, external class in a later enhancement). For example, the "Requester Name" field in dataset 1530 represents people. The values in this field can be promoted to resources either in a value-space scoped to the "Requester Name" property or, if we know there are other fields that map to the same values, to a dataset scoped value space. Examples of these two resource promotions are:

http://www.data.gov/semantic/dataset/1530/type/person/Connolly_Ward
http://www.data.gov/semantic/dataset/1530/value/requester_name/Connolly_Ward

If we asserted during such a resource promotion that these resources were of type "Person" (as is required in the former case of defining a dataset scoped value space), then we would also end up with each of these resources having an rdf:type of

http://www.data.gov/semantic/dataset/1530/vocab/Person

Discussion

The URI design discussed above makes some assumptions about how governmental datasets are produced, converted to RDF, and published. Currently this design is influenced by the data.gov approach of aggregating and publishing governmental datasets in bulk. Ideally the design and ownership of URIs would be done by the owners of each dataset, allowing them to make informed decisions based on relevant domain information such as knowledge of underlying dataset version changes and which if any fields uniquely identify a row (primary key) or value (e.g. are two rows with the same "requester name" value referencing the same person or two different people with the same name?). In such a situation, some of the URI design assumptions made here would obviously be affected. Understanding these assumptions and how they might be affected by a change in dataset publisher is important, but beyond the scope of this document.

There has been quite a bit of work put into similar issues surrounding the data.gov.uk datasets. Within the UK effort, URIs are designed, "both to encourage those that definitively own reference data to make it available for re-use, and to give those that have data that could be linked, the confidence to re-use a URI set that is not under their direct control."

Those working on data.gov.uk have also described versioning of linked data and dataset statistics and mappings to external vocabularies. Although not described here, our system supports very similar approaches to versioning and mapping as those described.

Recent Developments

[25 August 2010]

We have made some great progress throughout the summer. The following three pages provide more in-depth descriptions of our work to design, implement, and adopt a new URI naming scheme. We just finished leading a Mashathon with representatives of many federal agencies, and our two-day experience "in the trenches" has reassured us that we are on the right track.

  • URI design for RDF conversion of CSV-based data - lists enhancement parameters that can describe how tabular data should be interpreted and cast into Linked-friendly RDF. Examples are provided with snippets of input, the (RDF) enhancement parameter, and the output.
  • Csv2rdf4lod - describes (and provides a pointer to download) some automation infrastructure to ease the conversion process. While a user-friendly interface is very much desirable, the system currently uses unix shell scripts to invoke a Java jar.
  • Triplify challenge 2010 - lebo and williams - describes a submission to the I-SEMANTICS Triplification Challenge, which was nominated and accepted for presentation.
Facts about A Proposal for Governmental Data URIsRDF feed
Dcterms:created17 May 2010  +
Dcterms:creatorGregory Todd Williams  +, Tim Lebo  +, and Alvaro Graves  +
Dcterms:descriptionThis article discusses the design of URIs used in converting governmental datasets to RDF.
Dcterms:modified2010-8-25
Foaf:nameA Proposal for Governmental Data URIs
Skos:altLabelA Proposal for Governmental Data URIs  +, a proposal for governmental data uris  +, and A PROPOSAL FOR GOVERNMENTAL DATA URIS  +
Personal tools
internal pages