Many national, regional and local governments, as well as other organizations inside and outside of the public sector, are operating data catalogs – web portals that provide access to machine-readable data published by these organizations. The need for a standard format for representing the metadata contained in these catalogs has been recognized. This document is a collection of use cases for such a standard.
The use cases presented in this document were originally collected by the Data Catalog Vocabulary Task Force of the W3C eGovernment Interest Group.
Many national, regional and local governments, as well as other organizations inside and outside of the public sector, are operating data catalogs – web portals that provide access to machine-readable data published by these organizations. The need for a standard format for representing the metadata contained in these catalogs has been recognized. This document is a collection of use cases for such a standard.
A dataset is a collection of information in a machine-readable format. It is published by an agency, usually some sort of official government organisation, and thought to be useful to the public.
A catalog record consists of metadata for a dataset. It thus describes the dataset. The actual dataset is not considered part of the catalog record, but the catalog record usually contains a download link or web page link from where the actual dataset can be obtained.
A catalog is a collection of catalog records, and thus contains metadata for a collection of datasets. It is operated by a catalog operator, which could be a government agency, citizen initiative, …
A catalog operator is an organization that collects metadata about datasets and publishes them as a catalog on the Web.
Metadata are data that provide information about aspects of other data, such as its time and date of creation, its creator or author, its purpose,its format, and so on.
A format is machine-readable if it is amenable to automated processing by a machine, as opposed to presentation to a human user.
This section presents scenarios that would be enabled by the existence of a standard for the representation of data catalogs.
An increasing number of government agencies make their data available on-line in the form of data catalogs such as data.gov (see datacatalogs.orgfor a list). Catalogs exist at national, regional and local level; some are operated by official government bodies and others by citizen initiatives; some have general coverage, while others have a specific focus (e.g., statistical data, historical datasets).
Citizens, journalists, researchers and businesses thus may have to spend considerable amounts of time searching a number of catalogs for relevant datasets. Federated catalogs such as the Guardian's World Government Data site, Sunlight Labs' National Data Catalog, OKFN's publicdata.eu and RPI's IOGDS are emerging as a response to this problem. They present a unified catalog and unified user interface. They may also provide additional advanced features that individual catalog operators will not or can not supply, such as convenient APIs for mashup developers.
The federated catalog replicates individual catalogs' contents into its local database. A website interface similar to those of current individual catalogs is offered for interacting with the federated catalog. Updates to the individual catalogs (new datasets, modified metadata, deleted datasets) also have to be reflected in the federated catalog.
Creating federated catalogs is challenging for various reasons:
A standard format for data catalogs helps with all three problems: First, the existence of a well-documented standard creates an additional incentive towards publishing machine-readable metadata for the catalog operators. Second, a single importer can be used to import all catalogs that support the format. Third, harmonising metadata fields becomes the job of individual catalog operators, who know the contents of their own catalog best.
The model of most current data catalogs assumes that agencies publish datasets on their own website, and then register the dataset with the central catalog by providing the download location and other metadata to the catalog operator. This model is not always efficient. Individual agencies sometimes have existing dataset publishing workflows and metadata management capabilities (e.g., statistics offices). Also, the amount and nature of metadata that agencies can provide differs widely, and a central catalog with a single, non-extensible metadata schema cannot capture the requirements of a wide range of government institutions.
In a distributed publishing model, on the other hand, agencies manage their own metadata on their own websites, using their own publishing workflows and information systems. Central catalogs such as data.gov play the role of aggregator that collects dataset descriptions from different agency websites and presents them in a unified user interface. The central catalog must somehow be able to discover newly published datasets on an agency's web site, e.g., by crawling or by receiving an automated notification from the agency. There also has to be a way of notifying about changes to the metadata.
Note that individual agencies in this scenario may not want to run a full-blown “agency-level data catalog”, but may just want to make metadata available in a more structured form alongside the datasets that are already scattered throughout its web site. This distinguishes this use case from the catalog federation scenario (UC1), which assumes that the sites to be federated are dedicated data catalog websites.
All catalogs websites provide some sort of parametric search facility (e.g., search by publishing agency, by data format, or by theme). Available search parameters differ among catalogs and they are not sufficient for all users needs. For example, data.gov provides search by department, format and category, but not by keyword, update date, or temporal/geographic coverage.
If catalogs are exposed in a standard machine-readable format, then third parties are able to replicate the contents of a catalog into their own database, and run advanced queries over the catalog, or provide interfaces for performing such queries to the general public.
Queries may rely on information that is not present in the catalog but in external sources. For example, by using the US Government Structure Ontology one can query for datasets published by an agency that directly reports to the Executive Office of the President.
Data catalogs support the creation of innovative mashups of government data by making it easier for developers to find data sources of interest. Developers may browse or search the catalog until they have found a dataset of interest, and then download the linked file.
However some mashups and applications may access not just one but a very large number of datasets from a catalog. For example, an application could make all geographic datasets (in ESRI shapefile, GML, KML formats) available for display on a map.
The creation of such applications would become much easier if it was possible to automate the downloading of all datasets that meet certain criteria. Furthermore, the ability to automatically discover new datasets that meet those criteria, and to discover updated datasets, would be useful.
The use cases presented in the previous section give rise to the following requirements for a standard representation of data catalogs. Requirements are cross-linked with the use cases that motivate them.
Must allow retrieval of a machine-readable representation of catalog entries.
Must allow retrieval of all entries in a catalog.
Must provide stable, persistent identifiers for individual entries.
Must allow checking wether an individual dataset has changed or was updated.
Required by: UC2
Must allow the discovery of new entries in a catalog, and the discovery of entries that have been recently updated.
Must include pointers/links to original catalog record when an entry is federated into another catalog.
Must cover the metadata that is found in typical government data catalogs.
Must allow population from existing data catalogs without requiring the production of new metadata, or an expensive (that is, manual) modification of existing metadata. In other words, implementing the standard format for an existing data catalog must not require cleaning up or otherwise modifying the metadata that your catalog collects beyond simple mechanical transformations.
Must be extensible with additional, catalog-specific metadata fields.
Required by: UC2
Must scale to catalogs that contain thousands of datasets without putting unreasonable strain on the bandwidth resources of catalog operator and catalog consumer.
Must allow to query the entries and catalog metadata using a standard mechanism (e.g., SPARQL, XQuery, OpenSearch, etc.).
Required by: UC3
The editors are very thankful for comments and contributions from Vassilios Peristeras, Martin Alvarez, Ed Summers, Christopher Gutteridge, and David Read.
This document is the result of a collective effort of the W3C's eGovernment Interest Group and the Government Linked Data Working Group. Many members of these groups have provided valuable input.