--- a/data-cube-ucr/index.html Wed Feb 27 19:19:36 2013 +0100
+++ b/data-cube-ucr/index.html Wed Feb 27 23:37:55 2013 +0100
@@ -19,20 +19,21 @@
<section id="abstract">
<p>Many national, regional and local governments, as well as other
- organizations in- and outside of the public sector, collect numeric
+ organisations in- and outside of the public sector, collect numeric
data and aggregate this data into statistics. There is a need to
- publish theses statistics in a standardised, machine-readable way on
+ publish these statistics in a standardised, machine-readable way on
the web, so that they can be freely integrated and reused in consuming
applications.</p>
<p>
- This document presents the preparatory work for a W3C recommendation
- of the RDF Data Cube Vocabulary [<cite><a href="#ref-QB-2013">QB-2013</a></cite>].
- It lists representative use cases, which were partly obtained from
- existing deployments of an earlier version of the vocabulary [<cite><a
- href="#ref-QB-2010">QB-2010</a></cite>] and partly obtained from discussions
- within the working group. This document also features a set of
- requirements that have been derived from the use cases and are
- considered in the specification.
+ In this document, the <a href="http://www.w3.org/2011/gld/">W3C
+ Government Linked Data Working Group</a> presents use cases and
+ requirements supporting a recommendation of the RDF Data Cube
+ Vocabulary [<cite><a href="#ref-QB-2013">QB-2013</a></cite>]. The
+ group obtained use cases from existing deployments of and experiences
+ with an earlier version of the data cube vocabulary [<cite><a
+ href="#ref-QB-2010">QB-2010</a></cite>]. The group also describes a set of
+ requirements derived from the use cases and to be considered in the
+ recommendation.
</p>
</section>
@@ -43,32 +44,44 @@
<a href="http://www.w3.org/2011/gld/">W3C Government Linked Data
Working Group</a>.
</p>
- <p>
- Comments on this document may be sent to <a
- href="mailto:public-gld-comments@w3.org">mailto:public-gld-comments@w3.org</a>;
- please include the text "[QB] UCR comment" in the subject line. All
- messages received at this address are viewable in a <a
- href="http://lists.w3.org/Archives/Public/public-gld-comments/">public
- archive</a>.
- </p>
</section>
<section>
<h2 id="introduction">Introduction</h2>
- The aim of this document is to present use cases (rather than general
- scenarios) that benefit from a standard vocabulary to publish
- statistics as Linked Data. These use cases are used to derive and
- justify requirements to the specification. Use cases do not necessarily
- need to be implemented, their main purpose is to document and
- illustrate design decisions.
+ The aim of this document is to present concrete use cases and
+ requirements for a vocabulary to publish statistics as Linked Data. An
+ earlier version of the data cube vocabulary [<cite><a
+ href="#ref-QB-2010">QB-2010</a></cite>] has been existing for some time and
+ has proven applicable in <a
+ href="http://wiki.planet-data.eu/web/Datasets">several deployments</a>.
+ The <a href="http://www.w3.org/2011/gld/">W3C Government Linked
+ Data Working Group</a> intends to transform the data cube vocabulary into
+ a W3C recommendation of the RDF Data Cube Vocabulary [<cite><a
+ href="#ref-QB-2013">QB-2013</a></cite>]. This document describes use cases
+ and requirements derived from existing data cube deployments in order
+ to document and illustrate design decisions that have driven the work.
+ <p>The rest of this document is structured as follows. We will
+ first give a short introduction of the specificities of modelling
+ statistics. Then, we will describe use cases that have been derived
+ from existing deployments or feedback to the earlier data cube
+ vocabulary version. In particular, we describe possible benefits and
+ challenges of use cases. Afterwards, we will describe concrete
+ requirements that were derived from those use cases and that have been
+ taken into account for the specification.</p>
+
+ <p>We use the name data cube vocabulary throughout the document
+ when referring to the vocabulary.</p>
+
+ <section>
+ <h3 id="describing statistics">Describing statistics</h3>
<p>In the following, we describe the challenge of an RDF vocabulary
for publishing statistics as Linked Data.</p>
- <p>Publishing statistics - collected and aggregated numeric data -
+ <p>Describing statistics - collected and aggregated numeric data -
is challenging for the following reasons:</p>
<ul>
<li>Representing statistics requires more complex modeling as
- discussed by Martin Fowler [<cite><a href="#ref-Fowler1997">Fowler1997</a></cite>]:
+ discussed by Martin Fowler [<cite><a href="#ref-FOWLER97">FOWLER97</a></cite>]:
Recording a statistic simply as an attribute to an object (e.g., the
fact that a person weighs 185 pounds) fails with representing
important concepts such as quantity, measurement, and unit. Instead,
@@ -84,9 +97,9 @@
and machines can appropriately visualize such observations or have
conversions between different quantities.</li>
<li>Also, an observation separates a value from the actual event
- at which it was collected; for instance, one can describe the person
- that collected the observation and the time the observation was
- collected.</li>
+ at which it was collected; for instance, one can describe the
+ "Person" that collected the observation and the "Time" the
+ observation was collected.</li>
</ul>
The following figure illustrates this specificitiy of modelling in a
class diagram:
@@ -103,7 +116,7 @@
<p>
The Statistical Data and Metadata eXchange [<cite><a
href="#ref-SDMX">SDMX</a></cite>] - the ISO standard for exchanging and
- sharing of statistical data and metadata among organizations - uses
+ sharing of statistical data and metadata among organisations - uses
"multidimensional model" that caters for the specificity of modelling
statistics. It allows to describe statistics as observations.
Observations exhibit values (Measures) that depend on dimensions
@@ -112,17 +125,15 @@
<p>Since the SDMX standard has proven applicable in many contexts,
the vocabulary adopts the multidimensional model that underlies SDMX
and will be compatible to SDMX.</p>
- <p>We use the name data cube vocabulary throughout the document
- when referring to the vocabulary.</p>
- </section>
+ </section> </section>
<section>
<h2 id="terminology">Terminology</h2>
<p>
<dfn>Statistics</dfn>
is the <a href="http://en.wikipedia.org/wiki/Statistics">study</a> of
- the collection, organization, analysis, and interpretation of data.
+ the collection, organisation, analysis, and interpretation of data.
Statistics comprise statistical data.
</p>
@@ -167,7 +178,7 @@
<p>
A
<dfn>publisher</dfn>
- is a person or organization that exposes source data as Linked Data on
+ is a person or organisation that exposes source data as Linked Data on
the Web.
</p>
@@ -187,8 +198,8 @@
<section>
<h2 id="usecases">Use cases</h2>
<p>This section presents scenarios that are enabled by the
- existence of an vocabulary for the representation of statistics as
- Linked Data.</p>
+ existence of a standard vocabulary for the representation of
+ statistics as Linked Data.</p>
<section>
<h3 id="SDMXWebDisseminationUseCase">SDMX Web Dissemination Use
@@ -212,9 +223,10 @@
consumption application (consumer) that first discovers data from the
registry, then queries data from the corresponding publisher of
selected data, and then visualises the data.</p>
- <p>Abstracted from the SDMX specificities, this use case contains
- the following processes, also illustrated in a process flow diagram by
- SDMX and in more detail described as follows:</p>
+ <p>In the following, we illustrate the processes from this use case
+ in a flow diagram by SDMX and describe what activities are enabled in
+ this use case by having statistics described in a machine-readable
+ format.</p>
<p class="caption">
Figure: Process flow diagram by SDMX [<cite><a
@@ -226,20 +238,27 @@
src="./figures/SDMX_Web_Dissemination_Use_Case.png" width="1000px"></img>
</p>
<p>Benefits:</p>
- <p>A structural metadata source (registry) collects metadata about
- statistical data.</p>
- <p>A data web service (publisher) registers statistical data in a
- registry, and provides statistical data from a database and metadata
- from a metadata repository for consumers. For that, the publisher
- creates database tables (see 1 in figure), and loads statistical data
- in a database and metadata in a metadata repository.</p>
- <p>A consumer discovers data from a registry (3) and creates a
- query to the publisher for selected statistical data (4).</p>
- <p>The publisher translates the query to a query to its database
- (5) as well as metadata repository (6) and returns the statistical
- data and metadata.</p>
- <p>The consumer visualises the returned statistical data and
- metadata.</p>
+ <ul>
+ <li>A structural metadata source (registry) can collect metadata
+ about statistical data.</li>
+
+ <li>A data web service (publisher) can register statistical data
+ in a registry, and can provide statistical data from a database and
+ metadata from a metadata repository for consumers. For that, the
+ publisher creates database tables (see 1 in figure), and loads
+ statistical data in a database and metadata in a metadata repository.</li>
+
+ <li>A consumer can discover data from a registry (3) and
+ automatically can create a query to the publisher for selected
+ statistical data (4).</li>
+
+ <li>The publisher can translate the query to a query to its
+ database (5) as well as metadata repository (6) and return the
+ statistical data and metadata.</li>
+
+ <li>The consumer can visualise the returned statistical data and
+ metadata.</li>
+ </ul>
<p>Requirements:</p>
<ul>
@@ -264,7 +283,7 @@
href="#ref-COINS">COINS</a></cite>])
</span>
</p>
- <p>More and more organizations want to publish statistics on the
+ <p>More and more organisations want to publish statistics on the
web, for reasons such as increasing transparency and trust. Although
in the ideal case, published data can be understood by both humans and
machines, data often is simply published as CSV, PDF, XSL etc.,
@@ -274,52 +293,52 @@
machine-readable and application-independent description of common
statistics with use of open standards, to foster usage and innovation
on the published data.</p>
- <p>In the "COINS as Linked Data" project (Ian Dickinson et al.
- COINS as Linked Data. http://data.gov.uk/resources/coins. Last visited
- on Jan 9 2013), the Combined Online Information System (COINS)
- (Treasury's web site.
- http://www.hm-treasury.gov.uk/psr_coins_data.htm. Last visited on Jan
- 9 2013) shall be published using a standard Linked Data vocabulary.</p>
- <p>In the Combined Online Information System (COINS), HM Treasury,
- the principal custodian of financial data for the UK government,
- released previously restricted financial information about government
- spendings.</p>
+ <p>
+ In the "COINS as Linked Data" project [<cite><a
+ href="#ref-COINS">COINS</a></cite>], the Combined Online Information System
+ (COINS) shall be published using a standard Linked Data vocabulary.
+ </p>
+ <p>
+ Via the Combined Online Information System (COINS), <a
+ href="http://www.hm-treasury.gov.uk/psr_coins_data.htm">HM
+ Treasury</a>, the principal custodian of financial data for the UK
+ government, releases previously restricted financial information about
+ government spendings.
+ </p>
- <p>Benefits:</p>
- According to the COINS as Linked Data project, the reason for
- publishing COINS as Linked Data are threefold.
-
+ <p>According to the COINS as Linked Data project, the reason for
+ publishing COINS as Linked Data are threefold:</p>
<ul>
- <li>using open standard representation makes it easier to work
- with the data with available technologies and promises innovative
- third-party tools and usages</li>
- <li>individual transactions and groups of transactions are given
- an identity, and so can be referenced by web address (URL), to allow
- them to be discussed, annotated, or listed as source data for
- articles or visualizations</li>
- <li>cross-links between linked-data datasets allow for much
- richer exploration of related datasets</li>
+ <li>
+ <ul>
+ <li>using open standard representation makes it easier to work
+ with the data with available technologies and promises innovative
+ third-party tools and usages</li>
+ <li>individual transactions and groups of transactions are
+ given an identity, and so can be referenced by web address (URL),
+ to allow them to be discussed, annotated, or listed as source data
+ for articles or visualizations</li>
+ <li>cross-links between linked-data datasets allow for much
+ richer exploration of related datasets</li>
+ </ul>
+ </li>
+ <li>The COINS data has a hypercube structure. It describes
+ financial transactions using seven independent dimensions (time,
+ data-type, department etc.) and one dependent measure (value). Also,
+ it allows thirty-three attributes that may further describe each
+ transaction. For further information, see the "COINS as Linked Data"
+ project website.</li>
+ <li>COINS is an example of one of the more complex statistical
+ datasets being publishing via data.gov.uk.</li>
+ <li>Part of the complexity of COINS arises from the nature of the
+ data being released.</li>
+ <li>The published COINS datasets cover expenditure related to
+ five different years (2005–06 to 2009–10). The actual COINS database
+ at HM Treasury is updated daily. In principle at least, multiple
+ snapshots of the COINS data could be released through the year.</li>
</ul>
- <p>The COINS data has a hypercube structure. It describes financial
- transactions using seven independent dimensions (time, data-type,
- department etc.) and one dependent measure (value). Also, it allows
- thirty-three attributes that may further describe each transaction.
- For further information, see the "COINS as Linked Data" project
- website.</p>
-
- <p>COINS is an example of one of the more complex statistical
- datasets being publishing via data.gov.uk.</p>
-
- <p>Part of the complexity of COINS arises from the nature of the
- data being released.</p>
-
- <p>The published COINS datasets cover expenditure related to five
- different years (2005–06 to 2009–10). The actual COINS database at HM
- Treasury is updated daily. In principle at least, multiple snapshots
- of the COINS data could be released through the year.</p>
-
<p>The COINS use case leads to the following challenges:</p>
<ul>
<li>The actual data and its hypercube structure are to be
@@ -362,9 +381,6 @@
should clarify the use of subsets of observations</a></li>
</ul>
-
-
-
</section> <section>
<h3 id="PublishingExcelSpreadsheetsasLinkedData">Publisher Use
Case: Publishing Excel Spreadsheets as Linked Data</h3>
@@ -384,62 +400,80 @@
spreadsheets with several multidimensional data tables, having a name
and notes, as well as column values, row values, and cell values.</p>
<p>Benefits:</p>
- <p>The goal in this use case is to to publish spreadsheet
- information in a machine-readable format on the web, e.g., so that
- crawlers can find spreadsheets that use a certain column value. The
- published data should represent and make available for queries the
- most important information in the spreadsheets, e.g., rows, columns,
- and cell values.</p>
- <p>
- For instance, in the C<a href="http://ehumanities.nl/ceda_r/">CEDA_R</a>
- and <a href="http://www.data2semantics.org/">Data2Semantics</a>
- projects publishing and harmonizing Dutch historical census data (from
- 1795 onwards) is a goal. These censuses are now only available as
- Excel spreadsheets (obtained by data entry) that closely mimic the way
- in which the data was originally published and shall be published as
- Linked Data.
- </p>
+ <ul>
+ <li>The goal in this use case is to to publish spreadsheet
+ information in a machine-readable format on the web, e.g., so that
+ crawlers can find spreadsheets that use a certain column value. The
+ published data should represent and make available for queries the
+ most important information in the spreadsheets, e.g., rows, columns,
+ and cell values.</li>
+ <li>For instance, in the <a href="http://ehumanities.nl/ceda_r/">CEDA_R</a>
+ and <a href="http://www.data2semantics.org/">Data2Semantics</a>
+ projects publishing and harmonizing Dutch historical census data
+ (from 1795 onwards) is a goal. These censuses are now only available
+ as Excel spreadsheets (obtained by data entry) that closely mimic the
+ way in which the data was originally published and shall be published
+ as Linked Data.
+ </li>
+ </ul>
<p>Challenges in this use case:</p>
- <p>All context and so all meaning of the measurement point is
- expressed by means of dimensions. The pure number is the star of an
- ego-network of attributes or dimensions. In a RDF representation it is
- then easily possible to define hierarchical relationships between the
- dimensions (that can be exemplified further) as well as mapping
- different attributes across different value points. This way a
- harmonization among variables is performed around the measurement
- points themselves.</p>
- <p>In historical research, until now, harmonization across datasets
- is performed by hand, and in subsequent iterations of a database: it
- is very hard to trace back the provenance of decisions made during the
- harmonization procedure.</p>
- <p>Combining Data Cube with SKOS to allow for cross-location and
- cross-time historical analysis</p>
- <p>Novel visualisation of census data</p>
- <p>Integration with provenance vocabularies, e.g., PROV-O, for
- tracking of harmonization steps</p>
- <p>These challenges may seem to be particular to the field of
- historical research, but in fact apply to government information at
- large. Government is not a single body that publishes information at a
- single point in time. Government consists of multiple (altering)
- bodies, scattered across multiple levels, jurisdictions and areas.
- Publishing government information in a consistent, integrated manner
- requires exactly the type of harmonization required in this use case.</p>
- <p>Excel sheets provide much flexibility in arranging information.
- It may be necessary to limit this flexibility to allow automatic
- transformation.</p>
- <p>There are many spreadsheets.</p>
- <p>Semi-structured information, e.g., notes about lineage of data
- cells, may not be possible to be formalized.</p>
- <p>Another concrete example is the Stats2RDF [1] project that
- intends to publish biomedical statistical data that is represented as
- Excel sheets. Here, Excel files are first translated into CSV and then
- translated into RDF.</p>
+
+ <ul>
+ <li>All context and so all meaning of the measurement point is
+ expressed by means of dimensions. The pure number is the star of an
+ ego-network of attributes or dimensions. In a RDF representation it
+ is then easily possible to define hierarchical relationships between
+ the dimensions (that can be exemplified further) as well as mapping
+ different attributes across different value points. This way a
+ harmonization among variables is performed around the measurement
+ points themselves.</li>
+ <li>In historical research, until now, harmonization across
+ datasets is performed by hand, and in subsequent iterations of a
+ database: it is very hard to trace back the provenance of decisions
+ made during the harmonization procedure.</li>
+ <li>Combining Data Cube with SKOS [<cite><a
+ href="#ref-skos">SKOS</a></cite>] to allow for cross-location and
+ cross-time historical analysis
+ </li>
+ <li>Novel visualisation of census data</li>
+ <li>Integration with provenance vocabularies, e.g., PROV-O, for
+ tracking of harmonization steps</li>
+ <li>These challenges may seem to be particular to the field of
+ historical research, but in fact apply to government information at
+ large. Government is not a single body that publishes information at
+ a single point in time. Government consists of multiple (altering)
+ bodies, scattered across multiple levels, jurisdictions and areas.
+ Publishing government information in a consistent, integrated manner
+ requires exactly the type of harmonization required in this use case.</li>
+ <li>Excel sheets provide much flexibility in arranging
+ information. It may be necessary to limit this flexibility to allow
+ automatic transformation.</li>
+ <li>There are many spreadsheets.</li>
+ <li>Semi-structured information, e.g., notes about lineage of
+ data cells, may not be possible to be formalized.</li>
+ </ul>
+ <p>Existing work:</p>
+ <ul>
+ <li>Another concrete example is the <a
+ href="http://ontowiki.net/Projects/Stats2RDF?show_comments=1">Stats2RDF</a>
+ project that intends to publish biomedical statistical data that is
+ represented as Excel sheets. Here, Excel files are first translated
+ into CSV and then translated into RDF.
+ </li>
+ <li>Some of the challenges are met by the work on an ISO
+ Extension to SKOS [<cite><a href="#ref-xkos">XKOS</a></cite>].
+ </li>
+ </ul>
+
<p>Requirements:</p>
<ul>
<li><a
href="#Vocabularyshouldrecommendamechanismtosupporthierarchicalcodelists">Vocabulary
should recommend a mechanism to support hierarchical code lists</a></li>
+ <li><a
+ href="#Thereshouldbearecommendedwayofdeclaringrelationsbetweencubes">There
+ should be a recommended way of declaring relations between cubes</a></li>
</ul>
@@ -450,17 +484,52 @@
and Open Data Communities</h3>
<p>
<span style="font-size: 10pt">(Use case has been taken from [<cite><a
- href="#ref-SDMX-21">QB4OLAP</a></cite>])
+ href="#ref-QB4OLAP">QB4OLAP</a></cite>] and from discussions at <a
+ href="http://groups.google.com/group/publishing-statistical-data/msg/7c80f3869ff4ba0f">publishing-statistical-data
+ mailing list</a>)
</span>
</p>
<p>It often comes up in statistical data that you have some kind of
- 'overall' figure, which is then broken down into parts (GLD mailing
- list discussion.
- http://groups.google.com/group/publishing-statistical-data/msg/7c80f3869ff4ba0f).</p>
+ 'overall' figure, which is then broken down into parts.</p>
+
+ <p>Example (in pseudo-turtle RDF):</p>
+ <pre>
+ex:obs1
+ sdmx:refArea <uk>;
+ sdmx:refPeriod "2011";
+ ex:population "60" .
+ex:obs2
+ sdmx:refArea <england>;
+ sdmx:refPeriod "2011";
+ ex:population "50" .
+ex:obs3
+ sdmx:refArea <scotland>;
+ sdmx:refPeriod "2011";
+ ex:population "5" .
+ex:obs4
+ sdmx:refArea <wales>;
+ sdmx:refPeriod "2011";
+ ex:population "3" .
+ex:obs5
+ sdmx:refArea <northernireland>;
+ sdmx:refPeriod "2011";
+ ex:population "2" .
+ </pre>
<p>
- Etcheverry and Vaisman [<cite><a href="#ref-SDMX-21">QB4OLAP</a></cite>]
+ We are looking for the best way (in the context of the RDF/Data
+ Cube/SDMX approach) to express that the values for the
+ England/Scotland/Wales/ Northern Ireland ought to add up to the value
+ for the UK and constitute a more detailed breakdown of the overall UK
+ figure? Since we might also have population figures for France,
+ Germany, EU27, it is not as simple as just taking a
+ <code>qb:Slice</code>
+ where you fix the time period and the measure.
+ </p>
+
+ <p>
+ Similarly, Etcheverry and Vaisman [<cite><a href="#ref-QB4OLAP">QB4OLAP</a></cite>]
present the use case to publish household data from <a
href="http://statswales.wales.gov.uk/index.htm">StatsWales</a> and <a
href="http://opendatacommunities.org/doc/dataset/housing/household-projections">Open
@@ -468,7 +537,7 @@
</p>
<p>This multidimensional data contains for each fact a time
- dimension with one level year and a location dimension with levels
+ dimension with one level Year and a location dimension with levels
Unitary Authority, Government Office Region, Country, and ALL.</p>
<p>As unit, units of 1000 households is used.</p>
@@ -486,9 +555,8 @@
<p>Importantly, one would like to maintain the relationship between
the resulting datasets, i.e., the levels and aggregation functions.</p>
- <p>Note, this use case does not simply need a selection (or "dice"
- in OLAP context) where one fixes the time period and the measure
- (qb:Slice where you fix the time period and the measure).</p>
+ <p>Again, this use case does not simply need a selection (or "dice"
+ in OLAP context) where one fixes the time period dimension.</p>
<p>Requirements:</p>
<ul>
@@ -503,15 +571,18 @@
Use Case: Publishing slices of data about UK Bathing Water Quality</h3>
<p>
<span style="font-size: 10pt">(Use case has been provided by
- Epimorphics Ltd (<a
- href="http://www.epimorphics.com/web/projects/bathing-water-quality">http://www.epimorphics.com/web/projects/bathing-water-quality</a>))
+ Epimorphics Ltd, in their <a
+ href="http://www.epimorphics.com/web/projects/bathing-water-quality">UK
+ Bathing Water Quality</a> deployment)
</span>
</p>
- <p>As part of their work with data.gov.uk and the UK Location
- Programme Epimorphics Ltd have been working to pilot the publication
- of both current and historic bathing water quality information from
- the UK Environment Agency (http://www.environment-agency.gov.uk/) as
- Linked Data.</p>
+ <p>
+ As part of their work with data.gov.uk and the UK Location Programme
+ Epimorphics Ltd have been working to pilot the publication of both
+ current and historic bathing water quality information from the <a
+ href="http://www.environment-agency.gov.uk/">UK Environment
+ Agency</a> as Linked Data.
+ </p>
<p>The UK has a number of areas, typically beaches, that are
designated as bathing waters where people routinely enter the water.
The Environment Agency monitors and reports on the quality of the
@@ -530,34 +601,36 @@
<p>The most important dimensions of the data are bathing water,
sampling point, and compliance classification.</p>
<p>Challenges:</p>
-
- <p>Observations may exhibit a number of attributes, e.g., whether
- ther was an abnormal weather exception.</p>
- <p>
- Relevant slices of both datasets are to be created:
- <ul>
- <li>Annual Compliance Assessment Dataset: all the observations
- for a specific sampling point, all the observations for a specific
- year.</li>
- <li>In-Season Sample Assessment Dataset: samples for a given
- sampling point, samples for a given week, samples for a given year,
- samples for a given year and sampling point, latest samples for each
- sampling point.</li>
- <li>The use case suggests more arbitrary subsets of the
- observations, e.g., collecting all the "latest" observations in a
- continuously updated data set.</li>
- </ul>
+ <ul>
+ <li>Observations may exhibit a number of attributes, e.g.,
+ whether ther was an abnormal weather exception.</li>
+ <li>Relevant slices of both datasets are to be created:
+ <ul>
+ <li>Annual Compliance Assessment Dataset: all the observations
+ for a specific sampling point, all the observations for a specific
+ year.</li>
+ <li>In-Season Sample Assessment Dataset: samples for a given
+ sampling point, samples for a given week, samples for a given year,
+ samples for a given year and sampling point, latest samples for
+ each sampling point.</li>
+ <li>The use case suggests more arbitrary subsets of the
+ observations, e.g., collecting all the "latest" observations in a
+ continuously updated data set.</li>
+ </ul>
- </p>
+ </li>
+ </ul>
<p>Existing Work:</p>
<ul>
- <li>Semantic Sensor Network ontology (SSN) [2] already provides a
- way to publish sensor information. SSN data provides statistical
- Linked Data and grounds its data to the domain, e.g., sensors that
- collect observations (e.g., sensors measuring average of temperature
- over location and time).</li>
- <li>A number of organizations, particularly in the Climate and
+ <li>The <a href="http://purl.oclc.org/NET/ssnx/ssn">Semantic
+ Sensor Network ontology</a> (SSN) already provides a way to publish
+ sensor information. SSN data provides statistical Linked Data and
+ grounds its data to the domain, e.g., sensors that collect
+ observations (e.g., sensors measuring average of temperature over
+ location and time).
+ </li>
+ <li>A number of organisations, particularly in the Climate and
Meteorological area already have some commitment to the OGC
"Observations and Measurements" (O&M) logical data model, also
published as ISO 19156.</li>
@@ -583,16 +656,18 @@
Linked Data Wrapper</a> and <a
href="http://eurostat.linked-statistics.org/">Linked Statistics
Eurostat Data</a>, both deployments for publishing Eurostat SDMX as
- Linked Data using the draft version of the vocabulary)
+ Linked Data using the draft version of the data cube vocabulary)
</span>
</p>
- <p>As mentioned already, the ISO standard for exchanging and
- sharing statistical data and metadata among organizations is
- Statistical Data and Metadata eXchange (SDMX). Since this standard has
- proven applicable in many contexts, we adopt the multidimensional
- model that underlies SDMX and intend the standard vocabulary to be
- compatible to SDMX.</p>
+ <p>
+ As mentioned already, the ISO standard for exchanging and sharing
+ statistical data and metadata among organisations is Statistical Data
+ and Metadata eXchange [<cite><a href="#ref-SDMX">SDMX</a></cite>].
+ Since this standard has proven applicable in many contexts, we adopt
+ the multidimensional model that underlies SDMX and intend the standard
+ vocabulary to be compatible to SDMX.
+ </p>
<p>
Therefore, in this use case we intend to explain the benefit and
@@ -602,19 +677,18 @@
warehouse as SDMX and other formats on the web. Eurostat also provides
an interface to browse and explore the datasets. However, linking such
multidimensional data to related data sets and concepts would require
- download of interesting datasets and manual integration.The goal here
- is to improve integration with other datasets; Eurostat data should be
- published on the web in a machine-readable format, possible to be
- linked with other datasets, and possible to be freeley consumed by
- applications. Both <a href="http://estatwrap.ontologycentral.com/">Eurostat
+ downloading of interesting datasets and manual integration.The goal
+ here is to improve integration with other datasets; Eurostat data
+ should be published on the web in a machine-readable format, possible
+ to be linked with other datasets, and possible to be freeley consumed
+ by applications. Both <a href="http://estatwrap.ontologycentral.com/">Eurostat
Linked Data Wrapper</a> and <a
href="http://eurostat.linked-statistics.org/">Linked Statistics
- Eurostat Data</a> intend to publish Eurostat SDMX data as Linked Data. In
- these use cases, <a
+ Eurostat Data</a> intend to publish <a
href="http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/">Eurostat
- data</a> shall be published as <a href="http://5stardata.info/">5-star
- Linked Open Data</a>. Eurostat data is partly published as SDMX, partly
- as tabular data (TSV, similar to CSV). Eurostat provides a <a
+ SDMX data</a> as <a href="http://5stardata.info/">5-star Linked Open
+ Data</a>. Eurostat data is partly published as SDMX, partly as tabular
+ data (TSV, similar to CSV). Eurostat provides a <a
href="http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=table_of_contents_en.xml">TOC
of published datasets</a> as well as a feed of modified and new datasets.
@@ -632,8 +706,9 @@
<ul>
<li>Possible implementation of ETL pipelines based on Linked Data
- technologies (e.g., LDSpider) to load the data into a data warehouse
- for analysis</li>
+ technologies (e.g., <a href="http://code.google.com/p/ldspider/">LDSpider</a>)
+ to effectively load the data into a data warehouse for analysis
+ </li>
<li>Allows useful queries to the data, e.g., comparison of
statistical indicators across EU countries.</li>
@@ -677,7 +752,7 @@
possible slices is a challenge.</li>
<li>Each dimension used by a dataset has a range of permitted
- values that ought to be represented.</li>
+ values that need to be described.</li>
<li>The Eurostat SDMX as Linked Data use case suggests to have
time lines on data aggregating over the gender dimension.</li>
@@ -709,7 +784,7 @@
</li>
- <p>Query interface</p>
+ <li>Query interface</li>
<ul>
<li>Eurostat - Linked Data provides SPARQL endpoint for the
@@ -718,34 +793,37 @@
use Qcrumb.com to query the data.</li>
</ul>
- <p>
- Browsing and visualising interface:
+ <li>Browsing and visualising interface:
<ul>
<li>Eurostat Linked Data Wrapper provides for each dataset an
HTML page showing a visualisation of the data.</li>
</ul>
- </p>
-
- <p>Non-requirements:</p>
- <ul>
- <li>One possible application would run validation checks over
- Eurostat data. The intended standard vocabulary is to publish the
- Eurostat data as-is and is not intended to represent information for
- validation (similar to business rules).</li>
- </ul>
+ </li>
+ </ul>
- <p>Requirements:</p>
- <ul>
- <li><a href="#VocabularyshouldbuildupontheSDMXinformationmodel">There
- should be mechanisms and recommendations regarding publication and
- consumption of large amounts of statistical data</a></li>
- <li><a
- href="#Thereshouldbearecommendedmechanismtoallowforpublicationofaggregateswhichcrossmultipledimensions">There
- should be a recommended mechanism to allow for publication of
- aggregates which cross multiple dimensions</a></li>
- </ul>
+ <p>Non-requirements:</p>
+ <ul>
+ <li>One possible application would run validation checks over
+ Eurostat data. The intended standard vocabulary is to publish the
+ Eurostat data as-is and is not intended to represent information for
+ validation (similar to business rules).</li>
+ <li>Information of how to match elements of the geo-spatial
+ dimension to elements of other data sources, e.g., NUTS, GADM, is not
+ part of a vocabulary recommendation.</li>
+ </ul>
+
+ <p>Requirements:</p>
+ <ul>
+ <li><a href="#VocabularyshouldbuildupontheSDMXinformationmodel">There
+ should be mechanisms and recommendations regarding publication and
+ consumption of large amounts of statistical data</a></li>
+ <li><a
+ href="#Thereshouldbearecommendedmechanismtoallowforpublicationofaggregateswhichcrossmultipledimensions">There
+ should be a recommended mechanism to allow for publication of
+ aggregates which cross multiple dimensions</a></li>
+ </ul>
</section> <section>
<h3 id="Representingrelationshipsbetweenstatisticaldata">Publisher
Use Case: Representing relationships between statistical data</h3>
@@ -792,7 +870,8 @@
the following figure.
</p>
- <p class="caption">Figure: Illustration of ETL of statistics</p>
+ <p class="caption">Figure: Illustration of ETL workflows to process
+ statistics</p>
<p align="center">
<img alt="COGS relationships between statistics example"
@@ -830,21 +909,19 @@
</ul>
<p>
- Existing Work (optional):
-
- <p>
- Possible relation to <a
+ Existing Work:
+ <ul>
+ <li>Possible relation to <a
href="http://www.w3.org/2011/gld/wiki/Best_Practices_Discussion_Summary#Versioning">Versioning</a>
- part of GLD Best Practices Document, where it is specified how to
- publish data which has multiple versions.
- </p>
- <p>
- The <a href="http://sites.google.com/site/cogsvocab/">COGS</a>
- vocabulary [<cite><a href="#ref-COGS">COGS</a></cite>] is related to
- this use case since it may complement the standard vocabulary for
- representing ETL pipelines processing statistics.
- </p>
-
+ part of GLD Best Practices Document, where it is specified how to
+ publish data which has multiple versions.
+ </li>
+ <li>The <a href="http://sites.google.com/site/cogsvocab/">COGS</a>
+ vocabulary [<cite><a href="#ref-COGS">COGS</a></cite>] is related to
+ this use case since it may complement the standard vocabulary for
+ representing ETL pipelines processing statistics.
+ </li>
+ </ul>
</p>
<p>Requirements:</p>
<ul>
@@ -946,7 +1023,7 @@
from the web as shown in the following figure:</p>
<p class="caption">Figure: An interactive chart in GPDE for
- visualising Eurostat data in the DSPL</p>
+ visualising Eurostat data described with DSPL</p>
<p align="center">
<img
alt="An interactive chart in GPDE for visualising Eurostat data in the DSPL"
@@ -974,10 +1051,15 @@
without knowing the data.</li>
</ul>
- <p>Unanticipated Uses (optional): DSPL is representative for using
- statistical data published on the web in available tools for analysis.
- Similar tools that may be automatically covered are: Weka (arff data
- format), Tableau, SPSS, STATA, PC-Axis etc.</p>
+ <p>
+ Non-requirements:
+ <ul>
+ <li>DSPL is representative for using statistical data published
+ on the web in available tools for analysis. Similar tools that may
+ be automatically covered are: Weka (arff data format), Tableau,
+ SPSS, STATA, PC-Axis etc.</li>
+ </ul>
+ </p>
<p>Requirements:</p>
<ul>
@@ -1025,8 +1107,8 @@
An example scenario of this use case is the Financial Information
Observation System (FIOS) [<cite><a href="#ref-FIOS">FIOS</a></cite>],
where XBRL data provided by the SEC on the web is to be re-published
- as Linked Data and made analysable for stakeholders in a web-based
- OLAP client Saiku.
+ as Linked Data and made possible to explore and analyse by
+ stakeholders in a web-based OLAP client Saiku.
</p>
<p>The following figure shows an example of using FIOS. Here, for
@@ -1066,6 +1148,8 @@
<li>Depending on the expressivity of the OLAP queries (e.g.,
aggregation functions, hierarchies, ordering), performance plays an
important role.</li>
+ <li>Olap systems have to cater for possibly missing information
+ (e.g., the aggregation function or a human readable label).</li>
</ul>
@@ -1105,22 +1189,30 @@
<p>
A concrete use case is the structured collection of <a
href="http://wiki.planet-data.eu/web/Datasets">RDF Data Cube
- Vocabulary datasets</a> in the PlanetData Wiki. It is supposed to list
- statistical data published. This list is supposed to describe the
- formal RDF descriptions on a higher level and to provide a useful
- overview of RDF Data Cube deployments in the Linked Data cloud.
+ Vocabulary datasets</a> in the PlanetData Wiki. This list is supposed to
+ describe statistical datasets on a higher level - for easy discovery
+ and selection - and to provide a useful overview of RDF Data Cube
+ deployments in the Linked Data cloud.
</p>
- <p>Unanticipated Uses: If data catalogs contain statistics, they do
- not expose those using Linked Data but for instance using CSV or HTML
- (e.g., Pangea). It could also be a use case to publish such data using
- the standard vocabulary.</p>
- <p>
- Existing Work: The <a href="http://www.w3.org/TR/vocab-dcat/">Data
- Catalog vocabulary</a> (DCAT) is strongly related to this use case since
- it may complement the standard vocabulary for representing statistics
- in the case of registering data in a data catalog.
- </p>
+ <p>Unanticipated Uses:</p>
+
+ <ul>
+ <li>If data catalogs contain statistics, they do not expose those
+ using Linked Data but for instance using CSV or HTML (e.g., Pangea).
+ It could also be a use case to publish such data using the data cube
+ vocabulary.</li>
+ </ul>
+
+ <p>Existing Work:</p>
+ <ul>
+ <li>The <a href="http://www.w3.org/TR/vocab-dcat/">Data
+ Catalog vocabulary</a> (DCAT) is strongly related to this use case since
+ it may complement the standard vocabulary for representing statistics
+ in the case of registering data in a data catalog.
+ </li>
+ </ul>
+
<p>Requirements:</p>
<ul>
@@ -1201,19 +1293,40 @@
<h3
id="Vocabularyshouldrecommendamechanismtosupporthierarchicalcodelists">Vocabulary
should recommend a mechanism to support hierarchical code lists</h3>
- <p>First, hierarchical code lists may be supported via SKOS. Allow
- for cross-location and cross-time analysis of statistical datasets.</p>
- <p>Second, one can think of non-SKOS hierarchical code lists. E.g.,
- if simple skos:narrower/skos:broader relationships are not sufficient
- or if a vocabulary uses specific hierarchical properties, e.g.,
- geo:containedIn.</p>
- <p>Also, the use of hierarchy levels needs to be clarified. It has
- been suggested, to allow skos:Collections as value of qb:codeList.</p>
+ <p>
+ First, hierarchical code lists may be supported via SKOS [<cite><a
+ href="#ref-skos">SKOS</a></cite>]. Allow for cross-location and cross-time
+ analysis of statistical datasets.
+ </p>
+ <p>
+ Second, one can think of non-SKOS hierarchical code lists. E.g., if
+ simple
+ <code> skos:narrower</code>
+ /
+ <code>skos:broader</code>
+ relationships are not sufficient or if a vocabulary uses specific
+ hierarchical properties, e.g.,
+ <code>geo:containedIn</code>
+ .
+ </p>
+ <p>
+ Also, the use of hierarchy levels needs to be clarified. It has been
+ suggested, to allow
+ <code>skos:Collections</code>
+ as value of
+ <code>qb:codeList</code>
+ .
+ </p>
<p>Background information:</p>
<ul>
<li>Issue: <a href="http://www.w3.org/2011/gld/track/issues/31">http://www.w3.org/2011/gld/track/issues/31</a></li>
<li>Issue: <a href="http://www.w3.org/2011/gld/track/issues/39">http://www.w3.org/2011/gld/track/issues/39</a>
</li>
+ <li>Discussion at publishing-statistical-data mailing list: <a
+ href="http://groups.google.com/group/publishing-statistical-data/msg/7c80f3869ff4ba0f">http://groups.google.com/group/publishing-statistical-data/msg/7c80f3869ff4ba0f</a></li>
+ <li>Part of the requirement is met by the work on an ISO
+ Extension to SKOS [<cite><a href="#ref-xkos">XKOS</a></cite>]
+ </li>
</ul>
<p>Required by:</p>
@@ -1226,12 +1339,12 @@
<h3
id="VocabularyshoulddefinerelationshiptoISO19156ObservationsMeasurements">Vocabulary
should define relationship to ISO19156 - Observations & Measurements</h3>
- <p>An number of organizations, particularly in the Climate and
+ <p>An number of organisations, particularly in the Climate and
Meteorological area already have some commitment to the OGC
"Observations and Measurements" (O&M) logical data model, also
published as ISO 19156. Are there any statements about compatibility
and interoperability between O&M and Data Cube that can be made to
- give guidance to such organizations?</p>
+ give guidance to such organisations?</p>
<p>Background information:</p>
<ul>
<li>Issue: <a href="http://www.w3.org/2011/gld/track/issues/32">http://www.w3.org/2011/gld/track/issues/32</a></li>
@@ -1272,6 +1385,10 @@
<p>Background information:</p>
<ul>
<li>Issue: <a href="http://www.w3.org/2011/gld/track/issues/30">http://www.w3.org/2011/gld/track/issues/30</a></li>
+ <li>Discussion in <a
+ href="http://groups.google.com/group/publishing-statistical-data/browse_thread/thread/75762788de10de95">publishing-statistical-data
+ mailing list</a>
+ </li>
</ul>
<p>Required by:</p>
@@ -1375,7 +1492,7 @@
<dd>
Ian Dickinson et al., COINS as Linked Data <a
href="http://data.gov.uk/resources/coins">http://data.gov.uk/resources/coins</a>,
- Last visited on Jan 9 2013
+ last visited on Jan 9 2013
</dd>
<dt id="ref-FIOS">[FIOS]</dt>
@@ -1386,7 +1503,7 @@
</dd>
- <dt id="ref-Fowler1997">[Fowler1997]</dt>
+ <dt id="ref-FOWLER97">[FOWLER97]</dt>
<dd>Fowler, Martin (1997). Analysis Patterns: Reusable Object
Models. Addison-Wesley. ISBN 0201895420.</dd>
@@ -1451,14 +1568,20 @@
<dd>
SMDX - SDMX User Guide Version 2009.1, <a
href="http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf">http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf</a>,
- Last visited Jan 8 2013.
+ last visited Jan 8 2013.
</dd>
<dt id="ref-SDMX-21">[SMDX 2.1]</dt>
<dd>
SDMX 2.1 User Guide Version. Version 0.1 - 19/09/2012. <a
href="http://sdmx.org/wp-content/uploads/2012/11/SDMX_2-1_User_Guide_draft_0-1.pdf">http://sdmx.org/wp-content/uploads/2012/11/SDMX_2-1_User_Guide_draft_0-1.pdf</a>.
- Last visited on 8 Jan 2013.
+ last visited on 8 Jan 2013.
+ </dd>
+
+ <dt id="ref-xkos">[XKOS]</dt>
+ <dd>
+ Extended Knowledge Organization System (XKOS), <a
+ href="https://github.com/linked-statistics/xkos">https://github.com/linked-statistics/xkos</a>
</dd>
</dl>