--- a/data-cube-ucr/index.html Mon Feb 25 16:52:05 2013 +0100
+++ b/data-cube-ucr/index.html Mon Feb 25 16:52:18 2013 +0100
@@ -1,68 +1,148 @@
<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
- "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
-<html xmlns="http://www.w3.org/1999/xhtml">
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
+
<head>
+<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<title>Use Cases and Requirements for the Data Cube Vocabulary</title>
-<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-<script type="text/javascript" src='../respec/respec3/builds/respec-w3c-common.js' class='remove'></script>
+
+<script type="text/javascript"
+ src='../respec/respec3/builds/respec-w3c-common.js' class='remove'></script>
<script src="respec-ref.js"></script>
<script src="respec-config.js"></script>
<link rel="stylesheet" type="text/css" href="local-style.css" />
</head>
+
<body>
<section id="abstract">
<p>Many national, regional and local governments, as well as other
- organizations inside and outside of the public sector, create
- statistics. There is a need to publish those statistics in a
- standardized, machine-readable way on the web, so that statistics can
- be freely integrated and reused in consuming applications. This
- document is a collection of use cases for a standard vocabulary to
- publish statistics as Linked Data.</p>
+ organizations in- and outside of the public sector, collect numeric
+ data and aggregate this data into statistics. There is a need to
+ publish theses statistics in a standardised, machine-readable way on
+ the web, so that they can be freely integrated and reused in consuming
+ applications.</p>
+ <p>
+ This document presents the preparatory work for a W3C recommendation
+ of the RDF Data Cube Vocabulary [<cite><a href="#ref-QB-2013">QB-2013</a></cite>].
+ It lists representative use cases, which were partly obtained from
+ existing deployments of an earlier version of the vocabulary [<cite><a
+ href="#ref-QB-2010">QB-2010</a></cite>] and partly obtained from discussions
+ within the working group. This document also features a set of
+ requirements that have been derived from the use cases and are
+ considered in the specification.
+ </p>
</section>
<section id="sotd">
<p>
- This is a working document of the <a
- href="http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary">Data
- Cube Vocabulary project</a> within the <a
- href="http://www.w3.org/2011/gld/">W3C Government Linked Data
- Working Group</a>. Feedback is welcome and should be sent to the <a
- href="mailto:public-gld-comments@w3.org">public-gld-comments@w3.org
- mailing list</a>.
+ This document is an editorial update to an Editor's Draft of the "Use
+ Cases and Requirements for the Data Cube Vocabulary" developed by the
+ <a href="http://www.w3.org/2011/gld/">W3C Government Linked Data
+ Working Group</a>.
+ </p>
+ <p>
+ Comments on this document may be sent to <a
+ href="mailto:public-gld-comments@w3.org">mailto:public-gld-comments@w3.org</a>;
+ please include the text "[QB] UCR comment" in the subject line. All
+ messages received at this address are viewable in a <a
+ href="http://lists.w3.org/Archives/Public/public-gld-comments/">public
+ archive</a>.
</p>
</section>
<section>
- <h2>Introduction</h2>
+ <h2 id="introduction">Introduction</h2>
+ The aim of this document is to present use cases (rather than general
+ scenarios) that benefit from a standard vocabulary to publish
+ statistics as Linked Data. These use cases are used to derive and
+ justify requirements to the specification. Use cases do not necessarily
+ need to be implemented, their main purpose is to document and
+ illustrate design decisions.
- <p>Many national, regional and local governments, as well as other
- organizations inside and outside of the public sector, create
- statistics. There is a need to publish those statistics in a
- standardized, machine-readable way on the web, so that statistics can
- be freely linked, integrated and reused in consuming applications.
- This document is a collection of use cases for a standard vocabulary
- to publish statistics as Linked Data.</p>
+ <p>In the following, we describe the challenge of an RDF vocabulary
+ for publishing statistics as Linked Data.</p>
+ <p>Publishing statistics - collected and aggregated numeric data -
+ is challenging for the following reasons:</p>
+ <ul>
+ <li>Representing statistics requires more complex modeling as
+ discussed by Martin Fowler [<cite><a href="#ref-Fowler1997">Fowler1997</a></cite>]:
+ Recording a statistic simply as an attribute to an object (e.g., the
+ fact that a person weighs 185 pounds) fails with representing
+ important concepts such as quantity, measurement, and unit. Instead,
+ a statistic is modeled as a distinguishable object, an observation.
+ </li>
+ <li>The object describes an observation of a value, e.g., a
+ numeric value (e.g., 185) in case of a measurement or a categorical
+ value (e.g., "blood group A") in case of a categorical observation.</li>
+ <li>To allow correct interpretation of the value, the object can
+ be further described by "dimensions", e.g., the specific phenomenon
+ "weight" observed and the unit "pounds". Given background
+ information, e.g., arithmetical and comparative operations, humans
+ and machines can appropriately visualize such observations or have
+ conversions between different quantities.</li>
+ <li>Also, an observation separates a value from the actual event
+ at which it was collected; for instance, one can describe the person
+ that collected the observation and the time the observation was
+ collected.</li>
+ </ul>
+ The following figure illustrates this specificitiy of modelling in a
+ class diagram:
+
+ <p class="caption">Figure demonstrating specificity of modelling a
+ statistic</p>
+
+ <p align="center">
+ <img alt="specificity of modelling a
+ statistic"
+ src="./figures/modeling_quantity_measurement_observation.png"></img>
+ </p>
+
+ <p>
+ The Statistical Data and Metadata eXchange [<cite><a
+ href="#ref-SDMX">SDMX</a></cite>] - the ISO standard for exchanging and
+ sharing of statistical data and metadata among organizations - uses
+ "multidimensional model" that caters for the specificity of modelling
+ statistics. It allows to describe statistics as observations.
+ Observations exhibit values (Measures) that depend on dimensions
+ (Members of Dimensions).
+ </p>
+ <p>Since the SDMX standard has proven applicable in many contexts,
+ the vocabulary adopts the multidimensional model that underlies SDMX
+ and will be compatible to SDMX.</p>
+ <p>We use the name data cube vocabulary throughout the document
+ when referring to the vocabulary.</p>
</section>
<section>
- <h2>Terminology</h2>
+ <h2 id="terminology">Terminology</h2>
<p>
<dfn>Statistics</dfn>
is the <a href="http://en.wikipedia.org/wiki/Statistics">study</a> of
- the collection, organization, analysis, and interpretation of data. A
- statistic is a statistical dataset.
+ the collection, organization, analysis, and interpretation of data.
+ Statistics comprise statistical data.
</p>
<p>
- A
- <dfn>statistical dataset</dfn>
- comprises multidimensional data - a set of observed values organized
- along a group of dimensions, together with associated metadata. Basic
- structure of (aggregated) statistical data is a multidimensional table
- (also called a cube) <a href="#ref-SDMX">[SDMX]</a>.
+
+ The basic structure of
+ <dfn>statistical data</dfn>
+ is a multidimensional table (also called a data cube) [<cite><a
+ href="#ref-SDMX">SDMX</a></cite>], i.e., a set of observed values organized
+ along a group of dimensions, together with associated metadata. If
+ aggregated we refer to statistical data as "macro-data" whereas if
+ not, we refer to "micro-data".
+ </p>
+ <p>
+ Statistical data can be collected in a
+ <dfn>dataset</dfn>
+ , typically published and maintained by an organisation [<cite><a
+ href="#ref-SDMX">SDMX</a></cite>]. The dataset contains metadata, e.g.,
+ about the time of collection and publication or about the maintaining
+ and publishing organisation.
</p>
<p>
@@ -96,506 +176,88 @@
<dfn>consumer</dfn>
is a person or agent that uses Linked Data from the Web.
</p>
-
+ <p>
+ A
+ <dfn>registry</dfn>
+ collects metadata about statistical data in a registration fashion.
+ </p>
</section>
<section>
- <h2>Use cases</h2>
- <p>
- This section presents scenarios that would be enabled by the existence
- of a standard vocabulary for the representation of statistics as
- Linked Data. Since a draft of the specification of the cube vocabulary
- has been published, and the vocabulary already is in use, we will call
- this standard vocabulary after its current name RDF Data Cube
- vocabulary (short <a href="#ref-QB">[QB]</a>) throughout the document.
- </p>
- <p>We distinguish between use cases of publishing statistical data,
- and use cases of consuming statistical data since requirements for
- publishers and consumers of statistical data differ.</p>
- <section>
- <h3>Publishing statistical data</h3>
+ <h2 id="usecases">Use cases</h2>
+ <p>This section presents scenarios that are enabled by the
+ existence of an vocabulary for the representation of statistics as
+ Linked Data.</p>
<section>
- <h4>Publishing general statistics in a machine-readable and
- application-independent way (UC 1)</h4>
- <p>More and more organizations want to publish statistics on the
- web, for reasons such as increasing transparency and trust. Although
- in the ideal case, published data can be understood by both humans and
- machines, data often is simply published as CSV, PDF, XSL etc.,
- lacking elaborate metadata, which makes free usage and analysis
- difficult.</p>
-
- <p>The goal in this use case is to use a machine-readable and
- application-independent description of common statistics with use of
- open standards. The use case is fulfilled if QB will be a Linked Data
- vocabulary for encoding statistical data that has a hypercube
- structure and as such can describe common statistics in a
- machine-readable and application-independent way.</p>
-
+ <h3 id="SDMXWebDisseminationUseCase">SDMX Web Dissemination Use
+ Case</h3>
<p>
- An example scenario of this use case has been to publish the Combined
- Online Information System (<a
- href="http://data.gov.uk/resources/coins">COINS</a>). There, HM
- Treasury, the principal custodian of financial data for the UK
- government, released previously restricted information from its
- Combined Online Information System (COINS). Five data files were
- released containing between 3.3 and 4.9 million rows of data. The
- COINS dataset was translated into RDF for two reasons:
+ <span style="font-size: 10pt">(Use case taken from SDMX Web
+ Dissemination Use Case [<cite><a href="#ref-SDMX-21">SDMX
+ 2.1</a></cite>]
+ </span>
</p>
+ <p>Since we have adopted the multidimensional model that underlies
+ SDMX, we also adopt the "Web Dissemination Use Case" which is the
+ prime use case for SDMX since it is an increasing popular use of SDMX
+ and enables organisations to build a self-updating dissemination
+ system.</p>
+ <p>The Web Dissemination Use Case contains three actors, a
+ structural metadata web service (registry) that collects metadata
+ about statistical data in a registration fashion, a data web service
+ (publisher) that publishes statistical data and its metadata as
+ registered in the structural metadata web service, and a data
+ consumption application (consumer) that first discovers data from the
+ registry, then queries data from the corresponding publisher of
+ selected data, and then visualises the data.</p>
+ <p>Abstracted from the SDMX specificities, this use case contains
+ the following processes, also illustrated in a process flow diagram by
+ SDMX and in more detail described as follows:</p>
- <ol>
- <li>To publish statistics (e.g., as data files) are too large to
- load into widely available analysis tools such as Microsoft Excel, a
- common tool-of-choice for many data investigators.</li>
- <li>COINS is a highly technical information source, requiring
- both domain and technical skills to make useful applications around
- the data.</li>
- </ol>
- <p>Publishing statistics is challenging for the several reasons:</p>
- <p>
- Representing observations and measurements requires more complex
- modeling as discussed by Martin Fowler <a href="#Fowler1997">[Fowler,
- 1997]</a>: Recording a statistic simply as an attribute to an object
- (e.g., a the fact that a person weighs 185 pounds) fails with
- representing important concepts such as quantity, measurement, and
- observation.
+ <p class="caption">
+ Process flow diagram by SDMX [<cite><a href="#ref-SDMX-21">SDMX
+ 2.1</a></cite>]
</p>
- <p>Quantity comprises necessary information to interpret the value,
- e.g., the unit and arithmetical and comparative operations; humans and
- machines can appropriately visualize such quantities or have
- conversions between different quantities.</p>
-
- <p>A Measurement separates a quantity from the actual event at
- which it was collected; a measurement assigns a quantity to a specific
- phenomenon type (e.g., strength). Also, a measurement can record
- metadata such as who did the measurement (person), and when was it
- done (time).</p>
-
- <p>Observations, eventually, abstract from measurements only
- recording numeric quantities. An Observation can also assign a
- category observation (e.g., blood group A) to an observation. Figure
- demonstrates this relationship.</p>
- <p>
- <div class="fig">
- <a href="figures/modeling_quantity_measurement_observation.png"><img
- src="figures/modeling_quantity_measurement_observation.png"
- alt="Modeling quantity, measurement, observation" /> </a>
- <div>Modeling quantity, measurement, observation</div>
- </div>
- </div>
+ <p align="center">
+ <img alt="SDMX Web Dissemination Use Case"
+ src="./figures/SDMX_Web_Dissemination_Use_Case.png"></img>
</p>
+ <p>Benefits:</p>
+ <p>A structural metadata source (registry) collects metadata about
+ statistical data.</p>
+ <p>A data web service (publisher) registers statistical data in a
+ registry, and provides statistical data from a database and metadata
+ from a metadata repository for consumers. For that, the publisher
+ creates database tables (see 1 in figure), and loads statistical data
+ in a database and metadata in a metadata repository.</p>
+ <p>A consumer discovers data from a registry (3) and creates a
+ query to the publisher for selected statistical data (4).</p>
+ <p>The publisher translates the query to a query to its database
+ (5) as well as metadata repository (6) and returns the statistical
+ data and metadata.</p>
+ <p>The consumer visualises the returned statistical data and
+ metadata.</p>
- <p>QB deploys the multidimensional model (made of observations with
- Measures depending on Dimensions and Dimension Members, and further
- contextualized by Attributes) and should cater for these complexity in
- modelling.</p>
- <p>Another challenge is that for brevity reasons and to avoid
- repetition, it is useful to have abbreviation mechanisms such as
- assigning overall valid properties of observations at the dataset or
- slice level, and become implicitly part of each observation. For
- instance, in the case of COINS, all of the values are in thousands of
- pounds sterling. However, one of the use cases for the linked data
- version of COINS is to allow others to link to individual
- observations, which suggests that these observations should be
- standalone and self-contained – and should therefore have explicit
- multipliers and units on each observation. One suggestion is to author
- data without the duplication, but have the data publication tools
- "flatten" the compact representation into standalone observations
- during the publication process.</p>
- <p>A further challenge is related to slices of data. Slices of data
- group observations that are of special interest, e.g., slices
- unemployment rates per year of a specific gender are suitable for
- direct visualization in a line diagram. However, depending on the
- number of Dimensions, the number of possible slices can become large
- which makes it difficult to select all interesting slices. Therefore,
- and because of their additional complexity, not many publishers create
- slices. In fact, it is somewhat unclear at this point which slices
- through the data will be useful to (COINS-RDF) users.</p>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
+ <p>Requirements:</p>
- </section> <section>
- <h4>Publishing one or many MS excel spreadsheet files with
- statistical data on the web (UC 2)</h4>
- <p>Not only in government, there is a need to publish considerable
- amounts of statistical data to be consumed in various (also
- unexpected) application scenarios. Typically, Microsoft Excel sheets
- are made available for download. Those excel sheets contain single
- spreadsheets with several multidimensional data tables, having a name
- and notes, as well as column values, row values, and cell values.</p>
- <p>The goal in this use case is to to publish spreadsheet
- information in a machine-readable format on the web, e.g., so that
- crawlers can find spreadsheets that use a certain column value. The
- published data should represent and make available for queries the
- most important information in the spreadsheets, e.g., rows, columns,
- and cell values. QB should provide the level of detail that is needed
- for such a transformation in order to fulfil this use case.</p>
- <p>In a possible use case scenario an institution wants to develop
- or use a software that transforms their excel sheets into the
- appropriate format.</p>
-
- <p class="editorsnote">@@TODO: Concrete example needed.</p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>Excel sheets provide much flexibility in arranging
- information. It may be necessary to limit this flexibility to allow
- automatic transformation.</li>
- <li>There may be many spreadsheets.</li>
- <li>Semi-structured information, e.g., notes about lineage of
- data cells, may not be possible to be formalized.</li>
- </ul>
- <p>Unanticipated Uses (optional): -</p>
- <p>
- Existing Work (optional): Stats2RDF uses OntoWiki to translate CSV
- into QB <a href="http://aksw.org/Projects/Stats2RDF">[Stats2RDF]</a>.
- </p>
+ <p>The SDMX Web Dissemination Use Case can be concretised by
+ several sub-use cases, detailed in the following sections.</p>
</section> <section>
- <h4>Publishing SDMX as Linked Data (UC 3)</h4>
- <p>The ISO standard for exchanging and sharing statistical data and
- metadata among organizations is Statistical Data and Metadata eXchange
- (SDMX). Since this standard has proven applicable in many contexts, QB
- is designed to be compatible with the multidimensional model that
- underlies SDMX.</p>
- <p class="editorsnote">@@TODO: The QB spec should maybe also use
- the term "multidimensional model" instead of the less clear "cube
- model" term.</p>
- <p>Therefore, it should be possible to re-publish SDMX data using
- QB.</p>
- <p>
- The scenario for this use case is Eurostat <a
- href="http://epp.eurostat.ec.europa.eu/">[EUROSTAT]</a>, which
- publishes large amounts of European statistics coming from a data
- warehouse as SDMX and other formats on the web. Eurostat also provides
- an interface to browse and explore the datasets. However, linking such
- multidimensional data to related data sets and concepts would require
- download of interesting datasets and manual integration.
- </p>
- <p>The goal of this use case is to improve integration with other
- datasets; Eurostat data should be published on the web in a
- machine-readable format, possible to be linked with other datasets,
- and possible to be freeley consumed by applications. This use case is
- fulfilled if QB can be used for publishing the data from Eurostat as
- Linked Data for integration.</p>
- <p>A publisher wants to make available Eurostat data as Linked
- Data. The statistical data shall be published as is. It is not
- necessary to represent information for validation. Data is read from
- tsv only. There are two concrete examples of this use case: Eurostat
- Linked Data Wrapper (http://estatwrap.ontologycentral.com/), and
- Linked Statistics Eurostat Data
- (http://eurostat.linked-statistics.org/). They have slightly different
- focus (e.g., with respect to completeness, performance, and agility).
- </p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>There are large amounts of SDMX data; the Eurostat dataset
- comprises 350 GB of data. This may influence decisions about toolsets
- and architectures to use. One important task is to decide whether to
- structure the data in separate datasets.</li>
- <li>Again, the question comes up whether slices are useful.</li>
- </ul>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
- </section> <section>
- <h4>Publishing sensor data as statistics (UC 4)</h4>
- <p>Typically, multidimensional data is aggregated. However, there
- are cases where non-aggregated data needs to be published, e.g.,
- observational, sensor network and forecast data sets. Such raw data
- may be available in RDF, already, but using a different vocabulary.</p>
- <p>The goal of this use case is to demonstrate that publishing of
- aggregate values or of raw data should not make much of a difference
- in QB.</p>
- <p>
- For example the Environment Agency uses it to publish (at least
- weekly) information on the quality of bathing waters around England
- and Wales <A
- href="http://www.epimorphics.com/web/wiki/bathing-water-quality-structure-published-linked-data">[EnvAge]</A>.
- In another scenario DERI tracks from measurements about printing for a
- sustainability report. In the DERI scenario, raw data (number of
- printouts per person) is collected, then aggregated on a unit level,
- and then modelled using QB.
- </p>
- <p>Problems and Limitations:</p>
- <ul>
- <li>This use case also shall demonstrate how to link statistics
- with other statistics or non-statistical data (metadata).</li>
- </ul>
- <p>Unanticipated Uses (optional): -</p>
+ <h3 id="COINS">Publisher Use Case: UK government financial data
+ from Combined Online Information System (COINS)</h3>
<p>
- Existing Work (optional): Semantic Sensor Network ontology <A
- href="http://purl.oclc.org/NET/ssnx/ssn">[SSN]</A> already provides a
- way to publish sensor information. SSN data provides statistical
- Linked Data and grounds its data to the domain, e.g., sensors that
- collect observations (e.g., sensors measuring average of temperature
- over location and time). A number of organizations, particularly in
- the Climate and Meteorological area already have some commitment to
- the OGC "Observations and Measurements" (O&M) logical data model, also
- published as ISO 19156. The QB spec should maybe also prefer the term
- "multidimensional model" instead of the less clear "cube model" term.
-
-
-
- <p class="editorsnote">@@TODO: Are there any statements about
- compatibility and interoperability between O&M and Data Cube that can
- be made to give guidance to such organizations?</p>
- </p>
- </section> <section>
- <h4>Registering statistical data in dataset catalogs (UC 5)</h4>
- <p>
- After statistics have been published as Linked Data, the question
- remains how to communicate the publication and let users find the
- statistics. There are catalogs to register datasets, e.g., CKAN, <a
- href="http://www.datacite.org/datacite.org">datacite.org</a>, <a
- href="http://www.gesis.org/dara/en/home/?lang=en">da|ra</a>, and <a
- href="http://pangaea.de/">Pangea</a>. Those catalogs require specific
- configurations to register statistical data.
+ <span style="font-size: 10pt">(This use case has been
+ summarised from Ian Dickinson et al. (COINS as Linked Data.
+ http://data.gov.uk/resources/coins. Last visited on Jan 9 2013). </span>
</p>
- <p>The goal of this use case is to demonstrate how to expose and
- distribute statistics after modeling using QB. For instance, to allow
- automatic registration of statistical data in such catalogs, for
- finding and evaluating datasets. To solve this issue, it should be
- possible to transform QB data into formats that can be used by data
- catalogs.</p>
-
- <p class="editorsnote">@@TODO: Find specific use case scenario or
- ask how other publishers of QB data have dealt with this issue Maybe
- relation to DCAT?</p>
- <p>Problems and Limitations: -</p>
- <p>Unanticipated Uses (optional): If data catalogs contain
- statistics, they do not expose those using Linked Data but for
- instance using CSV or HTML (Pangea [11]). It could also be a use case
- to publish such data using QB.</p>
- <p>Existing Work (optional): -</p>
- </section> <section>
- <h4>Making transparent transformations on or different versions of
- statistical data (UC 6)</h4>
- <p>Statistical data often is used and further transformed for
- analysis and reporting. There is the risk that data has been
- incorrectly transformed so that the result is not interpretable any
- more. Therefore, if statistical data has been derived from other
- statistical data, this should be made transparent.</p>
- <p>The goal of this use case is to describe provenance and
- versioning around statistical data, so that the history of statistics
- published on the web becomes clear. This may also relate to the issue
- of having relationships between datasets published using QB. To fulfil
- this use case QB should recommend specific approaches to transforming
- and deriving of datasets which can be tracked and stored with the
- statistical data.</p>
-
- <p>A simple specific use case is that the Welsh Assembly government
- publishes a variety of population datasets broken down in different
- ways. For many uses then population broken down by some category (e.g.
- ethnicity) is expressed as a percentage. Separate datasets give the
- actual counts per category and aggregate counts. In such cases it is
- common to talk about the denominator (often DENOM) which is the
- aggregate count against which the percentages can be interpreted.</p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>Operations on statistical data result in new statistical
- data, depending on the operation. For intance, in terms of Data Cube,
- operations such as slice, dice, roll-up, drill-down will result in
- new Data Cubes. This may require representing general relationships
- between cubes (as discussed here: [12]).</li>
- <li>Should Data Cube support explicit declaration of such
- relationships either between separated qb:DataSets or between
- measures with a single qb:DataSet (e.g. ex:populationCount and
- ex:populationPercent)?</li>
- <li>If so should that be scoped to simple, common relationships
- like DENOM or allow expression of arbitrary mathematical relations?</li>
- </ul>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): Possible relation to Best Practices
- part on Versioning [13], where it is specified how to publish data
- which has multiple versions.</p>
-
-
- </section></section> <section>
- <h3>Consuming published statistical data</h3>
+ </section> </section>
<section>
- <h4>Simple chart visualizations of (integrated) published
- statistical datasets (UC 7)</h4>
- <p>Data that is published on the Web is typically visualized by
- transforming it manually into CSV or Excel and then creating a
- visualization on top of these formats using Excel, Tableau,
- RapidMiner, Rattle, Weka etc.</p>
- <p>This use case shall demonstrate how statistical data published
- on the web can be directly visualized, without using commercial or
- highly-complex tools. This use case is fulfilled if data that is
- published in QB can be directly visualized inside a webpage.</p>
- <p>An example scenario is environmental research done within the
- SMART research project (http://www.iwrm-smart.org/). Here, statistics
- about environmental aspects (e.g., measurements about the climate in
- the Lower Jordan Valley) shall be visualized for scientists and
- decision makers. Statistics should also be possible to be integrated
- and displayed together. The data is available as XML files on the web.
- On a separate website, specific parts of the data shall be queried and
- visualized in simple charts, e.g., line diagrams. The following figure
- shows the wanted display of an environmental measure over time for
- three regions in the lower Jordan valley; displayed inside a web page:</p>
-
- <p>
- <div class="fig">
- <a href="figures/Level_above_msl_3_locations.png"><img
- width="800px" src="figures/Level_above_msl_3_locations.png"
- alt="Line chart visualization of QB data" /> </a>
- <div>Line chart visualization of QB data</div>
- </div>
- </div>
- </p>
-
- <p>The following figure shows the same measures in a pivot table.
- Here, the aggregate COUNT of measures per cell is given.</p>
-
- <p>
- <div class="fig">
- <a href="figures/pivot_analysis_measurements.PNG"><img
- src="figures/pivot_analysis_measurements.PNG"
- alt="Pivot analysis measurements" /> </a>
- <div>Pivot analysis measurements</div>
- </div>
- </div>
- </p>
-
- <p>The use case uses Google App Engine, Qcrumb.com, and Spark. An
- example of a line diagram is given at [14] (some loading time needed).
- Current work tries to integrate current datasets with additional data
- sources, and then having queries that take data from both datasets and
- display them together.</p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>The difficulties lay in structuring the data appropriately so
- that the specific information can be queried.</li>
- <li>Also, data shall be published with having potential
- integration in mind. Therefore, e.g., units of measurements need to
- be represented.</li>
- <li>Integration becomes much more difficult if publishers use
- different measures, dimensions.</li>
-
- </ul>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
- </section> <section>
- <h4>Uploading published statistical data in Google Public Data
- Explorer (UC 8)</h4>
- <p>Google Public Data Explorer (GPDE -
- http://code.google.com/apis/publicdata/) provides an easy possibility
- to visualize and explore statistical data. Data needs to be in the
- Dataset Publishing Language (DSPL -
- https://developers.google.com/public-data/overview) to be uploaded to
- the data explorer. A DSPL dataset is a bundle that contains an XML
- file, the schema, and a set of CSV files, the actual data. Google
- provides a tutorial to create a DSPL dataset from your data, e.g., in
- CSV. This requires a good understanding of XML, as well as a good
- understanding of the data that shall be visualized and explored.</p>
- <p>In this use case, it shall be demonstrate how to take any
- published QB dataset and to transform it automatically into DSPL for
- visualization and exploration. A dataset that is published conforming
- to QB will provide the level of detail that is needed for such a
- transformation.</p>
- <p>In an example scenario, a publisher P has published data using
- QB. There are two different ways to fulfil this use case: 1) A
- customer C is downloading this data into a triple store; SPARQL
- queries on this data can be used to transform the data into DSPL and
- uploaded and visualized using GPDE. 2) or, one or more XLST
- transformation on the RDF/XML transforms the data into DSPL.</p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>The technical challenges for the consumer here lay in knowing
- where to download what data and how to get it transformed into DSPL
- without knowing the data.</li>
- <p>Unanticipated Uses (optional): DSPL is representative for using
- statistical data published on the web in available tools for
- analysis. Similar tools that may be automatically covered are: Weka
- (arff data format), Tableau, etc.</p>
- <p>Existing Work (optional): -</p>
- </ul>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
- </section> <section>
- <h4>Allow Online Analytical Processing on published datasets of
- statistical data (UC 9)</h4>
- <p>Online Analytical Processing [15] is an analysis method on
- multidimensional data. It is an explorative analysis methode that
- allows users to interactively view the data on different angles
- (rotate, select) or granularities (drill-down, roll-up), and filter it
- for specific information (slice, dice).</p>
- <p>The multidimensional model used in QB to model statistics should
- be usable by OLAP systems. More specifically, data that conforms to QB
- can be used to define a Data Cube within an OLAP engine and can then
- be queries by OLAP clients.</p>
- <p>An example scenario of this use case is the Financial
- Information Observation System (FIOS) [16], where XBRL data has been
- re-published using QB and made analysable for stakeholders in a
- web-based OLAP client. The following figure shows an example of using
- FIOS. Here, for three different companies, cost of goods sold as
- disclosed in XBRL documents are analysed. As cell values either the
- number of disclosures or - if only one available - the actual number
- in USD is given:</p>
-
- <p>
- <div class="fig">
- <a href="figures/FIOS_example.PNG"><img
- src="figures/FIOS_example.PNG" alt="OLAP of QB data" /> </a>
- <div>OLAP of QB data</div>
- </div>
- </div>
- </p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>A problem lies in the strict separation between queries for
- the structure of data, and queries for actual aggregated values.</li>
- <li>Another problem lies in defining Data Cubes without greater
- insight in the data beforehand.</li>
- <li>Depending on the expressivity of the OLAP queries (e.g.,
- aggregation functions, hierarchies, ordering), performance plays an
- important role.</li>
- <li>QB allows flexibility in describing statistics, e.g., in
- order to reduce redundancy of information in single observations.
- These alternatives make general consumption of QB data more complex.
- Also, it is not clear, what "conforms" to QB means, e.g., is a
- qb:DataStructureDefinition required?</li>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
- </ul>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
- </section> <section>
- <h4>Transforming published statistics into XBRL (UC 10)</h4>
- <p>XBRL is a standard data format for disclosing financial
- information. Typically, financial data is not managed within the
- organization using XBRL but instead, internal formats such as excel or
- relational databases are used. If different data sources are to be
- summarized in XBRL data formats to be published, an internally-used
- standard format such as QB could help integrate and transform the data
- into the appropriate format.</p>
- <p>In this use case data that is available as data conforming to QB
- should also be possible to be automatically transformed into such XBRL
- data format. This use case is fulfilled if QB contains necessary
- information to derive XBRL data.</p>
- <p>In an example scenario, DERI has had a use case to publish
- sustainable IT information as XBRL to the Global Reporting Initiative
- (GRI - https://www.globalreporting.org/). Here, raw data (number of
- printouts per person) is collected, then aggregated on a unit level
- and modelled using QB. QB data shall then be used directly to fill-in
- XBRL documents that can be published to the GRI.</p>
- <p>Challenges of this use case are:</p>
- <ul>
- <li>So far, QB data has been transformed into semantic XBRL, a
- vocabulary closer to XBRL. There is the chance that certain
- information required in a GRI XBRL document cannot be encoded using a
- vocabulary as general as QB. In this case, QB could be used in
- concordance with semantic XBRL.</li>
- </ul>
- <p class="editorsnote">@@TODO: Add link to semantic XBRL.</p>
- <p>Unanticipated Uses (optional): -</p>
- <p>Existing Work (optional): -</p>
-
- </section> </section></section>
- <section>
- <h2>Requirements</h2>
+ <h2 id="requirements">Requirements</h2>
<p>The use cases presented in the previous section give rise to the
following requirements for a standard representation of statistics.
@@ -684,6 +346,41 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
</pre>
<p>What is the best way (in the context of the RDF/Data Cube/SDMX
approach) to express that the values for the England/Scotland/Wales/
@@ -793,7 +490,7 @@
<p>Required by: UC7, UC8, UC9, UC10</p>
</section> </section> </section>
<section class="appendix">
- <h2>Acknowledgments</h2>
+ <h2 id="acknowledgements">Acknowledgements</h2>
<p>The editors are very thankful for comments and suggestions ...</p>
</section>
@@ -802,18 +499,32 @@
<dl>
<dt id="ref-SDMX">[SMDX]</dt>
<dd>
- SMDX - User Guide 2009, <a
- href="http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf">http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf</a>
+ SMDX - SDMX User Guide Version 2009.1, <a
+ href="http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf">http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf</a>,
+ last visited Jan 8 2013.
</dd>
- <dt id="ref-SDMX">[Fowler1997]</dt>
+ <dt id="ref-SDMX-21">[SMDX 2.1]</dt>
+ <dd>
+ SDMX 2.1 User Guide Version. Version 0.1 - 19/09/2012. <a
+ href="http://sdmx.org/wp-content/uploads/2012/11/SDMX_2-1_User_Guide_draft_0-1.pdf">http://sdmx.org/wp-content/uploads/2012/11/SDMX_2-1_User_Guide_draft_0-1.pdf</a>.
+ Last visited on 8 Jan 2013.
+ </dd>
+
+ <dt id="ref-Fowler1997">[Fowler1997]</dt>
<dd>Fowler, Martin (1997). Analysis Patterns: Reusable Object
Models. Addison-Wesley. ISBN 0201895420.</dd>
- <dt id="ref-QB">[QB]</dt>
+ <dt id="ref-QB">[QB-2010]</dt>
<dd>
RDF Data Cube vocabulary, <a
- href="http://dvcs.w3.org/hg/gld/raw-file/default/data-cube/index.html">http://dvcs.w3.org/hg/gld/raw-file/default/data-cube/index.html</a>
+ href="http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html">http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html</a>
+ </dd>
+
+ <dt id="ref-QB">[QB-2013]</dt>
+ <dd>
+ RDF Data Cube vocabulary, <a
+ href="http://www.w3.org/TR/vocab-data-cube/">http://www.w3.org/TR/vocab-data-cube/</a>
</dd>
<dt id="ref-OLAP">[OLAP]</dt>