--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/data-cube-ucr/data-cube-ucr-20120222/index.html Wed Feb 27 23:44:50 2013 +0100
@@ -0,0 +1,860 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
+ "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>Use Cases and Requirements for the Data Cube Vocabulary</title>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
+<script type="text/javascript"
+ src="http://dev.w3.org/2009/dap/ReSpec.js/js/respec.js" class="remove"></script>
+<script src="respec-ref.js"></script>
+<script src="respec-config.js"></script>
+<link rel="stylesheet" type="text/css" href="local-style.css" />
+</head>
+<body>
+
+ <section id="abstract">
+ <p>Many national, regional and local governments, as well as other
+ organizations inside and outside of the public sector, create
+ statistics. There is a need to publish those statistics in a
+ standardized, machine-readable way on the web, so that statistics can
+ be freely integrated and reused in consuming applications. This
+ document is a collection of use cases for a standard vocabulary to
+ publish statistics as Linked Data.</p>
+ </section>
+
+ <section id="sotd">
+ <p>
+ This is a working document of the <a
+ href="http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary">Data
+ Cube Vocabulary project</a> within the <a
+ href="http://www.w3.org/2011/gld/">W3C Government Linked Data
+ Working Group</a>. Feedback is welcome and should be sent to the <a
+ href="mailto:public-gld-comments@w3.org">public-gld-comments@w3.org
+ mailing list</a>.
+ </p>
+ </section>
+
+ <section>
+ <h2>Introduction</h2>
+
+ <p>Many national, regional and local governments, as well as other
+ organizations inside and outside of the public sector, create
+ statistics. There is a need to publish those statistics in a
+ standardized, machine-readable way on the web, so that statistics can
+ be freely linked, integrated and reused in consuming applications.
+ This document is a collection of use cases for a standard vocabulary
+ to publish statistics as Linked Data.</p>
+ </section>
+
+
+ <section>
+ <h2>Terminology</h2>
+ <p>
+ <dfn>Statistics</dfn>
+ is the <a href="http://en.wikipedia.org/wiki/Statistics">study</a> of
+ the collection, organization, analysis, and interpretation of data. A
+ statistic is a statistical dataset.
+ </p>
+
+ <p>
+ A
+ <dfn>statistical dataset</dfn>
+ comprises multidimensional data - a set of observed values organized
+ along a group of dimensions, together with associated metadata. Basic
+ structure of (aggregated) statistical data is a multidimensional table
+ (also called a cube) <a href="#ref-SDMX">[SDMX]</a>.
+ </p>
+
+ <p>
+ <dfn>Source data</dfn>
+ is data from datastores such as RDBs or spreadsheets that acts as a
+ source for the Linked Data publishing process.
+ </p>
+
+ <p>
+ <dfn>Metadata</dfn>
+ about statistics defines the data structure and give contextual
+ information about the statistics.
+ </p>
+
+ <p>
+ A format is
+ <dfn>machine-readable</dfn>
+ if it is amenable to automated processing by a machine, as opposed to
+ presentation to a human user.
+ </p>
+
+ <p>
+ A
+ <dfn>publisher</dfn>
+ is a person or organization that exposes source data as Linked Data on
+ the Web.
+ </p>
+
+ <p>
+ A
+ <dfn>consumer</dfn>
+ is a person or agent that uses Linked Data from the Web.
+ </p>
+
+ </section>
+
+
+ <section>
+ <h2>Use cases</h2>
+ <p>
+ This section presents scenarios that would be enabled by the existence
+ of a standard vocabulary for the representation of statistics as
+ Linked Data. Since a draft of the specification of the cube vocabulary
+ has been published, and the vocabulary already is in use, we will call
+ this standard vocabulary after its current name RDF Data Cube
+ vocabulary (short <a href="#ref-QB">[QB]</a>) throughout the document.
+ </p>
+ <p>We distinguish between use cases of publishing statistical data,
+ and use cases of consuming statistical data since requirements for
+ publishers and consumers of statistical data differ.</p>
+ <section>
+ <h3>Publishing statistical data</h3>
+
+ <section>
+ <h4>Publishing general statistics in a machine-readable and
+ application-independent way (UC 1)</h4>
+ <p>More and more organizations want to publish statistics on the
+ web, for reasons such as increasing transparency and trust. Although
+ in the ideal case, published data can be understood by both humans and
+ machines, data often is simply published as CSV, PDF, XSL etc.,
+ lacking elaborate metadata, which makes free usage and analysis
+ difficult.</p>
+
+ <p>The goal in this use case is to use a machine-readable and
+ application-independent description of common statistics with use of
+ open standards. The use case is fulfilled if QB will be a Linked Data
+ vocabulary for encoding statistical data that has a hypercube
+ structure and as such can describe common statistics in a
+ machine-readable and application-independent way.</p>
+
+ <p>
+ An example scenario of this use case has been to publish the Combined
+ Online Information System (<a
+ href="http://data.gov.uk/resources/coins">COINS</a>). There, HM
+ Treasury, the principal custodian of financial data for the UK
+ government, released previously restricted information from its
+ Combined Online Information System (COINS). Five data files were
+ released containing between 3.3 and 4.9 million rows of data. The
+ COINS dataset was translated into RDF for two reasons:
+ </p>
+
+ <ol>
+ <li>To publish statistics (e.g., as data files) are too large to
+ load into widely available analysis tools such as Microsoft Excel, a
+ common tool-of-choice for many data investigators.</li>
+ <li>COINS is a highly technical information source, requiring
+ both domain and technical skills to make useful applications around
+ the data.</li>
+ </ol>
+ <p>Publishing statistics is challenging for the several reasons:</p>
+ <p>
+ Representing observations and measurements requires more complex
+ modeling as discussed by Martin Fowler <a href="#Fowler1997">[Fowler,
+ 1997]</a>: Recording a statistic simply as an attribute to an object
+ (e.g., a the fact that a person weighs 185 pounds) fails with
+ representing important concepts such as quantity, measurement, and
+ observation.
+ </p>
+
+ <p>Quantity comprises necessary information to interpret the value,
+ e.g., the unit and arithmetical and comparative operations; humans and
+ machines can appropriately visualize such quantities or have
+ conversions between different quantities.</p>
+
+ <p>Quantity comprises necessary information to interpret the value,
+ e.g., the unit and arithmetical and comparative operations; humans and
+ machines can appropriately visualize such quantities or have
+ conversions between different quantities.</p>
+
+ <p>A Measurement separates a quantity from the actual event at
+ which it was collected; a measurement assigns a quantity to a specific
+ phenomenon type (e.g., strength). Also, a measurement can record
+ metadata such as who did the measurement (person), and when was it
+ done (time).</p>
+
+ <p>Observations, eventually, abstract from measurements only
+ recording numeric quantities. An Observation can also assign a
+ category observation (e.g., blood group A) to an observation. Figure
+ demonstrates this relationship.</p>
+ <p>
+ <div class="fig">
+ <a href="figures/modeling_quantity_measurement_observation.png"><img
+ src="figures/modeling_quantity_measurement_observation.png"
+ alt="Modeling quantity, measurement, observation" /> </a>
+ <div>Modeling quantity, measurement, observation</div>
+ </div>
+ </div>
+ </p>
+
+ <p>QB deploys the multidimensional model (made of observations with
+ Measures depending on Dimensions and Dimension Members, and further
+ contextualized by Attributes) and should cater for these complexity in
+ modelling.</p>
+ <p>Another challenge is that for brevity reasons and to avoid
+ repetition, it is useful to have abbreviation mechanisms such as
+ assigning overall valid properties of observations at the dataset or
+ slice level, and become implicitly part of each observation. For
+ instance, in the case of COINS, all of the values are in thousands of
+ pounds sterling. However, one of the use cases for the linked data
+ version of COINS is to allow others to link to individual
+ observations, which suggests that these observations should be
+ standalone and self-contained – and should therefore have explicit
+ multipliers and units on each observation. One suggestion is to author
+ data without the duplication, but have the data publication tools
+ "flatten" the compact representation into standalone observations
+ during the publication process.</p>
+ <p>A further challenge is related to slices of data. Slices of data
+ group observations that are of special interest, e.g., slices
+ unemployment rates per year of a specific gender are suitable for
+ direct visualization in a line diagram. However, depending on the
+ number of Dimensions, the number of possible slices can become large
+ which makes it difficult to select all interesting slices. Therefore,
+ and because of their additional complexity, not many publishers create
+ slices. In fact, it is somewhat unclear at this point which slices
+ through the data will be useful to (COINS-RDF) users.</p>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+
+ </section> <section>
+ <h4>Publishing one or many MS excel spreadsheet files with
+ statistical data on the web (UC 2)</h4>
+ <p>Not only in government, there is a need to publish considerable
+ amounts of statistical data to be consumed in various (also
+ unexpected) application scenarios. Typically, Microsoft Excel sheets
+ are made available for download. Those excel sheets contain single
+ spreadsheets with several multidimensional data tables, having a name
+ and notes, as well as column values, row values, and cell values.</p>
+ <p>The goal in this use case is to to publish spreadsheet
+ information in a machine-readable format on the web, e.g., so that
+ crawlers can find spreadsheets that use a certain column value. The
+ published data should represent and make available for queries the
+ most important information in the spreadsheets, e.g., rows, columns,
+ and cell values. QB should provide the level of detail that is needed
+ for such a transformation in order to fulfil this use case.</p>
+ <p>In a possible use case scenario an institution wants to develop
+ or use a software that transforms their excel sheets into the
+ appropriate format.</p>
+
+ <p class="editorsnote">@@TODO: Concrete example needed.</p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>Excel sheets provide much flexibility in arranging
+ information. It may be necessary to limit this flexibility to allow
+ automatic transformation.</li>
+ <li>There may be many spreadsheets.</li>
+ <li>Semi-structured information, e.g., notes about lineage of
+ data cells, may not be possible to be formalized.</li>
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>
+ Existing Work (optional): Stats2RDF uses OntoWiki to translate CSV
+ into QB <a href="http://aksw.org/Projects/Stats2RDF">[Stats2RDF]</a>.
+ </p>
+
+ </section> <section>
+ <h4>Publishing SDMX as Linked Data (UC 3)</h4>
+ <p>The ISO standard for exchanging and sharing statistical data and
+ metadata among organizations is Statistical Data and Metadata eXchange
+ (SDMX). Since this standard has proven applicable in many contexts, QB
+ is designed to be compatible with the multidimensional model that
+ underlies SDMX.</p>
+ <p class="editorsnote">@@TODO: The QB spec should maybe also use
+ the term "multidimensional model" instead of the less clear "cube
+ model" term.</p>
+ <p>Therefore, it should be possible to re-publish SDMX data using
+ QB.</p>
+ <p>
+ The scenario for this use case is Eurostat <a
+ href="http://epp.eurostat.ec.europa.eu/">[EUROSTAT]</a>, which
+ publishes large amounts of European statistics coming from a data
+ warehouse as SDMX and other formats on the web. Eurostat also provides
+ an interface to browse and explore the datasets. However, linking such
+ multidimensional data to related data sets and concepts would require
+ download of interesting datasets and manual integration.
+ </p>
+ <p>The goal of this use case is to improve integration with other
+ datasets; Eurostat data should be published on the web in a
+ machine-readable format, possible to be linked with other datasets,
+ and possible to be freeley consumed by applications. This use case is
+ fulfilled if QB can be used for publishing the data from Eurostat as
+ Linked Data for integration.</p>
+ <p>A publisher wants to make available Eurostat data as Linked
+ Data. The statistical data shall be published as is. It is not
+ necessary to represent information for validation. Data is read from
+ tsv only. There are two concrete examples of this use case: Eurostat
+ Linked Data Wrapper (http://estatwrap.ontologycentral.com/), and
+ Linked Statistics Eurostat Data
+ (http://eurostat.linked-statistics.org/). They have slightly different
+ focus (e.g., with respect to completeness, performance, and agility).
+ </p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>There are large amounts of SDMX data; the Eurostat dataset
+ comprises 350 GB of data. This may influence decisions about toolsets
+ and architectures to use. One important task is to decide whether to
+ structure the data in separate datasets.</li>
+ <li>Again, the question comes up whether slices are useful.</li>
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+ </section> <section>
+ <h4>Publishing sensor data as statistics (UC 4)</h4>
+ <p>Typically, multidimensional data is aggregated. However, there
+ are cases where non-aggregated data needs to be published, e.g.,
+ observational, sensor network and forecast data sets. Such raw data
+ may be available in RDF, already, but using a different vocabulary.</p>
+ <p>The goal of this use case is to demonstrate that publishing of
+ aggregate values or of raw data should not make much of a difference
+ in QB.</p>
+ <p>
+ For example the Environment Agency uses it to publish (at least
+ weekly) information on the quality of bathing waters around England
+ and Wales <A
+ href="http://www.epimorphics.com/web/wiki/bathing-water-quality-structure-published-linked-data">[EnvAge]</A>.
+ In another scenario DERI tracks from measurements about printing for a
+ sustainability report. In the DERI scenario, raw data (number of
+ printouts per person) is collected, then aggregated on a unit level,
+ and then modelled using QB.
+ </p>
+ <p>Problems and Limitations:</p>
+ <ul>
+ <li>This use case also shall demonstrate how to link statistics
+ with other statistics or non-statistical data (metadata).</li>
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>
+ Existing Work (optional): Semantic Sensor Network ontology <A
+ href="http://purl.oclc.org/NET/ssnx/ssn">[SSN]</A> already provides a
+ way to publish sensor information. SSN data provides statistical
+ Linked Data and grounds its data to the domain, e.g., sensors that
+ collect observations (e.g., sensors measuring average of temperature
+ over location and time). A number of organizations, particularly in
+ the Climate and Meteorological area already have some commitment to
+ the OGC "Observations and Measurements" (O&M) logical data model, also
+ published as ISO 19156. The QB spec should maybe also prefer the term
+ "multidimensional model" instead of the less clear "cube model" term.
+
+ <p class="editorsnote">@@TODO: Are there any statements about
+ compatibility and interoperability between O&M and Data Cube that can
+ be made to give guidance to such organizations?</p>
+ </p>
+ </section> <section>
+ <h4>Registering statistical data in dataset catalogs (UC 5)</h4>
+ <p>
+ After statistics have been published as Linked Data, the question
+ remains how to communicate the publication and let users find the
+ statistics. There are catalogs to register datasets, e.g., CKAN, <a
+ href="http://www.datacite.org/datacite.org">datacite.org</a>, <a
+ href="http://www.gesis.org/dara/en/home/?lang=en">da|ra</a>, and <a
+ href="http://pangaea.de/">Pangea</a>. Those catalogs require specific
+ configurations to register statistical data.
+ </p>
+ <p>The goal of this use case is to demonstrate how to expose and
+ distribute statistics after modeling using QB. For instance, to allow
+ automatic registration of statistical data in such catalogs, for
+ finding and evaluating datasets. To solve this issue, it should be
+ possible to transform QB data into formats that can be used by data
+ catalogs.</p>
+
+ <p class="editorsnote">@@TODO: Find specific use case scenario or
+ ask how other publishers of QB data have dealt with this issue Maybe
+ relation to DCAT?</p>
+ <p>Problems and Limitations: -</p>
+ <p>Unanticipated Uses (optional): If data catalogs contain
+ statistics, they do not expose those using Linked Data but for
+ instance using CSV or HTML (Pangea [11]). It could also be a use case
+ to publish such data using QB.</p>
+ <p>Existing Work (optional): -</p>
+ </section> <section>
+ <h4>Making transparent transformations on or different versions of
+ statistical data (UC 6)</h4>
+ <p>Statistical data often is used and further transformed for
+ analysis and reporting. There is the risk that data has been
+ incorrectly transformed so that the result is not interpretable any
+ more. Therefore, if statistical data has been derived from other
+ statistical data, this should be made transparent.</p>
+ <p>The goal of this use case is to describe provenance and
+ versioning around statistical data, so that the history of statistics
+ published on the web becomes clear. This may also relate to the issue
+ of having relationships between datasets published using QB. To fulfil
+ this use case QB should recommend specific approaches to transforming
+ and deriving of datasets which can be tracked and stored with the
+ statistical data.</p>
+ <p class="editorsnote">@@TODO: Add concrete example use case
+ scenario.</p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>Operations on statistical data result in new statistical
+ data, depending on the operation. For intance, in terms of Data Cube,
+ operations such as slice, dice, roll-up, drill-down will result in
+ new Data Cubes. This may require representing general relationships
+ between cubes (as discussed here: [12]).</li>
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): Possible relation to Best Practices
+ part on Versioning [13], where it is specified how to publish data
+ which has multiple versions.</p>
+
+
+ </section></section> <section>
+ <h3>Consuming published statistical data</h3>
+
+ <section>
+ <h4>Simple chart visualizations of (integrated) published
+ statistical datasets (UC 7)</h4>
+ <p>Data that is published on the Web is typically visualized by
+ transforming it manually into CSV or Excel and then creating a
+ visualization on top of these formats using Excel, Tableau,
+ RapidMiner, Rattle, Weka etc.</p>
+ <p>This use case shall demonstrate how statistical data published
+ on the web can be directly visualized, without using commercial or
+ highly-complex tools. This use case is fulfilled if data that is
+ published in QB can be directly visualized inside a webpage.</p>
+ <p>An example scenario is environmental research done within the
+ SMART research project (http://www.iwrm-smart.org/). Here, statistics
+ about environmental aspects (e.g., measurements about the climate in
+ the Lower Jordan Valley) shall be visualized for scientists and
+ decision makers. Statistics should also be possible to be integrated
+ and displayed together. The data is available as XML files on the web.
+ On a separate website, specific parts of the data shall be queried and
+ visualized in simple charts, e.g., line diagrams. The following figure
+ shows the wanted display of an environmental measure over time for
+ three regions in the lower Jordan valley; displayed inside a web page:</p>
+
+ <p>
+ <div class="fig">
+ <a href="figures/Level_above_msl_3_locations.png"><img
+ width="800px" src="figures/Level_above_msl_3_locations.png"
+ alt="Line chart visualization of QB data" /> </a>
+ <div>Line chart visualization of QB data</div>
+ </div>
+ </div>
+ </p>
+
+ <p>The following figure shows the same measures in a pivot table.
+ Here, the aggregate COUNT of measures per cell is given.</p>
+
+ <p>
+ <div class="fig">
+ <a href="figures/pivot_analysis_measurements.PNG"><img
+ src="figures/pivot_analysis_measurements.PNG"
+ alt="Pivot analysis measurements" /> </a>
+ <div>Pivot analysis measurements</div>
+ </div>
+ </div>
+ </p>
+
+ <p>The use case uses Google App Engine, Qcrumb.com, and Spark. An
+ example of a line diagram is given at [14] (some loading time needed).
+ Current work tries to integrate current datasets with additional data
+ sources, and then having queries that take data from both datasets and
+ display them together.</p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>The difficulties lay in structuring the data appropriately so
+ that the specific information can be queried.</li>
+ <li>Also, data shall be published with having potential
+ integration in mind. Therefore, e.g., units of measurements need to
+ be represented.</li>
+ <li>Integration becomes much more difficult if publishers use
+ different measures, dimensions.</li>
+
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+ </section> <section>
+ <h4>Uploading published statistical data in Google Public Data
+ Explorer (UC 8)</h4>
+ <p>Google Public Data Explorer (GPDE -
+ http://code.google.com/apis/publicdata/) provides an easy possibility
+ to visualize and explore statistical data. Data needs to be in the
+ Dataset Publishing Language (DSPL -
+ https://developers.google.com/public-data/overview) to be uploaded to
+ the data explorer. A DSPL dataset is a bundle that contains an XML
+ file, the schema, and a set of CSV files, the actual data. Google
+ provides a tutorial to create a DSPL dataset from your data, e.g., in
+ CSV. This requires a good understanding of XML, as well as a good
+ understanding of the data that shall be visualized and explored.</p>
+ <p>In this use case, it shall be demonstrate how to take any
+ published QB dataset and to transform it automatically into DSPL for
+ visualization and exploration. A dataset that is published conforming
+ to QB will provide the level of detail that is needed for such a
+ transformation.</p>
+ <p>In an example scenario, a publisher P has published data using
+ QB. There are two different ways to fulfil this use case: 1) A
+ customer C is downloading this data into a triple store; SPARQL
+ queries on this data can be used to transform the data into DSPL and
+ uploaded and visualized using GPDE. 2) or, one or more XLST
+ transformation on the RDF/XML transforms the data into DSPL.</p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>The technical challenges for the consumer here lay in knowing
+ where to download what data and how to get it transformed into DSPL
+ without knowing the data.</li>
+ <p>Unanticipated Uses (optional): DSPL is representative for using
+ statistical data published on the web in available tools for
+ analysis. Similar tools that may be automatically covered are: Weka
+ (arff data format), Tableau, etc.</p>
+ <p>Existing Work (optional): -</p>
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+ </section> <section>
+ <h4>Allow Online Analytical Processing on published datasets of
+ statistical data (UC 9)</h4>
+ <p>Online Analytical Processing [15] is an analysis method on
+ multidimensional data. It is an explorative analysis methode that
+ allows users to interactively view the data on different angles
+ (rotate, select) or granularities (drill-down, roll-up), and filter it
+ for specific information (slice, dice).</p>
+ <p>The multidimensional model used in QB to model statistics should
+ be usable by OLAP systems. More specifically, data that conforms to QB
+ can be used to define a Data Cube within an OLAP engine and can then
+ be queries by OLAP clients.</p>
+ <p>An example scenario of this use case is the Financial
+ Information Observation System (FIOS) [16], where XBRL data has been
+ re-published using QB and made analysable for stakeholders in a
+ web-based OLAP client. The following figure shows an example of using
+ FIOS. Here, for three different companies, cost of goods sold as
+ disclosed in XBRL documents are analysed. As cell values either the
+ number of disclosures or - if only one available - the actual number
+ in USD is given:</p>
+
+ <p>
+ <div class="fig">
+ <a href="figures/FIOS_example.PNG"><img
+ src="figures/FIOS_example.PNG" alt="OLAP of QB data" /> </a>
+ <div>OLAP of QB data</div>
+ </div>
+ </div>
+ </p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>A problem lies in the strict separation between queries for
+ the structure of data, and queries for actual aggregated values.</li>
+ <li>Another problem lies in defining Data Cubes without greater
+ insight in the data beforehand.</li>
+ <li>Depending on the expressivity of the OLAP queries (e.g.,
+ aggregation functions, hierarchies, ordering), performance plays an
+ important role.</li>
+ <li>QB allows flexibility in describing statistics, e.g., in
+ order to reduce redundancy of information in single observations.
+ These alternatives make general consumption of QB data more complex.
+ Also, it is not clear, what "conforms" to QB means, e.g., is a
+ qb:DataStructureDefinition required?</li>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+ </ul>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+ </section> <section>
+ <h4>Transforming published statistics into XBRL (UC 10)</h4>
+ <p>XBRL is a standard data format for disclosing financial
+ information. Typically, financial data is not managed within the
+ organization using XBRL but instead, internal formats such as excel or
+ relational databases are used. If different data sources are to be
+ summarized in XBRL data formats to be published, an internally-used
+ standard format such as QB could help integrate and transform the data
+ into the appropriate format.</p>
+ <p>In this use case data that is available as data conforming to QB
+ should also be possible to be automatically transformed into such XBRL
+ data format. This use case is fulfilled if QB contains necessary
+ information to derive XBRL data.</p>
+ <p>In an example scenario, DERI has had a use case to publish
+ sustainable IT information as XBRL to the Global Reporting Initiative
+ (GRI - https://www.globalreporting.org/). Here, raw data (number of
+ printouts per person) is collected, then aggregated on a unit level
+ and modelled using QB. QB data shall then be used directly to fill-in
+ XBRL documents that can be published to the GRI.</p>
+ <p>Challenges of this use case are:</p>
+ <ul>
+ <li>So far, QB data has been transformed into semantic XBRL, a
+ vocabulary closer to XBRL. There is the chance that certain
+ information required in a GRI XBRL document cannot be encoded using a
+ vocabulary as general as QB. In this case, QB could be used in
+ concordance with semantic XBRL.</li>
+ </ul>
+ <p class="editorsnote">@@TODO: Add link to semantic XBRL.</p>
+ <p>Unanticipated Uses (optional): -</p>
+ <p>Existing Work (optional): -</p>
+
+ </section> </section></section>
+ <section>
+ <h2>Requirements</h2>
+
+ <p>The use cases presented in the previous section give rise to the
+ following requirements for a standard representation of statistics.
+ Requirements are cross-linked with the use cases that motivate them.
+ Requirements are similarly categorized as deriving from publishing or
+ consuming use cases.</p>
+
+ <section>
+ <h3>Publishing requirements</h3>
+
+ <section>
+ <h4>Machine-readable and application-independent representation of
+ statistics</h4>
+ <p>It should be possible to add abstraction, multiple levels of
+ description, summaries of statistics.</p>
+
+ <p>Required by: UC1, UC2, UC3, UC4</p>
+ </section> <section>
+ <h4>Representing statistics from various resource</h4>
+ <p>Statistics from various resource data should be possible to be
+ translated into QB. QB should be very general and should be usable for
+ other data sets such as survey data, spreadsheets and OLAP data cubes.
+ What kind of statistics are described: simple CSV tables (UC 1), excel
+ (UC 2) and more complex SDMX (UC 3) data about government statistics
+ or other public-domain relevant data.</p>
+
+ <p>Required by: UC1, UC2, UC3</p>
+ </section> <section>
+ <h4>Communicating, exposing statistics on the web</h4>
+ <p>It should become clear how to make statistical data available on
+ the web, including how to expose it, and how to distribute it.</p>
+
+ <p>Required by: UC5</p>
+ </section> <section>
+ <h4>Coverage of typical statistics metadata</h4>
+ <p>It should be possible to add metainformation to statistics as
+ found in typical statistics or statistics catalogs.</p>
+
+ <p>Required by: UC1, UC2, UC3, UC4, UC5</p>
+ </section> <section>
+ <h4>Expressing hierarchies</h4>
+ <p>It should be possible to express hierarchies on Dimensions of
+ statistics. Some of this requirement is met by the work on ISO
+ Extension to SKOS [17].</p>
+
+ <p>Required by: UC3, UC9</p>
+ </section> <section>
+ <h4>Machine-readable and application-independent representation of
+ statistics</h4>
+ <p>It should be possible to add abstraction, multiple levels of
+ description, summaries of statistics.</p>
+
+ <p>Required by: UC1, UC2, UC3, UC4</p>
+ </section> <section>
+ <h4>Expressing aggregation relationships in Data Cube</h4>
+ <p>Based on [18]: It often comes up in statistical data that you
+ have some kind of 'overall' figure, which is then broken down into
+ parts. To Supposing I have a set of population observations, expressed
+ with the Data Cube vocabulary - something like (in pseudo-turtle):</p>
+ <pre>
+ex:obs1
+ sdmx:refArea <UK>;
+ sdmx:refPeriod "2011";
+ ex:population "60" .
+
+ex:obs2
+ sdmx:refArea <England>;
+ sdmx:refPeriod "2011";
+ ex:population "50" .
+
+ex:obs3
+ sdmx:refArea <Scotland>;
+ sdmx:refPeriod "2011";
+ ex:population "5" .
+
+ex:obs4
+ sdmx:refArea <Wales>;
+ sdmx:refPeriod "2011";
+ ex:population "3" .
+
+ex:obs5
+ sdmx:refArea <NorthernIreland>;
+ sdmx:refPeriod "2011";
+ ex:population "2" .
+
+
+ </pre>
+ <p>What is the best way (in the context of the RDF/Data Cube/SDMX
+ approach) to express that the values for the England/Scotland/Wales/
+ Northern Ireland ought to add up to the value for the UK and
+ constitute a more detailed breakdown of the overall UK figure? I might
+ also have population figures for France, Germany, EU27, etc...so it's
+ not as simple as just taking a qb:Slice where you fix the time period
+ and the measure.</p>
+ <p>Some of this requirement is met by the work on ISO Extension to
+ SKOS [19].</p>
+
+
+ <p>Required by: UC1, UC2, UC3, UC9</p>
+ </section> <section>
+ <h4>Scale - how to publish large amounts of statistical data</h4>
+ <p>Publishers that are restricted by the size of the statistics
+ they publish, shall have possibilities to reduce the size or remove
+ redundant information. Scalability issues can both arise with
+ peoples's effort and performance of applications.</p>
+
+ <p>Required by: UC1, UC2, UC3, UC4</p>
+ </section> <section>
+ <h4>Compliance-levels or criteria for well-formedness</h4>
+ <p>The formal RDF Data Cube vocabulary expresses few formal
+ semantic constraints. Furthermore, in RDF then omission of
+ otherwise-expected properties on resources does not lead to any formal
+ inconsistencies. However, to build reliable software to process Data
+ Cubes then data consumers need to know what assumptions they can make
+ about a dataset purporting to be a Data Cube.</p>
+ <p>What *well-formedness* criteria should Data Cube publishers
+ conform to? Specific areas which may need explicit clarification in
+ the well-formedness criteria include (but may not be limited to):</p>
+ <ul>
+ <li>use of abbreviated data layout based on attachment levels</li>
+ <li>use of qb:Slice when (completeness, requirements for an
+ explicit qb:SliceKey?)</li>
+ <li>avoiding mixing two approaches to handling multiple-measures
+ </li>
+ <li>optional triples (e.g. type triples)</li>
+ </ul>
+
+ <p>Required by all use cases.</p>
+ </section> <section>
+ <h4>Declaring relations between Cubes</h4>
+ <p>In some situations statistical data sets are used to derive
+ further datasets. Should Data Cube be able to explicitly convey these
+ relationships?</p>
+ <p>A simple specific use case is that the Welsh Assembly government
+ publishes a variety of population datasets broken down in different
+ ways. For many uses then population broken down by some category (e.g.
+ ethnicity) is expressed as a percentage. Separate datasets give the
+ actual counts per category and aggregate counts. In such cases it is
+ common to talk about the denominator (often DENOM) which is the
+ aggregate count against which the percentages can be interpreted.</p>
+ <p>Should Data Cube support explicit declaration of such
+ relationships either between separated qb:DataSets or between measures
+ with a single qb:DataSet (e.g. ex:populationCount and
+ ex:populationPercent)?</p>
+ <p>If so should that be scoped to simple, common relationships like
+ DENOM or allow expression of arbitrary mathematical relations?</p>
+ <p>Note that there has been some work towards this within the SDMX
+ community as indicated here:
+ http://groups.google.com/group/publishing-statistical-data/msg/b3fd023d8c33561d</p>
+
+ <p>Required by: UC6</p>
+ </section> </section> <section>
+ <h3>Consumption requirements</h3>
+
+ <section>
+ <h4>Finding statistical data</h4>
+ <p>Finding statistical data should be possible, perhaps through an
+ authoritative service</p>
+
+ <p>Required by: UC5</p>
+ </section> <section>
+ <h4>Retrival of fine grained statistics</h4>
+ <p>Query formulation and execution mechanisms. It should be
+ possible to use SPARQL to query for fine grained statistics.</p>
+
+ <p>Required by: UC1, UC2, UC3, UC4, UC5, UC6, UC7</p>
+ </section> <section>
+ <h4>Understanding - End user consumption of statistical data</h4>
+ <p>Must allow presentation, visualization .</p>
+
+ <p>Required by: UC7, UC8, UC9, UC10</p>
+ </section> <section>
+ <h4>Comparing and trusting statistics</h4>
+ <p>Must allow finding what's in common in the statistics of two or
+ more datasets. This requirement also deals with information quality -
+ assessing statistical datasets - and trust - making trust judgements
+ on statistical data.</p>
+
+ <p>Required by: UC5, UC6, UC9</p>
+ </section> <section>
+ <h4>Integration of statistics</h4>
+ <p>Interoperability - combining statistics produced by multiple
+ different systems. It should be possible to combine two statistics
+ that contain related data, and possibly were published independently.
+ It should be possible to implement value conversions.</p>
+
+ <p>Required by: UC1, UC3, UC4, UC7, UC9, UC10</p>
+ </section> <section>
+ <h4>Scale - how to consume large amounts of statistical data</h4>
+ <p>Consumers that want to access large amounts of statistical data
+ need guidance.</p>
+
+ <p>Required by: UC7, UC9</p>
+ </section> <section>
+ <h4>Common internal representation of statistics, to be exported
+ in other formats</h4>
+ <p>QB data should be possible to be transformed into data formats
+ such as XBRL which are required by certain institutions.</p>
+
+ <p>Required by: UC10</p>
+ </section> <section>
+ <h4>Dealing with imperfect statistics</h4>
+ <p>Imperfections - reasoning about statistical data that is not
+ complete or correct.</p>
+
+ <p>Required by: UC7, UC8, UC9, UC10</p>
+ </section> </section> </section>
+ <section class="appendix">
+ <h2>Acknowledgments</h2>
+ <p>The editors are very thankful for comments and suggestions ...</p>
+ </section>
+
+ <h2 id="references">References</h2>
+
+ <dl>
+ <dt id="ref-SDMX">[SMDX]</dt>
+ <dd>
+ SMDX - User Guide 2009, <a
+ href="http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf">http://sdmx.org/wp-content/uploads/2009/02/sdmx-userguide-version2009-1-71.pdf</a>
+ </dd>
+
+ <dt id="ref-SDMX">[Fowler1997]</dt>
+ <dd>Fowler, Martin (1997). Analysis Patterns: Reusable Object
+ Models. Addison-Wesley. ISBN 0201895420.</dd>
+
+ <dt id="ref-QB">[QB]</dt>
+ <dd>
+ RDF Data Cube vocabulary, <a
+ href="http://dvcs.w3.org/hg/gld/raw-file/default/data-cube/index.html">http://dvcs.w3.org/hg/gld/raw-file/default/data-cube/index.html</a>
+ </dd>
+
+ <dt id="ref-OLAP">[OLAP]</dt>
+ <dd>
+ Online Analytical Processing Data Cubes, <a
+ href="http://en.wikipedia.org/wiki/OLAP_cube">http://en.wikipedia.org/wiki/OLAP_cube</a>
+ </dd>
+
+ <dt id="ref-linked-data">[LOD]</dt>
+ <dd>
+ Linked Data, <a href="http://linkeddata.org/">http://linkeddata.org/</a>
+ </dd>
+
+ <dt id="ref-rdf">[RDF]</dt>
+ <dd>
+ Resource Description Framework, <a href="http://www.w3.org/RDF/">http://www.w3.org/RDF/</a>
+ </dd>
+
+ <dt id="ref-scovo">[SCOVO]</dt>
+ <dd>
+ The Statistical Core Vocabulary, <a
+ href="http://sw.joanneum.at/scovo/schema.html">http://sw.joanneum.at/scovo/schema.html</a>
+ <br /> SCOVO: Using Statistics on the Web of data, <a
+ href="http://sw-app.org/pub/eswc09-inuse-scovo.pdf">http://sw-app.org/pub/eswc09-inuse-scovo.pdf</a>
+ </dd>
+
+ <dt id="ref-skos">[SKOS]</dt>
+ <dd>
+ Simple Knowledge Organization System, <a
+ href="http://www.w3.org/2004/02/skos/">http://www.w3.org/2004/02/skos/</a>
+ </dd>
+
+ <dt id="ref-cog">[COG]</dt>
+ <dd>
+ SDMX Content Oriented Guidelines, <a
+ href="http://sdmx.org/?page_id=11">http://sdmx.org/?page_id=11</a>
+ </dd>
+
+ </dl>
+</body>
+</html>