Introduction
When data scientists discuss graph normalization, they
do so in the context of achieving a particular set of goals. Since a directed
graph can express the same information in a variety of different ways, it
becomes necessary to be able to transform a graph of information into a
standard form for the purposes of calculating differences between two graphs,
generating a cryptographically-strong hash identifier for a graph, or digitally
signing a graph.
Many RDF graphs, containing nodes with unique identifiers, can be normalized
quite easily. However, when a node does not have a unique identifier,
graph normalization becomes exponentially more difficult. This problem is
called the graph isomorphism problem, and it is believed to be a
NP-hard problem in the worst cases. This means that the problem is very
difficult to solve in a reasonable timeframe for certain inputs. Luckily, these
types of inputs are extremely rare in the real world. Graph isomorphisms are
most commonly used as a denial of service attack and thus any software system
attempting to solve the graph normalization problem should be able to detect
graph isomorphisms correctly.
This document outlines an algorithm for generating a normalized RDF
graph given an input RDF graph. The algorithm is called the
Universal RDF Graph Normalization Algorithm 2011 or
URGNA2011.
How to Read this Document
This document is a detailed specification for an RDF graph normalization
algorithm. The document is primarily intended for the following audiences:
- Software developers that want to implement an RDF graph normalization
algorithm.
- Masochists.
To understand the basics in this specification you must be familiar with basic
RDF concepts [[!RDF-CONCEPTS]]. A working knowledge of
graph theory and
graph isomorphisms
is also recommended.
Contributing
There are a number of ways that one may participate in the development of
this specification:
- Technical discussion typically occurs on the public mailing list:
public-linked-json@w3.org
- Public teleconferences are held
on Tuesdays at 1500UTC on the second and fourth week of each month.
- Specification bugs and issues should be reported in the
issue tracker.
- Source code for the
specification can be found on Github.
- The #json-ld
IRC channel is available for real-time discussion on irc.freenode.net.
Normalization
This algorithm is a work in progress, do not implement it.
There is a newer, simpler to implement algorithm that has not yet been
described here. The algorithm below is obsolete.
Normalization is the process of taking input graph and
performing a transformation on that input that results in all
aspects of the graph being arranged in a deterministic way in the
output graph. That is, for two input graphs
containing the same
information, regardless of their arrangement, the output graphs
will be identical. The problem is a fairly difficult technical
problem to solve because it requires a directed graph to be ordered into a
set of nodes and edges in a deterministic way. This is easy to do when all of
the nodes have unique names, but very difficult to do when some of the nodes
are not labeled and thus names must be generated for those nodes in a
deterministic fashion.
In time, there may be more than one normalization algorithm that will need
to be identified. For identification purposes, this algorithm is named the
"Universal RDF Graph Normalization Algorithm 2011"
(URGNA2011).
Normalization Algorithm Terms
- input graph
-
The data structure that is provided as input to the algorithm.
- output graph
-
The data structure that is produced as output by the algorithm.
- label
-
The subject IRI associated with a graph node.
- list of expanded nodes
-
A list of all nodes in the input graph.
- alpha and beta values
-
The words alpha and beta refer to the first and
second nodes or values being examined in an algorithm. The names are
merely used to refer to each input value to a comparison algorithm.
- renaming counter
-
A counter that is used during the
Node Relabeling Algorithm. The
counter typically starts at one (1) and counts up for every node that is
relabeled. There will be two such renaming counters in an implementation
of the normalization algorithm. The first is the
labeling counter and the second is the
deterministic labeling counter.
- serialization label
-
An identifier that is created to aid in the normalization process in the
Deep Comparison Algorithm. The
value typically takes the form of
s<NUMBER>
or
c<NUMBER>
.
Normalization State
When performing the steps required by the normalization algorithm,
it is helpful to track the many pieces of information in a
data structure called the normalization state. Many of these
pieces simply provide indexes into the graph. The information
contained in the normalization state is described below.
- node state
-
Each node in the graph will be assigned a node state. This
state contains the information necessary to deterministically
label all nodes in the graph. A node state
includes:
- node reference
-
A node reference is a reference to a node in the graph.
For a given node state, its node reference
refers to the node that the state is for. When a
node state is created, its node reference
should be to the node it is created for.
- outgoing list
-
Lists the labels for all nodes that are properties of
the node reference. This list should be initialized
by iterating over every object associated with a property in the
node reference adding its label if it is another node.
- incoming list
-
Lists the labels for all nodes in the graph for which
the node reference is a property. This list is
initialized to an empty list.
- outgoing serialization map
-
Maps node labels to serialization labels.
This map is initialized to an empty map. When this map is populated,
it will be filled with keys that are the labels of every node in the
graph with a label that begins with
_:
and that has a
path, via properties, that starts with the
node reference.
- outgoing serialization
-
A string that can be lexicographically compared to the
outgoing serializations of other
node states. It is a representation of the
outgoing serialization map and other related
information. This string is initialized to an empty string.
- incoming serialization map
-
Maps node labels to serialization labels.
This map is initialized to an empty map. When this map is populated,
it will be filled with keys that are the labels of every
node in the graph with a label that begins with
_:
and that has a path, via properties, that ends with
the node reference.
- incoming serialization
-
A string that can be lexicographically compared to the
outgoing serializations of other
node states. It is a representation of the
incoming serialization map and other related
information. This string is initialized to an empty string.
- node state map
-
A mapping from a node's label to a node state.
It is initialized to an empty map.
- labeling prefix
-
The labeling prefix is a string that is used as the beginning of a node
label. It should be initialized to a random base string that
starts with the characters
_:
, is not used by any other
node's label in the input graph, and does not
start with the characters _:c14n
. The prefix has two uses.
First it is used to temporarily name nodes during the normalization
algorithm in a way that doesn't collide with the names that already
exist as well as the names that will be generated by the normalization
algorithm. Second, it will eventually be set to _:c14n
to
generate the final, deterministic labels for nodes in the graph. This
prefix will be concatenated with the labeling counter to
produce a node label. For example, _:j8r3k
is
a proper initial value for the labeling prefix.
- labeling counter
-
A counter that is used to label nodes. It is appended to the
labeling prefix to create a node label. It is
initialized to
1
.
- map of flattened nodes
-
A map containing a representation of all nodes in the graph where the
key is a node label and the value is a single
JSON object that has no nested sub-objects
and has had all properties for the same node merged into a single
JSON object.
Normalization Algorithm
The normalization algorithm expands the input graph,
flattens the data structure, and creates an initial set of names for all
nodes in the graph. The flattened data structure is then processed by a
node labeling algorithm in order to get a fully expanded and named list of
nodes which is then sorted. The result is a deterministically named and
ordered list of graph nodes.
- Expand the input graph according to the steps in
the Expansion Algorithm and store the
result as the expanded input.
- Create a normalization state.
- Initialize the map of flattened nodes by recursively
processing every expanded node in the
expanded input in depth-first order:
- If the expanded node is an unlabeled node, add a
new key-value pair to the expanded node
where the key is
@id
and the value is the
concatenation of the labeling prefix
and the string value of the labeling counter.
Increment the labeling counter.
- Add the expanded node to the
map of flattened nodes:
- If the expanded node's label is already
in the
map of flattened nodes merge all properties from the
entry in the map of flattened nodes into the
expanded node.
- Go through every property associated with an array in the
expanded node and remove any duplicate IRI entries from
the array. If the resulting array only has one IRI entry, change it
from an array to an object.
- Set the entry for the expanded node's label
in the map of flattened nodes to the
expanded node.
- After exiting the recursive step, replace the reference to the
expanded node with an object containing a single
key-value pair where the key is
@id
and the value is
the value of the @id
key in the node.
- For every entry in the map of flattened nodes, insert a
key-value pair into the node state map where the key is the
key from the map of flattened nodes and the value is a
node state where its node reference refers to
the value from the map of flattened nodes.
- Populate the incoming list for each node state
by iterating over every node in the graph and adding its label
to the incoming list associated with each node found in its
properties.
- For every entry in the node state map that has a
label that begins with
_:c14n
, relabel the node
using the Node Relabeling Algorithm.
- Label all of the nodes that contain a
@id
key associated
with a value starting with _:
according to the steps in the
Deterministic Labeling Algorithm.
Node Relabeling Algorithm
This algorithm renames a node by generating a unique
new label and updating all references to that label
in the node state map. The old label and the
normalization state must be given as an input to the
algorithm. The old label is the current label of
the node that is to be relabeled.
The node relabeling algorithm is as follows:
- If the labeling prefix is
_:c14n
and the
old label begins with _:c14n
then return as
the node has already been renamed.
- Generate the new label by concatenating the
labeling prefix with the string value of the
labeling counter. Increment the labeling counter.
- For the node state associated with the
old label, update every node in the incoming list
by changing all the properties that reference the old label to
the new label.
- Change the old label key in the node state map
to the new label and set the associated
node reference's label to the
new label.
Deterministic Labeling Algorithm
The deterministic labeling algorithm takes the
normalization state
and produces a list of finished nodes that is sorted and
contains deterministically named and expanded nodes from the graph.
- Set the labeling prefix to
_:c14n
, the
labeling counter to 1
,
the list of finished nodes to an empty array, and create
an empty array, the list of unfinished nodes.
- For each node reference in the node state map:
- If the node's label does not start with
_:
then put the node reference in the
list of finished nodes.
- If the node's label does start with
_:
then put the node reference in the
list of unfinished nodes.
- Append to the list of finished nodes by processing
the remainder of the list of unfinished nodes until it is
empty:
- Sort the list of unfinished nodes in descending order
according to the
Deep Comparison Algorithm to
determine the sort order.
- Create a list of labels and initialize it to an
empty array.
- For the first node from the list of unfinished nodes:
- Add its label to the list of labels.
- For each key-value pair from its associated
outgoing serialization map, add the key to a list and
then sort the list according to the lexicographical order of the
keys' associated values. Append the list to the
list of nodes to label.
- For each key-value pair from its associated
incoming serialization map, add the key to a list and
then sort the list according to the lexicographical order of the
keys' associated values. Append the list to the
list of nodes to label.
- For each label in the list of labels,
relabel the associated node according to the
Node Relabeling Algorithm. If
any outgoing serialization map contains a key that
matches the label, clear the map and set the associated
outgoing serialization to an empty string. If any
incoming serialization map contains a key that
matches the label, clear the map and set the associated
incoming serialization to an empty string.
-
Remove each node with a label that starts with
_:c14n
from the list of unfinished nodes and
add it to the list of finished nodes.
- Sort the list of finished nodes in descending order
according to the
Deep Comparison Algorithm to
determine the sort order.
Shallow Comparison Algorithm
The shallow comparison algorithm takes two unlabeled nodes,
alpha and beta, as input and
determines which one should come first in a sorted list. The following
algorithm determines the steps that are executed in order to determine the
node that should come first in a list:
- Compare the total number of node properties. The node with fewer
properties is first.
- Lexicographically sort the property IRIs for each node and compare
the sorted lists. If an IRI is found to be lexicographically smaller, the
node containing that IRI is first.
- Compare the values of each property against one another:
- The node associated with fewer property values is first.
- Create an alpha list by adding all values associated
with the alpha property that are not unlabeled nodes.
- Create a beta list by adding all values associated
with the beta property that is not an unlabeled node.
- Compare the length of alpha list and
beta list. The node associated with the list containing
the fewer number of items is first.
- Sort alpha list and beta list according to
the
Object Comparison Algorithm.
For each offset into the alpha list, compare the item
at the offset against the item at the same offset in the
beta list according to the
Object Comparison Algorithm.
The node associated with the lesser item is first.
- Process the incoming lists associated with each node to
determine order:
- The node with the shortest incoming list is first.
- Sort the incoming lists according to incoming property
and then incoming label.
- The node associated with the fewest number of incoming nodes is
first.
- For each offset into the incoming lists,
compare the associated properties and labels:
- The node associated with a label that does not begin with
_:
is first.
- If the nodes' labels do not begin with
_:
, then the node associated with the
lexicographically lesser label is first.
- The node associated with the lexicographically lesser associated
property is first.
- The node with the label that does not begin with
_:c14n
is first.
- The node with the lexicographically lesser label
is first.
Otherwise, the nodes are equivalent.
Object Comparison Algorithm
The object comparison algorithm is designed to compare two graph node
property values, alpha and beta, against the other.
The algorithm is useful when sorting two lists of graph node properties.
- If one of the values is a string and the other is not, the value that is
a string is first.
- If both values are strings, the lexicographically lesser string is
first.
- If one of the values is a literal and the other is not, the value that is
a literal is first.
- If both values are literals:
- The lexicographically lesser string associated with
@literal
is first.
- The lexicographically lesser string associated with
@datatype
is first.
- The lexicographically lesser string associated with
@language
is first.
- If both values are expanded IRIs, the
lexicographically lesser string associated with
@id
is first.
- Otherwise, the two values are equivalent.
Deep Comparison Algorithm
The deep comparison algorithm is used to compare the difference between two
nodes, alpha and beta.
A deep comparison takes the incoming and outgoing node edges in
a graph into account if the number of properties and value of those properties
are identical. The algorithm is helpful when sorting a list of nodes and will
return whichever node should be placed first in a list if the two nodes are
not truly equivalent.
When performing the steps required by the deep comparison algorithm, it
is helpful to track state information about mappings. The information
contained in a mapping state is described below.
- mapping state
-
- mapping counter
-
Keeps track of the number of nodes that have been mapped to
serialization labels. It is initialized to
1
.
- processed labels map
-
Keeps track of the labels of nodes that have already
been assigned serialization labels. It is initialized
to an empty map.
- serialized labels map
-
Maps a node label to its associated
serialization label. It is initialized to an empty map.
- adjacent info map
-
Maps a serialization label to the node
label associated with it, the list of sorted
serialization labels for adjacent nodes, and the map of
adjacent node serialiation labels to their associated
node labels. It is initialized to an empty map.
- key stack
-
A stack where each element contains an array of adjacent
serialization labels and an index into that array. It
is initialized to a stack containing a single element where its
array contains a single string element
s1
and its
index is set to 0
.
- serialized keys
-
Keeps track of which serialization labels have already
been written at least once to the serialization string.
It is initialized to an empty map.
- serialization string
-
A string that is incrementally updated as a serialization is built.
It is initialized to an empty string.
The deep comparison algorithm is as follows:
- Perform a comparison between alpha and beta
according to the
Shallow Comparison Algorithm.
If the result does not show that the two nodes are equivalent, return
the result.
- Compare incoming and outgoing edges for each node, updating their
associated node state as each node is processed:
- If the outgoing serialization map for alpha
is empty, generate the serialization according to the
Node Serialization Algorithm.
Provide alpha's node state, a new
mapping state,
outgoing direction
to the algorithm as inputs.
- If the outgoing serialization map for beta
is empty, generate the serialization according to the
Node Serialization Algorithm.
Provide beta's node state, a new
mapping state, and
outgoing direction
to the algorithm as inputs.
- If alpha's outgoing serialization is
lexicographically less than beta's, then
alpha is first. If it is greater, then beta
is first.
- If the incoming serialization map for alpha
is empty, generate the serialization according to the
Node Serialization Algorithm.
Provide alpha's node state, a new
mapping state with its serialized labels map
set to a copy of alpha's
outgoing serialization map, and
incoming direction
to the algorithm as inputs.
- If the incoming serialization map for beta
is empty, generate the serialization according to the
Node Serialization Algorithm.
Provide beta's node state, a new
mapping state with its serialized labels map
set to a copy of beta's
outgoing serialization map, and
incoming direction
to the algorithm as inputs.
- If alpha's incoming serialization is
lexicographically less than beta's, then
alpha is first. If it is greater, then beta
is first.
Node Serialization Algorithm
The node serialization algorithm takes a node state, a
mapping state, and a direction (either
outgoing direction
or incoming direction
) as
inputs and generates a deterministic serialization for the
node reference.
- If the label exists in the
processed labels map, terminate the algorithm as the
serialization label has already been created.
- Set the value associated with the label in the
processed labels map to
true
.
- Generate the next serialization label for the
label according to the
Serialization Label Generation Algorithm.
- Create an empty map called the adjacent serialized labels map
that will store mappings from serialized labels to adjacent
node labels.
- Create an empty array called the
adjacent unserialized labels list that will store
labels of adjacent nodes that haven't been assigned
serialization labels yet.
- For every label in a list, where the list the outgoing list if
the direction is
outgoing direction
and the
incoming list otherwise, if the label starts with
_:
, it is the target node label:
- Look up the target node label in the
processed labels map and if a mapping exists,
update the adjacent serialized labels map where the key is
the value in the serialization map and the value is the
target node label.
- Otherwise, add the target node label to the
adjacent unserialized labels list.
- Set the maximum serialization combinations to
1
or the length of the
adjacent unserialized labels list, whichever is greater.
- While the maximum serialization combinations is greater than
0
, perform the
Combinatorial Serialization Algorithm
passing the node state, the mapping state for the
first iteration and a copy of it for each subsequent iteration, the
generated serialization label, the direction,
the adjacent serialized labels map, and the
adjacent unserialized labels list.
Decrement the maximum serialization combinations by
1
for each iteration.
Serialization Label Generation Algorithm
The algorithm generates a serialization label given a
label and a mapping state and returns the
serialization label.
- If the label is already in the
serialization labels map, return its associated value.
- If the label starts with the string
_:c14n
,
the serialization label is the letter c
followed by the number that follows _:c14n
in the
label.
- Otherwise, the serialization label is the
letter
s
followed by the string value of
mapping count. Increment the mapping count by
1
.
- Create a new key-value pair in the serialization labels map
where the key is the label and the value is the
generated serialization label.
Combinatorial Serialization Algorithm
The combinatorial serialization algorithm takes a node state, a
mapping state, a serialization label, a
direction, a adjacent serialized labels map,
and a adjacent unserialized labels list as inputs and generates
the lexicographically least serialization of nodes relating to the
node reference.
- If the adjacent unserialized labels list is not empty:
- Copy the adjacent serialized labels map to the
adjacent serialized labels map copy.
- Remove the first unserialized label from the
adjacent unserialized labels list and create a new
new serialization label according to the
Serialization Label Generation Algorithm.
- Create a new key-value mapping in the
adjacent serialized labels map copy
where the key is the new serialization label and the value
is the unserialized label.
- Set the maximum serialization rotations to
1
or the length of the
adjacent unserialized labels list, whichever is greater.
- While the maximum serialization rotations is greater than
0
:
- Recursively perform the
Combinatorial Serialization Algorithm
passing the mapping state for the first iteration of the
loop, and a copy of it for each subsequent iteration.
- Rotate the elements in the
adjacent unserialized labels list by shifting each of
them once to the right, moving the element at the end of the list
to the beginning of the list.
- Decrement the maximum serialization rotations by
1
for each iteration.
- If the adjacent unserialized labels list is empty:
- Create a list of keys from the keys in the
adjacent serialized labels map and sort it
lexicographically.
- Add a key-value pair to the adjacent info map where
the key is the serialization label and the value is
an object containing the node reference's label, the
list of keys and the
adjacent serialized labels map.
- Update the serialization string according to the
Mapping Serialization Algorithm.
- If the direction is
outgoing direction
then directed serialization refers to the
outgoing serialization and the
directed serialization map refers to the
outgoing serialization map, otherwise it refers to the
incoming serialization and the
directed serialization map refers to the
incoming serialization map. Compare the
serialization string to the
directed serialization according to the
Serialization Comparison Algorithm.
If the serialization string is less than or equal to
the directed serialization:
- For each value in the list of keys, run the
Node Serialization Algorithm.
- Update the serialization string according to the
Mapping Serialization Algorithm.
- Compare the serialization string to the
directed serialization again and if it is less than
or equal and the length of the serialization string is
greater than or equal to the length of the
directed serialization, then set the
directed serialization to the
serialization string and set the
directed serialization map to the
serialized labels map.
Serialization Comparison Algorithm
The serialization comparison algorithm takes two serializations,
alpha and beta and returns either which of the two
is less than the other or that they are equal.
- Whichever serialization is an empty string is greater. If they are
both empty strings, they are equal.
- Return the result of a lexicographical comparison of alpha
and beta up to the number of characters in the shortest of
the two serializations.
Mapping Serialization Algorithm
The mapping serialization algorithm incrementally updates the
serialization string in a mapping state.
- If the key stack is not empty:
- Pop the serialization key info off of the
key stack.
- For each serialization key in the
serialization key info array, starting at
the serialization key index from the
serialization key info:
- If the serialization key is not in the
adjacent info map, push the
serialization key info onto the
key stack and exit from this loop.
- If the serialization key is a key in
serialized keys, a cycle has been detected. Append
the concatenation of the
_
character and the
serialization key to the
serialization string.
- Otherwise, serialize all outgoing and incoming edges in the
related node by performing the following steps:
- Mark the serialization key as having
been processed by adding a new key-value pair to
serialized keys where the key
is the serialization key and the value is
true
.
- Set the serialization fragment to the value of
the serialization key.
- Set the adjacent info to the value of the
serialization key in the
adjacent info map.
- Set the adjacent node label to the node
label from the adjacent info.
- If a mapping for the adjacent node label
exists in the map of all labels:
- Append the result of the
Label Serialization Algorithm to the
serialization fragment.
- Append all of the keys in the adjacent info
to the serialization fragment.
- Append the serialization fragment to the
serialization string.
- Push a new key info object containing the keys from the
adjacent info and an index of
0
onto the key stack.
- Recursively update the serialization string
according to the
Mapping Serialization Algorithm.
Label Serialization Algorithm
The label serialization algorithm serializes information about a node that
has been assigned a particular serialization label.
- Initialize the label serialization to an empty string.
- Append the
[
character to the
label serialization.
- Append all properties to the label serialization by
processing each key-value pair in the node reference,
excluding the
@id
property. The keys should be processed in
lexicographical order and their associated values should be processed
in the order produced by the
Object Comparison Algorithm:
- Build a string using the pattern
<
KEY>
where KEY is the current key. Append string to the
label serialization.
- The value may be a single object or an array of objects.
Process all of the objects that are associated with the key, building
an object string for each item:
- If the object contains an
@id
key with a
value that starts
with _:
, set the object string to
the value _:
. If the value does not
start with _:
, build the object string
using the pattern
<
IRI>
where IRI is the value associated with the
@id
key.
- If the object contains a
@literal
key and a
@datatype
key, build the object string
using the pattern
"
LITERAL"^^<
DATATYPE>
where LITERAL is the value associated with the
@literal
key and DATATYPE is the
value associated with the @datatype
key.
- If the object contains a
@literal
key and a
@language
key, build the object string
using the pattern
"
LITERAL"@
LANGUAGE
where LITERAL is the value associated with the
@literal
key and LANGUAGE is the
value associated with the @language
key.
- Otherwise, the value is a string. Build the
object string using the pattern
"
LITERAL"
where LITERAL is the value associated with the
current key.
- If this is the second iteration of the loop,
append a
|
separator character to the
label serialization.
- Append the object string to the
label serialization.
- Append the
]
character to the
label serialization.
- Append the
[
character to the
label serialization.
- Append all incoming references for the current
label to the label serialization by
processing all of the items associated with the incoming list:
- Build a reference string
using the pattern
<
PROPERTY>
<
REFERER>
where PROPERTY is the property associated with the
incoming reference and REFERER is either the subject of
the node referring to the label in the incoming reference
or _:
if REFERER begins with
_:
.
- If this is the second iteration of the loop,
append a
|
separator character to the
label serialization.
- Append the reference string to the
label serialization.
- Append the
]
character to the
label serialization.
- Append all adjacent node labels to the
label serialization by concatenating the string value
for all of them, one after the other, to the
label serialization.
- Push the adjacent node labels onto the
key stack and append the result of the
Mapping Serialization Algorithm
to the label serialization.