Reification Index =
Current Reifications ==¶
Normal reification requires 4 statements in the store:
<ID> <rdf:type> <rdf:Statement>
<ID> <rdf:subject> <subject>
<ID> <rdf:predicate> <predicate>
<ID> <rdf:object> <object>
Note that this does not assert the statement itself. This requires a fifth statement:
<subject> <predicate> <object>
This is very expensive if many statements are to be reified. With the current storage of 64 bits per gNode (gNode = graph node: The number representing the resource) then this requires 768 bytes for every reified statement.
8 bytes per resource x 4 nodes per statement x 4 statements per reification x 6 indexes
Reifications in the Index¶
Another alternative is to modify the indexes to handle reified statements, and to write code which returns these modification as separate statements. Fortunately, the mechanism to artificially create statements already exists in the Resolver interface. So all that remains is to modify the indexes accordingly.
The current indexes are:
We can modify the first index to include the reification ID:
SPOM --> SPOMR
This ID (R) can be 0 when a statement is not reified. Otherwise it will be set to the gNode of the reification ID. Currently, negative numbers are never stored in any of these indexes (though they get used internally to represent temporary nodes during queries), so a negative gNode can be used to represent a statement that has been reified, but not asserted. Any code which lazily resolves on this index can then skip over any statements with a negative R.
This mechanism maps statments to a possible reification ID. With this technique any statement can be checked quickly (O(log(N))) in this index to see if it is reified. It has an associated cost of 8 bytes per statement (or an increase of about 4%). This applies if statements are reified or not.
A Reification Index¶
The other requirement is to map reification IDs back to their statments. It might be tempting to have an index which maps an ID directly to a pointer which indicates an offset in the first index. However, this loses coherency as that index gets modified, with blocks of the index being copied through various phases, etc. Instead, the only way to map indexes to the associated statements is to introduce a new index of the form:
Unlike the other indexes, this one only requires an entry whenever a statement is reified. The result is zero overhead for non-reified statements, and 40 bytes for each reification.
This proposal requires a constant overhead of 4% for the entire system. This seems acceptably low, but may be enough to consider allowing reified indexes as a different graph type. This should be possible with the current mechanism of model types.
The 4% increase should not matter from the perspective of disk space, but may be an issue in terms of performance, given that the more data each statement takes up, the more swapping the system needs to make (both in terms of heap space, and memory mapped files).
Once the 4% overhead has been accepted, the savings in reification are significant. Instead of an extra 768 bytes (over 4 statements) to reify a single statement, this new system will require just 40 extra bytes, and will only need to touch two indexes. This saves over 80% in space, and results in less disk access, making this a much more scalable form of reification.
Asserting the reified statement will only update the first index (instead of taking more space), and add to the remaining 5. This would have otherwise added to 6 indexes, so there is a slight saving of 32 bytes out of the original 192.
Another problem with adjusting the SPOM index to SPOMR is that the count of a range of statements will be incorrect if any of those statements contain negative values for R. This is because those statements would be counted, even though they have not been asserted.
There are three approaches to dealing with this situation:
- Brute-force the count. This is not scalable at all.
- Maintain 2 counts in the AVL nodes. The first is the count of all statements. The second is a count of only the asserted statements. This would incur an overhead of 8 bytes per 256 statements.
- Use counts from another index. This is not practical unless we have redundant indexes. These are not currently available, but will be discussed on another page.