Project

General

Profile

Actions

Redundant Indexes =
Indexes with Models ==
Part of the nature of "perfect" indexes is that they are completely symmetric about all axes. This makes it possible to search on all elements in any combination. This was an important property to have for our early code, as the fourth element (the graph node) was being used for different purposes.

Initially this node represented security permissions as each statement could be individually secured. Over time it became apparent that statements needed to have security administered for groups of statements, and that these groups usually coincided with graphs. While the RDF specification does not have any definition for provenance, they do still describe statements as appearing in graphs, and the demand for them quickly made it clear that this is what we should be using the fourth node for.

Once we moved the fourth node to representing graphs (or models as they were known), we maintained symmetry with all the axes in the indexes, as we were hearing of use-cases for thousands of graphs, sometimes with only a single statement per graph. Over time we found that these sorts of applications were unrealistic, and the requirement seemed to go away. However, the indexes are still completely symmetric, and can be as easily searched for graphs as for subjects, predicates or objects.

Overhead

In reality, all use-cases I am aware of use only a few indexes. This leads to a massive overhead on the indexes. Each index contains 4 elements when they could contain 3, and we have 6 indexes, when we could have just 3. Space costs here, in terms of speed of disk access. The larger the records, the more seeking, reading and writing is required. Disk seeking, in particular, is our largest overhead.

It is only realistic to keep the indexes at their current size and number if they are being employed. However, the first 3 indexes:
  • SPOM
  • POSM
  • OSPM

are almost never used. This is because the model is usually chosen for a query, or else it is determined by the end of the first few constraints. The only way these particular indexes can be used is if the first constraint contains a variable for the model.

Dropping Model Symmetry

Dropping model symmetry would lead to just the following indexes:
  • MSPO
  • MPOS
  • MOSP

Usage patterns have shown that these are the indexes which get used all the time.

The missing indexes allowed statement selection with variable models. This could still be acheived by performing multiple searches: one per model. While this would be inefficient for enourmous numbers of models, it would still be practical if the number of models were up into the hundreds.

One suggestion to save on file size is to drop the M at the start of each index (saving 25% of space), and to have each model stored in its own directory. However, this is not practical for more than a few models, as processes are limited in the number of files which may be open. (If we go to configurable stores, then this may be an option for users who have a small number of models, and require high scalability).

Reification

This proposal is also compatible with the scheme proposed in Reification Index. In this case, the resulting indexes would be:
  • MSPOR
  • MPOS
  • MOSP
  • RMSPO

Updated by Paula Gearon over 16 years ago ยท 2 revisions