Rationale

The string pool in XA 1 was designed to have information regularly inserted and removed from it. However, a subtle race in deleting items from the string pool has led to the disabling of the remove() function.

Also, common use cases have demonstrated that removing strings and other data have a negative impact on performance, as that data is often re-loaded. This is especially true for URIs. If data is to be removed from a string pool, it should be done in a background task, rather than immediately.

This has led to an acceptance of the WORM paradigm in the String Pool (Write Once Read Many). Given that the design of the string pool accepted numerous performance compromises in order to allow for arbitrary insertion and removal of data, it is appropriate to consider if there are some simple updates that take advantage of the new assumptions.

Similarly, when the string pool was first written, there was an assumption that we were working with 32 bit values. The ubiquity of 64 bit systems has made this our default data type, and the resulting data space offers a different set of features for us.

XA 1.1 Structure

This structure is an iterative improvement over XA 1, borrowing some features from the XA2 design, along with some original elements. The plan is to gradually improve the design where it is inexpensive to do so, or where new code can be re-used in XA 2.

The String Pool (or more properly, the "Data Pool") is where the majority of the changes are made, as this is the primary bottleneck in XA 1. The String Pool is made up of 3 main components:
  • Map of data to gNodes.
  • Map of gNodes to data.
  • Raw data.

The Raw Data is kept separate to both maps in order to allow these structures to use records of fixed (and hence, removable) size, with a reference to any remaining data that does not fit in the structure.

Map of Data to gNodes

This map is performed using a phased AVL tree. Nodes contain the left and right pointers to the child nodes, the head of the raw data being stored, the size of the data, and a pointer to any remaining data.

Phase management requires copying of tree nodes from one phase to the next, which is something we can eventually overcome. However, the complexity of this structure is such that it will be dealt with in a later iteration.

For the moment, this structure is being left almost untouched. Only the stored data is being updated.

Raw Data

Raw data was being stored in 20 ManagedBlockFiles. Each of these objects refers to a file for holding blocks of data, a "free list" file for managing which blocks are available, and a free list phase file for managing which blocks were allocated in which phase. This meant that this part of the string pool used 60 files.

Each of the 20 ManagedBlockFile objects handled blocks of increasing size (doubling in size for each one), and any data to be stored was put into the smallest block file that could handle it. This meant that up to 50% of the blocks were being left empty. However, having blocks of equal size in each file makes them easy to reuse when they are freed.

The free list files are all used by ManagedBlockFile to allocate which blocks are available, and to store those that have been freed. Since no blocks were ever being release, the free lists never had to grow.

The phase data for the free lists stores a phase number as an integer at the offset for each allocated block. This means that if block 1000 is allocated in phase 42, the phase file will have the 1000th integer (the 4 bytes at offset 4000 bytes) set to the value 42. As the block files grow, so to do the phase maps.

With no need to re-use space (at least, not immediately), this data can be stored packed. Packing also means that it is not necessary to put data of different sizes in different files. Since there are no longer any blocks to re-use, there is no longer a need to manage them with free lists and free list phase files.

Map of gNodes to Data

This file was a raw array of records, with each record containing similar information to the AVL Nodes in the Data to gNode map. The record number (offset divided by the record size) was the gNode value for the data. The data in the record included:
  • type information
  • data length
  • start of data
  • "pointer" to remaining data (pointer included a file ID and block number in that file)

Blank nodes were also in this array, where the type information would be left as "0".

The new design here is the same data described in the last section. That is, the data is packed together. This time, instead of having a pointer to remaining data, the entire data block is stored.

This then leaves the question of mapping a gNode to the offset of the data that it represents. In this case, the gNode can be the offset. That means that most gNode values will not be valid, but since their allocation is controlled internally, this is not a problem.

Blank nodes do not need on-disk representation. Instead, they are allocated with a counter, and the second-top bit is set on them. This makes them easy to allocate and discover. Since we cannot store 2^62 items, there is no concern of conflict with normal gNodes.

Issues

There are some minor gotchas to be managed here.

Offset 0

A gNode of 0 indicates a missing value, so the first offset cannot be returned as a gNode. This means we can either start storing data at offset 1, or else add 1 to all offsets to get the associated gNode. Since "empty" string pools will be initialized with an empty file (and a file pointer initialized to offset 0), it is simpler to do the numeric conversion.

Data Alignment

While each new item to store at the end of the data file can be stored anywhere, modern computer hardware is faster accessing the data if it is aligned on a word boundary. After seeing the performance problem associated with unaligned data, all data has been padded out to the nearest 8 byte boundary. This improved the speed of the system.

File Appending

In some environments a file opened for appending means that the file pointer is immediately set to the end of the file. However, in Ant/JUnit the code changed behavior. So now whenever the file is opened, the position is set.

Also, multiple reads/writes had their own performance impact. This is probably because there was no caching done on the Java side of the JNI calls to the underlying filesystem. So now data is read in using 2 large reads (the first for the header, the second for the data, once the size is known), and written using a single block write.

Directories

Since writing is ALWAYS to the end of the flat file, this offers the chance to avoid seeking altogether when writing large amounts of data. Unfortunately, the AVL tree is going to keep the drive head busy here. With this in mind, a "directory" parameter has been introduced to the NodePool and StringPool elements in the config file. This is actually a set of directories, separated by a path separator character (;* on Windows, and *: on Unix). If a second path is present, then this is the location where the flat file will be stored.

Similarly, reading from the flat file will cause seeks that will slow down the write process. Ideally, a second flat file will be mirrored onto a 3rd drive. This file should be used for reading, while the other file will be used for writing. Performing a commit would result in a copy from the end of the writing file to the end of the "reader" file. Of course, reading past the end of the "reader" file would require a read from the writer file, but the need for this is instantly seen, simply by the value of the gNode (which is the offset).