Performance Comparison Between X42_VER_V4b and X42_VER_V5

X42_VER_V5 introduces a radically different animation data encoding to the x42 format. The new format is much more efficient than the previous incarnation, to the point where scanning for the current pair of keys to lerp between has gone from one of the heaviest functions to one that's barely noticed when profiling.

The Old Way

File versions X42_VER_V0 through X42_VER_V4b broke the animation keys down by key type, by time and value channels, and then by bone. Finding the values to lerp between involved searching (either linearly or using a binary search) the times array to find the correct value. This was done three times per bone (for position, rotation, and scale). Once the values were found, the values were pulled from the parallel values array and interpolated.

Note that the searches are spread over many separate arrays (though, in fairness, many of these arrays were small and adjacent) causing many data cache misses.

The New Way

The animation system is now modeled as a state machine that stores a pair of time-value pairs to interpolate between for each key type of each bone. A specially sorted stream of bone-type-time-value tuples is used to initialize and update the state. The start of the stream contains two key stream elements per bone-type pair and is used to initialize the state machine. Subsequent keys update each pair by replacing its first value with the second and then replacing the second value with that stored in the key.

A running maximum of the frame values in the first pair element is maintained, and represents the number of the minimum frame that can be interpolated from the current pairs. The maximum frame that can be interpolated is the frame number of the element that will be replaced by the next key stream entry (or the number of frames in the file if there are no more elements). Keys are read and processed until the desired frame is bounded by the minimum and maximum valid frame numbers.

While this approach has a high setup cost it is extremely cheap to advance the animation to the next frame. Often, the desired frame will still be within the valid range from the last use and nothing need be done. In the case where the animation must advance to the next set of keys only a few elements from the key stream need be read. Moving back in the animation is, however, very expensive as it requires resetting the state machine and then advancing to the desired frame - though this cost can also be mitigated by caching intermediate states (allowing one to reset to a closer frame to one's target than frame zero).

Interpolating the actual animation values is done in a separate pass by looping over each pair, reading the value indexed by that pair (values are now palletized rather than running parallel to the time values), and interpolating the final result.

Test and Results

The performance implications of the change have been checked out in a character-heavy test level.

The Setup

The test level consisted of a simple square floor (to keep world drawing from showing up in the profile) with one light. 56 characters were then spawned into this level and were left to play their idle animations for several seconds, 556 frames of which were recorded. Characters were rendered using their roughest LOD (and would not have been drawn at all except that it was important that we visually confirm that the new algorithm was indeed working).

The characters in the scene had an average of 36 bones, yielding roughly 2016 bones in the scene.

Summary

The following table summarizes our results. Each row is explained more fully below:

TestKeys ReadState Updates
Baseline3,362,6883,362,688
New Code, Basic Caching2,057,650711,342
Combining Groups1,091,154733,212
Better Caching937,233622,046
One Scan for All Groups867,471701,816
Even Better Caching804,497612,584
With Start Points687,610513,174

Baseline

The old algorithm did the same amount of work each frame regardless of the desired pose. In the best case, searching each bone's tracks would return the correct position in one operation. This gives us (2016 bones * 556 frames * 3 key types) 3,362,688 memory lookups over the course of the recorded test. Note that this is the best case and was certainly not representative of any real measurements (which, unfortunately, we did not collect with the old code). Earlier testing suggests that the searches took an average of 1.6 iterations per bone, per key type (which makes the number of lookups even higher).

New Code, Basic Caching

Running the same test using the new system with a very simple frame-to-frame caching algorithm (maintain one cache per character) yielded a surprising result: only 2,057,650 key stream entries were read, and only 711,342 of those keys actually caused updates (our model skeletons are split into multiple animation groups, each of which was animated separately - skipped keys belong to groups other than the one we're currently processing).

Performance had also visibly increased. This was partly due to the much better cache coherency exhibited by the new system since keys are always in order and processed sequentially.

Also, note that this system is already superior to the old system in terms of the number of times we hit memory. This is due to the large number of bone attributes (most position and scale values) that simply do not animate. These are set up once and then these bones are never touched again.

Combining Groups

This was an incremental improvement on the prior code, except that it updates adjacent groups of bones together whenever possible.

Better Caching

This was an improvement on the prior caching algorithm. It allowed two characters, with the same mode, who were animating in lock-step to share a single cache. It also allowed characters to trade caches in order to avoid updates. For example:

Character one plays frames 50-60 in a loop while character two plays 70-80 in a loop. Previously both caches had to be reset each time their character's animation looped. With the new code the characters would swap their caches each time they both loop, allowing character two to avoid resetting its cache (since it could just increment character one's cache from frame 60 up to 70).

One Scan for All Groups

The scan code was re-worked to update all groups at once. While it did cause some groups to read more keys than strictly necessary, it reduced the total number of scanned keys since the algorithm no longer had to read each key once per batch of groups (discarding it when it belonged to another animation group's bone).

Even Better Caching

Another incremental improvement to the caching algorithm. This prioritized the returned caches better. Previously if a character needed a cache for frame 60, and there were two available, one already at 25 and one at 35, the cache manager would return whichever it had seen first. The new cache manager made sure to return the closer cache.

This is a bit of a milestone as well, since we're now not just suffering fewer cache misses while processing the key stream, but we're also definitely reading less memory as well (each key stream element is four times the size of the time value array elements stored in the old system).

With Start Points

This update introduced the idea of start points to our engine. Start points are partial cache states, created at various frames and spaced evenly throughout the file. They often allow us to jump back to a point much closer to the target frame than the beginning of the animation.

Regrets

No attempt was made to see how the old system would perform if coupled with a cache. This is unfortunate as it would be interesting to know how well it might have benefited from frame-to-frame coherency. I suspect it would have performed similarly to the new system in terms of memory words read, though it would still suffer from poor cache coherency.