llorens.marti.works

Contiguous Memory Container

The two most common contiguous memory containers that I personally use are std::vector<T> and std::array<T>. Both guarantee contiguous memory and when you need to iterate over them, CPU caches are happy.

While crawling over the 3D engine code that I am working on, I found a pattern that involves vectors of pointers and objects. Normally this pattern is described by a vector holding pointers to objects.

As you might know, there is no guarantee that the objects pointed by the pointers are stored on a continuous memory chunk, which of course, it will make caches not happy at all.

At the same time, I think it feels natural for a human to think in terms of discreet objects, but as mentioned before, that "natural thinking" might make the CPU to go slower rather than if we think first in memory aligned data.

So, an experiment I wanted to try was to create a Contiguous Memory Container (CMC) that will guarantee contiguous memory for all the stored objects, and at the same time it will provide discreet handling for a given object. Also, the idea is that this container stores objects that are bigger in size than normal variables, otherwise the overhead of memory will not worth the strategy.

Code

https://github.com/llorens-marti/cpp-continuous-memory-container

Requisites

When I started thinking in this container I wanted some characteristics to be met:

All the stored objects must be kept in contiguous memory.
The container will manage the creation and destruction of the objects.
No gaps between objects if a given object is destroyed.
Every object is referenced by a minimum of one "special" pointer.
No "special" pointer invalidation ever.
Ability to re-arrange objects in memory for performance purposes.
Only one container per type of variable on a given thread.

Based on those characteristics, the internals of the Contiguous Memory Container are implemented with std::vector<T> to store the objects for convenience. Also, the container uses a minimum of one non-atomic "smart" pointer Ptr<T> as a handler per object. This handler is returned to the consumer to allow the object to be accessed.

Container Implementation

The CMC has 4 internal vectors.

std::vector<T> objects;
std::vector<unsigned int> refCount;
std::vector<unsigned int> ptrOffset;
std::vector<ContPtr<T>*> ptrAddress;

The "objects" vector contains the objects we want to store inside the container.

The "refCount" vector has always the same size as the objects vector, and in every cell it contains the total count of Container pointers Ptr<T> pointing to that object.

The "ptrOffset" vector acts as a level of indirection between the Ptr<T> and the object it is pointing too. This allows many Ptrs to be referencing the same object. This vector has always a size equal or greater than "objects" and "refCount".

The "ptrAddress" vector contains the address of the Ptr. This will allow the CMC inner code to update the internals of the Ptr to make sure there are no gaps in memory at any given time. This vector has the same size as "ptrOffset".

To instantiate one Container of type T we only need to do:

Container<T> c;

The CMC API provides a function to create new objects of type T, this function accepts all the parameters the T constructor might need and returns a ContPtr so the consumer can access later to the object.

Ptr<ObjA> pA = c.make(param1, param2, ...);

Ptr Implementation

Every Ptr has 2 internal variables.

Container<T>* c_;
unsigned int index_;

The "c_" variable is a pointer to the container of that type of objects. This pointer will almost never change its value, and it can not be const because sometimes we will need to perform non-const operations through it.

The "index_" variable refers to the pointer information of "ptrAddress" and "ptrOffset" that will allow us to access the correct object.

The Ptr<T> has pretty much all the characteristics of a self managed pointer but without any atomic operations. This should remain the execution fast inside one thread.

The API of Ptr<T> allows us to call any function of ObjA with the operator "->".

pA->someFunction();

It can be copied as normal and the reference count will be managed internally.

Ptr<ObjA> pA = c.make(param1, param2, ...);
Ptr<ObjA> pB = pA;

No memory gaps

For this experiment the Container will keep the internal memory without gaps. This means that when an object is destroyed/removed, the last object will take its place:

To achieve this behavior, it is crucial that the container keeps the address of every Ptr that is pointing to the object, this way the container can update its internal "index_" variable to the correct one.

As far as I've been able to test, this feature provides some characteristics to take into account:

The insertion of a new object will always be at the end of the container.

This is important because it means that there will be no "search" for a suitable empty slot while trying to create a new object.

The deletion of an object is going to be a bit slower than before.

This is because when an object is destroyed, the container needs to copy the last object into that place and update all the associated references. As I will mention in the conclusions, having code in the destructor will make the destructor non-trivial and we will need to pay the price.

Performance

Some very common scenarios can be:

A bunch of separated pointers to objects.
A vector of pointers to objects.

These objects are not guaranteed to be on a continuous memory. In those cases when the memory is not contiguous, if you want to iterate them, there is the probability that the CPU caches are not going to be happy.

So I wanted to see if the CMC was able to make any difference. The test was to instantiate N objects of type A, but between every instantiation of an A object the code would instantiate M objects of type B to simulate that the A objects were not contiguous in memory. Something like this:

for (int i = 0; i < 200000; i++ ) {
    // Instantiate Object A

    for (int j = 0; j < 100; j++) {
        // Instantiate Object B
    }
}

With that pseudocode in mind, I was able to get these results:

-- execution: 0.120272s
test_with_non_linear_memory_with_regular_vector_and_pointers (6.375813s)

-- execution: 0.026821s  <---- !!
test_with_linear_memory_with_experimental_container (3.868148s)

Because the CMC kept all the objects contiguous in memory, it is highly probable that the CPU was able to take advantage of the caches and the operations were able to be performed at a much higher speed.

The execution time can vary from machine to machine, but I would like to point out the huge difference between using non-contiguous memory and contiguous memory.

~0.026s / ~0.12s = ~0.21  ---->  21%

Or said on another way, it was 5 times faster using the CMC compared with regular pointers and vectors.

Conclusion

This experiment has been very interesting on learning how much important is to have the CPU caches happy.

Speaking about pros, I would list these ones:

Faster computation due to contiguous memory.
CMC API is very comfortable while instantiating individual objects.
Auto-managed and non-atomic pointers to access objects.
Insertion on CMC is very fast.
Able to expand the code to detect circular memory dependencies/leaks.

About the cons that I was able to perceive:

Ptr<T> destructor is non-trivial and can make delete process a lot slower.
Random access to objects on the CMC can still suffer from cache misses.