Memory Allocation - Part I

In this post we tackle the memory system implementation of Sempervirens. We spend quite some time reviewing the literature, as developing a full-fledged memory allocation system was pretty new to us. We finally set our mind on the approach presented by Stefan Reinalter on his development blog for the Molecule Engine. Stefan wrote an amazing series of posts on the memory system he came up with, as well as posts on various memory allocation strategies. His dev blog is an amazing resource that we cannot recommend enough to new developers!

The memory system of Sempervirens consists in different memory arenas. An arena is basically a wrapper around an allocator actually doing the job in the background (i.e. allocating a memory block of the required size and returning a pointer to that block). However, before and after forwarding allocation calls to its allocator, an arena can take care of different ancillary functionalities such as thread safety, bounds checking, and memory tracking/tagging. The policy-based design followed by Stefan offers an elegant implementation of the relationship between the arena and these orthogonal functionalities, and provides much flexibility. Here is the implementation of our memory arena with allocation, bounds checking and memory tracking policies:

template<class AllocationPolicy, class BoundsCheckingPolicy, class MemoryTrackingPolicy>
class MemoryArena {
public:
  MemoryArena(AllocationPolicy* allocator);
  ~MemoryArena() = default;

  MemoryArena(MemoryArena const&) = delete;
  MemoryArena& operator=(MemoryArena const&) = delete;
  MemoryArena(MemoryArena&&) = delete;
  MemoryArena& operator=(MemoryArena&&) = delete; 

  void* allocate(std::size_t size, std::size_t alignment, const char* file, int line);
  void deallocate(void* ptr, std::size_t size);

private:
  AllocationPolicy* _allocator;
  BoundsCheckingPolicy _boundsChecking;
  MemoryTrackingPolicy _memoryTracking;
};

Unlike other policies, the allocation policy reflects the type of allocator the arena uses rather than a functionality. That policy is passed by pointer, as different types of allocators usually have different interfaces (e.g. constructors). The other policies follow the same convention of presenting a common interface for three different policy levels: void (not doing anything), default, and extended. For example, the bounds checking policy levels are implemented as:

class VoidBoundsChecking {
public:
  inline void onAlloc(void* const ptr, std::size_t size) const {}
  inline void onDealloc(void* const ptr, std::size_t size) const {}
};

class DefaultBoundsChecking {
public:
  static const std::uint32_t GUARD_VALUE = 0xbcbcbcbc;
  static const std::size_t GUARD_SIZE = sizeof(GUARD_VALUE); 

  inline void onAlloc(void* const ptr, std::size_t size) const {
    APtr<char> p{ptr};
    Memset(p.asVoid(), GUARD_VALUE);
    p += (GUARD_SIZE + size);
    Memset(p.asVoid(), GUARD_VALUE);  
  }
  inline void onDealloc(void* const ptr, std::size_t size) const {
    APtr<char> p{ptr};
    p -= GUARD_SIZE;
    Memcheck(p.asVoid(), GUARD_VALUE);
    p += (GUARD_SIZE + size);
    Memcheck(p.asVoid(), GUARD_VALUE);
  }
};  

class ExtendedBoundsChecking {
public:
  static const std::uint32_t GUARD_VALUE = 0xbcbcbcbc;
  static const std::size_t GUARD_SIZE = sizeof(GUARD_VALUE); 
  static const std::uint32_t SIZE_SIZE = sizeof(std::uint32_t);
  static const std::size_t PTR_SIZE = sizeof(void*);

  void onAlloc(void* const ptr, std::size_t size) const;
  void onDealloc(void* const ptr, std::size_t size) const;
};

All three levels provide the same interface. The void policy simply inlines empty functions, while the default policy writes front and back guards at the beginning and end of each block upon allocation, and checks them at deallocation. Our extended policy requires a bit more memory to include the size of the allocation as well as a pointer to the next allocation block. This allows the policy to keep/update a global list of all allocation blocks to check the guards of all blocks at each allocation and deallocation. The way the extended policy keeps tracks of all allocation blocks could be customized for each allocation strategy (e.g. allocated blocks follow each other for a linear allocator, and are never individually deallocated), but this starts bringing in dependencies among policies we want orthogonal, and the extended policy is used in Debug build not in Release anyway. As a side note, the APtr<T> class in the previous code block and some of the following ones is just a simple wrapper around a union of T* and void* with basic arithmetic. We just got tired of having to constantly static cast pointers around.

As another example, the default memory tracking policy implements the common interface as:

class DefaultMemoryTracking {
public:
  ~DefaultMemoryTracking() {
    SEMPERVIRENS_MSG("Net allocations: %d\n", NUM_ALLOC); 
  }
  inline void onAlloc() const { ++NUM_ALLOC; }
  inline void onDealloc() const { --NUM_ALLOC; }

  static std::uint32_t NUM_ALLOC;
};

It simply counts the total number of allocations and deallocations, and checks for a null net result upon destruction.

For now, we are not implementing the thread safety policy until we end up needing to allocate/deallocate memory across threads. The memory tagging is also kind of a bonus for extreme debugging that we do not need for the moment.

As mentioned at the beginning, before/after it calls its allocator for allocation/deallocation, the arena will call its different policies, e.g.:

template<class AllocationPolicy, class BoundsCheckingPolicy, class MemoryTrackingPolicy>
void* MemoryArena<AllocationPolicy, BoundsCheckingPolicy, MemoryTrackingPolicy>::
allocate(std::size_t size, std::size_t alignment, const char* file, int line) {
  // Allocate for total size and offset based on bounds checking policy
  auto totalSize = size;
  auto offset = std::size_t{0};
  if constexpr (std::is_same_v<BoundsCheckingPolicy, DefaultBoundsChecking>) {
      totalSize += 2 * BoundsCheckingPolicy::GUARD_SIZE;
      offset = BoundsCheckingPolicy::GUARD_SIZE;
  }
  if constexpr (std::is_same_v<BoundsCheckingPolicy, ExtendedBoundsChecking>) {
      totalSize += 2 * BoundsCheckingPolicy::GUARD_SIZE
    + BoundsCheckingPolicy::SIZE_SIZE
    + BoundsCheckingPolicy::PTR_SIZE;
      offset = BoundsCheckingPolicy::GUARD_SIZE
    + BoundsCheckingPolicy::SIZE_SIZE
    + BoundsCheckingPolicy::PTR_SIZE;
  }
  auto ptr = _allocator->allocate(totalSize, alignment, offset);

  _boundsChecking.onAlloc(ptr, size);
  _memoryTracking.onAlloc();

  APtr<char> p{ptr};
  p += offset;
  return p.asVoid();
}

template <class AllocationPolicy, class BoundsCheckingPolicy, class MemoryTrackingPolicy>
void MemoryArena<AllocationPolicy, BoundsCheckingPolicy, MemoryTrackingPolicy>::
deallocate(void* ptr, std::size_t size) {

  _boundsChecking.onDealloc(ptr, size);
  _memoryTracking.onDealloc();

  _allocator->deallocate(ptr, size);
}

This pretty much covers the memory arena template in Sempervirens. Concerning the memory allocation itself, several allocation strategies are used for different use cases. So far, we have only considered the linear, stack, and pool allocators. Here is the interface of the linear allocator:

class LinearAllocator {
public:
  LinearAllocator(void* ptr, std::size_t size);
  ~LinearAllocator() = default;

  void* allocate(std::size_t size, std::size_t alignment, std::size_t offset);
  inline void deallocate(void* ptr, std::size_t size) {};
  inline void reset() { _current = _begin; }

  inline void* begin() const { return static_cast(_begin); }
  inline void* current() const { return static_cast(_current); }
  inline void* end() const { return static_cast(_end); }

private:
  char* _begin;
  char* _current;
  char* _end;
};

The linear allocation is the simplest strategy: the allocator holds a view (pointer and size) to a memory block, which can be on the stack or on the heap (e.g. returned by malloc). The allocator keeps track of where the last allocation ends, and simply allocates blocks one after the other. In this strategy, individual blocks are never deallocated, the allocator is just reset to the beginning of the underlying memory area. The allocate method is implemented as follows:

void* LinearAllocator::allocate(std::size_t size, std::size_t alignment, std::size_t offset) {
  APtr<char> p{_current};
  // offset pointer first, align it, and offset it back
  p += offset;
  p = alignUp(p.asVoid(), alignment);
  p -= offset;
  // reset current ptr
  auto nLostBytes = p.asType() - _current;
  _current += nLostBytes + size;
  assert(!(_current > _end));
  return p.asVoid();
}

Here the size argument represents the total size of the allocation, including the user requested size, plus any additional bytes the bounds checking policy of the memory arena requires. The offset parameter is also defined by the bounds checking policy before allocation, e.g. in the default bounds checking policy the offset will be the size of the front guard (4 bytes). The alignment parameter corresponds to the intrinsic memory alignment of the type being allocated. Depending on where the current pointer of the allocator is, and the size of the offset, we might not end up on a proper multiple of the required alignment. The alignUp function takes care of that:

inline void* alignUp(void* ptr, std::size_t alignment) {
  return reinterpret_cast<void*>(
      (reinterpret_cast<std::uintptr_t>(ptr) + alignment - 1) &
      ~(alignment - 1));
}

Unless you are fluent in bitwise operation, this takes a little while to get (pen and paper helped!). What the (ptr + alignment - 1) & (alignment - 1) does, is to shift up your pointer by the number of bytes left to the next alignment boundary if your pointer does not sit already on the required boundary. So for example, if the considered type has an alignment requirement of 4 bytes and the pointer sits one byte away, then the pointer is shifted up by 1 or 3 bytes if below or above the boundary respectively. The weird reinterpret cast to uint_ptr is unfortunately necessary, as bitwise operations are only allowed on pointers to integral types. uint_ptr is an integral type guaranteed to be large enough to store a pointer. If the alignments/offsets do not play well together, we might loose a few bytes between two successive allocations.

In Sempervirens, we use a linear arena to set up the whole application on start up (allocation for all subsystems, e.g. inputs, renderer, etc). This would look something like this:

using Arena_t = MemoryArena<LinearAllocator, VoidBoundsChecking, VoidMemoryTracking>;
auto allocator = LinearAllocator{allocateHeap(SEMPERVIRENS_B(512)), SEMPERVIRENS_B(512)};
auto arena = Arena_t{&allocator};

// create app and input devices
auto app = createApplication(arena);
auto window = SEMPERVIRENS_NEW(Window_t, arena);
app._ptr->setWindow(window._ptr);
auto keyboard = SEMPERVIRENS_NEW(Keyboard_t, arena);
app._ptr->setKeyboard(keyboard._ptr);
auto mouse = SEMPERVIRENS_NEW(Mouse_t, arena);
app._ptr->setMouse(mouse._ptr);
auto renderer = SEMPERVIRENS_NEW(Renderer_t, arena);
app._ptr->setRenderer(renderer._ptr);

app._ptr->run();

// delete app and input devices
SEMPERVIRENS_DELETE(renderer, arena);
SEMPERVIRENS_DELETE(mouse, arena);
SEMPERVIRENS_DELETE(keyboard, arena);
SEMPERVIRENS_DELETE(window, arena);
SEMPERVIRENS_DELETE(app, arena);

Two things to note here: first, in Sempervirens, allocations are always returned as blocks, which are simple templated structs holding the pointer to the allocation as well as the size of the allocation. That way allocators do not have to store the size of the allocations (usually in first few bytes), it is passed back to them when a block needs to be deallocated. Second, all allocation/deallocation calls are wrapped up in a few macros that call the corresponding functions for scalar and array types:

#define SEMPERVIRENS_NEW(type, arena, ...)      \                                  
  New(arena, __FILENAME__, __LINE__, ##__VA_ARGS__)
#define SEMPERVIRENS_DELETE(block, arena) Delete(block, arena)
#define SEMPERVIRENS_NEW_ARRAY(type, arena, N)  \                             
  NewArray(arena, __FILENAME__, __LINE__, N)
#define SEMPERVIRENS_DELETE_ARRAY(block, arena) DeleteArray(block, arena)

template<typename T, class Arena, typename... Args>
Block<T> New(Arena& arena, Args&&... args);

template<typename T, class Arena> 
void Delete(Block<T> block, Arena& arena);

template<typename T, class Arena>
Block<T> NewArray(Arena& arena, const char* file, int line, std::size_t N);

template<typename T, class Arena> 
void DeleteArray(Block<T> block, Arena& arena);

The helper functions distinguish POD vs non-POD types. This can provide some optimization for the allocation/deallocation of arrays, where no constructor/destructor calls are needed for POD types. The standard type trait std::is_pod<T> is deprecated in C++20 for more granularity in the type requirements. The standard specifies that this trait is now equivalent to the conjunction of the std::is_trivial<T> and std::is_standard_layout<T> type traits.

The allocation helper functions are using the placement syntax of the new operator that calls the default overload of operator new(size_t bytes, void* ptr). That way, the arena/allocator is responsible for the memory allocation while the operator new simply calls the constructor of the considered type at the given address. Since all instances are created with the placement new operator, the delete operator (which would call the destructor) cannot be called on these instances. Instead, the destructors must be called manually as shown below, before the arena takes care of deallocating the memory.

template<typename T>
struct IsPOD {
  static const bool value =
      std::conjunction_v<std::is_trivial, std::is_standard_layout<T>>;
};

template<typename T>
inline constexpr bool isPOD = IsPOD<T>::value;

template<typename T, class Arena, typename... Args>
Block<T> New(Arena& arena, const char* file, int line, Args&&... args) {
  void* ptr = arena.allocate(sizeof(T), std::alignment_of_v<T>, file, line);
  if constexpr (isPOD<T>)
    return {static_cast<T*>(new (ptr) T{std::forward<Args>(args)...}),
            sizeof(T)};
  else
    return {static_cast<T*>(new (ptr) T(std::forward<Args>(args)...)),
            sizeof(T)};
}

template<typename T, class Arena>
void Delete(Block<T> block, Arena& arena) {
  if constexpr (!isPOD<T>)
    block._ptr->~T();
  arena.deallocate(block._ptr, block._size);
}

template<typename T, class Arena>
Block<T> NewArray(Arena& arena, const char* file, int line, std::size_t N) {
  if constexpr (isPOD<T>)
    return {static_cast<T*>(arena.allocate(sizeof(T) * N, std::alignment_of_v<T>, file, line)),
            sizeof(T) * N};
  else {
    union {
      void* as_void;
      T* as_T;
    };

    as_void = arena.allocate(sizeof(T) * N, std::alignment_of_v<T>, file, line);

    // construct instances using placement new
    const T* const onePastLast = as_T + N;
    while (as_T < onePastLast)
      new (as_T++) T;

    // return pointer to the first instance
    return {(as_T - N), sizeof(T) * N}; 
  }
}

template<typename T, class Arena>
void DeleteArray(Block<T> block, Arena& arena) {
  if constexpr (isPOD<T>)
    arena.deallocate(block._ptr, block._size);
  else {
    auto as_T = block._ptr; 

    // Call instances' destructor in reverse order, per C++ std
    auto N = block._size / sizeof(T);
    for (std::size_t i = N; i > 0; --i)
      as_T[i - 1].~T();

    arena.deallocate(block._ptr, block._size);
  }
}

And this is about it for the memory system in Sempervirens for now. We will cover the implementation of the stack and pool allocators when the need comes.