Lecture Notes: 21 Advanced Malloc
··4 mins
Optimizing a Memory Allocator #
Buddy System #
Buddy trick:
- Power of two allocations.
- We can split a power of two block in half, into two “buddies”.
- e.g. The size 128 range 1024-1152 can be split into 1024-1088 and 1088-1152.
- 128 is 2^7
- The addresses of the two buddies (1024 and 1088) differ in only bit #7.
- That’s 0100 0000 0000 and 0100 0100 0000
- So given either address, we can find the other in O(1) time by XORing with 2^7.
- If our allocations have a “used” header bit, then when either is freed we can find the other, check if it’s free, and merge in O(1) time.
Example:
- Start with a constant heap size, e.g. 1MB
- If we run out of space, we allocate a whole new heap.
- So we innately need to handle multiple arenas - each buddy system heap is a separate arena.
- Have an array of free list pointers.
- One bucket for each power of two up to the heap size.
- For a 1MB = 2^20 B heap, that’s slots numbered 0..20. We don’t use index zero since we can’t do 1B allocations, but that’s fine.
- Doubly linked for O(1) insert / remove.
- Allocate 128k
- Once: Split 1MB => 512k, 512k => 256k, 256k => 128k
- Again: No need to split, we’ve got one.
- Again: Split 256k => 128k
- Free the first one, no merge.
- Free the last one, no merge.
- Free the middle one, merge back to 1MB.
Problem: Our free list structure too big. #
struct cell {
// in both cell and header
long size;
long arena;
bool used;
// not in header
cell* next;
cell* prev;
};
sizeof(cell) = 40
On AMD64, a reasonable minimum header size is 8. Pointers need to be 8-byte aligned, so using a size 8 header maintains that alignment for the pointer we return to the user.
Similarly, a reasonable minimum cell size is 16, since nobody should really expect the allocator to be efficient for allocations smaller than 8.
So let’s fit in those sizes:
struct cell {
// in both cell and header
uint32_t arena; // up to 4B of them
uint8_t size; // store which power of two
uint8_t used; // used flag
uint16_t _pad;
// not in header
int32_t next; // offset from start of arena, -1 is EOL
int32_t prev; // offset from start of arena, -1 is EOL
};
Crazy Chunk Allocator #
Some of the ideas here are losely based on the jemalloc allocator from Facebook, but much simpler.
How can we do allocations with no minimum size?
- Worst case: 1 byte allocations
- We can’t even have a size field in our allocations, much less stick a linked list in there.
- So we need to put that data outside the allocation.
- Idea: Have a whole page of allocations of the same size, and put our metatdata at the beginning of the page.
- Metadata:
- The size of the items in this chunk (8 bytes).
- One bit per item to track which items in the chunk are allocated.
- For 1 byte items in 4k chunk, allocated bitmap:
- 4096 bits is 512 bytes. That leaves 4096-512-8 = 3576 bytes to allocate
- For 2 byte items in 4k chunk, allocated bitmap:
- 2k bits is 256 bytes
- Problem 1: How to find metadata?
- Round address down to the closest multiple of 4096.
- Problem 2: Large allocations don’t fit on a page.
- Not a problem. These allocations will be some number of whole pages, so we can just have a size field a the start of the first page which and we can tell that it’s a large allocation with no bitmap from the size field.
- Problem 3: Medium size allocations (e.g. 2048 bytes).
- Can’t fit two 2k items on one page with metadata.
- Solution: Make the chunks bigger than one page.
- In jemalloc they use 2MB chunks, but that’s probably too big
- The chunk size should be bigger than the largest allocation handled by chunks. Why?