General purpose memory allocator, better and safer than dlmalloc and ptmalloc
bzt ba24007050 Fixed comments | 3 years ago | |
---|---|---|
LICENSE | 6 years ago | |
README.md | 6 years ago | |
bztalloc.c | 3 years ago | |
bztalloc.h | 3 years ago | |
libc.c | 6 years ago |
This is a general purpose memory allocator written in ANSI C (C99). It has different design considerations than most of the other allocators, which makes it unique. It's source is relatively simple and straightforward (compared to other allocators), more bullet-proof, and it was designed to be thread-safe, but also to be portable and threading implementation independent.
Unlike Doug Lea's allocator and just like jemalloc, this allocator does not handle the whole memory, only programmer defined blocks, so called arenas. Unlike dlalloc (which have been used in glibc for a looong time) and it's successor, ptmalloc, this allocator does not mix allocator data with application data. That could lead to serious failures (imagine what would happen if the application data happens to have the same footer magic as the allocator's footer at an unlucky offset). The bztalloc allocator does not make any assumptions on the content of the application data at all (like hopefully not having some magic bytes), instead it keeps allocator data separated.
It provides relatively small memory footprint (lot smaller than jemalloc) and for the common case has a very good performance. But for edge cases (like implementing something insane in C++ which needs hundred thousands of new/free every seconds), jemalloc will perform better for the price of more memory overhead and considerably more complex structures/algorithms. I'd suggest to avoid allocating and freeing small pieces of memory all the time in the first place.
Also unlike Hoard and Linux's slab allocators (both originated from SUN Solaris' allocator), it does not use fixed sized heaps and does not suffer from fragmentation due to limited splitting, merging and freeing blocks. Arenas are divided into chunks of the same quantum according to the application's usage. When the application hasn't allocated any memory with - let's say - 32 bytes, then there will be no 32 bytes quantum chunk for sure (not even free ones). Also bztalloc uses considerably less memory for it's structures than Hoard, yet with having as many arenas as CPU cores it can guarantee cache locality for each core.
Unlike other thread-aware allocators, bztmalloc does not depend on pthreads, and it can be easily used with any threading model. It can also be used to allocate thread-local, process-local (thread-global) and shared (process-global) memory as well just by providing different (but proper) arguments to it.
PAGESIZE - pageframe size, which is CPU MMU dependent. Mostly 4096 bytes, but the allocator works with any power of two size.
SLOTSIZE - a big memory block as handled by the CPU MMU. Mostly 2M (pagesize/sizeof(uniptr_t)*pagesize), but could be any other.
ALLOCSIZE - a block that is allocated from the VMM if requirement is bigger than __PAGESIZE. Multiple of SLOTSIZE.
ARENASIZE - a block that contains the arena pointers. Usually equals to SLOTSIZE (meaning an arena can have 256k fragments).
quantum - memory unit in which the memory is allocated by the application.
chunk - an array which has a small header, an allocation bitmap and contains quantum sized elements.
fragment - bztalloc describes memory with independent chunks. One fragment has exactly one chunkmap_t. How much memory it can describe depends on the quantum size of that fragment.
The smallest unit bztalloc can handle is 8 bytes. When the application allocates memory, the smallest power of two quantum size is used. For example allocating 211 bytes and 245 bytes will both use the quantum size of 256, and they will end up in the same chunk.
Depending on the quantum size, three different allocation strategies take place:
Note that the allocator does not make assumptions on quantum sizes and their element count, it allocates chunks dynamically according to the application's usage. For example it's perfectly valid to have 10 fragments all of which containing 8 bytes quantum blocks, and only one fragment for 512 bytes quantum, and not having any fragments for medium or large quantums at all.
The allocator operates on arenas. It's up to the programmer to define these arenas according to what their code needs. Only one allocator instance allowed to operate on one arena at any given time. So if you have several concurrent threads, I'd suggest to define more arenas (using threadid mod numcpucore for example). Note that the arenas only describe the address space, and not RAM. Therefore the minimum physical RAM requirement for an arena is one page only (4096 bytes), regardless how big address space it covers. This also means that there's no upper limit for the address space of an arena, so if you use more than one, make sure you use an arena address which does not overlap with the previous arena (depending what you have allocated in it). Luckily the 64 bit address space is so huge, that can be easily done.
The allocator does not have any thread-related code at all. Instead it delegates that to three functions which the programmer (or the underlying OS) must provide. Therefore this allocator can be used with LWP as well as with pthreads just out-of-the-box.
The run-time dependency is minimal: the arena pointer should point to a single, empty, zerod-out page. You can use your linker script for that, or you can call mmap()+memzero() from your libc's initialization code, or you could allocate that page from a page fault handler if you already have a neat CoW paging implementation. The allocator otherwise requires seven functions that it's depending on, but out of which only four are mandatory. These functions are:
void seterr(int errno);
This function sets the POSIX errno in a thread-safe way. If you do not care about threads, simply use #define seterr(x) errno=x
.
But for pthreads for example, you must set a statically allocated int variable that is local to the current thread (which you
should have already for other libc calls).
void lockacquire(int bit, uint64_t *ptr);
void lockrelease(int bit, uint64_t *ptr);
Used to lock and unlock an arena. Again, if you're not into threads or shared memory, these can be empty macros or functions doing nothing. Otherwise for shared memory they must be atomic, and if lockacquire fails, then you must yield to gave the CPU to another thread or process. Lacking to do so and a dead-lock could occour. One implementation of these atomic locking functions for the x86_64 architecture could be (for the fastcall ABI using AT&T syntax):
lockacquire:
1: lock
btsq %rdi, (%rsi)
jnc 1f
movb $SYS_sched_yield, %al
syscall
jmp 1b
1: ret
lockrelease:
lock
btrq %rdi, (%rsi)
ret
The allocator always call these functions with bit 63 (meaning the most significant bit in a long int) and the pointer to the arena. It passes that bit anyway because it assumes you already have an universal locking function which could accept any bit from 0-63.
For pthread, you could simply implement these with (note we won't use the arguments at all, as we have a single mutex object):
pthread_mutex_t mtx;
#define lockacquire(b,p) if(pthread_mutex_trylock(&mtx))pthread_yield()
#define lockrelease(b,p) pthread_mutex_unlock(&mtx)
Or for better performance, you could write a function that uses different muteces for each arena pointer, should you have more than one arenas.
void memzero (void *dest, size_t n);
void *memcpy (void *dest, void *src, size_t n);
Not much to say about these, they are well-known libc functions, sometimes provided as compiler intristicts. The first zeros out a memory region, the second copies non-overlapping areas efficiently.
void *mmap (void *addr, size_t len, int prot, int flags, -1, 0);
int munmap (void *addr, size_t len);
These are provided by the standard POSIX memory management library (sys/mman.h). They allocate physical RAM into the address space and release them back to the OS. This allocator does not use the file system part of these routines, only MAP_ANONYMOUS mappings, which means the last two arguments for mapping is always -1 and 0. This makes VMM implementation in kernels easier (no need for an FS subsystem) while retaining compatibility with the POSIX standard. Also note that bztalloc minimalizes these calls by allocating and freeing ALLOCSIZE (2M) blocks if possible (medium and large quantum), or PAGESIZE if quantum is smaller than PAGESIZE.
The API is very simple and self-explanatory, has only two functions and one for debugging purposes.
void bzt_free(uint64_t *arena, void *ptr)
Frees a previously allocated memory. If you accidentaly pass NULL as ptr, it won't freeze or misbehave anyhow. If you pass a ptr which was not allocated with bztalloc previously, then errno is set to EFAULT.
__attribute__((malloc)) void *bzt_alloc(uint64_t *arena, size_t align, void *ptr, size_t size, int flag)
The allocator. The first argument (arena
) is the arena descriptor, which must be mapped into the address space prior to be used
by bztalloc. On the first call it must be one page (4096 bytes) long, page-aligned, and zerod-out (probably by the run-time
linker or the libc initialization code). Subsequent calls and the arena descriptor resize is taken care of by the allocator
transparently to the programmer.
The second argument (align
) specifies an alignment requirement for the allocation. Must be power of two (8,16,32... etc.). If the
quantum is bigger than the alignment, then alignment for quantum will be guaranteed. For example allocating 256 bytes with 8 bytes
alignment requirement will also guarantee 256 bytes alignment. Allocating 8 bytes with 16 bytes alignment is also valid.
The third argument (ptr
) can be a pointer returned by a previous bzt_alloc() call. Only used if you reallocate a previously
allocated memory. For new allocations this must be NULL. When reallocating, bztalloc may return a different ptr than it received,
but it tries to return the same address if possible.
The fourth argument (size
) specifies the size that you want to allocate (or reallocate to).
The last argument (flag
) will be passed to the mmap
call, so that you can use MAP_SHARED for example. It also uses one
bztalloc specific bit, MAP_SPARSE (0x40). By default, for security reasons bztalloc returns a zerod-out memory. If you don't want
that, you can pass MAP_SPARSE
along with the other mmap flags.
Returns the address of the newly allocated memory (by default zerod-out), and NULL on error with errno set to ENOMEM. That could happen if mmap() failed or all fragments for the quantum are full and there's no more space for a new fragment. In that case you should free some memory first (probably) or you can also create a new arena dynamically (unlikely).
void bzt_dumpmem(uint64_t *arena)
Dumps the memory map and allocation info in an arena if compiled with the DEBUG define. Uses a printf compatible dbg_printf()
function for generating the output. For simplicity, you can use #define dbg_printf printf
.
In it's simplest form, bztalloc can be used to provide the libc malloc functions with the following defines:
#define malloc(s) bzt_alloc((void*)BSS_ADDRESS,8,NULL,s,MAP_PRIVATE)
#define calloc(n,s) bzt_alloc((void*)BSS_ADDRESS,8,NULL,n*s,MAP_PRIVATE)
#define realloc(p,s) bzt_alloc((void*)BSS_ADDRESS,8,p,s,MAP_PRIVATE)
#define aligned_alloc(a,s) bzt_alloc((void*)BSS_ADDRESS,a,NULL,s,MAP_PRIVATE)
#define free(p) bzt_free ((void*)BSS_ADDRESS,p)
Assuming there's an empty, zerod-out page mapped at BSS_ADDRESS at start.
It's easy to use bztalloc for shared memory too. All you have to do is to pass MAP_SHARED in the flags instead of MAP_PRIVATE. Not checked, but it's very important that you must consistently use either MAP_PRIVATE or MAP_SHARED for a specific arena. The same requirement stands for shared memory too, as it has to have an empty, zerod-out page before the first call to bztalloc.
The first ALLOCSLOT (2M) is used to store the arena fragments. That can hold 256k pointers (unless you define ARENASIZE otherwise). Each pointer points to an allocmap_t structure, followed by application data slot. The layout of allocation map depends on quantum size, but it's always in a different slot than application data. This does not mean RAM is also allocated. It's very common case that the allocation map slot has only the first page mapped (uses 4096 bytes of RAM), and the data slot too (another 4096 bytes). That means although the allocator uses 4 Megabytes of address space, it only needs 8 kilobytes of physical RAM for that.
That's all, hope it will be useful,
bzt