[musl] What's wrong with musl's malloc

Discussion:

Rich Felker

2018-04-20 20:09:04 UTC

I've talked on and off about musl's malloc needing to be overhauled or
replaced, and gaining the ability to experiment with that is one of
the motivations for making malloc interposable/replacable. Here's a
brief write-up of what's wrong that needs to be addressed:

The main benefits of musl's malloc vs the standard dlmalloc algorithms
it's based on is the fine-grained locking. As long as there are binned
free chunks of various sizes available, threads calling malloc will
only contend for a lock when they're requesting allocations of same or
similar size. This works out well under artificial random loads; I'm
not sure how much it helps on typical real-world loads.

In any case, the fine-grained locking also created a problem I didn't
anticipate when I designed it: when allocating memory that involves
splitting a free chunk, or when freeing memory that will get merged
with an adjacent free chunk, the operation as a whole is not atomic;
rather, a large free chunk is consumed, possibly emptying the bin it
lies in, split or merged, then returned either to the same or a
different bin. By saying this operation is non-atomic, I mean other
threads see the intermediate state where the large free chunk has been
consumed but not yet returned, and when this happens, they
unnecessarily split another free chunk or expand the heap rather than
waiting for it to be returned. This yields bad, potentially
catastrophic, fragmentation.

The malloc side already partially mitigates this with the "pretrim"
function, which is used whenever the leftover part after splitting
would remain in the same bin it was already in. In particular his
covers splitting small allocations off a large "top of heap" zone, but
it doesn't help with splitting small chunks. And the free side is much
worse -- large free zones momentarily disappear every time something
adjacent to them is freed.

In order to overcome this fully while maintaining the fine-grained
locking, we'd need the ability to hold multiple locks concurrently,
which would require a protocol for lock acquisition order to avoid
deadlock. But there may be cheap mitigations that could be made in
free like on the malloc side. For instance it might be possible to
make a simpler protocol whereby we could always hold the lock for bin
63, thereby avoiding exposing a state where large free zones are
unavailable. If there's a way to make this work, moving the binmap
masking for newly-emptied bins from unbin() to unlock_bin() would
ensure that malloc doesn't "see" a "temporary empty" state.

In the long term, I think the whole malloc design we're using now is
wrong, and should be replaced, but until the time comes to actually do
that, it may make sense to try to improve some of the worst
properties, or even to reduce or eliminate the granularity of the
locking if it proves not to be very valuable on real-world loads.

Rich

Rich Felker

2018-04-21 05:28:12 UTC

Permalink

Post by Rich Felker
I've talked on and off about musl's malloc needing to be overhauled or
replaced, and gaining the ability to experiment with that is one of
the motivations for making malloc interposable/replacable. Here's a
The main benefits of musl's malloc vs the standard dlmalloc algorithms
it's based on is the fine-grained locking. As long as there are binned
free chunks of various sizes available, threads calling malloc will
only contend for a lock when they're requesting allocations of same or
similar size. This works out well under artificial random loads; I'm
not sure how much it helps on typical real-world loads.
In any case, the fine-grained locking also created a problem I didn't
anticipate when I designed it: when allocating memory that involves
splitting a free chunk, or when freeing memory that will get merged
with an adjacent free chunk, the operation as a whole is not atomic;
rather, a large free chunk is consumed, possibly emptying the bin it
lies in, split or merged, then returned either to the same or a
different bin. By saying this operation is non-atomic, I mean other
threads see the intermediate state where the large free chunk has been
consumed but not yet returned, and when this happens, they
unnecessarily split another free chunk or expand the heap rather than
waiting for it to be returned. This yields bad, potentially
catastrophic, fragmentation.
The malloc side already partially mitigates this with the "pretrim"
function, which is used whenever the leftover part after splitting
would remain in the same bin it was already in. In particular his
covers splitting small allocations off a large "top of heap" zone, but
it doesn't help with splitting small chunks. And the free side is much
worse -- large free zones momentarily disappear every time something
adjacent to them is freed.
In order to overcome this fully while maintaining the fine-grained
locking, we'd need the ability to hold multiple locks concurrently,
which would require a protocol for lock acquisition order to avoid
deadlock. But there may be cheap mitigations that could be made in
free like on the malloc side. For instance it might be possible to
make a simpler protocol whereby we could always hold the lock for bin
63, thereby avoiding exposing a state where large free zones are
unavailable. If there's a way to make this work, moving the binmap
masking for newly-emptied bins from unbin() to unlock_bin() would
ensure that malloc doesn't "see" a "temporary empty" state.

One protocol that might work:

When locking multiple bins, they must be locked in decreasing order.

For malloc, this is the natural thing to do -- the bin you return the
excess to will be the same or smaller than the bin you allocate from.

For free, take the freelock from the beginning, rather than just
momentarily at the end. Then you know the bin sizes you'll be working
with (up to at most 3 -- self, prev, and next) and can lock them in
the protocol order, merge them, and bin the result without any
looping. The MADV_DONTNEED operation (the most expensive and
contention-generating part when it happens) can be deferred until
after the freelock and all but one of the bin locks are released.

Overall I like this because it creates an opportunity to simplify the
code. My guess is that performance would be roughly the same, or at
least not significantly worse (and probably significantly better for
single-threaded use), and the chances at catastrophic fragmentation
from contention would be gone.

I still think a complete redesign is the right way forward in the long
term.

Rich

Rich Felker

2018-04-21 05:52:36 UTC

Permalink

Post by Rich Felker
When locking multiple bins, they must be locked in decreasing order.
For malloc, this is the natural thing to do -- the bin you return the
excess to will be the same or smaller than the bin you allocate from.
For free, take the freelock from the beginning, rather than just
momentarily at the end. Then you know the bin sizes you'll be working

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Post by Rich Felker
with (up to at most 3 -- self, prev, and next) and can lock them in

^^^^^^^^^^^^^^^^^^^^^

I don't think this is actually quite correct. A concurrent malloc
could consume part of either of the adjacent free chunks. It's okay if
it consumes the whole thing, but if it only consumes part, you'll end
up with a new adjacent free chunk of a different size (different bin).

Of course loop-and-retry is an option but doesn't seem like a good
one; the retry time could be unbounded.

Not sure if this is salvagable or not. Problems like this keep
pointing in the direction of wanting to lock chunks (via their
header/footer) rather than locking just bins. On the plus side it's
much finer-grained locking and MT-friendly. Unfortunately it's also
more complex and might have more fixed overhead.

Rich

Markus Wichmann

2018-04-22 19:34:50 UTC

Permalink

Yeah, I was about to ask for one.

Post by Rich Felker
The main benefits of musl's malloc vs the standard dlmalloc algorithms
it's based on is the fine-grained locking. As long as there are binned
free chunks of various sizes available, threads calling malloc will
only contend for a lock when they're requesting allocations of same or
similar size. This works out well under artificial random loads; I'm
not sure how much it helps on typical real-world loads.

That chiefly depends on the design of the programs. Mine tend to avoid
heap allocation wherever possible, and wherever impossible tend to
coalesce allocations into few large requests. However, a lot of
object-oriented code I have seen (even in C) allocates many small
objects.

But that is good: In the small numbers, there are many bins available.
If the program makes random requests for anything between 16 and 128
bytes, there are 4 bins for that. Add 2k to each of these and they all
fall into one bin.

Post by Rich Felker
In any case, the fine-grained locking also created a problem I didn't
anticipate when I designed it: when allocating memory that involves
splitting a free chunk, or when freeing memory that will get merged
with an adjacent free chunk, the operation as a whole is not atomic;
rather, a large free chunk is consumed, possibly emptying the bin it
lies in, split or merged, then returned either to the same or a
different bin. By saying this operation is non-atomic, I mean other
threads see the intermediate state where the large free chunk has been
consumed but not yet returned, and when this happens, they
unnecessarily split another free chunk or expand the heap rather than
waiting for it to be returned. This yields bad, potentially
catastrophic, fragmentation.

Fragmentation is bad with the current malloc() even in the single
threaded case. A simple example is a request for fifty bytes followed by
a request for two-thousand fifty bytes. If the stack was empty before,
the first request will allocate a page from the OS. Assuming that was
4k, malloc will now split off the rest of the page and put it in the bin
for "chunks sized 2k - 4k". The second request however will be rounded
up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two
more pages will be allocated from the system. Then the requested 2k and
change will be split off, leaving the heap with one free chunk in the
"2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the
"4k - 8k" bin that is circa 6k in size. Three pages were allocated where
one would have sufficed. (That is one definition of fragmentation: The
request could not be serviced although the resources where available.)

The benefit of this scheme, of course, is that allocations in the
single-threaded case are bounded time: The algorithm is to pop off the
first chunk in the bins large enough to support the request, or to
allocate the memory necessary for that from the OS. In the worst case,
allocation is
- OS memory allocation
- allocation of a chunk from the bin
- splitting that chunk

Each of these is constant time. I am not sure optimizing fragmentation
is worth reducing the performance for. Especially in today's desktop
and server systems, where anything below 16GB of RAM will just get
laughed off the stage (slight exaggeration).

[...]

Post by Rich Felker
In the long term, I think the whole malloc design we're using now is
wrong, and should be replaced, but until the time comes to actually do
that, it may make sense to try to improve some of the worst
properties, or even to reduce or eliminate the granularity of the
locking if it proves not to be very valuable on real-world loads.

Care to elaborate on what is wrong with the current design of the
malloc? Everything you named so far was just implementation issues, but
nothing that's design related.

So far, I am also not impressed with the competition. dietlibc for
instance runs essentially the same algorithm, but with a big global lock
in the multi-threaded case. tcmalloc runs essentially the same
algorithm, but with an arena for each thread. Therefore every chunk has
to know which arena it came from, thuns increasing the overhead by a
pointer. But it is all fundamentally the same. Not like with sorting
algorithms, where you get meaningful differences and tradeoffs. No, the
tradeoffs appear to be entirely in the area I'd call tuning: Where do
you put the mmap threshold, how do you lock, do you optimize locality of
reference and tiny allocations, all that stuff. But as far as I can
tell, they all suffer from the problem described above. Well, OK, for
32-bit versions of dietlibc, 2050 bytes is already beyond the mmap
threshold. But if it weren't, my point would still stand.

Post by Rich Felker
Rich

Ciao,
Markus

Rich Felker

2018-04-23 02:00:10 UTC

Permalink

Post by Markus Wichmann

Yeah, I was about to ask for one.

For such programs, the properties of malloc are largely irrelevant, so
the only optimization that makes sense for them is ensuring that the
malloc implementation not do anything dumb like allocate tens to
hundreds of kB or even MB for expensive bookkeeping structures, etc.

Post by Markus Wichmann
object-oriented code I have seen (even in C) allocates many small
objects.

Indeed, this is the class of programs that are most affected, and
includes things like web browsers.

Post by Markus Wichmann
But that is good: In the small numbers, there are many bins available.
If the program makes random requests for anything between 16 and 128
bytes, there are 4 bins for that. Add 2k to each of these and they all
fall into one bin.

This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every
single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past
that point does sparse spacing of bins begin.

Post by Markus Wichmann

This is not how it works, at least not at scale. When heap is expanded
via brk, the expension merges with the existing top of the heap;
discrete pages are not relevant. When it's expanded via mmap (because
brk doesn't work/is blocked on some other mapping), it's slightly true
for the first few allocations, but asymptotically irrelevant since we
don't keep allocating small maps. The amount of new anon memory mapped
grows exponentially, doubling every two times it happens.

There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to
2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to
the next power of 2. So If you do start with a 4k chunk, after
splitting off 50 bytes to allocate, you have left a chunk in the
"sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly
satisfiable from it. You could adjust the example to work, but then
the fragmentation you find as a result is much lower.

Post by Markus Wichmann
The benefit of this scheme, of course, is that allocations in the
single-threaded case are bounded time: The algorithm is to pop off the
first chunk in the bins large enough to support the request, or to
allocate the memory necessary for that from the OS. In the worst case,
allocation is
- OS memory allocation
- allocation of a chunk from the bin
- splitting that chunk
Each of these is constant time. I am not sure optimizing fragmentation
is worth reducing the performance for. Especially in today's desktop
and server systems, where anything below 16GB of RAM will just get
laughed off the stage (slight exaggeration).

I've rarely used a system with 16GB ram, and plenty of systems using
musl have under 1GB. The old musl git/web server had 256MB before the
host shutdown had; now we're stuck with a more expensive one with
something like 768MB or 1GB. Also 32-bit archs have a hard *virtual*
limit of 2, 3, or 4 GB (mostly 2 or 3) regardless of how much physical
memory you have. Treating 16GB like something reasonable to have is
how you get the glibc & mainstream browser ecosystem...

Post by Markus Wichmann

Care to elaborate on what is wrong with the current design of the
malloc? Everything you named so far was just implementation issues, but
nothing that's design related.

Designs that require splitting/merging chunks inherently require
excessive synchronization that makes them not scale well to many
threads/cores (and awful on NUMA, if anyone cares). They also have
inherent fragmentation problems with certain allocation patterns, like
alternating between small/large allocations then freeing all the large
ones, which is a reasonable pattern (e.g. allocating large working
spaces and small result spaces, then freeing all the working spaces).

Designs where all the metadata (and especially freelist pointers) are
inline are highly vulnerable to heap overflow and use-after-free
attacks. If possible, I'd rather have less metadata inline and have it
efficiently coupled with something out-of-line that's harder to
attack. This might also improve performance and reduce contention,
e.g. if we used atomic bitmasks in an index of chunks to allocate/free
them rather than having to move them in and out of linked lists.

Post by Markus Wichmann
So far, I am also not impressed with the competition. dietlibc for
instance runs essentially the same algorithm, but with a big global lock
in the multi-threaded case. tcmalloc runs essentially the same
algorithm, but with an arena for each thread. Therefore every chunk has
to know which arena it came from, thuns increasing the overhead by a
pointer. But it is all fundamentally the same. Not like with sorting
algorithms, where you get meaningful differences and tradeoffs. No, the

Here when I say "whole design is wrong" I'm considering all these as
the same basic design -- they're all dlmalloc variants. Doing better
in important ways while not doing worse in any important way is a hard
problem. But it seems like one worth solving.

Rich

Markus Wichmann

2018-04-23 19:02:05 UTC

Permalink

Post by Rich Felker
This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every
single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past
that point does sparse spacing of bins begin.

Yeah, I see that now. My error was assuming that bin_index() was
basically just a binary logarithm. It is not. It is some logarithm
beyond the linear range, but not a binary log. Which one can easily see
by plotting the results of that function in a graph.

Maybe I should have verified my assumptions before posting.

Post by Rich Felker
There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to
2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to
the next power of 2. So If you do start with a 4k chunk, after
splitting off 50 bytes to allocate, you have left a chunk in the
"sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly
satisfiable from it. You could adjust the example to work, but then
the fragmentation you find as a result is much lower.

Yup, by my calculation I get about 400 bytes of fragmentation in the
worst case. The "just shy of 4k" bin starts at 3616 bytes, so after
allocating a tiny amount of memory, allocating 3648 bytes or more will
require heap expansion even though the memory to fulfill the request is
already available. And expand_heap() will only round up to one page
size. So afterwards, roughly 3700 bytes will be allocated in two pages.

The example also does not scale, as repeating the tiny allocation will
not get another page. And repeatedly allocating 3650 bytes will allocate
one page per request, but it is doubtful that a better scheme is
available.

Post by Rich Felker

Yeah, that statement was not terribly well thought through. My point
was, though, that the current design manages to make single-threaded
allocation constant time (well, as constant as expand_heap()). At the
cost of more fragmentation than necessary.

Of course 16GB is a ridiculous number. That's why I chose it. My largest
system has 8GB, and that was after scavenging some parts out of laptops
due for the scrap heap.

Post by Rich Felker

Post by Markus Wichmann

Care to elaborate on what is wrong with the current design of the
malloc? Everything you named so far was just implementation issues, but
nothing that's design related.

How do we work around this, then? We could go the tcmalloc route and
essentially run the allocator separately for each thread (arena
allocation), which reduces the amount of synchronization at the cost of
more fragmentation; this time cross-thread fragmentation (no-one will be
able to utilize all the memory that is lost to overhead across all
threads).

We could also go the dietlibc route, which uses fixed widths for each
bin. There is a bin of 16-byte elements, and it is a list of chunks that
are all 16 bytes in size. If someone wants 16 bytes, we know where to
look. Of course, if someone wants first 16, then 32, then 48 bytes, that
will allocate 96 bytes in three pages. Fragmentation doesn't begin to
describe it.

Post by Rich Felker
Designs where all the metadata (and especially freelist pointers) are
inline are highly vulnerable to heap overflow and use-after-free
attacks. If possible, I'd rather have less metadata inline and have it
efficiently coupled with something out-of-line that's harder to
attack. This might also improve performance and reduce contention,
e.g. if we used atomic bitmasks in an index of chunks to allocate/free
them rather than having to move them in and out of linked lists.

That is a very good idea. Say, don't OS memory managers have to do it
this way? Maybe pick up some pointers from them...

Post by Rich Felker
Here when I say "whole design is wrong" I'm considering all these as
the same basic design -- they're all dlmalloc variants. Doing better
in important ways while not doing worse in any important way is a hard
problem. But it seems like one worth solving.
Rich

And here I thought memory management was a solved problem. Maybe it's
time I hit a library again.

Ciao,
Markus

Rich Felker

2018-04-23 19:47:22 UTC

Permalink

Post by Markus Wichmann

Yup, by my calculation I get about 400 bytes of fragmentation in the
worst case. The "just shy of 4k" bin starts at 3616 bytes, so after
allocating a tiny amount of memory, allocating 3648 bytes or more will
require heap expansion even though the memory to fulfill the request is
already available. And expand_heap() will only round up to one page
size. So afterwards, roughly 3700 bytes will be allocated in two pages.
The example also does not scale, as repeating the tiny allocation will
not get another page. And repeatedly allocating 3650 bytes will allocate
one page per request, but it is doubtful that a better scheme is
available.

Ah, I was going to correct you here, but I think you've identified a
significant inefficiency in the current implementation. My intent was
that expand_heap merge the newly obtained space with the existing top
of the heap, but in fact it leaves a free gap below it. Merging would
require temporarily taking possession of the existing free chunk just
below it, which brings us back to the original problem this thread
described (non-atomicity, ability of other threads to 'see' the moment
when it's temporarily taken) unlesswe have a solution for that.

So, if things worked as I originally envisioned, there would be no
fragmentation here. But as it's working now, there is, and it could be
significant. This might be the biggest source of unintended
fragmentation we've found yet.

Post by Markus Wichmann

Post by Rich Felker

Post by Markus Wichmann

Care to elaborate on what is wrong with the current design of the
malloc? Everything you named so far was just implementation issues, but
nothing that's design related.

How do we work around this, then? We could go the tcmalloc route and
essentially run the allocator separately for each thread (arena
allocation), which reduces the amount of synchronization at the cost of
more fragmentation; this time cross-thread fragmentation (no-one will be
able to utilize all the memory that is lost to overhead across all
threads).
We could also go the dietlibc route, which uses fixed widths for each
bin. There is a bin of 16-byte elements, and it is a list of chunks that
are all 16 bytes in size. If someone wants 16 bytes, we know where to
look. Of course, if someone wants first 16, then 32, then 48 bytes, that
will allocate 96 bytes in three pages. Fragmentation doesn't begin to
describe it.

One possible approach I'm considering is treating the partitioning of
virtual address space into chunks as immutable -- then, little or no
synchronization is needed to access the data structures, and rather
than having to start the search for a free chunk from a global bin,
each thread could have its own "current position" in the graph of free
chunks for each size, in such a way that they naturally (at least
statistically) avoid stomping on the same memory.

Early in process lifetime, when not much has been allocated, the
partioning could be purely based on demand, avoiding over-allocation
but creating structures that aren't as flexible for future reuse.
Later, groups of up to 32 chunks of the same size could be created
such that a single atomic could allocate one. Consecutive ones could
also be allocated together in principle but I'm concerned that might
be problematic.

One could envision having lockable bins (like the current bin system)
populated not with individual chunks but with groups of [up to] 32
chunks of the same size, where when a thread allocates the last free
chunk in a group, it locks the bin and shuffles the data structures in
it around to pick (or allocate) a new group of free chunks. This would
drop the need for locks to ~1/32 malloc calls, and much less if the
mallocs are frequently paired with frees, in which case you'd get by
with mostly just atomic test-and-set-bit operations.

Post by Markus Wichmann

That is a very good idea. Say, don't OS memory managers have to do it
this way? Maybe pick up some pointers from them...

If you mean like the Linux kernel's, it has very different usage
patterns, a lot more control over the usage patterns, and different
allocation functions that document the caller's intent to the
allocator, versus a userspace malloc. So I think it's solving a
simpler problem.

Rich

A. Wilcox

2018-04-23 06:50:47 UTC

Permalink

Post by Markus Wichmann
Each of these is constant time. I am not sure optimizing fragmentation
is worth reducing the performance for. Especially in today's desktop
and server systems, where anything below 16GB of RAM will just get
laughed off the stage (slight exaggeration).

this is so far off the mark I am actually writing a reply on a Sunday.

FreeBSD is currently discussing optimising NFS to run better on 256 MB
RAM i386 machines, and is considering the case where a non-zero number
of people are running it in 64 MB virtual machines.

Adelie is running KDE Plasma 5 on a Pentium III with 256 MB RAM, with
enough left over for a few tabs in Otter browser. In fact, our base
specifications for a server install are a 40 MB 486 or better (you can
squeeze text mode much lower, but apk won't run right below 40; our
baseline is 32 MB, however, on the PowerPC).

"Today's" systems fall in to two buckets: computers with insane specs
bought by people in the upper classes, and used computers with lower
specs bought by people in the lower classes. after spending a
significant chunk of my life (~15 years) in both, I can't see defences
like "anything below 16GB of RAM will get laughed off the stage" as
anything *but* internalised classism. as engineers, our jobs are to
make software for all people, not for the chosen few that can afford
16GB of RAM.

hell, my new Talos cost me so much that even though I'd technically be
considered upper-class now, I still couldn't afford more than 8 GB RAM
for it in the beginning (I'm planning to upgrade in the next months as
my finances allow).

I'm sorry if this comes off as ranty or preachy. I'm just trying to
enlighten everyone that 1) not everyone has or *can afford* 16 GB RAM;
2) that's a poor excuse for not tuning software to be the best it can be.

After all, wasted memory is wasted memory, whether you have a lot or a
little. (Wouldn't you rather fit more photos / videos / text in there
instead of wasting it on malloc overhead?

Best to you and yours,
--arw

Rich Felker

2018-04-23 16:43:01 UTC

Permalink

Post by A. Wilcox

this is so far off the mark I am actually writing a reply on a Sunday.
FreeBSD is currently discussing optimising NFS to run better on 256 MB
RAM i386 machines, and is considering the case where a non-zero number
of people are running it in 64 MB virtual machines.
Adelie is running KDE Plasma 5 on a Pentium III with 256 MB RAM, with
enough left over for a few tabs in Otter browser. In fact, our base
specifications for a server install are a 40 MB 486 or better (you can
squeeze text mode much lower, but apk won't run right below 40; our
baseline is 32 MB, however, on the PowerPC).
"Today's" systems fall in to two buckets: computers with insane specs
bought by people in the upper classes, and used computers with lower
specs bought by people in the lower classes. after spending a
significant chunk of my life (~15 years) in both, I can't see defences
like "anything below 16GB of RAM will get laughed off the stage" as
anything *but* internalised classism. as engineers, our jobs are to
make software for all people, not for the chosen few that can afford
16GB of RAM.
hell, my new Talos cost me so much that even though I'd technically be
considered upper-class now, I still couldn't afford more than 8 GB RAM
for it in the beginning (I'm planning to upgrade in the next months as
my finances allow).
I'm sorry if this comes off as ranty or preachy. I'm just trying to
enlighten everyone that 1) not everyone has or *can afford* 16 GB RAM;
2) that's a poor excuse for not tuning software to be the best it can be.
After all, wasted memory is wasted memory, whether you have a lot or a
little. (Wouldn't you rather fit more photos / videos / text in there
instead of wasting it on malloc overhead?

This. All of this. Considerations like the above were largely the
motivation for musl.

Beyond the question of whether everyone has or can afford 16GB of ram
financially -- while I believe it's important, I'm also less
idealistic than I used to be about whether we can engineer efficient
replacements for bloated stuff faster than Moore's law makes the
bloated stuff usable on affordable machines -- there are questions of
what you can do with what you can afford that immediately become
relevant once you stop thinking just about the OS as something that
runs on a physical present-day high-end box and consider uses like
containers, full virtual machines, embedded systems, etc. I recall
hearing an story I never verified that there was one PC game with
virtualized computers inside the game world running musl on them. This
is also the whole reason people are interested in Alpine on Docker.

Rich

Michael Clark

2018-04-22 21:40:08 UTC

Permalink

Hi Rich,

Have you considered importing jemalloc?

It’s been battle tested for 12 years in FreeBSD and in many other apps. Given it’s bundled and used in MariaDB which is undoubtedly a demanding user and presumably Monty approved the choice of allocator, so it shouldn’t be a bad choice. I’d trust the allocator choice of a well respected database architect.

From looking at the Lockless, Inc benchmarks jemalloc has good performance from very small to very large allocations. From the Lockless Inc memory allocator comparisons: “The disadvantage of the jemalloc allocator is its memory usage. It uses power-of-two sized bins, which leads to a greatly increased memory footprint compared to other allocators. This can affect real-world performance due to excess cache and TLB misses.”

From the wiki:

- https://github.com/jemalloc/jemalloc/wiki/Background

“jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. It is intended for use as the system-provided memory allocator, as in FreeBSD's libc library, as well as for linking into C/C++ applications. jemalloc provides many introspection, memory management, and tuning features beyond the standard allocator functionality.”

It’s a libc allocator.

Here are some of the major users (who it’s fair to say must have exercised due consideration in choosing their memory allocator given there are browsers (Firefox’s mozjemalloc fork), several databases and caches):

- FreeBSD
- NetBSD
- Mozilla Firefox
- Varnish
- Cassandra
- Redis
- hhvm
- MariaDB
- Aerospike
- Android

Curious how big the linked code is... assuming code size is an important consideration...

Looking at the code, ... it doesn’t look too big... and it’s interesting to note that the recent changes to glibc to support per thread pools uses the same tcache feature name as jemalloc whose per thread pool implementation is in tcache.c

- https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html
- https://sourceware.org/ml/libc-alpha/2017-01/msg00524.html

musl libc could vendor the jemalloc code with some sed scripts to modify includes so that jemalloc fits into the musl build structure and can be easily synced with upstream. Ideally musl would want to import/vendor vs using a submodule so musl remains stand-alone. I assume many projects have done the same. MariaDB bundles it. Firefox forked it.

I think the primary disadvantage of jemalloc is power of 2 size bins. Non power of 2 size strides for small structures have better caching performance for many workloads due to stochastic cache eviction when not using power of 2 strides. I’ve noticed that power of 2 alignments can have pathological performance in some cases. i.e. witnessing an array with sizeof(struct) == 28 or 36 perform much better than sizeof(struct) == 32. That said most apps would be using a single malloc call for these types of performance critical arrays so it’s only an issue for apps that individually malloc many small objects.

Worth considering.

BTW - I owe you an update on the riscv musl port. I’ll post one soon... I’ve been busy on the QEMU RISC-V port, whose ‘virt’ board has recently been used by Fedora and Debian distro porters for riscv64 arch support (glibc based distros). Stefan O’Rear got MTTCG working on qemu/target/riscv so we can run SMP Linux in QEMU with reasonable performance. Debian for example now has QEMU RISC-V builders for the riscv64 arch.

- https://github.com/riscv/riscv-qemu/wiki

It will be nice to get the riscv musl support upstream. The major issue at present is ELF thread local storage (pthreads are working but GCC’s __thread is not). The RISC-V TLS model is not documented so it requires some reverse engineering of the riscv support in GCC/binutils/glibc. I haven’t been able to isolate the issue and could benefit from some help by someone more knowledgable in that area (ELF TLS and architecture specific TLS models).

Regards,
Michael

Rich Felker

2018-04-23 02:09:33 UTC

Permalink

Post by Michael Clark

Hi Rich,
Have you considered importing jemalloc?

No, and that's not happening. It's got serious bloat problems,
problems with undermining ASLR, and is optimized pretty much only for
being as fast as possible without caring how much memory you use.
That's nothing like the goals of musl. With the recent changes to
allow interposing malloc, projects that want to use it now can, but it
definitely shouldn't be imposed on the vast majority of programs that
will only hurt (grow much bigger in memory) from using it.

Aside from that, importing nontrivial code is pretty much off the
table -- every time we've done it it's been almost entirely a mistake
(TRE being the worst), since it ends up being lots of code that nobody
understands with bugs and UB and often vulns. Just this release cycle,
blatent nonsensical special cases in the OpenBSD complex math code bit
us.

Post by Michael Clark
It’s been battle tested for 12 years in FreeBSD and in many other
apps. Given it’s bundled and used in MariaDB which is undoubtedly a
demanding user and presumably Monty approved the choice of
allocator, so it shouldn’t be a bad choice. I’d trust the allocator
choice of a well respected database architect.
From looking at the Lockless, Inc benchmarks jemalloc has good

These benchmarks were specifically tuned for thresholds that make
glibc's ptmalloc look bad. They're not a credible source. While
Lockless Inc has some nice documentation on multithreading topics, I'm
a lot more skeptical of their product and performance claims about
other products...

Post by Michael Clark
performance from very small to very large allocations. From the
Lockless Inc memory allocator comparisons: “The disadvantage of the
jemalloc allocator is its memory usage. It uses power-of-two sized
bins, which leads to a greatly increased memory footprint compared
to other allocators. This can affect real-world performance due to
excess cache and TLB misses.”

Yes, see above.

Rich

Rich Felker

2018-04-23 02:51:46 UTC

Permalink

Post by Michael Clark
BTW - I owe you an update on the riscv musl port. I’ll post one
soon.... I’ve been busy on the QEMU RISC-V port, whose ‘virt’ board
has recently been used by Fedora and Debian distro porters for
riscv64 arch support (glibc based distros). Stefan O’Rear got MTTCG
working on qemu/target/riscv so we can run SMP Linux in QEMU with
reasonable performance. Debian for example now has QEMU RISC-V
builders for the riscv64 arch.
- https://github.com/riscv/riscv-qemu/wiki
It will be nice to get the riscv musl support upstream. The major
issue at present is ELF thread local storage (pthreads are working
but GCC’s __thread is not). The RISC-V TLS model is not documented
so it requires some reverse engineering of the riscv support in
GCC/binutils/glibc. I haven’t been able to isolate the issue and
could benefit from some help by someone more knowledgable in that
area (ELF TLS and architecture specific TLS models).

If you have specific questions here I can probably help you find
answers quickly. It probably suffices just to compile and link a
trivial test program using a single TLS variable, and see what offset
from the thread pointer the compiler accesses it at. That should tell
you if TLS is above or below the thread pointer, and if there's any
offset that needs to be applied, what it is.

Once it's all working, the big thing that might take some additional
effort to get it merged is making sure contributions have been
documented. I know a bunch of different people worked on this on and
off and I want to make sure they all get credited properly.

Rich