I recently came by this fun story on getting an out-of-memory error on Linux when trying to free memory: Production postmortem: ENOMEM when trying to free memory
That error made absolutely no sense, as you can imagine. We are trying to release memory, not allocate it. Common sense says that you can’t really fail when you are freeing memory. After all, how can you run out of memory? I’m trying to give you some, damn it!
The linked blog post does a good job of explaining the problem briefly and clearly, so I’m just going to provide some further context on how this can indeed be a problem.
Modern computer architectures like x86-64 and even most (all?) the ARMs have a lot funkier relationship with memory than most people – and even most software developers! – tend to be aware of. (This is a good thing! You don’t need to know how to sausage is made. Abstraction is an invaluable tool for managing complexity.)
The memory that a running application (any running application) sees and lives in (where it loads all of its data and where all of its allocations live) is actually not a direct representation of the main memory of the computer (the RAM sticks). It used to be, long ago, but we’ve moved away from that. Instead, what your process is interacting with is called virtual memory, or more specifically, a virtual memory address-space. Virtual memory is a layer of abstraction on top of physical memory, a decoupling useful because it grants additional performance, stability, and safety. How?
A process’s virtual memory address-space is constructed specifically for that one particular process, and reflects only the parts of the physical memory that are used by it. Memory used by other processes is normally not visible here: it is simply not mapped (outside of special cases like shared memory created by shm_open()
and friends or debugging via ptrace()
). This means that, by default, no process can accidentally or maliciously access the memory of other processes and extract secrets from it or change the values in unexpected ways leading to erratic behaviour or crashes.
This layer of abstraction also enables other useful features that reduce overall memory usage at the expense of runtime performance:
- Demand-paging: memory allocated but not yet accessed doesn’t actually need to be available: if the application decided to allocate a gigantic 2 GB buffer but hasn’t yet used more than 100 KB of it, then the rest doesn’t have to be available
- Swapping: rarely used regions of memory are written to disk, and the memory it was used for is made available for new allocations
- Compression: rarely used regions of memory are compressed, and the space freed up is reclaimed for other purposes
- De-duplication: if there are multiple identical copies of the sequence of bytes in memory, we don’t actually need to store it multiple times (though caveats apply!)
For any physical memory to be accessible in a virtual memory address space, it has to be mapped in, typically by the operating system kernel. This is done by writing into a data structure called the page table, which is shared between the kernel and the CPU’s memory management unit (MMU). A page table entry typically represents the mapping of some physical memory region into a virtual memory address space: in this case, the entry will contain the physical memory address that it corresponds to. You may have multiple page table entries refer to the same or overlapping regions of physical memory: that is totally cool and allowed! (This has some very interesting uses e.g. in cybersecurity that I wrote my Master’s thesis on.)
A page table entry, beyond the physical memory address, also contains bits used for purposes. Some of these are used for protection, to indicate whether the region is readable, writable, or executable. This is enforced by the MMU: attempting to write to a virtual memory region whose page table entry does not have the writable bit set will fail (well, it will trigger a page fault, which can be handled – more on this later). For example, after a running application’s executable is loaded into memory at startup, the region is marked as non-writable (but readable and executable) to prevent buggy software from accidentally modifying its own code and also to limit the options of malware. Similarly, memory regions where the application’s runtime data is stored (so the stack, the heap, but also the .data, .bss, and .rodata sections containing global variables) are marked as non-executable, as that is a very easy attack vector for malicious users. (They can be made executable, otherwise things like Jit-in-Time compilers would not work, but you have to explicitly request it. e.g. on Linux by using mmap()
with PROT_EXEC
.)
The page table entry also contains a “dirty” bit. This is set the MMU whenever that particular memory region is used. The kernel can periodically clear this bit, and then check later if the bit is set: if not, then the corresponding memory region has not been used recently, and so may be candidate to be swapped out or compressed.
We need a mechanism though for detecting when e.g. a swapped-out piece of memory is accessed. Clearly, such an access cannot simply proceed as normal: the page table entry is marked as invalid, there is no physical memory backing it. The MMU will trigger a page fault in this case, which is an interrupt that the kernel handles. The kernel will consult its own internal book-keeping of the process’s memory and will find in this case that it has decided to swap the memory to disk: now, since the data is requested, the data must be loaded back from disk into main memory and the page table entry must be made valid again. Once this is done, the process resumes execution from the exact same place as it was left suspended when trying to access swapped memory, none the wiser to what happened in the background – except perhaps that some operation ending up having taken much more time than usual.
This pattern is actually very common, and is used for all of the memory-use optimizations I’ve mentioned so far. For instance, in the case of memory de-duplication, identical memory regions may be referenced by multiple processes at the same time. So long as they all only read from it, life is good, but when one of them wants to write to it, the kernel needs to intervene, as the write would make the memory region no longer contain what the other processes expect it to. This is achieved by marking the page table entries as non-writeable, such that attempting to write results in a page fault. On such page fault, the kernel quietly duplicates the memory region (by allocating a new one and then copying into it) and updates the writing process’s page table entry to point to the new address, restoring its writeability before returning control to the process.
You may have noticed that I was talking about individual bits of the page table entry. Indeed, these entries are extremely limited in size: typically just 64 bits (8 bytes), most of which is needed to store the referenced physical memory address. This is not always enough for more complicated features built on top of the virtual memory abstraction, so the kernel has additional book-keeping: in Linux, these are called virtual memory areas or VMAs for short: one for each distinct memory region of each running process. For reasons of simplicity and performance, these VMAs are stored in a large pre-allocated array with a fixed capacity.
This takes us back to the blog post that motivated this write-up: freeing or de-allocating memory may require an additional VMA to be allocated in the case where you’re not freeing an entire allocated memory region, but only a part of it. This is not possible to do via the commonly used malloc()
/ free()
APIs or even the C++ new
and delete
equivalent operators, but it is possible to do via direct system calls using mmap()
and munmap()
. To illustrate:
- We allocate 1000 bytes of memory
- Linux creates a new VMA for us to record this (note: no actual memory has been allocated unless you used
MAP_POPULATE
as demand paging kicks in!) - We now wish to free the middle 500 bytes
- Linux now has no choice but to split the original one VMA spanning 1000 bytes into two of 250 bytes each, with a 500 byte gap in the middle
Granted, this is a bit contrived, and not even accurate as you cannot free 500 bytes: memory is split into pages of 4096 bytes (4 KB) each (by default, on x86-64). But you can imagine the same situation happening when e.g. different protection bits are applied to different parts of a huge memory region: some part of writeable, other parts are not. The VMA having to be split causes the memory deallocation operation to actually have to allocate memory: for the new VMA. This is how you can get an out-of-memory error.
As you can imagine, there are a lot more details to this topic. Some teasers:
- A single virtual memory address space can span up to 256 terrabytes of memory, making virtual memory much more plentiful than physical memory.
- Resolving virtual memory addresses into physical memory addresses is slow even when performed by dedicated hardware: modern CPUs spend something like 25% of their total execution time just doing this.
- Caching (the TLB) is used to mitigate this, and while its hit-rate is very high (96%+ in typical applications), it opens up timing-related side-channel attacks that have become well-known in recent years.
- Huge pages of 2 MB and 1 GB are another effort to improve performance, but are a mess, and difficult to work with in practice.