Yes, LMDB is fast, blindingly fast. Can we make it faster still? Yes.
As you should be aware, LMDB uses mmap() to map an entire database into a process’ address space. A client of ours asked if we could adapt it to use DI-MMAP, a user-space memory-map manager that claims to be much more efficient (and therefore, much faster) than the Linux kernel’s memory manager. We could do it, but probably won’t.
It’s important to remember that LMDB was created to solve a number of different problems, and performance was only one of them (along with robustness, reliability, portability, simplicity and ease of use, to name a few). Another of these many goals was grace under pressure - specifically memory pressure. Something that often came up with its predecessor, BerkeleyDB, was that a mis-tuned or heavily overloaded database could lead to heavy swapping. Once your server starts hitting swap space, kiss any hopes of decent performance goodbye.
Back in 2004 another OpenLDAP team member, working at IBM, started an effort to make the server more resilient to memory pressure. The work is briefly summarized in this email message and presented in greater depth in their OpenLDAP Developer Day presentation. They outlined the problems pretty clearly - even if you had a well-tuned, balanced cache configuration initially, things could still go bad because of other applications started later. Even worse, if your server was running on a virtual machine, demand for memory from other VMs might come into play as well.
The 2004 effort focused on redesigning back-bdb’s caches to be dynamically and automatically resizable. The code would measure the latency of accesses to detect paging/swapping activity and start releasing cached memory accordingly. I personally saw this effort as misguided in a number of ways…
it depended on timing measurements to detect memory pressure, but required high resolution time information, and clocks are notoriously unreliable inside VMs.
it used mmap for its dynamic cache, but was using anonymous mmaps (i.e., mapped memory that is not explicitly backed by a file). This meant that in the face of memory pressure, the OS would be forced to write these pages out to the swap space. It would aggravate the very swap storms it was intended to remedy.
the resulting code was extremely complex.
The approach taken with LMDB and back-mdb solved all of these problems by totally doing away with application-level caching. By mapping the actual database into the process’ address space, all of the other problems simply vanish.
The user code doesn’t need to do any monitoring or timing measurements, it just runs about its business. No wasted effort here.
Since we’re using a file-backed mmap, none of our usage contributes to swap storms. Any clean page in the map can immediately be reused by the OS for any other purpose when memory pressure is high; no writeback of any kind is needed. And since LMDB’s default mode is to use a read-only mmap, by definition *every* page in the map is always clean. (If using a read/write mmap, then writes to the map create dirty pages that must eventually be written back to disk by the OS. When using anonymous memory, both clean *and* dirty pages must be written to swap if the OS wants to reuse the memory pages.)
The resulting code is extremely simple. (7000 lines of code for LMDB vs 1.5 million for BerkeleyDB, 1/3 less code in back-mdb than in back-bdb or hdb.)
So, with that history out of the way, back to DI-MMAP. Yes, it might be faster, but it suffers from a major design limitation - it requires the app to grab a statically sized chunk of memory from the OS, which it then manages itself for its mmap purposes. An app that adopts DI-MMAP will be vulnerable to swap storms, same as any other application-level cache manager. It will behave poorly when running inside a VM, unless additional code is added to inform it of the host hypervisor’s memory management activities. Of course, DI-MMAP was developed for the High Performance Computing audience, not as a general purpose utility.
If you know that your app is the only thing running on the machine, that approach is perfectly viable. But for a general purpose database engine like LMDB, it’s really not appropriate.
The other obvious downside is that DI-MMAP doesn’t appear to be actively maintained; the only available source code contains a kernel module specific to the Linux 2.6.32 kernel, which is several years out of date now.
Conclusion - yes, there may be ways to make LMDB even faster than it already is. But they come at a cost in usability, and after all the hard-won gains we’ve gotten, it’s a tradeoff that doesn’t make sense. Application-level caching - Just Say No.
Commentaires