Q&A WITH SYMAS CORPORATION’S HOWARD CHU ABOUT SYMAS’S LIGHTNING MEMORY-MAPPED DATABASE
Howard Chu, the Chief Architect for the OpenLDAP Project and CTO of Symas Corporation, discusses Symas’s Lightning Memory-Mapped Database (LMDB), the memory-mapped database that was developed and contributed to the OpenLDAP Project by Symas. In this interview we discuss the nitty gritty of the database and why it’s “not just another new database”.
Q: What exactly is LMDB? A: LMDB is a B+tree database that exploits the memory-mapped file capabilities available in modern operating systems. It is fully transactional, ACID compliant, and uses multi-version concurrency control (MVCC) to avoid the need for most database locks. Its compiled size is small enough to fit completely inside the L1 cache of most modern CPUs, making it extremely fast.
Q: Is this a completely new design? A: Not really. LMDB builds on several pieces of work done since the 1960s. The notion of a memory-mapped database came from work done on the Multics operating system in the 1960s, based on a concept called “Single-Level Store”. It was mostly abandoned by the mid-1990s when available disk space outstripped the the capacity of 32-bit address spaces (4 gigabytes). This has changed with the widespread availability of systems with 64-bit addressing, which puts the upper bound of database size at 8 exabytes. B+trees were first described in 1972 and are commonly used in databases and some file systems. The code implementing LMDB’s database structure has its origins in the append-only B-tree code written in 2010 by Martin Hedenfalk for OpenBSD’s ldapd implementation. I also added features available in Berkeley DB (BDB), which we used in OpenLDAP’s back-bdb and back-hdb backends, to simplify writing a slapd backend with LMDB. So the basics have been around for many years. I’ve just brought them together and made selective improvements on them to create LMDB.
Q: Why did you choose to base LMDB on an append-only design? A: The motivation behind an append-only design is that existing data never gets overwritten, making it impossible for a database to be corrupted by an interrupted operation (e.g. due to system crash). As such, these database designs are crash-proof and need no intensive recovery process when restarting after a crash. LMDB achieves the same goal (instantly available, crash-proof) but uses disk space more efficiently. Not everyone is like Google with endless strings of disks at their disposal.
Q: What do you think are the biggest contributors to LMDB’s performance? A: Database reads can be satisfied with no memory allocations (mallocs) or memory-to-memory copies (memcpy). The other major contributor is that no locks are required to satisfy a read operation. All of those can be significant performance bottlenecks in typical code. This is part of the reason that LMDB reads are so much faster than every other database, and why reads scale perfectly linearly across multiple CPUs - there are no bottlenecks in the read path.
Q: Databases that provide MVCC are notorious for their need of excess storage space- typically double that of the maximum expected database size. You also mentioned that the B+tree code is append-only. Doesn’t that make LMDB very wasteful in its use of disk space? A: No. LMDB maintains a second B+tree that tracks pages that have been freed. Updaters will re-use pages from this free list whenever possible and the database size will remain fairly static. This is a key advantage of LMDB over other MVCC databases like CouchDB.
Q: Another problem with MVCC databases is that updates may be denied while garbage collection takes place. Is that the case with LMDB? A: No. By using the second B+tree to manage free space, we avoid the need to perform garbage collection in the first place.
Q: Is the size of the database restricted to the amount of physical memory in the system? A: No. The maximum database size is constrained only by the amount of disk space available and by the size of the address space on the machine. For 32-bit implementations it’s restricted to approximately 2^31 bytes (2 GB), and for 64-bit implementations, which typically bring 48 address bits out of the CPU, it’s restricted to 2^47 bytes (128 TB). The operating system takes care of moving data in and out of available memory as needed. This means that database sizes can be many multiples of available physical memory.
Q: How does that work? A: If a portion of the database that is not currently in memory is accessed, the OS suspends the database task, reads the needed data from disk into memory, and then resumes the database task. If there’s no memory available at the moment, another page from the memory map is reused to hold the data. An interesting point here is that since most of the pages in LMDB’s map are never written/dirty (since LMDB does copy-on-write), they don’t need to be flushed to disk. They can be reused immediately by the kernel, the old contents are discarded with zero page-out overhead.
Q: How does available memory relate to database performance then? A: Moving data between disk and memory takes time, so having more memory available for the database to use reduces the number of times that movement takes place and improves performance. The best case is having enough memory to hold the entire database in memory, but that’s not always practical. As a general rule you want to have enough memory to accommodate as much of your “working set”, the data that’s commonly accessed, as possible. That will insure that most database operations use data that’s already in memory.
Q: What does the API look like? A: The API was deliberately modeled on that of Oracle’s Berkeley DB to ease porting applications that use that database.
Q: Does LMDB have the same access methods as Berkeley DB? A: No. While LMDB is basically modeled after BDB, it is overall much simpler. BDB provides extensible hashing as well as B-tree based methods. LMDB only implements B+trees. Hashing is cache-unfriendly by design, and doesn’t perform well as data volumes grow. B+trees leverage locality of reference thus maximizing the efficiency of system caches.
Q: Does LMDB have caches or need special tuning? A: No. The SLS design means that database reads are satisfied directly from the file system cache without the need to copy data into separate database buffers first. This is such a simple design that there is nothing to tune. The choice of underlying file system can have a strong impact on database performance, though, and there are several file system tuning parameters that can be tweaked to improve performance in certain applications.
Q: How do you back up an LMDB instance? A: MDB includes an mdb_copy utility that can make a complete copy of an existing LMDB instance. The new copy of the database will itself be a proper LMDB database, so there’s no need of a separate mdb_load utility.
Q: Can you back up an LMDB instance while it is actively updating, or do you need to stop it? A: There is no need to stop the database while performing a backup, but… LMDB uses MVCC, so once a read transaction starts, it is guaranteed a self-consistent view of the database until the transaction ends. Keeping a read transaction open for a very long time will prevent the dirty page reclaimer from re-using the pages that are referenced by that read transaction. As such, any ongoing write activity may be forced to use new pages. The database file size can grow very rapidly in these situations until the read transaction concludes.
Q: One of the issues with B-tree databases like Berkeley DB is that they periodically experience significant slowdowns for certain requests when B-tree re-balancing takes place. What is the typical response-time distribution for inserts with LMDB?
A: LMDB is also a B+tree; at a high level it will have characteristics similar to those of Berkeley DB. It’s always possible for an insert that results in a page split to have a cascade effect, causing page splits to propagate all the way back up to the root of the tree. But in general, LMDB is still more efficient than BDB so the actual variance in performance caused by re-balancing will be much smaller.
Also note that if you’re bulk-loading records in sorted order, using the available MDB_APPEND option basically transforms the adds into a series sequential write operations - when one page fills, instead of splitting it in half as usual, we just allocate a new sibling page and continue filling it. Page “splits” of this sort can still ripple upward, but they’re so cheap as to be unmeasurable.
The test results that I posted for memcachedb bear this out. In a test using four client threads, the average latency for a BDB write operation was 655 microseconds and the maximum was 336 milliseconds. In the same test, the average latency for LMDB was 127 microseconds and the maximum latency was 517 microseconds. So while on average LMDB performed 5 times faster than BDB, the difference in maximums was three orders of magnitude. So LMDB has a much more predictable response profile than BDB.
Q: Does an LMDB instance exploit multiple cores? If so, what is the structure of this usage? A: Within an LMDB environment, multiple readers can run concurrently with a single writer. Readers are never blocked by anything and do not themselves block writers.
Q: How does LMDB compare to other databases such as SAP HANA, Hadoop, and Cassandra? A: SAP HANA is an in-memory database and has all the same drawbacks as every other in-memory DB - it is limited to the size of physical RAM, which makes it useless for anything other than a constrained cache. It probably also suffers from excessive use of malloc and memcpy, like most other in-memory databases. Hadoop and Cassandra are both nice computer science projects, but they’re also both written in Java. They work, but we can do better with LMDB.
Q: We’ve talked a lot about design- what about performance? Can you give me some round numbers for how LMDB stacks up to other databases? A: Sure. We started with a fairly small system by today’s standards- 32GB of RAM, a 512 GB SSD, and 4 processors running at 2.2 GHz. We compared LMDB to the Berkeley DB in the OpenLDAP use cases using a 10-million entry database. We found that with LMDB the time to bulk-load the database was one third of that required for BDB, with the load time going from 113 minutes to 20 minutes. Read performance was where we saw the greatest improvement: LMDB posted search rate results 31,674 searches per second- more than twice that of BDB’s 14,567 searches per second. OpenLDAP’s memory footprint after running that test under LMDB was just about a quarter of what BDB required, so it’s clear that memory efficiency is also much better with LMDB.
We also ran a set of concurrency tests on a larger, faster machine. In those tests performance improvements ranged from about a 2x increase in a 2-processor configuration to a 15x increase in a 16-processor configuration. This is due to LMDB’s lockless read design. With randomly-generated searches across the entire database, we reached 119,000 searches/second vs 67,000 searches/second for BDB. These numbers are extremely good- no other directory server we tested has ever reached even half of what BDB can do. Update performance is excellent and continues to improve. In these tests a mixed search-update job pegged at 40,600 ops/second for HDB, while LMDB performed about about 86,000 ops/second. We are continuing to work on write performance, and we expect it to continue to improve in later releases. [Note: Additional benchmarking information can be found at https://symas.com/mdb/#bench]
Q: Where do you see LMDB having the most impact? A: It can replace pretty much any piece of code where Berkeley DB is being used, and the product will be many times faster with a smaller memory footprint. Obviously it’s a game changer for OpenLDAP and it’s seeing a tremendous acceptance. Where I really see benefit, though, is in mobile devices as a replacement for SQLite, which is used pervasively in mobile OSes for storing settings, configurations, and data. SQLite has a much larger footprint and is quite a bit less efficient than LMDB, so mobile devices that use LMDB instead of SQLite will be more responsive and will be able to do more. Other areas I think this would have a big impact include session management in web applications, replacing things like memcache and memcacheDB.
Q: I understand you’ve run benchmarks with several different types of file systems to see how LMDB behaves with them. What can you tell me about that work? A: Yes, and we’ve also run against other databases that are either established or now coming into vogue. The results of that work has been collected at https://symas.com/mdb/#bench.
As a general rule, all of the journaling systems perform worse than plain old Linux ext2. However, that’s just in their default configuration with their journals embedded in the same partition as their filesystem. I also set them up using an external journal. In this case, since I had no other hard drive available, I created a 1GB file on tmpfs (temporary file system) and used a loopback device to get mkfs to use it.
With an external journal the synchronous write results on JFS are quite fast, faster than even ext2. If you have an application where fully synchronous transactions are required, JFS looks to be the way to go.
I have not seen anyone else doing filesystem benchmarks with external journals. It seems to me most people are using these filesystems in quite sub-optimal configurations.
Q: What about clustering? Can LMDB be used in a clustered environment? A: We’ve looked at using LMDB as the backend for several existing distributed database systems, such as Hadoop, Cassandra, MongoDB, Riak, and HyperDex, among others. Many of these are poor candidates because their data stores are integral parts of their software architectures and it would take a great deal of effort to work LMDB in as a backend. Some, like Riak and HyperDex, have a very well-defined modular backend architecture, which makes them much better targets for adapting to LMDB. We’ve completed a port to HyperDex and the Riak folks report that they have an experimental backend.
Q: Are you encouraging other Open Source projects to adopt LMDB? A: Yes. Several projects have already integrated LMDB with their code, and we urge anyone interested in using LMDB to contact us if they need assistance.
Q: What, if any, other projects are adopting LMDB for use as their database? A: LMDB was integrated with OpenDKIM several months ago and is now available in public releases. Symas did the work to integrate LMDB with Heimdal Kerberos, Postfix, and Cyrus SASL, and contributed the results to each of the projects. We expect LMDB support to appear in public releases of these packages soon. We’ve also developed versions of MemcacheDB and SQLite3 that use LMDB, and that code was posted to Gitorious. As an interesting side-note, according to the memcache testing utility, MemcacheDB with LMDB is faster than the memory-only version of Memcache. LMDB is also integrated with Riak and we are working on an interface to SQLite4. Many other projects are using LMDB as well- the complete list can be found at https://symas.com/mdb/#projects.
Q: Can LMDB be used with programming languages? A: Yes. The major languages- C, C++, Perl, Python, PHP, etc, are all supported now, and many more are coming. The complete list of supported languages is at https://symas.com/mdb/#wrappers.
Q: What comes with LMDB? A: Besides the LMDB library itself, we offer mdb_stat, which provides information about the database and its state, and mdb_copy, which creates a complete backup of the database. We also provide the complete API documentation.
Q: How is LMDB licensed? A: It’s part of OpenLDAP, so it’s under the OpenLDAP license- basically a BSD-style license.
Q: What about support? A: My company, Symas, offers commercial-grade support programs for LMDB to OEMs. We’ll also be offering supported versions of memcacheDB and other NoSQL-type databases to end users.
About Howard Chu: Howard Chu is a co-founder of Symas and serves as its CTO. He is also Chief Architect for the OpenLDAP Project and plays a mean Irish fiddle. You can find out more at highlandsun.com.
More information: Symas Corporation
MDB: A Memory-Mapped Database and Backend for OpenLDAP (http://www.openldap.org/pub/hyc/mdm-paper.pdf) The slides from the talk (http://www.openldap.org/pub/hyc/mdm-slides.pdf)
LDAPCon 2011 - Memory-mapped Database for OpenLDAP (http://www.youtube.com/watch?v=SrKQNed7KK8)
Life After BerkeleyDB: Symas’s Lightning Memory-Mapped Database lanyrd.com/2012/linuxcon-north-america/sxxkq/
MDB and NoSQL (http://nosql-databases.blogspot.com/2012/08/openldap-mdb-memory-mapped-database.html)
SQLite and LMDB - AKA SQLightning (http://www.openldap.org/lists/openldap-devel/201203/msg00017.html)
The OhLoh Page for SQLightning (http://www.ohloh.net/p/SQLightning)
Wikipedia Article on B+ Trees: http://en.wikipedia.org/wiki/B%2Btree
LMDB Microbenchmarks: https://symas.com/mdb/microbench/
The collection of LMDB technical information: https://symas.com/mdb/