Anticipating the Future With LMDB
Updated: Oct 6, 2022
As we’ve noted before, LMDB was designed to be a data storage solution for both today and especially for tomorrow, not to keep warming over yesterday’s problems today.
That’s why our focus from the beginning was on Solid State storage. While still considered too risky back in 2011 when LMDB was first being written, SSDs are now ubiquitously accepted, in both large scale data centers and in mainstream consumer gear. The performance gain more than justified the increased price per gigabyte and the price has continued to steadily decline. For example, in just the few years from 2011 to today the price for SSDs has dropped from $2/GB to only $0.12/GB. We knew that LMDB’s ACID transaction design would exact a high performance cost on old-fashioned rotating storage platters, but we also knew that rotating media would rapidly be displaced by solid state. Instead of focusing on optimizing HDD efficiency, we focused on CPU efficiency, recognizing that this area of software design was generally stagnant and in dire need of help.
Another point that we recognized early on, and which the rest of the industry is even now still only grudgingly admitting, is that we’ve hit a CPU performance wall and from here on out there are no easy answers. CPU cycles aren’t getting any faster, and every cycle counts. Today’s CPUs have single-thread performance stats only a handful of percent faster than CPUs of 5 years ago. Aggregate multi-thread performance is still marginally increasing, but only by virtue of cramming more cores and hardware threads into a chip. Of course, this is still an overall improvement in cost per op, assuming your software parallelizes well. This is why LMDB’s design focused on lockless read performance, allowing reads to scale perfectly across arbitrarily many CPUs: because we knew that single-thread performance was at a standstill, and the only way forward was to support massive parallel scaling.
One avenue we discarded however, which other DB designers chose to invest in, is intrinsic data compression. There are a couple reasons for this decision:
It’s out of scope for a database engine. A DB engine’s #1 job is to accept data from the user and store it to disk in a way that it can be efficiently located for future reference. In particular, a Key/Value store like LMDB is not concerned with the semantics of the data being stored - it is just a binary blob, to be stored and fetched verbatim. It is the job of some other software abstraction layer to worry about the content and form of the data, including whether or not it is compressed or encrypted.
The choice to automatically compress data only makes sense if it boosts overall system performance and throughput. Other designers reason that their memory or I/O costs are high, and they have CPU cycles to spare, so it’s worth their effort. Unfortunately, that’s a decision made from an extremely myopic viewpoint - from a designer who sees only his DB engine and believes that the CPU has nothing else better to do with its time. In a busy application environment, where the CPU has to actually do work on behalf of real users, and not just spin around crunching data for the DB engine, this perspective is completely invalid.
Ultimately, the tradeoff isn’t worth it. It will cost more than 2x more CPU to compress data by a factor of 2, in the hopes of saving memory space or I/O bandwidth by a factor of 2. The fastest compression algorithms will not reliably deliver 2x compression, and the strongest compression algorithms will cost far more than 2x as much CPU time.
Historically, the trend has been to make compression in the DB layer irrelevant. While CPU performance costs have been steadily declining 33%/year, the cost of storage has been decreasing at 38%/year. I.e., the CPU cycles are more expensive and the storage space is cheaper, so it’s more cost-effective not to waste the CPU cycles on compression. Indeed, the gap between CPU performance cost and storage cost is only going to widen, as Intel not only acknowledges the end of Moore’s Law, but predicts the next few generations to deliver *less* performance than today’s chips.
As time goes on, LMDB’s leanness will continue to serve its users well, while other folks’ investment into irrelevant features will simply decay into useless bloat. Don’t get mired down in software written for yesterday’s problems. Look ahead, and meet the new challenges head on. We’ll be right there with you.