- Howard Chu
Musing on the Future of Computing
Symas has set the pace in the database world with LMDB. The efficiency gains from LMDB’s Single Level Store approach make LMDB’s performance unmatched by any other technology. LMDB’s approach is also ideally positioned to leverage future developments in memory and storage technology, all without changing a single line of code. How did we get here, and where might things be going next? It’s not hard to understand why LMDB’s approach of eliminating the distinction between in-memory and on-disk storage improves efficiency. When traditional software architectures try to manage on-disk data separately from in-memory data, they invariably perform complex transformations on the data to move it between disk and memory, and must maintain sophisticated bookkeeping records to track the movements of the data.
Leveraging the virtual memory support of modern operating systems removes any need for such bookkeeping or transformations, and thus a large amount of unnecessary CPU operations is saved.
But to really understand how LMDB came to be you have to step back and look at a slightly bigger picture. The design philosophy runs counter to the current trends in computer programming. It’s not just about unifying the dichotomy between disk and memory. It’s also about unifying the diaspora of data types so deeply entrenched in modern programming language design. While modern languages go to great pains to design type systems, abstracting data into structures of various incompatible forms, LMDB is written with the understanding that ultimately, it’s all just bytes. This is a level of type agnosticism that is easily implemented in the C programming language, but that is actively prohibited in most modern languages. In my opinion, the fanatical adherence to type systems, and denial of the true digital nature of data, is a severe handicap to modern software engineering. Abstractions are useful when they help humans reason about the tasks at hand, but they have no place at the machine level.
Another consequence of the realization “it’s all just bytes” is that the distinction between code and data is another artificial handicap. None of this is novel to seasoned hackers - the fact that code is just data is par for the course. Real hackers - the people who specialize in copy protection systems, code obfuscation, and the breaking of such systems - build all their work on this fact. A single byte may be used as a valid instruction, or as a data parameter for a computation, or as a character in an output message - all at once, or at varying times in a run of a program. For example, the following seven bytes may be interpreted just as bytes, or as ASCII text, or as valid machine instructions. It all depends on context, and all interpretations are valid simultaneously:
Taking it further, reverse engineering is the art of unraveling the obfuscation, analyzing a block of unidentified data and identifying its behavior when executed as code. Engineering in this realm requires the greatest attention to detail and the greatest breadth and depth of knowledge of how computer systems are put together.
We’ve come a long way in processor design, using a modified Harvard Architecture, which decrees that machine instructions must be kept completely separate from normal data. Like training wheels, it helps designers to get a grasp on the task of building a computer, but it’s not the way to create the most efficient system. Indeed, today we know that the best solutions are often Genetic algorithms but why they work so well is still poorly understood. To me it’s obvious that part of their strength is because they don’t segregate code from data - it’s all just bytes - and it all mutates without regard to syntax or structure.
Again, you need to step back and expand your view. In nature, brains are composed of neurons. There’s no distinction between “compute” neurons and “memory” neurons like there are in a contemporary digital computer. Every neuron serves both purposes simultaneously. The power and adaptability of the system as a whole comes from the individual components being so flexible themselves.
It looked like we might arrive at an analogous system when Hewlett Packard announced their ambitious “The Machine” project. They intended to use memristors the same way brains use neurons. Unfortunately HP has had to scale back their ambitions since the initial announcement. Still, I believe it points the way forward: the computer system of the future will be based on something like memristors, will not distinguish between logic units and data storage units, and will not distinguish between “RAM” and “disk” storage. The software for such systems will not segregate instructions from data, and will not be designed around distinct data types. Any “bit” will serve as memory when needed, or participate in a computation when needed. It may be an integer in one instance or a fully fleshed object in another instance.
How long it takes to get to that future remains to be seen. We’re still in the stage of unifying RAM and disk storage, with persistent RAM technologies like MRAM just entering the scene. Computer science curricula are nowhere near ready yet to deal with a world where instructions and data are interchangeable. They’re still struggling to teach how to properly design data structures. But, just as our brains are the culmination of millions of years of evolution, eventually our technology will get there.