Toby Jungen

March 30, 2010

NoSQL Smackdown Postmortem, Part 1

Two weeks ago, I had the honor of co-leading a breakout talk at the SXSWi Big Data meetup coordinated by infochimps. I’ll leave thoughts on my session for another time. Instead, I am writing about the highlight of the event - the so titled “NoSQL Smackdown”, which consisted of a four-way debate between Jan Lehnardt of CouchDB, Wynn Netherland standing in for MongoDB, Stu Hood of Cassandra, and as a surprise guest, Werner Vogels jumping in for Amazon’s SimpleDB/Dynamo. You can find a recording of the whole thing here, courtesy of the changelog show. In the interest of readability, I’ll split this into multiple posts.

Needless to say, the debate got rather heated. Represented were two of the biggest ideologies in the somewhat controversially named NoSQL movement. For the uninitiated, NoSQL is an umbrella term to describe alternative database systems that do not conform to the traditional RDBMS structure of highly structured and possibly interconnected tables organized by columns and rows. RDBMS’s have served their purpose extremely well, but the proliferation of a more dynamic web with much higher levels of interaction and much larger date volumes means that the traditional RDBMS is no longer well-suited for every task.

Two fundamental problems RDBMS’s struggle with are scalability and impedance mismatch. The former is a rather obvious problem - as the data managed by a database grows, so do the demands on the hardware running the database system. Moore’s Law has allowed traditional, single-machine based solutions to remain passable solutions for many years. And while Moore’s Law still applies, it has not kept up with certain data demands. Certain applications can easily outstrip the computational capacity of even more exotic supercomputers. Internet companies serving hundreds of millions of users daily, such as Google or Facebook, have to manage petabytes of data being accessed and manipulated constantly. A single computer that can even only just store this data does not exist, much less a computer that can perform the necessary computations to serve thousands of requests a second. To ameliorate this dilemma, engineers have concocted database systems that can utilize a cluster of networked computers rather than a single machine. I won’t go into the nasty details for now, but needless to say, such systems have to deal with many complexities that are simply not present when running on a single machine. Machines in the cluster can and do fail, connections can be interrupted, and network latency makes most operations orders of magnitude slower than they would be on a single machine. Tremendous research has been done on such distributed systems, and they are now sufficiently mature to form the backbones of the biggest of systems in existence today.

However, in order to make these systems work, some compromises had to be made. Perhaps the most noticeable compromise is that ad-hoc joins are no longer feasible. Many applications built on top of RDBMS’s rely fundamentally on such joins being available, so to modify an application to work with a distributed database often requires a grounds-up rewrite of the entire application. Note that relationships in data sets can still be represented in such systems, but it is now incumbent on the application to manage such relationships, rather than rely on the database to compute the relationships on demand.

The other major problem with RDBMS’s, the impedance mismatch, is not quite as easily understood. Fundamentally, this refers to the logical disconnect between how data is represented in the database (tables with rows and columns) and how data is represented in the application (usually objects). In a database, relationships in the data are managed via keys or identifiers. In an application, such relationships are instead managed via references, which allow the data to be thought of as a graph. Granted, the object model is an abstraction, as under the hood, objects are represented in memory in a fashion highly reminiscent of tables with rows, columns, and keys. But the object model has proven to be a highly useful abstraction, and nearly every modern programming language has some notion of an object (yes, even functional languages - I’ll get into language issues at a later time). To solve the impedance mismatch, software engineers have developed object-relational mapping libraries, which aim to facilitate the translation of data stored in tables to application objects. These libraries are very useful and can greatly reduce the amount of code programmers need to manage, but they have their limitations. And most importantly, these libraries only make dealing with the mismatch easier - they do not eliminate the mismatch. To this end, database system developers have created systems that abandon the typical table model of data and instead represent data in ways much more natural to object-oriented programmers. Such systems are often described as object or document stores (there is a distinction there, but I’m simplifying here). Data stored in these systems can be mapped to application objects with very little extra work, resulting in much more concise and understandable application code. Furthermore, these systems are far more flexible than table-based systems, as they generally do not constrain their data to fit into a rigid schema that must be defined a priori. However, as with distributed database systems, compromises are made. Data relationships cannot be queried as easily as in an RDBMS, and the lack of a predefined schema push data validation and consistency checking to the application.

Now that the boring stuff is out of the way - stay tuned until next time! Part 2 will follow shortly!