Toby Jungen

March 30, 2010

NoSQL Smackdown Postmortem, Part 2

Yesterday I wrote a post outlining the status of NoSQL, and what technology is involved. With the boring stuff out of the way, on to the more interesting questions regarding the NoSQL Smackdown. Where do the participants fit in? What do they have to say? Who cried uncle and ran home to mommy?

On the one side, we have Cassandra and SimpleDB, which are both distributed database systems built with scalability concerns as the driving motivator. On the other side, we have CouchDB and MongoDB, both document databases with flexible schema and programmer-friendly data representations.

  • Cassandra originates from Facebook, which built the system as a means to store and deliver private messages between users. The system was released as open source, and is now under heavy development, with many high profile companies contributing both funds and code. The system has proven itself in a number of operational environments, including popular sites like Digg and Twitter.
  • SimpleDB was developed at Amazon as a means to manage shopping cart information. It applies some principles of Dynamo, a research project developed by Amazon researchers. SimpleDB is available for use in Amazon’s cloud computing platform, at charge. While not open source, some implementation concepts are outlined in the research publications of the Dynamo project, and a number of open source implementations have arisen out of this public knowledge (of which Cassandra may be considered one).
  • CouchDB was one of the first document database systems to gain popularity. It was built as an open source system from the start, and notably was implemented in the rather novel Erlang programming language, a rather unusual choice for a database system.
  • MongoDB is also a document database system, built to satisfy many of the same needs CouchDB aims to address. It is also open source, but implemented in C++. There are a number of features that differentiate MongoDB from CouchDB, such as a custom built binary serialization format, or as of recent betas, automatic replication and sharding.

Where’s the beef?

One could wonder, which so many similarities between these systems, why all the disagreement? Well, even though these systems all make many of the same assumptions, and all aim to offer solutions where traditional RDBMS’s have failed, there are strong opinions about which problems are most important to solve, and how exactly those problems must be solved.

One major rift was between Werner Vogels and the other three contenders - as CTO of Amazon, Werner is representing a major enterprise, with strong business needs and shareholders to answer to. The other contenders are all open source. Though many businesses rely on open source, the divide between the corporate world and open source community is firmly entrenched, and heated debate has been waged across those lines for decades.

Another rift was between two obvious contenders - CouchDB and MongoDB. The two are competing in the same space, and are vying for the same audience. As a result, each attempted to one-up the other by trumpeting certain distinguishing features, ease of use, or specific performance benchmarks.

Finally, there was also a rift in ideology. Both Cassandra and SimpleDB tout scalability concerns as the most pressing issue for many applications, while CouchDB and MongoDB place emphasis on a simpler data model for higher productivity and more agile development practices. While both Cassandra and SimpleDB also offer a unique data model, and both CouchDB and MongoDB promise scalability, the difference in focus is clear.

The missing factor

One issue I felt wasn’t very well addressed during these debates (and I voiced this as a question near the end), was how traditional RDBMS’s fit into the picture. These systems have served us well for decades, and there is still much ongoing and exciting development in such traditional systems (e.g. RethinkDB or Drizzle). Stu was quick to counter that outgrowing hardware and machine failure can be painful and disastrous for a live application. I agree, but there are known solutions to dealing with such problems (e.g. master-slave replication and judicious use of data sharding). Werner gave a more elaborate rebuttal, mentioning that RDBMS’s are used for several of Amazon’s components, and that many, if not most of the applications deployed on the Amazon cloud platform make use of traditional database systems. This eventually led Amazon to release an offering specifically centered around RDBMS’s. I may not be sufficiently remembering the event, but I cannot recall the rebuttals from Wynn and Jan.

Another interesting mention made in closing remarks was a third ideology in the NoSQL movement - graph databases. Specifically mentioned was Neo4J, a database capable of managing data stored in a graph consisting of millions of nodes. As mentioned in the last post, application objects are treated as nodes in a graph, so the mapping between such a database and an application may be even more natural than with document databases. Furthermore, certain complex operations such as deep graph traversal can have very high computational costs in non-graph based systems. These operations are trivial on a graph database, as such databases are built to optimize such operations.

Conclusion of sorts

One final thought I’d like to take away from this is that the days where a single system can be used for every job are over. There is no swiss army knife, there is no magic bullet when it comes to storing and managing data. Every application has unique needs; what works well for one application can be horrible solution for another. It is vital for modern application developers to be knowledgeable of their needs, problems that may arise, and what solutions are available. And perhaps more importantly, developers must be able to prioritize their needs accordingly. Everyone has scaling problems of some form, but does that really mean you need a fully distributed database system? Sometimes the old trusted “good enough” solution is just that - good enough.

My thoughts on the NoSQL Smackdown event end here, but there’s much more to be said about NoSQL. Such as how to deal with the trolls. I’ll keep you posted.

March 30, 2010

NoSQL Smackdown Postmortem, Part 1

Two weeks ago, I had the honor of co-leading a breakout talk at the SXSWi Big Data meetup coordinated by infochimps. I’ll leave thoughts on my session for another time. Instead, I am writing about the highlight of the event - the so titled “NoSQL Smackdown”, which consisted of a four-way debate between Jan Lehnardt of CouchDB, Wynn Netherland standing in for MongoDB, Stu Hood of Cassandra, and as a surprise guest, Werner Vogels jumping in for Amazon’s SimpleDB/Dynamo. You can find a recording of the whole thing here, courtesy of the changelog show. In the interest of readability, I’ll split this into multiple posts.

Needless to say, the debate got rather heated. Represented were two of the biggest ideologies in the somewhat controversially named NoSQL movement. For the uninitiated, NoSQL is an umbrella term to describe alternative database systems that do not conform to the traditional RDBMS structure of highly structured and possibly interconnected tables organized by columns and rows. RDBMS’s have served their purpose extremely well, but the proliferation of a more dynamic web with much higher levels of interaction and much larger date volumes means that the traditional RDBMS is no longer well-suited for every task.

Two fundamental problems RDBMS’s struggle with are scalability and impedance mismatch. The former is a rather obvious problem - as the data managed by a database grows, so do the demands on the hardware running the database system. Moore’s Law has allowed traditional, single-machine based solutions to remain passable solutions for many years. And while Moore’s Law still applies, it has not kept up with certain data demands. Certain applications can easily outstrip the computational capacity of even more exotic supercomputers. Internet companies serving hundreds of millions of users daily, such as Google or Facebook, have to manage petabytes of data being accessed and manipulated constantly. A single computer that can even only just store this data does not exist, much less a computer that can perform the necessary computations to serve thousands of requests a second. To ameliorate this dilemma, engineers have concocted database systems that can utilize a cluster of networked computers rather than a single machine. I won’t go into the nasty details for now, but needless to say, such systems have to deal with many complexities that are simply not present when running on a single machine. Machines in the cluster can and do fail, connections can be interrupted, and network latency makes most operations orders of magnitude slower than they would be on a single machine. Tremendous research has been done on such distributed systems, and they are now sufficiently mature to form the backbones of the biggest of systems in existence today.

However, in order to make these systems work, some compromises had to be made. Perhaps the most noticeable compromise is that ad-hoc joins are no longer feasible. Many applications built on top of RDBMS’s rely fundamentally on such joins being available, so to modify an application to work with a distributed database often requires a grounds-up rewrite of the entire application. Note that relationships in data sets can still be represented in such systems, but it is now incumbent on the application to manage such relationships, rather than rely on the database to compute the relationships on demand.

The other major problem with RDBMS’s, the impedance mismatch, is not quite as easily understood. Fundamentally, this refers to the logical disconnect between how data is represented in the database (tables with rows and columns) and how data is represented in the application (usually objects). In a database, relationships in the data are managed via keys or identifiers. In an application, such relationships are instead managed via references, which allow the data to be thought of as a graph. Granted, the object model is an abstraction, as under the hood, objects are represented in memory in a fashion highly reminiscent of tables with rows, columns, and keys. But the object model has proven to be a highly useful abstraction, and nearly every modern programming language has some notion of an object (yes, even functional languages - I’ll get into language issues at a later time). To solve the impedance mismatch, software engineers have developed object-relational mapping libraries, which aim to facilitate the translation of data stored in tables to application objects. These libraries are very useful and can greatly reduce the amount of code programmers need to manage, but they have their limitations. And most importantly, these libraries only make dealing with the mismatch easier - they do not eliminate the mismatch. To this end, database system developers have created systems that abandon the typical table model of data and instead represent data in ways much more natural to object-oriented programmers. Such systems are often described as object or document stores (there is a distinction there, but I’m simplifying here). Data stored in these systems can be mapped to application objects with very little extra work, resulting in much more concise and understandable application code. Furthermore, these systems are far more flexible than table-based systems, as they generally do not constrain their data to fit into a rigid schema that must be defined a priori. However, as with distributed database systems, compromises are made. Data relationships cannot be queried as easily as in an RDBMS, and the lack of a predefined schema push data validation and consistency checking to the application.

Now that the boring stuff is out of the way - stay tuned until next time! Part 2 will follow shortly!

April 03, 2009

Importance of Terminology

Recently, I came across an article on TechCrunch, which reported on Amazon’s recent addition of Hadoop to their cloud infrastructure. I won’t go into what all of this means as there is a differnet point I want to make. The author of this article mistakenly called MapReduce a file system, which isn’t entirely correct. Readers were quick to point out the mistake (myself included), with some responses being rather harsh. Why would readers react so strongly to a misuse of terminology that is entirely understandable if you have not had intimate experience with the technology in question?

The reason for such a reaction is simple. For a developer, using the correct terminology when describing and discussing his / her work is crucial. If the wrong terminology is used, then grave misunderstandings can result, which could easily lead to rather disastrous consequences. To use another example, I recently attended VoCamp Austin, which was a technology meetup to discuss Semantic Web technology. During the meetup it became exceedingly clear that a number of the attendees had a poor grasp of the terminology used to describe Semantic Web concepts. Terms such as triples, ontology, semantics, structured data, linked data, schemas, RDF, OWL, and XML were thrown around, often incorrectly.

The improper use of terminology when discussing something as involved as the Semantic Web introduces a large communication barrier when developers interact. My definition of linked data may differ from yours, and when I describe the role of linked data in my application, you will be utterly confused. In the worst case, the term will mean something utterly different to you, and if you are using my description as the foundation for developing your application, then your implementation will be fundamentally flawed. The cost of such miscommunication may be subtle, but the consequences can be disastrous, such as in the case of the Mars Climate Orbiter, where certain values were communicated incorrectly, and one team understood the values to be in metric, while others understood the values to be in the imperial system.

Lesson learned? When discussing something of significant complexity and substantial inherent terminology, be sure to make the definitions of the terminology used absolutely clear. This may sometimes require your audience to forget their existing definitions, and you should be extra cautious when introducing terminology that uses common words or is borrowed from other terminology.

February 23, 2009

Site Update

It probably still looks the same but I’ve updated alot of the site behind the scenes. Yeah and I haven’t posted in like 6 months. Anyways, now that the site isn’t quite as much of a hack anymore, hopefully I’ll be updating more often again. Check out the other sections for some more updates. Expect more posts to follow here soonish.