In school courses on data mining and big data, one of the foremost lessons learned is that we have to be cautious about the way we represent our data and (consequentially) our results. Most people don’t think of databases when we think of data representation, but how we store our information probably contributes more to overall system performance than almost anything else nowadays.
Today’s post discusses this NoSQL report. The original article is about 10 pages and can get a little dense at times, but it makes a couple of important points. I’ll try and work around the technical details here to explain some idiosyncracies of the NoSQL world. For the purpose of this post, some basic knowledge of the difference between SQL/NoSQL is assumed.
No Free Lunch
This first quote from the report explains why data representation is so important in today’s world:
SQL redefined the types of technical problems that programmers tackle every day. Equipped with relational tables and fast queries, companies and institutions could largely stop worrying about how to represent and store their data. Come up with a few good relational rules, choose your preferred flavor of DBMS, and cram your data into the tables. For nearly three decades, Dijkstra’s creed of ‘usefully structured’ could be delegated away so that the programmer could focus on deliverables.
This approach to data management has been losing traction for many years now as the computational world shifts its focus to new types of problems. In a deterministic world with cleanly formatted data collection, SQL reigned. But in a world with asymmetric, fast-flowing data, a relational system can no longer keep up. The responsibility falls once more to the programmer to decide how to demonstrate correctness in data storage.
Sometimes we find what seems like a magic technology – some tool that’s so good, we’re willing to forgive every shortcoming and apply it to our technical problems like it’s a panacea. While this tends to work for a little while, it’ll soon hurt the efficiency of our programs, and we’ll be forced to move on. This applies both to SQL and NoSQL technology. As the report mentions, we spent a long time believing that SQL was the cure to all our data storage problems. Now we have to be careful not to fall into the same trap with NoSQL.
Why is MongoDB so loved and so hated?
Although [Mongo] was first undertaken as a way to solve the data sparseness problem, it has developed into a sort of silver bullet for the average programmer. For this reason, and because of its popularity, it has attracted more criticism than nearly any other NoSQL system. Although some of this criticism is deserved, most of the frustrations are the result of conscious design choices and the inevitable disillusionment as a programmer realizes that MongoDB is not as magic as it once seemed.
It’s really easy to love MongoDB. Whereas SQL makes you spend hours coming up with the perfect schema, get special tools to interact with your database, and hire a whole team to maintain it, MongoDB is so smooth by comparison. Write two lines in Python and you’ve got a database set up and ready to cram all your information into it. Not only is it easy to use, it’s also wicked fast.
The problem is that it becomes easy to believe that Mongo is the only database you’ll ever need. Those two lines look really sexy when you’re trying to get your concept proven and impress your peers, but when it’s six months later and your project has actually started to get traction, chances are that you’ll regret using Mongo in 70-80% of all applications. Sure, it’s nice to just cram in whatever you like without wasting any time figuring out how it’s supposed to fit, but you’ll eventually find that “a few weeks of programming can save you 30 minutes of design.”
Cassandra and column-oriented storage
Column orientation is the idea that rows (or instances) of data usually contain more information than you need. The way Cassandra stores its data for queries is based on columns. Imagine, for example, that instead of having a row looking like “id300432, Bob, Male, 30” we instead get to write columnar information. The “age” column, therefore, would look like “30:id300432, 26:id234099, 41:id013992,…” etc.
Although the advantages might not be immediately obvious, column-orientation has enormous advantages when querying diverse information requests across a wide variety of rows. This happens because no row needs to be read (or even known about) unless it belongs to the column family being requested. This is, of course, moderately slower when the whole row is being queried for, but most of the problem sets interested in a full-row query are typically better suited for an RDBMS setup anyway.
What this basically boils down to is that Cassandra can go really quickly when you only have one column that you care about in your lookup. If you’ve got some rows that don’t have information in that column, then great. You don’t have to look at it. No idea what the other columns look like? No problem, Cassandra only deals with one column’s data, meaning it can go lightning fast on these types of queries. Just be careful when you’ve got a more organized search you’re doing.
Anyway, there’s a lot more information available in the report. Hope you enjoy!