HN2new | past | comments | ask | show | jobs | submitlogin
Is the Relational Database Doomed? (readwriteweb.com)
71 points by timf on Feb 12, 2009 | hide | past | favorite | 47 comments


This article is fairly well-balanced, but the hype around these "hashes in the sky" is completely out of control. This article tries to be objective, but it's not doing enough to counter the rabid zeal that is gripping silver-bullet-seeking, Twitter-wannabe young web developers. My observations are that too many people are making the following mistakes:

* Overestimating their need to scale - a powerful box with a cache-backed site can serve a LOT of hits. The vast majority of apps will never need to scale beyond this, don't kid yourself.

* Underestimating the amount of work and the cost of failure in maintaining data integrity at the application level - for all but the most simplistic applications you will end up writing a ton of code that replicates what a RDBMS does, except slower, with more bugs, and less generality.

* Underestimating the importance of data integrity - When your massive database with billions of records has subtle data integrity issues, how hard will it be to fix? My guess is that some truly nasty situations will arise and over time the hype will be tempered by the horror stories.

* Underestimating the constraints this puts on future development - Sure greenfield apps may seem like a great candidate for a document-oriented DB, but how often do apps look like the original version 1 year later, 2 years later, 5 years later. With a relational database your bets are hedged automatically. You can go in a million different directions with your data. The relational model gives you orders of magnitude more flexibility than complex algorithms designed for map-reduce-style scalability. If you don't need the scalability you are throwing an awful lot of flexibility out the window for nothing.

* Underestimating how relational their data really is - It's not just that developers understand the relational model, it's the fact that it's a rigorous theoretical model that actually models relationships in an academically complete way. Sure, real life RDBMSs don't live up to the theory, and sometimes things are too slow to be practical, in which case you fudge things as necessary or add layers of caching. But at the end of the day, the relational model can cover pretty much anything, where these hash-databases give you a small set of scalable functionality and some clever algorithms to accomplish a bunch of different things. The things you can accomplish are not comprehensive in a theoretical sense, they are just techniques that have proven useful for a number of applications recently. However the limitations are not as well defined as a RDBMS.


>> Underestimating the amount of work and the cost of failure in maintaining data integrity at the application level - for all but the most simplistic applications you will end up writing a ton of code that replicates what a RDBMS does, except slower, with more bugs, and less generality.

If I'm the first to explicate the underlying rule, can I name it after myself? Please? Pretty please with Lisp on it?

Doffing's Tenth Rule of Database Systems (with apologies to Phillip Greenspun):

"Any sufficiently complicated database management system contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a good RDBMS."


The answer is "no". The question is "will key/value cloud computing data bases replace relational"? RDBs are very good at modeling your data, but have scaling problems. The cloud computing solutions scale well but are really poor at modeling data; you have to do it yourself. Example problem: you commit an update to a record. Subsequent reads may not see the change until the DB gets around to committing it some time later. Can't run your airlines or banks that way. The eventual solution will have RDB-like modeling with cloud-like scaling, replication, robustness, and all that good stuff. The real solution will allow you to request high data integrity along with I-dont-care-what-happens for storing your RSS feeds or slashdot comments.


RDBs are very good at modeling your data, but have scaling problems.

I think the scaling problems of relational databases a vastly overstated. Right now, using off the shelf kit, you could build a relational database handling 10,000 commits/sec and 100T of data, without using "sharding" or any nonsense like that. Too many people think MySQL and its limitations are representative of this technology; it isn't, not by a long way.


Too many people think MySQL and its limitations are representative of this technology; it isn't, not by a long way.

This is about as true as it is sad, but when your only exposure to relational databases is the least mature and featured one on the market, and everyone is too busy sharding and remaining oblivious about it, this is the kind of discussion you get.


"Sharding" is shorthand for "MySQL's concurrent write performance is abysmal, let's spend more on trying to make it work than it would have cost to just buy Oracle".


Not too mention that sharding probably is one of those things which might seem clever when your database engine doesn't even have hash joins.

I mean, when your DB can't join large result-sets quickly and efficiently in process, why care about the loss of joinability when you spread your data across several servers?


DB2 is free up to something like eight gig of memory usage. I don't understand why people are so wedded to MySQL and Postgres now that DB2 and Oracle have free and very cheap versions.


I can't speak to DB2, but I am really sick of the weird complexity of Oracle. It's power comes at a fairly high cost of complexity, whatever the pricing structure looks like.


It sure is mired in complexity. I am an Oracle programmer by day and once you put it under load there are some pile of bugs that start to show up! That said, Oracle can do so e amazing work, if you have people on hand who can make it do it!


Whoever solves the "scalability or consistency: choose one" problem will make a boatload of money. From what I can tell this is a very hard problem, so software that makes it seamless to operate at multiple levels of this tradeoff curve -- perhaps at the same time, for different parts of your data -- would also be a big win.


I think people are doing it, they are just not slicing it up in a pay-as-you-go format for end users.

The best talk I ever saw in this area was by Paul Strong from eBay (they obviously have a very strong requirement for both scalability and consistency). He talked about how they eventually ripped out all transactions and stored procedures from the RDBMS layers and built their own giant, cross cluster, consistent system to make the zillion customer/auction problem finally managable.

For the less giant problems, I think Amazon could charge a premium for an API-managed RDBMS solution, they have all the tools for making one in place (as opposed to requiring you go build it yourself with EBS and EC2 nodes).

add-slave-node() ...


In addition to a cloud-hosted solution, packaged software (along the lines of Hadoop) would be great too. Not everyone wants their corporate data out in the cloud, but I suspect they would all be happy to skip writing one-off "fix RDBMS scalability" layers.


It is a very difficult problem. I worked for a short time for a company that was trying to solve consistency, distributed redundancy, and scalability. It succeded consistency and redundancy, but failed because transaction chatter caused it to fail at scaling. I think that within an application or enterprise schemas will be partitioned among types of database by whether the data must always be consistent(RDB's), are mostly readonly and can be updated leisurely (contact-lists, personnel records), and those that can afford to be inconsistent (archives, web chatter).


I see this as an area where Microsoft could pull out a big win. Imagine if, instead of their half-baked BigTable clone that they just released, they had instead put 1000 or so brains on the problem of distributing a single SQL Server database across N machines.

Scaling out to a dozen DB servers that master/slave their way to scalability is no fun, but it's solved. The problem is that you're currently required to do it yourself. It would rock to be able to outsource that to the cloud.

I want to toss my ASP.NET application up into the Microsoft Cloud, where it will figure out how many webservers it needs to spread itself across and how many database servers it needs to fire up to handle the load it's seeing. And I want it to pretend like it's a single webserver talking to a single DB instance on a single box.

Say what you will about Microsoft, but they have the skills to pull that off. I sure hope they're working on it.


Imagine if, instead of their half-baked BigTable clone that they just released, they had instead put 1000 or so brains on the problem of distributing a single SQL Server database across N machines.

Unfortunately brains don't scale, so you're better off using one huge brain, like Michael Stonebraker.

http://db.cs.yale.edu/hstore/


Right and his commercial startup is Vertica http://www.vertica.com


I am pretty sure the Azure services platform will eventually have full RDBMS


I assume such a hypothetical company would become a huge Fortune 500 company selling their database technology at a very high price!


Context: So, why are relational databases going to die? Because only key-value hash lookups are fast enough. Wait, can't RDBMs do hash-based lookups too? Yes they can. So what's the problem? Well, if you want to take advantage of the relations, you have to use joins and such. Why is that bad? The query plans the RDBMs uses end up being slow. Putative solution? Throw out RDBMs, replace them with key/value DBs (which are strict and severely impoverished subsets of RDBMs), and implement your own relations.

My comment: This makes me wonder if the solution shouldn't be looked for in a different direction. Instead of throwing out the RDBMs, why not give me one that gives me complete control of the query plan? That's the real stickler here, not the relations themselves, which are there whether you maintain them or not. If the DB can't guess up a good query plan, why not let me simply feed one to it?

Is there any DB that lets me have complete control over the query plan? (And even if there is, it's probably a half-hearted, unoptimized feature with little design effort poured behind it.)

Yes, even so you may have to give up 40-table joins, but it seems to me RDBMs, with more control given to the user, could still provide a lot of value without throwing out the baby too.


Why is that bad? The query plans the RDBMs uses end up being slow.

I think it's simply not the case that people are using key-value stores because the query plans that a typical DBMS optimizer chooses are suboptimal for the kinds of queries most web apps use.


If the DB can't guess up a good query plan, why not let me simply feed one to it?

You speak like good databases wont allow you to override the query-optimizer. Look up optimizer hints.

Most real database-systems worth using allows this, but it is discouraged because for most queries, most of the times, the statistics the DB has on existing indexes, data uniqueness, estimated size of result-sets etc etc provides a better basis for optimizing the query than anything you can come up with.


The rise of functional programming might not be accelerated by the pain of impedance matching to RDBMSs. Converting between the relational and OO models of data is a time-consuming pain. If this is widely realized, it could give FP a boost.

>> Is the Relational Database Doomed?

Ah, linkbait! Sorry, no sale. Using a (pre-compiled) stored procedure to do a key value lookup will likely be about as fast as a lookup in a key/value DB. In any case, it's easier to add a key/value lookup capability to an RDBMS than it is to add efficient query capabilities to a KVDB.

The article focuses on scaling, which is rarely an issue. KVDBs win on simplicity and flexibility (no need to pre-define a data model) and a better impedance match with OOPLs. RDBMSs crush on querying, which most (though not all) applications need.


>> The rise of functional programming might not be accelerated by the pain of impedance matching to RDBMSs.

Arrrgh! I meant that it might be accelerated by the pain of impedance matching to RDBMSs. Sorry, it was late at night.


The future is here already, and has been for decades.

These non-relational database structures are not a new or revolutionary idea. The same concepts have long been implemented in systems like LDAP, X.500, and document-oriented databases like Lotus Domino.

The real mystery is the recent trend towards completely reinventing the wheel. For instance, LDAP (the protocol) is incredibly efficient, standardized, and well supported on most development platforms. It provides a standard query / filter language, facilities for scoping, and data updates. There are standardized exchange formats, wide availability of OSS and commercial management tools, and multiple server implementations.

Yet, there seems to be a great interest in developing from scratch highly proprietary and non-portable alternatives that do exactly the same thing (or less). These systems ignore decades of research and lessons learned through practical implementation.

The real issue though is that direct comparison of the two models is inherently flawed. They each have their own strengths and weaknesses, each excelling in situations that the other falls short. Looking at them as complementary models is perhaps a better approach.


As I started reading the article my very first thought went to Lotus Notes (never developed in it, but had friends that did). I agree with you, that they are complementary models is probably a better approach to the comparison.


The future is probably going to be a mix. RDBMS are good for data integrity. The non-relational databases are great for distributing queries. I've seen some apps which use a slow, asynchronous RDBMS to supply denormalized 'documents' for the non-relational stores.

Mind you, the RDBMS could still make a comeback. The world is looking to non-relational because since 1998 or so, throwing multiple pieces of hardware and network capacity at the problem is cheaper than solving it all in one giant node or cluster. But maybe one day an off-the-shelf computer (perhaps with better concurrent programming techniques for multicore) will be able to handle all the queries you would ever want and have the bandwidth to match.

Also, just wondering; exactly how many organizations really need a distributed database? There are some top websites, like LinkedIn, that do it all in one giant node, and my guess is that even sites like Facebook have only a few hundred database shards, maybe a few thousand at most.


I do wonder how many people on EC2 use SimpleDB because they need (or think they will need) the scalability, as opposed to thinking it's easier to code for, or simply just so they can avoid the drudgery of running their own RDBMS.


Are Non-Baiting Titles Doomed?


yes

(i meant to say that i hope one word answers are doomed as well)


Jokey comments that are easy to read and digest and are mildly amusing and wind up at the top of the comment sections just like [PIC] submissions on lesser websites...where have I seen those before...?


There may be a good point in here somewhere, but it's not clear what it is. Relational databases don't scale, so... try CouchDB! or Drizzle! ...which also don't scale. What?

For a more complete list of RDBMS-killing science projects, try http://www.metabrew.com/article/anti-rdbms-a-list-of-distrib... (discussion: https://hackernews.hn/item?id=440687 )


Mr. Bain's article would be more worthwhile if he were to elaborate on what it means for a knowledge representation system to be "doomed".

The relational model proposed by Codd[1] was formulated so that applications which adhere to it would continue to work and be free of inconsistencies when the underlying data is updated or reorganized. That may not sound like much but it's surprisingly useful when the data is important or valuable and needs to always be correct. This is also the whole point of all those irritating normal forms.

The relational model has a lot of shortcomings but it's not likely to be replaced by anything that doesn't address these issues.

That said, for some kinds of data it is definitely overkill. Religiously following the relational model for short lived stuff with little significance makes about as much sense as filling out your shopping lists in triplicate, storing permanent copies in a safe deposit box and requiring two signatures for every alteration the same way you might if you changing your will or a deed to your property.

[1] - "Relational Model of Data for Large Shared Data Banks" , http://www.cis.upenn.edu/~zives/03f/cis550/codd.pdf


Codd's paper is instructive from a historical perspective as well. When you read it, you realize that people were grappling with all these same issues before the relational database came on the scene. His description of hierarchical data storage systems describes some of the issues with things like XML databases and XQuery pretty well.


From the submitted article: "But in making your decision, remember the database's limitations and the risks you face by branching off the relational path.

"For all other requirements, you are probably best off with the good old RDBMS. So, is the relational database doomed? Clearly not. Well, not yet at least."

So the answer is no. The article doesn't do justice to the theoretical rigor of the relational database model. An online posting with some interesting discussion of technical trade-offs in popular terms and links to other posts can be found at

http://highscalability.com/paper-dynamo-amazon-s-highly-avai...


RDBMSs are not doomed, but people are beginning to realize that they are not the Solution To Every Problem. I don't know why people ever started treating them like that; they really only handle one (small) problem space well -- modeling relational data. If that's the problem space you're working in, there's no better tool. If that's not the problem space you're working in, then now we have some better options.

The reality is that 99% of web apps don't really want a RDBMS. There are no arbitrary queries to run, they just need to extract some objects from the database, interact with them, and render a web page. As of late, people have been using ORMs to make their relational database look like an object database. The problem is that the objects you get from an ORM don't work like real objects (try making a cyclical structure, or try `change-class`-ing the objects; doesn't work). This usually means you need to map the objects from your ORM into a better model of your app, adding complexity. (Some ORMs also screw up the relational part, making the data in the database nearly useless to any other apps. Oops.)

People are starting to realize that this model is a waste of their time, and they are using object databases instead. Now there is no ORM, they write their objects that the app interacts with, and those are persisted as needed. Suddenly half your app's code is gone, and it runs faster. That is why people are excited. Less code is always exciting.

This article is really heavy on key/value databases, probably because they are easy to understand. But really, key/value databases are like the assembly language of databases. You will have to do a lot of work to run your app on a key/value database. It is better to use something higher-level, like CouchDB or an object database. (For an OODB, I recommend KiokuDB. It has indexing (stolen from Postgres), and can store your data to BDB, a directory on disk, a RDBMS, Amazon SimpleDB, or CouchDB. This gives you a lot of flexibility.)


"The reality is that 99% of web apps don't really want a RDBMS. There are no arbitrary queries to run, they just need to extract some objects from the database"

Well, I'll tell you what my reality has been. It's been that every time I started a project, it appeared like I just needed to extract some objects from the database. But very soon it turned out that there wasn't just one way to view that information. There were two or three important use cases that destroyed my preferred, supposedly "natural" view.

Now, I'm fully aware that sometimes one preferred view of the information you have is so dominant that you have to optimize for that case. The question, is how do you optimize? Do you hard code that particular view from the beginning, or do you normalize first and then layer an optimized view on top of it, e.g. in the form of caching or materialized views? RDBMS make it easy to do the latter.

You say "they really only handle one (small) problem space well -- modeling relational data."

I think being relational is not an a priori property of data and relational modelling is not a problem space. It's the other way around. First you have a problem space, then you decide how to model it, and then you get the data according to that model. Assuming that information already has a kind of "natural" model attached to it is a fallacy in my view.


Where comes the notion that it were to be either this or that?

Why not use a relational database for data that actually suits the relational model, and an object/hierarchical/keyvalue database for complete entities that are often mapped to objects in code? Just put each kind of data to a storage that makes sense.

If you start with a RDBMS, there are plenty of options of how to store the objects. You can just use plain files in the simplest case. Or a good dbm style database. Or if your objects are mostly read-only you can keep the data in the RDB and cache the big joins. Or if you're really stuck with the RDB only you can even store the key-value pairs in the RDB: just make a two-column table for run-time data and keep it separate from the actual relational data. Or keep everything in the RDB but make a live copy off the RDB when the user logs in to keep the runtime data accessible and then unpack it back to the RDB lazily or when the user logs out. There are pretty much endless possibilities depending on your business and search needs.

Ain't easy but non-trivial databases never are.


Aside from RDB and key-value, one other model I haven't heard mentioned much is Terracotta (currently the only example of its kind, AFAIK). That gives you basically all the advantages of a database, including replicated ACID transactions, but with data that behaves like live objects. Better than live objects, in fact, because of lazy loading and unloading, which let objects grow beyond the limits of RAM. Better than a database, too, in that it only moves diffs, and it knows exactly where they're needed. If you are the only one currently touching a data structure, the only traffic is "may I?" "carry on until I say otherwise".


I've been fascinated with Key-value databases for almost as long as my career. Interestingly enough, I first came across the concept when I did a little work with Lotus Notes, way back in 2001. My memory is a little rusty on the full capabilities of Lotus databases, so I don't know if it applies to their format, but what I miss most from other HashMaps in the cloud are aggregate functions.

As a developer, I'd prefer one of two innovations: relational databases that scale better across machines or k/v databases that provide a familiar mechanism for ad-hoc querying and aggregation across keys.


why can't they both just get along


A colleague once pointed out to me that my preference for flat file systems over relational databases was likely why I tended to have trouble forming and maintaining relationships. This I could not dispute, but nor has the knowledge altered my preference for flat files.


memcached + MySQL solution was invented by livejournal.com and is using by facebook.com. So-called in-memory databases also in the market. And second important moment is the prices of RAM modules, SSDs and commodity hardware. In some cases it is better to create in-house solution that reflects your data structures and data flows.


Thanks for downgrading me! I didn't knew about CouchDB! I'm stupid!


no.


Insightful!


I'm sure the day will come when RMDBs will be a distant memory, but sometimes it so comforting to open a database and see all the data in pretty columns and rows.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: