I'm failing to see the utility of hstore or json over adding nullable columns here, especially if the latest version makes adding a nullable column essentially instantaneous. It doesn't seem to simplifying the handling of missing/unavailable data, but does manage to complicate the SQL syntax.
I have clients that use a variety of databases for a variety of reasons, not the least of which is "just because that is where our data is." All the value of most projects is in the data, and the value of data grows with age, because you cannot recreate the past. If you lose something or you fail to record something, then you can't ever get it back in a way that will stand up to audit scrutiny. This has the awful effect of making technology-specific details of databases get pushed into the business-decision realm, rather than the technology-decision realm.
So complicating the SQL syntax is a significant issue for me. Sticking to as much standard, ANSI SQL as possible makes my programs more portable across RDBMSes. With some of the tools I've written, I can make a full transition from MySQL to MS SQL Server and back again and the application doesn't care. Having that sort of power makes upgrading your database a technology decision, not a business one.
Yes, the features that Postgres have are nice, but to me they represent a very great chance of vendor lockin. I don't believe that the Postgres team will ever pull anything to make me hate them, but then I thought the same about Sun at one point, too.
HStore and especially JSON are useful because they allow you to store complex data structures fairly easily. Do you have something that contains a number of somethings that contain a number of somethings that contain a number of somethings? Normally you would need like, 4 tables to store this and carefully planned queries to work with them efficiently. However, if the object graph has no external references (besides the top-level object) you can just safely shove the whole thing into a JSON without losing referential integrity.
Your reasons are very valid however so obviously you have to think about the consequences before using non-standard features. However, not using advanced or novel features (which non-standard features tend to be) can also lead to stagnation and loss of competitiveness so you have to also be careful with being careful, if that makes sense.
I'm not very familiar with the JSON datatype in postgres. Is it possible to update subvalues without writing the entire JSON bag?
For example: {"a": 1, "b": 2} (in psuedocode)
User 1: select JSON
User 2 select JSON
User 1: update JSON set "a" = 11
User 2: update JSON set "b" = 22
What would the value of the JSON column be after that. In other systems I've worked in (where it's JSON in a database column), the answer is {"a": 1, "b", 22} or a failure on the second update letting the client know that their copy is out of date.
I'd love to know if postgres can deal with this in the normal relational way so that the result is actually {"a": 11, "b": 22}
Depending on at which point these transactions are started and committed and the isolation level of the transaction this would either result in {"a": 1, "b", 22}, failure at step 3, failure at step 4 or failure when one of the transactions is committed. However, since locks are on row level rather than on value (cell) level, this would be the same even if these values were stored separately in ordinary column types.
You could still achieve what you want by using read committed transaction isolation and keeping your values in separate rows, e.g. a row with a/1 and a second row with b/2. But basically you would need a separate record for every property in your object graph that you want this behaviour to apply to, which could get pretty messy. Does that make sense to you?
> Such data can also be stored as text, but the json data type has the advantage of checking that each stored value is a valid JSON value. There are also related support functions available
That doesn't answer my question. Storing structured data in any single-field, be it checked or unchecked, violates the first normal form, if you're going to be querying directly on sub-fields of that field. If the json document is just getting exported out to the client without any inspection of the document from the SQL side, then it's fine and I see the point of having a JSON type that can validate the data. But to extend the SQL syntax to allow for querying sub-fields of that data is the wrong solution to the problem.
SQL Server has had this feature for quite some time as well, in the ability to store and query on XML documents. It's awful. The few times that it has "saved the day" on certain queries just because the data was already stored as XML, it turned out to be a false profit and became a serious issue down the road.
> That doesn't answer my question. Storing structured data in any single-field, be it checked or unchecked, violates the first normal form...
This seems to be one of the most common mistakes regarding understanding 1NF I come across.
The atomicity requirement doesn't say anything about structured text, or internally atomic data types. That a data type could be further decomposed doesn't impact first normal form at all. Otherwise, using a data type like DATETIME would be a total no-no. Neither does storing arrays, and so forth.
What 1NF requires is that for the domain the value is atomic. If I store in a text field "192.168.1.1/24" that is structured text but it is still atomic. It may include both a netmask and an IP address, but the data taken as a whole represents one domain value. Similarly storing '1-509-555-1212" in a text field contains an area code number, an exchange number, an end terminal number, and a country-code designation, but it represents a single phone number and that's all that is required. I could represent the phone number as ARRAY[1,5,0,9,5,5,5,1,2,1,2] and it would still be atomic as long as the array was properly ordered.
What I am getting at is that all 1NF requires is that the item represents a single value in a domain.
There's no reason why JSON would necessarily violate that iether. Consider {ip: '192.168.1.1', netbits: '24'}
That's no different from storing '192.168.1.1/24' in a text field. And for my phone number, I could store it as {ccode: '1', areacode: '509', exchange: '555', terminal: '1212'} which again is no different from storing '1-509-555-1212' in a text field.
This isn't to say that JSON doesn't often violate 1NF, but 1NF is not as restricting as people sometimes think.
You had to trim my comment to be able to make yours. As I said, "if you're going to be querying directly on sub-fields of that field. If the json document is just getting exported out to the client without any inspection of the document from the SQL side, then it's fine".
You can argue whether or not the Normal Forms are a good idea, but you can't argue that this tool--in extending the SQL syntax to be able to give access to fields within the JSON document--is not meant for violating 1NF.
> As I said, "if you're going to be querying directly on sub-fields of that field. If the json document is just getting exported out to the client without any inspection of the document from the SQL side, then it's fine".
How is that different from, say, pulling area codes from text phone numbers, subnet masks from CIDR designations, or dates from DATETIME fields?
Where first normal form is sometimes (productively) violated is something like storing a list of tags in a varchar[] array. Since the ordering doesn't matter, it is merely a set of tags (not a field which stores a single value as a whole), and violates 1NF, but is can be a reasonable tradeoff in certain cases.
Consider a common use case of wanting to mix SQL structured data and JSON structured data:
You want a logging service that records events in an SQL table. Each event has base attributes common to all events such as date, event name, component name, user id ... these parameters become columns in the SQL table. Some events can also include arbitrary custom data particular to that kind of event, and you want that data to also be in the database but you don't want to play whack-a-mole with the table schema to make sure every possible custom parameter has a column in the SQL table, in this case just stash the custom data as JSON, and hopefully the database server offers JSON comprehensions that let you access any fields within the JSON data via SQL.
Gin index on all the text arrays. There are about 150k events saved per day and PostgreSQL is running on the same kinds of hardware as SQL Server was. The application is quite read heavy.
There are times actually I find it helpful to break 1NF in PostgreSQL. This is certainly not free but it really helps some sorts of queries, and it avoids having to do a sort of pseudo-EAV to get certain things working.
Maybe not, but you significantly edited your post after I answered. I was just replying to, "Why would I use this over a nullable column?" (paraphrasing)
> I'm failing to see the utility of hstore or json over adding nullable columns here
Well, for example, they make a good way to store user defined fields.
> So complicating the SQL syntax is a significant issue for me. Sticking to as much standard, ANSI SQL as possible makes my programs more portable across RDBMSes. With some of the tools I've written, I can make a full transition from MySQL to MS SQL Server and back again and the application doesn't care. Having that sort of power makes upgrading your database a technology decision, not a business one.
I think the tradeoff is between having a database for multiple apps and having an app for multiple RDBMS's. If you are going to go with the former, absolutely avoid every PostgreSQL-ism. But if you want to make a more generally useful to an organization, you need a more intelligent database and that means going to things like user defined functions, complex triggers and more.
It's not really any different of code, though. It's the same wrongness of CoffeeScript vs. JavaScript: there is no paradigm shift, it just adorns the same old dog chow with a slightly different set of ASCII characters. Postgres is turning into Foxpro, from the opposite direction. Foxpro was this BASIC sort of application language thing that had delusions of grandeur to be a database, too. Postgres is growing delusions of grandeur for being an applications programming language, too.
> Postgres is growing delusions of grandeur for being an applications programming language, too.
PostgreSQL has had an extraordinarily extensible dialect of SQL since they moved from Quel in 1995. This is because the db was designed so you can plug whatever programming languages you want into the back end.
For example, if you have a database of images tagged with location data, and you want to make it so you can run an SQL query that gives you all images matching a photo recognition algorithm for spotting sunsets within 20 miles of Sacramento California, you can build in the spacial and image recognition functions into the database and you can tune it so that the least expensive filters are run first, so that the image recognition algorithm outputs are indexed, and all kinds of other things.
Since it was founded in the 1980's, PostgreSQL has been aimed at those sorts of problems.
> It's not really any different of code, though.
Yes and no. On one hand, you are right. It is all code.
On the other hand, what it gives you is an ability to write code that can work in a set-based paradigm, and be subject to the various optimizations that PostgreSQL can give you in that regard. And because inside such extensions you have no transaction control, there are hard limits to what one should do or try to do inside the database.
For example, some people decide they want to send emails out of the database back-end MS SQL-style. This is a bad idea because it necessitates mixing transactional and non-transactional code in dangerous ways. You can, however, raise a message on commit and have another process process it, and handle transactions accordingly.
TL;DR: PostgreSQL has always been a development platform, and a very good one, but one which is also quite limited in what kinds of problems you can solve with it.
Most people would say that PostgreSQL gives you the same portability that you have with your app, but at a different level. You can port your PostgreSQL database from a Linux server to a Windows server and back again and your application doesn't care. But because you are porting at the db layer, your application could leverage a bunch of PostgreSQL features that are very useful. As a developer, this might not matter a lot to you, but as a company, who relies on its data and expects longevity from its systems, this kind of thing matters a lot. PostgreSQL is a technology that you can rely on in the long haul and you will not be held hostage to a technology owner whose interests are not aligned with yours.
I just see it like all applications and libraries that adhere to some standard. Some adhere strictly to the standard whatever that is. Some do that but also add more features to push the industry forward. NVIDIA does that for graphics. Mozilla does that for browsers. Looks like PostgreSQL is starting to take that role with relational databases.
I generally view it as a good thing unless you care about cross-platform. If you do, then don't use the non-standard features.
> Looks like PostgreSQL is starting to take that role with relational databases.
PostgreSQL has been doing this for a long time. It started, actually, as a research project for how to manage very complex forms of data in a relational-like database (many people use it as a purely relational database but that doesn't even begin to scratch the surface of what PostgreSQL can do).
Like the ability to write user defined functions to process data in SQL queries in whatever language you want, for example. Language handlers have long been available not only for SQL, C, and PL/PGSQL (PostgreSQL's variant of Oracle's PL/SQL), but also TCL, Perl, and Python. Third party language handlers are available for Java, PHP, and many more languages. Yes, you can run PHP inside your PostgreSQL database.....
Now new data types are pretty much C-only, but you can also create a lot of the same functionality using composite types and sql. Composite types (and even arrays of composite types) can be tuple elements.
Add to this the object-oriented-inspired features, allowing you to use multiple inheritance to mix in tables and create common interfaces to data (both features older than PostgreSQL's use of SQL).
What this means is you can do some extremely advanced db-stuff with PostgreSQL. This can include:
- I want to create a common interface for notes but avoid a global notes table.
- I want to scan a series of images stored in the db for those matching an algorithm in a Java library showing which are likely images of sunsets.
- I want to add support for a third party Java full text search engine into the back-end so I can call its functions from inside SQL queries.
> But how is that different from the old MS battle-cry "embrace, extend, extinguish"?
It's OSS. That alone makes it different. Anybody can continue work on it or even fork it.
Postgres is an awesome database and an awesome project. I think the core developers have a good blend of being committed to supporting the SQL standard and adding new non-standard features. If you read through the docs they usually specify if a given feature is standard SQL or a Postgres extenstion. If you want to use something new that isn't standardized yet you have that option. If you don't, you have the info that it's not. The choice is yours.
> It's the "extinguish" part I'm concerned about.
I'm not sure I understand this. Are you worried about Postgres extinguishing its competitor databases (MySQL, Oracle, SQL Server, ... etc)? I'll be honest that I do see that happening but it's not malicious. It's just a better product. For transactional/persistent storage of data I don't see myself recommending anything else.
There is no single point of ownership behind the Postgres project, no profit margins or sales goals. Thus there is no need to eliminate other platforms, but simply to be the best project the contributors can create.
Anecdotal, I migrated a sluggish multi table SQL Server setup to a single table PostgreSQL by replacing the previous EAV setup with hstore and m2m "article-tag" tables with array fields. Exactly the same functionality, only 20x faster. Sometimes vendor lockin is worth it :)
Hmmm... I'm having evil thoughts about putting a whole web application including templating inside of Postgres and just exposing a few functions via a web interface.
Don't worry I won't do it really but does anyone worry that Postgres is doing too much and that focus might be lost on being a reliable and fast relational DB? I haven't seen any signs of problems but I do have this slight concern with all the array/hstore/json features they have been adding recently.
I think it's impressive how quickly the Postgres world was able to adapt to the shifting needs of webapps and the whole nosql thing. The latest JSON features are a natural extension of the key-value & array stuff that has been around for years.
There's also a bunch of working focusing on replication enhancements that is underway; adding first-party replication tooling will be a huge reliability improvement for Postgres clusters.
I don't think it's that they lost focus, it's that the project is picking up steam.
Yes you are probably right. I really like Postgres, the attitude (reliability, standards etc.), the documentation, basically everything and don't want it to change too much.
It's worth looking at things like hstore and json through the history of the project. The project has always been one which has focused on how to manage complex data in a relatively relational way, but has tended to go where no other database has gone before (table inheritance for example).
Now, it is true that when you get into the advanced capabilities of the database you run into hard edges that just don't make much sense at first, in part because they represent real disputes regarding how everything is supposed to work. Composite types in fields and table inheritance are well known for these sorts of problems but once you get used to the ideosyncracies they aren't bad.
This is true for JSON too, as there is no real way to map nested composite types to JSON constructs both ways (you can do tuple -> json, but not json -> tuple if the json object is nested). But a lot of these things just take time.
I guess C++-like philosophy of "pay (with performance) only for what you use" applies here. If you're not using hstore/json/v8 it probably won't affect you. Also they keep doing solid work on the SQL front, just see http://www.postgresql.org/docs/devel/static/release-9-3.html.
I was a little hesitant to use hstore in my latest project, but it ended up being really cool.
I'm basically storing a trees of metadata that catalogs, in this case, a "season" of television as a root and then each individual episode as a leaf.
What I thought was really cool was that I can store the object references (as a self-referential foreign key) outside of the hstore: this means that the join I use when fetching the tree is still indexed and very inexpensive.
However the actual metadata itself is stored as an hstore payload -- which can be indexed in addition to being part of your query. The only downside is you lose a bit of typesafety since hstore is `text->text`.
It let me throw an application together without committing to a rigid schema, which was nice -- and it actually ends up being pretty easy to work with.
I just thought it was really cool how easy it was to separate the unstructured data, and the well-structured relation of that data.
I think you could turn JSON into XML with a little work, and then run xslt on it in the db ;-)
> Don't worry I won't do it really but does anyone worry that Postgres is doing too much and that focus might be lost on being a reliable and fast relational DB?
Not really. PostgreSQL has never been a purely relational DB. See table inheritance for example (another feature very easily misused but which makes some very difficult things very possible).
It isn't quite the same, but a large amount of configuration and content gets commingled in Drupal databases. It can make for interesting upgrade / migration challenges if you aren't careful.
That isn't evil at all, and I saw someone was actually working on doing that. For web apps that just need to expose a json api, we already just generate boilerplate code from the database anyways, we literally don't write any code for the webapp.
Postgresql continues to be better in real practical ways for us every release. I haven't see any features cause problems or interfere with other features. I am not concerned that it is going to happen.
If you're looking to try Postgres as a full on document store there's more and more tooling around making this feasible. Here's one newer gem that makes it pretty straightforward for Rails: https://github.com/webnuts/post_json
I'm pretty sure that "ALTER TABLE ... ADD COLUMN ... NULL;" holds an AccessExclusiveLock on the relation being altered for the duration of the transaction in PG 9.1 and 9.2. Is there some trick to the zero downtime schema changes that are mentioned in this slide deck? I only ask because I recently had to get creative with zero downtime schema migrations for my current project.
I believe the AccessExclusive is only if you are setting to NOT NULL, such as "ALTER TABLE my_table ADD COLUMN my_col boolean NOT NULL DEFAULT false"
We were just down for 56 hours due to our 3rd party platform vendor applying that type of update.
AFAICT the only workaround is to do it in steps: set the DEFAULT, fill existing rows with the default, then apply the NOT NULL constraint (which will still lock it for a full table scan to check the validity of the constraint).
ALTER TABLE always creates the AccessExclusive lock.
If you add a 'not null default something' to the column, it'll rewrite the whole table, which can take some time. And since the table has the AccessExclusive lock, this is what will block reads/writes.
Doesn't matter for small tables.
For large tables, you want to add the column without a not null or default, commit that transaction, then populate the data, then add the not null and default constraint.
You are right: in the fast case of "NULL" the amount of physical work amounts to diddling catalogs around rather than copying potentially gigabytes of data.
So in any case a lock is taken, but I've never seen anyone get too bent out of shape over that momentary mutual exclusion to deform the table's type (exception: in transactional DDL where cheap and expensive steps are inter-mingled).
More specifically in the multistage work flow, you are only holding the lock while writing the changes to the system catalogs which is not a significant period of time.
To my mind it doesn't offer any extra ability to add or remove columns over what we have already.
E.g. I'd never write select * I'd always name columns in my query so that I can be immune to reordering or adding columns to the underlying table or view.
Where we are looking at it in LedgerSMB is to allow user defined extra fields for forms. A previous approach was a framework for managing join tables, but json would be much cleaner.
I have clients that use a variety of databases for a variety of reasons, not the least of which is "just because that is where our data is." All the value of most projects is in the data, and the value of data grows with age, because you cannot recreate the past. If you lose something or you fail to record something, then you can't ever get it back in a way that will stand up to audit scrutiny. This has the awful effect of making technology-specific details of databases get pushed into the business-decision realm, rather than the technology-decision realm.
So complicating the SQL syntax is a significant issue for me. Sticking to as much standard, ANSI SQL as possible makes my programs more portable across RDBMSes. With some of the tools I've written, I can make a full transition from MySQL to MS SQL Server and back again and the application doesn't care. Having that sort of power makes upgrading your database a technology decision, not a business one.
Yes, the features that Postgres have are nice, but to me they represent a very great chance of vendor lockin. I don't believe that the Postgres team will ever pull anything to make me hate them, but then I thought the same about Sun at one point, too.