PostgreSQL Columnar Store for Analytic Workloads

chrisfarms · on April 3, 2014

I've never really digged into column-oriented storage, so had a quick skim... Would the below excerpts/example be a fair note of pros/cons of the general idea?

> Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

Example: SELECT sum(a) FROM things;

> Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.

Example: UPDATE things SET a = a+1;

> Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.

Example: SELECT * FROM things;

> Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.

Example: INSERT INTO things (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);

natebrennand · on April 3, 2014

That's pretty accurate. Column-stores also make data compression significantly more effective because they are storing many values of the same type together [1].

The improvements to your first two points are typically 1-2 magnitudes faster with column-stores [2].

[1] http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf

[2] http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

ozgune · on April 3, 2014

That sounds like a pretty good summary.

From a workload perspective, row-stores are predominant in the OLTP (transactional insert/update) domain. They are also used in OLAP and data warehousing. Still, column stores have benefits when the underlying tables have many columns and the user's running analytic queries over a small subset of those columns.

jorgeleo · on April 4, 2014

Also column-oriented db are much more efficient in filtering rows, specially for very large datasets with complex filtering expressions. This is what makes them good at data analysis.

I consider different from your first point because sometimes is not the aggregation, but the drill down what is needed

rusanu · on April 4, 2014

Good work, and I'm impressed by the PostgreSQL foreign data wrapped API power to allow for such ease of implementation.

The .proto defined for cstore_fwd differs from the ORC as defined for Hive. At a quick glance I can't find references to dictionary encoding nor statistics, and the datatypes used are apparently the native PostgreSQL data types. From what I can tell this implementation leverages reduced IO (fetch only columns of interest from disk), segment elimination (use min/max info to skip over entire row groups) and pgzl compression for the data stream. I couldn't find references to run-length encoding (RLE) or dictionary encoding. I'm sure the shortcoming will be improved in future iterations, specially better encoding schemes which would result in better compression.

But I'm a bit disappointed that the ORC format used is not the same as the one originally used in Hive, Pig and the rest of Java/Hadoop ecosystem. Had it shared the actual file format it would had enabled a number of very interesting scenarios. Think about Hive/PIG serving as ETL pipelines to produce ORC file that are attached directly to PostgreSQL (add today's partition as FOREIGN DATA WRAPPER ... OPTIONS (filename '.../4_4_2014.cstore') and then do interactive analytic driven by PostgreSQL. It would reduce the cost of ingress significantly (fast attach, no need for expensive INSERT and re-compression). I realize data type translation would had been thorny, to say the least (datetime and decimal as primitives, probably many of the structured types too).

- github.com/citusdata/cstore_fdw/blob/master/cstore.proto - github.com/apache/hive/blob/trunk/ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto

tkyjonathan · on April 4, 2014

I think the emphasis should be on the fact that Postgres now has a free and open source columnar store. For many years (probably decades) there have been companies that have developed "big data"/analytics system with Postgres as the base, but have not contributed back to the Postgres ecosystem.

While this new columnar store is not as speedy as ones that have been around since the early 2000s, it does give a platform for CitusData and other companies to build on and share solutions.

Having a DB that can hold both the transactional data and data for fast analytical purposes is very advantageous as you have less moving parts and much less ETL work.

What I am looking forward to now is a few start ups similar to CitusData that solve different "Big Data" problems and for them to work together to disrupt the multi-billion dollar datawarehouse/analytics vendors.

lucian1900 · on April 3, 2014

This is potentially very interesting, it's the only open source SQL columnar store I'm aware of. Bonus points for being part of Postgres.

Basically ParAccel (like used by Amazon RedShift) at reasonable cost.

alecco · on April 3, 2014

InfinyDB did this with MySQL a long time ago http://infinidb.co/products

lneves · on April 4, 2014

There is MonetDB: https://www.monetdb.org/

I use it with great success.

erichocean · on April 4, 2014

And it's been around forever. If you're interested in high-speed processing, there's lots of good papers from that project (e.g. how to avoid creating branches, stuff like that).

paperwork · on April 3, 2014

I agree, I'm excited. Hopefully the optimizer can take full advantage of these column stores. However, postgresql isn't the only open source column store. MonetDB is another one. It is an academic project but people do use it commercially. I know some people who get great performance out of MonetDB, even though its optimizer has some gaps.

alexdean · on April 3, 2014

https://www.infobright.org/ has been around for ages.

jerven · on April 4, 2014

Virtioso 7 by openlinksw is a GPL columnar sql (and sparql) RDBMS

rhelmer · on April 3, 2014

Don't you still need to purchase citus DB to use this?

ozgune · on April 3, 2014

Nope, you don't need to purchase CitusDB.

We open sourced cstore_fdw with the Apache license. You can just build and load the extension, and start using it with your PostgreSQL 9.3+ database.

If you later need to scale out your PostgreSQL cluster, just please let us know. :)

spaznode · on April 4, 2014

I know some statisticians / analyst minded people your generous contribution may make extremely happy. Myself as well of course.

Thank you very very much in advance. This is a really huge deal. Even for general things like academic research. We shall see.

rhelmer · on April 4, 2014

Very cool! Thanks for contributing this.

monstrado · on April 3, 2014

Any particular reason why ORC was chosen as the columnar store format over Parquet (https://github.com/Parquet/parquet-format)? Reason I ask is because Parquet seems to have its own development cycle, roadmap, and is pretty continuously updated with enhancements.

mason55 · on April 3, 2014

I'd be interested to see more benchmarks. The improvements in this post are not anywhere close to what we've seen going from PG to Infobright for our reporting queries - we get speedups from 10x - 1000x, the one speed benchmark they have here is only 2x.

ozgune · on April 3, 2014

(Ozgun from Citus Data here.)

We're going to publish more comprehensive results in a few weeks. In our initial tests, we found the speed up to depend a lot on the underlying data and the type of queries.

For example, we found that when compression reduced the working set from being on disk to in-memory, there was a significant jump.

Also, we're looking to do optimizations on the cost estimation side -- these will notably help with queries that join multiple tables together, a common scenario for the TPC-H benchmark.

mason55 · on April 3, 2014

Awesome, can't wait to see more benchmarks. I would love to be able to switch back to 100% Postgres. IB gives us tons of speed but even ignoring the MySQL warts it just feels incomplete. And we ran into a bug in production where the order of AND was actually having an effect on query results, something that's completely unacceptable and makes me worry about my data.

antonmks · on April 4, 2014

Doesn't look like it uses any kind of vectorized execution. So the only advantage one gets is reduced amount of data to process. It also doesn't mention execution on compressed data so it has to decompress the data which doesn't come cheap.

erichocean · on April 4, 2014

A relatively easy way to change that would be to use Intel's ISPC compiler.

http://ispc.github.io/

alecco · on April 3, 2014

Because they still use PostgreSQL engine, very slow.

There is a great talk by Stonebraker about this. (He created PostgreSQL about 3 decades ago and moved on to newer database engines).

jeffdavis · on April 3, 2014

He started Postgres before it became PostgreSQL (i.e. before it even supported SQL). It's quite a different system now.

jeltz · on April 4, 2014

The PostgreSQL Stonebraker left was a very different one than the current. It did not even have MVCC.

Alex3917 · on April 3, 2014

Is this talk online?

garenp · on April 3, 2014

Probably referring to this:

http://blog.jooq.org/2013/08/24/mit-prof-michael-stonebraker...

alecco · on April 3, 2014

How is this different than the many other columnar SQL databases and extensions?

Columnar querying is typical for OLAP, PostgreSQL engine is aimed at OLTP. This doesn't look like a good idea. Like adding side floats to a car and paddles to use it like a boat.

This goes against using the right tool for the right job.

ddorian43 · on April 3, 2014

How is 'jsonb' better than mongodb ? Because you can use the same tool, you'll support the same db, you'll pay 0$ for licensing, you may pay 0$ for sharding (postgresql-xc) etc.

alecco · on April 3, 2014

Downvotes? Lovely.

Also, I bet VoltDB, a modern open source OLTP, can beat this thing hands down. Also in-memory and clusters. Complex store procedures precompiled and many other goodies.

Commercial column stores like Vertica should be orders of magnitude faster.

nemothekid · on April 3, 2014

Don't know why you are being down voted, but you should understand its about tradeoffs.

If I want to increase my database performance I can either

1.) Build this plugin and integrate it into my already working ecosystem 2.) Spend time researching, testing, and deploying VoltDB.

Given the popularity of postgres, and the relatively low friction solution of (1.), its clear why this could be an adequate solution. Sure you won't be as fast as VoltDB, but as an outside engineer we don't know the potential customers requirements, and if being as fast as VoltDB actually matters.

noelherrick · on April 3, 2014

This is really exciting! Columnar storage is something that the big boys like Microsoft and Oracle charge an arm and a leg for. You can currently get column-based SQL open-source databases, but this new FDW allows you to mix workloads on the same machine.

mimighost · on April 3, 2014

I am curious about how this thing compares to something like Amazon Redshift.

Briefly skimming, it looks pretty similar, except for the data compression part, which uses RCFile. It also supports more data types. If this being adapted by redshift or something else, I will be thrilled.

capkutay · on April 3, 2014

So with Cassandra you have a pretty nice, scalable, columnar DB with a SQL interface[0]. Not to mention, it's free and apache licensed so you can distribute it as part of your own software. I guess I've only looked at cassandra from the scope of a developer. Would a DBA prefer using a columnar version of PostgreSQL rather than using cassandra for free?

0:http://cassandra.apache.org/doc/cql/CQL.html

Edit: I didn't realize Citus Data was making the columnar postgres offering open source...that's great!

nemothekid · on April 3, 2014

Cassandra does not have a SQL interface. CQL (Cassandra Query Language) provides none of the aggregation, grouping and dearth of other features an analyst would depend on from a SQL-like database.

Not to say there aren't tools to get SQL ontop of Cassandra (Datastax has a hadoop/mr driver, that you can probably put pig/hive/presto(?) on top of).

squarecog · on April 4, 2014

As pointed out by another commenter, Cassandra is not a column store. Its storage format is quite inefficient for compression and for large scans of subsets of columns, which is what columnar stores are designed for (Cassandra and HBase optimize for different use cases, which are much closer to OLTP).

You can find a brief description of why and when you want actual columnar formats in the introduction section of this blog post: https://blog.twitter.com/2013/dremel-made-simple-with-parque... .

You can read more about columnar databases here: https://en.wikipedia.org/wiki/Column-oriented_DBMS

A columnar storage format for a database engine that does not have operations specifically tuned for vectorization and taking advantage of data being cheaply available a column chunk at a time instead of a row chunk at a time is only half the battle -- you can get massive perf wins on top of the work CitusDB did by doing things like operating on compressed data / delaying decompression, avoiding deserialization for column values based on a filtering scan of another column, pipelining operations on column values to fit better into your cpu architecture, etc.

But just having the IO wins is nice, too.

arielweisberg · on April 3, 2014

Cassandra is a Bigtable style column store not an analytic column store.

It's an unfortunate naming collision.

capkutay · on April 3, 2014

I think cassandra is pretty good for analytical workloads...for instance "Give me all the customers from pittsburgh" is a faster scan when the city column is laid out contiguously on disk.

arielweisberg · on April 3, 2014

Cassandra doesn't store columns from different rows together to my knowledge. That means it is no way like an analytic column store where columns from all rows are stored together. This is not a bad thing, it just wasn't intended to be an analytic column store and makes design choices suited for other things.

See http://wiki.apache.org/cassandra/ArchitectureSSTable

Analytic column stores can exploit storing columns separately by using various encoding schemes to compress data that are far superior to generalized compression and they can use the sort to zero on relevant ranges for each column.

capkutay · on April 5, 2014

Ah ok. I understand why a columnar database would be better for analytical workloads, but I didn't realize that HBase and Cassandra aren't actually laid out in a columnar fashion on disk. I was confused by the 'column store' terminology. This makes Citus data's release even more compelling.

Thanks for clearing that up!

foolinaround · on April 3, 2014

can you kindly provide at a high level the differences between the two?

arielweisberg · on April 4, 2014

For Cassandra see http://www.datastax.com/docs/0.8/ddl/index

An analytic column store like say Vertica has a schema like a regular SQL database. I don't know what their flexible schema story is right now.

Instead of storing the columns of a row together an analytic column store will store columns from many rows together in sorted runs. When you go to do a scan your disk will only read the columns you have selected. The format for column storage is optimized for specific types and uses type specific compression so 10-50x is something that is claimed. This further improves the IO situation. They can also zero in on relevant ranges of data for each column because they are indexed and this further reduces the IO requirements.

Where other databases are bound on seeks or sequential throughput an analytic column store will be bound on CPU, especially CPU for the non-parallel portions of every query.

Obviously a column store will have a hard time selecting individual rows because the data is not stored together so it will be expensive to materialize. They also have trouble with updates/deletes to already inserted data, in some cases requiring the data be reloaded because updates have dragged everything down.

aaronblohowiak · on April 3, 2014

>We are excited to open source our columnar store extension for PostgreSQL and share it with the community

https://github.com/citusdata/cstore_fdw/blob/master/LICENSE

hfmuehleisen · on April 7, 2014

I have written a short blog post on how CitusDB's column store compares with purpose-built systems such as MonetDB. Repeatability ftw. Disclosure: I am part of the MonetDB development team.

CitusDB vs. MonetDB TPC-H Shootout http://www.monetdb.org/content/citusdb-postgresql-column-sto...

rpedela · on April 3, 2014

I am curious, why are foreign tables necessary?

ozgune · on April 3, 2014

(Ozgun from Citus Data here.)

Foreign tables provide a nice abstraction to separate the storage layer from the rest of the database. They are similar to MySQL storage engines in a sense. With them, users don't need to make any changes to their existing PostgreSQL databases, and instead just Create Extension and get rolling.

You can also use both foreign and regular tables in the same query: http://citusdata.github.io/cstore_fdw/#toc_4

darksaints · on April 3, 2014

My question is what do you lose by doing so? FDW has always felt like a little bit of a hack, but I guess if the optimizer and execution engines can use it as if they were native tables, then there isn't much lost. But the fact that ProtoBuf is used makes me think that there is some overhead that doesn't occur in native tables.

Jweb_Guru · on April 3, 2014

I could be a bit off here, but from what I remember about working with FDWs I think the biggest drawback is that you can't assume native access to system tables on a remote server, which means you can't do things like CREATE INDEX on a foreign table and aren't going to have, e.g., statistics about those tables. However, it does have pretty much all the information the optimizer does about the actual query, and if you're implementing your own storage engine like Citus rather than trying to hook into someone else's you can probably get around those problems.

jasonmp85 · on April 3, 2014

You're correct that foreign tables do not allow explicit index creation (though cstore_fdw does use skip indexes internally), but ANALYZE can ask them for statistics: http://www.postgresql.org/docs/current/static/sql-analyze.ht...

cstore_fdw supports ANALYZE so these statistics can indeed be used by the planner; however, the autovacuum daemon doesn't do this automatically for foreign tables, so it's up to the user to decide how often to run ANALYZE on them.

Jweb_Guru · on April 3, 2014

That's pretty cool, I had no idea that was supported :)

rpedela · on April 3, 2014

Makes sense, thanks for the explanation. Mixing table types is a pretty nice feature.

mason55 · on April 3, 2014

Likely because you have to run this as a separate db server. The speed improvements come from adjusting data locality on disk so they're probably not able to have columnar & regular data stores in the same PG instance/data files.

Similar to how Infobright is just MySQL but you still can't mix columnar and regular DBs in the same instance, you have to run IB separately.

jasonmp85 · on April 3, 2014

(I'm Jason from Citus Data)

As Ozgun notes above, the Foreign Data Wrapper APIs just provide a powerful abstraction for what this extension is doing.

Though the DDL to set up a foreign table requires first creating a "server" with CREATE SERVER, this is merely a formality: as in file_fdw the server doesn't actually represent an external process. All I/O occurs within the postgres backend process servicing the query.

hans_castorp · on April 4, 2014

This sounds very interesting. Are there any pre-built Windows binaries available for this extension?

joevandyk · on April 3, 2014

You'd have to mangage backups of the storage file separately than backups of postgresql, right?

yazun · on April 4, 2014

Any chance it will support INSERT some day?