How do you compare to Iceberg?

Shelnutt2 · on May 7, 2020

Disclosure: I am a member of the TileDB team.

Iceberg, similar to Delta Lake, is another layer you have to add on top of Parquet, which limits your mobility of compute. You can currently use it either with Spark or Presto, but not directly via language APIs or other compute engines (e.g., pandas, Dask, MariaDB, etc). For example, the pandas "read_parquet" will not be able to read your dataframe, you will have to use pyspark or presto + sql query to fetch the data into pandas. TileDB has updates and data versioning built into its format and storage engine, therefore reading an array directly via a TileDB language API, or Spark, Dask, PrestoDB or MariaDB will have the same behavior. You will be able to see the dataframe after all the changes, or time travel and "activate" only some of the changes in time.

xyzzy_plugh · on May 8, 2020

This isn't strictly true.

You can't arbitrarily reach into the data from the client with Iceberg or Delta Lake, true, but that's intentional. The service on top provides a lot of the functionality TileDB provides. There's nothing preventing you from writing a pandas integration, for example.

Notably, the most significant difference I can see is that, as far as I can tell, TileDB will not solve for failures due to S3 consistency, which are solved by Iceberg and Delta Lake. It's strictly necessary to have a central, ACID-ish place to record transactions, as probing S3 is not reliable.

At the very least, you are definitely suspectable to stale reads, which means users will see all sorts of bizarre failures at scale. I'd be pretty terrified of putting anything serious on TileDB.

Shelnutt2 · on May 8, 2020

Disclosure: I am a member of the TileDB team.

Several important concepts are being conflated here, so I'll elaborate on each separately.

S3 eventual consistency: TileDB is fully aware and designed around the eventual consistency guarantees of cloud object stores [1]. When an array is opened, only the committed writes (up to the specified timestamp if using time traveling) are seen by the reader (each write produces a timestamped "fragment", which is essentially a folder). There will never be partial reads, or corrupt reads. The array is always in a readable state with committed data. This works very similar to Iceberg's method of opening a table at a snapshot for reads [2]. I don't believe that TileDB is any more affected by stall reads than Iceberg (or Delta lake), the user must reopen or re-query a table to see data from a newer timestamp/snapshot.

Handling S3 write failures: TileDB is designed for a lock-free, multi-writer scenario. All writes produce a new timestamped fragment, which is immutable after completion of the write. TileDB performs an atomic write of an "ok" object, which signals when the fragment is complete. Any fragment which is missing the "ok" file, is ignored by the reader [3]. TileDB handles corrupt or incomplete fragments by erroring out on the read. In the future, we could offer a retry mechanism for failed reads, but the important thing is we will never return corrupt or invalid results. Incomplete fragments can happen because of S3's eventual consistency; it is possible for the atomic ok file to show up before an object in the fragment "folder".

Write serializability: TileDB's fragments are written at a (timestamp + uuid) fashion. In the event of a conflicting write at the same timestamp, the uuid provides uniqueness and guarantees that there are no errors by essentially randomly ordering the conflicting fragments. The effect in the end is similar to Iceberg's cancel and retry conflicting writes, except TileDB does not have the penalty of retrying. For Iceberg, if there are two simultaneous conflicting writes, it is effectively random which one would be accepted and which one would be retried.

ACID: TileDB does not support ACID intentionally, as it was not designed to be a transactional database. TileDB was designed with a lock-free multi-writer/multi-reader model, as our use cases up until now involved a one-off massive parallel write, and then multiple concurrent reads. Lack of transactions does not yield inconsistent data though, as stated above the read/write algorithms are specifically designed with this in mind. That said, we have recognized that there are workloads where transactions are important, and we have plans to eventually add transactions to our cloud product where we have the orchestration layer needed to manage them. It is important to keep the transactional layer modular and format-agnostic (Delta Lake is blending it with Parquet) and, therefore, we will build it on top of TileDB, not inside the storage engine.

Interoperability: Obviously nothing stops someone from writing a pandas integration for Iceberg or Delta Lake. An important part of our philosophy at TileDB is extreme integration with existing tools and frameworks, which is why we strive to support things like returning numpy arrays for the Python results and doing zero-copying everywhere possible. In other words, it is not trivial to add a tool integration, just the same as it is not trivial to just add an ACID layer.

[1] https://docs.tiledb.com/main/basic-concepts/consistency

[2] https://iceberg.apache.org/reliability/

[3] https://docs.tiledb.com/main/basic-concepts/physical-storage

xyzzy_plugh · on May 8, 2020

Thanks for the detailed response.

This is what I was getting at, mostly:

> Any fragment which is missing the "ok" file, is ignored by the reader [3].

As a client, how do I detect this? Say I just wrote some data and now I want to generate a new total. Could my data be silently missing?

> TileDB handles corrupt or incomplete fragments by erroring out on the read. In the future, we could offer a retry mechanism for failed reads, but the important thing is we will never return corrupt or invalid results.

But now I have to handle this everywhere I interact with TileDB. Most users don't expect essentially random errors. Retries are a stop-gap, as it can take arbitrary time for consistency to converge. I've observed O(hours) regularly.

> Incomplete fragments can happen because of S3's eventual consistency; it is possible for the atomic ok file to show up before an object in the fragment "folder".

Indeed! And from what I understand, mostly from conversation with the Iceberg team last year, they solve this entirely by using a central ACID service backed by RDBMS (which, aside, I find very amusing!! They never listen to Stonebraker...).

Indeed, as far as I can tell, the only way to remove consistency issues is to: - store an index to objects somewhere central - never list the bucket (or speculatively stat, etc.)

Until either a) you support a central index or b) S3 has a better consistency model (not that DynamoDB garbage Hadoop uses) I would be very reluctant to use TileDB if it wasn't using my own disks.

I think for small stuff, it'll work great, but at scale you're going to hit some roadblocks. I recently encountered splitting a bucket into to many buckets and prefixing UUIDs at the root of the bucket to reduce consistency problems caused by S3's rebalancing, and it ended up being a very expensive marginal improvement.

stavrospap · on May 9, 2020

Folks, apologies, but I think we got a bit side tracked here, TileDB does not suffer from the consistency issues mentioned above.

Here is how TileDB performs a new (potentially concurrent with other reads and writes) write:

- It creates a fragment folder (or "prefix" of a set of objects on S3 - there are no "folders" on S3) which is timestamped and carries a unique UUID. This fragment is self-contained and represents the entire write (e.g., all cells and all attribute values)

- It writes all data objects under the fragment prefix. Note that TileDB never updates, it always writes new immutable objects.

- After all the PUT requests succeed for the data objects, it creates an empty "ok" object.

Here is how TileDB performs a (potentially concurrent with other reads and writes) read:

- It lists the array prefix to get the ok objects

- There are two cases:

1. The ok object is not there for some fragment. That fragment is completely ignored.

2. The ok object is there. Since TileDB writes the ok object last, all the data objects it wrote have been committed and are all visible with GET requests. TileDB reads the data objects only with GET requests (not ListObject requests). Due to S3’s read-after-write consistency model (https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction...), all those objects will be available for reading (now on all S3 regions) with GET and there will be no errors.

Therefore, TileDB follows the eventual consistency model of S3 without any errors and surprises. The user doesn't need to handle anything. Our customers have been using TileDB in production for a long time, storing hundreds of TBs of data on S3, and no consistency issue has ever come up.

Summarizing, what xyzzy_plugh is raising here is that TileDB does not have ACID guarantees. And that is true (we never claimed the contrary) and intentional. We are building a transactional layer outside of the storage engine. The reason is that this transactional layer indeed needs to be a constantly running distributed service, whereas we want the TileDB storage engine to be embeddable and used without performance regression even by applications that do not need ACID (that is, the majority of our data science applications).

willvarfar · on May 8, 2020

How does tiledb deal with appends and upserts etc in s3? Is there success files, atomic folder renaming and things? Or?

Shelnutt2 · on May 8, 2020

Please see my response here:

https://hackernews.hn/item?id=23117117