This pattern is really meaningful conceptually but the tricky thing is to not create a mess in the process.
If it's too easy to branch people will do so and the (knowledge) economics scale disappears (and we'll have a mess).
If the common data definition is too hard to branch from no experiments will happen (slow).
I think most tech for this seem to make it too easy, and in the process injecting a bunch of dependencies that makes it slow and harder to access. May have changed since I last looked.
I found that the simple pattern of versioned paths/table names as `s3://mybucket/mystage/version=42/` or `my_table_v42` puts a high enough evolutionary cost on branching (as consumers need to explicitly adapt) while it also doesn't have the costs associated with using special tech (legacy/lock in/dependencies).
It's also searchable on github/slack/etc if done right.
Love this idea! Biggest hurdle though have been to have predictable Auth&IO across multiple Python/Scala versions and all other things (Spark, orchestrators, CLI's of teams of varying types of OS etc etc) add to that access logs.
SF3s/boto/botocore versions x Scala/Spark x parquet x iceberg x k8s etc readers own assumptions makes reading from S3 alone a maintenance and compatibility nightmare.
Will the mounted system _really_ be accessible as local fs and seen as such to all running processes? No surprises? No need for python specific filesystem like S3Fs?
If so then you will win 100% I wouldn't even care about speed/cost if it's up to par with s3
Yeah, that's exactly right. I had some... experiences with Spark recently, that convinced me that this is something that could really help. I also really like the idea that organizations can continue to use S3 as the source of truth for their data (as you mention, it means that you can continue to use Access Logs, which would capture all usage of your S3 bucket across your applications).
> Will the mounted system _really_ be accessible as local fs and seen as such to all running processes? No surprises? No need for python specific filesystem like S3Fs?
Ha, well it depends on what you mean by surprises. We won't have a Python-specific file system. Our client is going to come in two flavors. Today, you can mount Regatta over NFSv3 (which we wrap in TLS to make it secure). This works for some workloads, but doesn't provide like-for-like performance with EBS. Over the next month, we plan to release the "custom protocol" that I wrote about above, that we expect to send to customers in the form of a FUSE file system.
Either way, it should be one package, you shouldn't need to worry about versioning, and it will appear as a real, local file system. :D
Right, some laws are just ftp-dragged & dropped to prod, understanding that it doesn't compile yet. But if legislator managed to express the _intention_ the courts will over time add and make all test cases pass.
I'm sure the mentality of a PHP developer running a successful but insane legacy site is a better model for this than a perfect OCAML project :)
* Optimization (this is less important but extremely useful)
If you know the mechanics of multivariate calculus you'll be fine learning the above. The course that personally have had most payoff was functional analysis. Purely theoretical course that will give you no practical skills and at first glance seems unrelated to ML but it (subtly) gave me a much deeper understanding of what ML is all about.
If it's too easy to branch people will do so and the (knowledge) economics scale disappears (and we'll have a mess).
If the common data definition is too hard to branch from no experiments will happen (slow).
I think most tech for this seem to make it too easy, and in the process injecting a bunch of dependencies that makes it slow and harder to access. May have changed since I last looked.
I found that the simple pattern of versioned paths/table names as `s3://mybucket/mystage/version=42/` or `my_table_v42` puts a high enough evolutionary cost on branching (as consumers need to explicitly adapt) while it also doesn't have the costs associated with using special tech (legacy/lock in/dependencies).
It's also searchable on github/slack/etc if done right.