Hacker News .hnnew | past | comments | ask | show | jobs | submit | data_ders's commentslogin

I think the advantage is simplicity. Why connect first to duckdb and attach the db when you can query it directly with ADBC which is guaranteed to be fast

You don’t need to connect to duckdb, it’s just a process that you spawn.

You spawn in memory instance of duckdb and connect to it.

Yeah for me standardization is the big win. But not just output formatting but cli commands and a guarantee that they’re as past as possible given that all the connectors use ADBC

Reminds me of the common observance of “machine elves” when taking DMT


hiya, anders from dbt here. cool project -- I especially love the branching and budgeting options you've built in. both are things that I'd love for the dbt standard to include one day. was it dbt's lack of those feature that inspired you to start this project? It also seems you have an aversion to Jinja, which, believe me, I get!

FYI dbt-fusion [1] is going GA next week (though GA for Databricks will come later) Most of it is source-available and ELv2-licensed, but there's a number of crates that are Apache 2.0, namely: dbt-xdbc, dbt-adapter, dbt-auth, dbt-jinja, dbt-agate. We also have plans to OSS more as time goes on (stay tuned).

I just wanted to call out the OSS crates in case you'd rather focus on "making your beer taste better" than have to re-build foundations. I'd love to hear if any of those crates come in handy for you (even more so if they don't work for you).

Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!

[1]: https://github.com/dbt-labs/dbt-fusion


Hey Anders! Thanks a lot for dropping a comment and show interest in Rocky. Yes! I won't going to lie that Jinja is one of the things that gives me some itches :). But it wasn't the major reason for start building Rocky though.

It all started with the need for auto-generating dbt models from the FiveTran connections I integrate with, then having to hot reload code location in Dagster to discover new assets. All in a zero-touch data pipeline. FiveTran connections are discovered as they're created, assets are materialized as these connections sync.

Auto-generating these dbt models and get the manifest aligned between Dagster code location reloads plus spinning up pods in EKS for each Dagster runs that need to rely on these auto-generated models have some impact on the performance overall, not only in production, but also affects DX in their local environment.

Rocky wasn't born with a "dbt replacement" in mind at all, but it was born to solve a real issue I'm facing. I made sure I can integrate well with dbt as it's in my plans to leverage the awesome work available as dbt packages for FiveTran.

I'll definitely have a look the crates you mentioned! Thank you!


thanks for the context!

> Auto-generating these dbt models and get the manifest aligned between Dagster code location

I just added you on LinkedIn. if you accept my connection there I can DM you a private preview document that you might find very interesting related to dbt project metadata (that is way less painful than `manifest.json`)


Accepted! :)


plus 1 for ADBC!


ok, this is definitely up my alley. color me nerd-sniped and forgive the onslaught of questions.

my questions are less about the syntax, which i'm largely familiar with knowing both SQL and ggplot.

i'm more interested in the backend architecture. Looking at the Cargo.toml [1], I was surprised to not see a visualization dependency like D3 or Vega. Is this intentional?

I'm certainly going to take this for a spin and I think this could be incredible for agentic analytics. I'm mostly curious right now what "deployment" looks like both currently in a utopian future.

utopia is easier -- what if databases supported it directly?!? but even then I think I'd rather have databases spit out an intermediate representation (IR) that could be handed to a viz engine, similar to how vega works. or perhaps the SQL is the IR?!

another question that arises from the question of composability: how distinct would a ggplot IR be from a metrics layer spec? could i use ggsql to create an IR that I then use R's ggplot to render (or vise versa maybe?)

as for the deployment story today, I'll likely learn most by doing (with agents). My experiment will be to kick off an agent to do something like: extract this dataset to S3 using dlt [2], model it using dbt [3], then use ggsql to visualize.

p.s. @thomasp85, I was a big fan of tidygraph back in the day [4]. love how small our data world is.

[1]: https://github.com/posit-dev/ggsql/blob/main/Cargo.toml

[2]: https://github.com/dlt-hub/dlt

[3]: https://github.com/dbt-labs/dbt-fusion

[4]: https://stackoverflow.com/questions/46466351/how-to-hide-unc...


Let me try to not miss any of the questions :-)

ggsql is modular by design. It consists of various reader modules that takes care of connecting with different data backends (currently we have a DuckDB, an SQLite, and an ODBC reader), a central plot module, and various writer modules that take care of the rendering (currently only Vegalite but I plan to write my own renderer from scratch).

As for deployment I can only talk about a utopian future since this alpha-release doesn't provide much tangible in that area. The ggsql Jupyter kernel already allows you to execute ggsql queries in Jupyter and Quarto notebooks, so deployment of reports should kinda work already, though we are still looking at making it as easy as possible to move database credentials along with the deployment. I also envision deployment of single .ggsql files that result in embeddable visualisations you can reference on websites etc. Our focus in this area will be Posit Connect in the short term

I'm afraid I don't know what IR stands for - can you elaborate?


Intermediate Representation


Ah - yes, in theory you could create a "ggplot2 writer" which renders the plot object to an R file you can execute. It is not too far away from the current Vega-Lite writer we use. The other direction (ggplot2->ggsql) is not really feasible


right? like it's a graph and a relational model query and a pipeline and a language and an abstract syntax tree and declarative logical plan


what do you think is the "most bad" thing about SQL?


Lack of abstractions like (block-scoped) variables and lambda functions.

SQL is declarative and purely functional anyways, so implementing these is a no-brainer.


TIL about Verse looks cool I'll have to check it out.

> SQL is not a pipeline, it is a graph.

Maybe it's both? and maybe there will always be hard-to-express queries in SQL, and that's ok?

the RDBMS's relational model is certainly a graph and joins accordingly introduce complexity.

For me, just as creators of the internet regret that subdomains come before domains, I really we could go back in time and have `FROM` be the first predicate and not `SELECT`. This is much more intuitive and lends itself to the idea of a pipeline: a table scan (FROM) that is piped to a projection (SELECT).


Pipeline is a specific kind of a graph.

Yes, there will always be hard-to-express queries, the question is how far can we go?


I'm as big a SQL stan as the next person and I'm also very skeptical anytime anyone says that SQL needs to be replaced.

At the same time, it's challenging that SQL cannot be iteratively improved and experimented upon.

IMHO, PRQL is a reasonable approach to extending SQL without replacing SQL.

But what I'd love to see is projects like Google's zeta-sql [1] and Substrait [2] get more traction. It would provide a more stable, standardized foundation upon which SQL could be improved, which would make the case for "SQL forever" even more strong.

I've blogged about this before [3].

[1]: https://github.com/google/googlesql [2]: https://substrait.io/ [3]: https://roundup.getdbt.com/p/problem-exists-between-database...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: