More

sonium · 2025-08-25T14:07:24 1756130844

TLDR: There are two versions of the Parquet file format, but adoption of Version 2 is slow due to limited compatibility in major engines and tools. While Version 2 offers improvements (smaller file sizes, faster write/read times), these gains are modest, and ecosystem support remains fragmented. If full control over the data pipeline is possible, using Version 2 can be worthwhile; otherwise, compatibility concerns with third-party integrations may outweigh the benefits. Parquet remains dominant, and its utility far surpasses these challenges

sonium · 2025-05-06T20:14:04 1746562444

Or you simply use the pytorch.powerplant_no_blow_up operator [1]

[1] https://www.youtube.com/watch?v=vXsT6lBf0X4

metaphor · 2025-05-06T23:34:28 1746574468

Paraphrasing this[1]?

[1] https://github.com/pytorch/pytorch/pull/132936/files#diff-98...

janalsncm · 2025-05-06T20:59:23 1746565163

Pretty much. From the article:

> Another solution is dummy calculations, which run while there are no spikes, to smooth out demand.

sonium · on Jan 12, 2025

But I still need pip to install uv, right? Or download it using a one-liner alternatively.

philomath_mn · on Jan 12, 2025

You can install it in several ways without pip, easiest is standalone installer (which can upgrade itself)

https://docs.astral.sh/uv/getting-started/installation/

dontdieych · on Jan 13, 2025

cargo install

sonium · on Nov 8, 2024

Yes, but the criticism is that using mathematics is a really bad guide, as this approach keeps failing

taylodl · on Nov 8, 2024

Yes, but she never bothered to offer any other alternative and also never addressed the gorilla in the room: conducting experiments in fundamental physics is expensive. How are we supposed to best use our limited funds?

You can't run experiments willy-nilly in the hopes that something interesting happens. We run experiments to test theories and the only tool we have for developing theories is mathematics - and we're ignoring when our mathematical models lead to nonsense such as singularities.

sonium · on Sept 11, 2024

I’d like the option to automatically choose the LEAST privacy conserving option, because

1. I don’t care

2. It should work better since it aligns with the goal of the site

creshal · on Sept 11, 2024

Regarding 2: That's the fun part! Manual consent isn't required for functional cookies, only for marketing garbage that doesn't help you at all.

bhawks · on Sept 11, 2024

What if the goal of the site is to monetize views so it is economically viable to produce content?

Then GP's point towards 'it should work better' implies it works over the long-term and not a single interaction.

I find ads frustrating as well, but it is a powerful monetization strategy and that doesn't have a substitute.

troupo · on Sept 11, 2024

You don't need invasive and pervasive tracking and wholesale trade of user data to display ads.

Google earned billions of dollars doing contextual ads before tracking user's every motion became the norm

alkonaut · on Sept 11, 2024

This comes up every time gdpr or ads are discussed. But it’s pretty simple I think: not enforcing privacy regulations forces site owners to break them.

The reason is that so long as some sites show tracking ads, the monetization possible by privacy-friendly ads is almost nothing.

The long term goal must be that no one cheats, so that ad the revenue from well-behaving advertising can go up.

Remember the consent dialogs aren’t ever asking permission to show ads.

xigoi · on Sept 11, 2024

Hot take: People who produce content with the goal of getting money should just do something else.

seanhunter · on Sept 11, 2024

That is an option with consent-o-matic. You just go to the first page of the preferences and turn everything on.

kevmo314 · on Sept 11, 2024

The extension allows you to choose what settings you want.

sonium · on July 25, 2024

The fact that OpenAI is releasing a fancy UI instead of an improved model says something. I'm afraid GPT-5 won't be there any time soon.

aabhay · on July 25, 2024

C’mon. OpenAI is a large company now with 1000+ employees. You’re really going to air this hot take?

- if they release a model “they’re just releasing models without use cases” - if they release safety guardrails “they are just doing this to avoid launching models” - if the release has a waitlist “they’re losing their velocity” - if they launch without a waitlist “they weren’t considering the safety implications” - if they hired a top researcher “they’re conspiring to out spend open source” - if they fire a top researcher “there’s too much politics taking over”

energy123 · on July 25, 2024

> I'm afraid GPT-5 won't be there any time soon.

Based on nothing but idle speculation.

HaZeust · on July 25, 2024

Probably because the benchmarks with higher models are, at this time, negligible. Increasing transformers and iterating attention might be a dead-stop for more capable models beyond 2T parameters. But, I'm not sure.

blagie · on July 25, 2024

You realize GPT-4o was released in May? And the new Facebook models within the past week?

New models are coming fast too.

asadotzler · on July 25, 2024

New models, not refinements on old models. You know what the OP was saying. Why the pedantry?

blagie · on July 26, 2024

I don't think it's pedantry.

To what extent 4o is a new model or a refinement depends on:

a) technology

b) thresholds for what it means for a model to be "new"

Not naming.

We have no clue about what happens within the super-secretive ironically-named OpenAI. To me, it feels like a new model. To you, it feels like a refinement. Unless one of us has insider information, I'm not sure it's worth disputing. We have a difference of opinion, and likely, neither of us has anything to back it up.

t0md4n · on July 25, 2024

Aren't 'new models' always technically just refinements of old models? isn't that the point?

jazzyjackson · on July 26, 2024

refinements = faster, cheaper

new = better, new use cases

sonium · on May 29, 2024

Literally every color grading example shows log footage as the "before". Of course this lacks contrast and vibrancy because it's not meant to be watched "as is". Please show me regular footage as a baseline so it's a fair comparison.

tiptup300 · on May 29, 2024

I didn't read this as a comparison against normal iPhone output

normal iPhone output is great looking just not "the look"

I prefer without a comparison of iPhones version.

IshKebab · on May 29, 2024

It's a meaningless comparison. They're showing a "before and after" but the "before" is something that nobody uses.

IshKebab · on May 29, 2024

Yeah talk about disingenuous.

sonium · on Feb 15, 2024

I just watched the demo with the Apollo 11 transcript. (sidenote: maybe Gemini is named after the space program?).

Wouldn't the transcript or at least a timeline of Apollo 11 be part of the training corpus? So even without the 400 pages in the context window just given the drawing I would assume a prompt like "In the context of Apoll 11, what moment does the drawing refer to?" would yield the same result.

technics256 · on Feb 15, 2024

Gemini is named that way because of the collaboration between Google brain and deep mind

torginus · on Feb 15, 2024

Gemini is named after the spacecraft that put the second person into orbit - pretty aptly named, but not sure if this was the intention.

d0mine · on Feb 16, 2024

The second person was put by MR-3 (Mercury, not Gemini) https://en.m.wikipedia.org/wiki/Timeline_of_space_travel_by_...

DrNosferatu · on Feb 16, 2024

Google needs their Apollo.

singularity2001 · on Feb 15, 2024

Correct except that it spits out the timestamp

empath-nirvana · on Feb 15, 2024

i asked chatgpt4 to identify three humorous moments in the apollo 11 transcript and it hallucinated all 3 of them (i think -- i can't find what it's referring to). Presumably it's in it's corpus, too.

> The "Snoopy" Moment: During the mission, the crew had a small, black-and-white cartoon Snoopy doll as a semi-official mascot, representing safety and mission success. At one point, Collins joked about "Snoopy" floating into his view in the spacecraft, which was a light moment reflecting the camaraderie and the use of humor to ease the intense focus required for their mission.

The "Biohazard" Joke: After the successful moon landing and upon preparing for re-entry into Earth's atmosphere, the crew humorously discussed among themselves the potential of being quarantined back on Earth due to unknown lunar pathogens. They joked about the extensive debriefing they'd have to go through and the possibility of being a biohazard. This was a light-hearted take on the serious precautions NASA was taking to prevent the hypothetical contamination of Earth with lunar microbes.

The "Mailbox" Comment: In the midst of their groundbreaking mission, there was an exchange where one of the astronauts joked about expecting to find a mailbox on the Moon, or asking where they should leave a package, playing on the surreal experience of being on the lunar surface, far from the ordinary elements of Earthly life. This comment highlighted the astronauts' ability to find humor in the extraordinary circumstances of their journey.

sonium · on Feb 5, 2024

You probably should not use UUIDs to start with in your database at least not as an ID. UUIDv7 aims solve some of the issues of UUIDv4 that are even less suitable in for databases. 99% of times using BigInt for an ID is better.

cjblomqvist · on Feb 5, 2024

There are some nice features of using UUIDs rather than ints. It's been written about before, a few on the top of my head: Client side generation of ids. No risk of faulty joins (using the wrong ids to join 2 tables can never get any hits with UUIDs, it can with ints).

Those two sucks for us right now (planning to move to UUIDs).

dankebitte · on Feb 5, 2024

Uniqueness aside, UUIDs for public-facing IDs also prevent enumeration attacks and leaking business information other than timestamps.

mdasen · on Feb 5, 2024

> No risk of faulty joins

Wouldn't Snowflake IDs also solve that problem? A Snowflake ID will fit within a signed 64-bit int.

https://en.wikipedia.org/wiki/Snowflake_ID

The nice thing about a Snowflake ID is that you can encode it into 11 characters in base 62. If I have a UUID, I'm going to need 22 characters. Maybe that doesn't really matter given that 11 characters isn't something someone will want to be typing anyway and Snowflake IDs do require a bit of extra caution to make sure you don't get collisions (since the number you can make per second is limited to how big your sequence generation is).

j16sdiz · on Feb 5, 2024

the same idea, but this is a IETF standard.

throwaway83623 · on Feb 5, 2024

The "faulty joins" can be solved by having a shared sequence for all tables. A bigint column should be enough for most use cases.

j16sdiz · on Feb 5, 2024

shared sequence for all tables can't be parallelized

CooCooCaCha · on Feb 5, 2024

Sorry but that’s terrible advice. I’ve worked on projects that started with integer ids and it caused nothing but problems.

giva · on Feb 5, 2024

What kind of problems did you encounter?

welder · on Feb 5, 2024

Not OP, but I can answer this:

Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]

[0]: https://blog.twitter.com/engineering/en_us/a/2010/announcing...

sgarland · on Feb 5, 2024

> Integers don't scale because you need a central server to keep track of the next integer in the sequence.

They most assuredly do scale. [0]

Also, Slack is built on MySQL + Vitess [1], the same system behind PlanetScale, which internally uses integer IDs [2].

[0]: https://www.enterprisedb.com/docs/pgd/latest/sequences/#glob...

[1]: https://slack.engineering/scaling-datastores-at-slack-with-v...

[2]: https://github.com/planetscale/discussion/discussions/366

CooCooCaCha · on Feb 5, 2024

I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.

It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.

I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.

sgarland · on Feb 5, 2024

In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.

I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.

I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.

welder · on Feb 5, 2024

Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.

If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.

sgarland · on Feb 5, 2024

If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.

If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.

kiitos · on Feb 5, 2024

Whatever is distributing the chunks is still a point of central coordination.

Dylan16807 · on Feb 6, 2024

Yes, and?

If you can't manage minor levels of coordination because your database is on fire, the problem is that your database is on fire.

kiitos · on Feb 11, 2024

Fewer points of coordination is always better.

In general you shouldn't need to make a roundtrip to produce an ID.

Dylan16807 · on Feb 13, 2024

> Fewer points of coordination is always better.

The distributed database needs a coordination system anyway, so it's not an additional point.

> In general you shouldn't need to make a roundtrip to produce an ID.

Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

kiitos · on Feb 17, 2024

> The distributed database needs a coordination system anyway, so it's not an additional point.

Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

> Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.

Dylan16807 · on Feb 19, 2024

> Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

I'm assuming a system that tracks nodes and checks for quorum(s), because if you let isolated servers be authoritative then your data integrity goes to hell. If you have that system, you can use it for low-bandwidth coordinated decisions like reserving blocks of ids.

Am I wrong to think that most distributed databases have systems like that?

> OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.

Sure, but the first thing you said in this conversation was "Whatever is distributing the chunks is still a point of central coordination." which is equally narrow, so I wasn't expecting you to suddenly broaden when I asked why that mattered.

kiitos · on Feb 21, 2024

> I'm assuming a system that tracks nodes and checks for quorum(s)

Not sure why.

> because if you let isolated servers be authoritative then your data integrity goes to hell

Many AP systems maintain data integrity without central authorities or quorums for data.

> Am I wrong to think that most distributed databases have systems like that?

No, not wrong! Just that it's one class of distributed systems, among many.

Dylan16807 · on Feb 23, 2024

Though if you're running AP then I sure hope you have a reconciliation system, and a good reconciliation system can handle that kind of ID conflict. (Maybe you still want to avoid it to speed that process up but that really gets into the weeds.)

akvadrako · on Feb 5, 2024

The way to solve that is giving each server it's own range of IDs.

giva · on Feb 5, 2024

Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.

CooCooCaCha · on Feb 5, 2024

I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.

This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.

The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.

sgarland · on Feb 5, 2024

    postgres=# CREATE TABLE foo(id INT, bar TEXT);
    CREATE TABLE
    postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
    INSERT 0 1
    postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED 
               ALWAYS AS IDENTITY (START WITH 2);
    ALTER TABLE
    postgres=# INSERT INTO foo (bar) VALUES ('ACK');
    INSERT 0 1
    postgres=# TABLE foo;
     id |     bar
    ----+--------------
      1 | Hello, world
      2 | ACK
    (2 rows)

CooCooCaCha · on Feb 5, 2024

I don’t understand what you’re getting at. This was a pre-existing Postgres db in production.

I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.

sgarland · on Feb 5, 2024

You said data import, so I assumed it was pulling rows into an empty table. The example I posted was a way to create a table with a static integer PK that you could rapidly generate in a loop, and then later convert it to auto-incrementing.

> I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.

IME, if something is easy with RDBMS in prod, it usually means you’re paying for it later. This is definitely the case with UUIDv4 PKs.

CooCooCaCha · on Feb 7, 2024

No I mean an active prod table with people adding new rows all the time. It's just so much easier not having to worry about integer conflicts and auto-incrementing shenanigans.

But I get you like integers so whatever works for you, I just don't think they're the right tradeoff for most projects.

lazide · on Feb 5, 2024

This doesn’t really help you in this case, because the patch is to generate the UUIDs in the database?

welder · on Feb 5, 2024

Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.

lazide · on Feb 5, 2024

IMO it’s always been easier to generate them in the client. Every major platform has had libraries since forever.

dventimi · on Feb 5, 2024

They also leak information.

brodo · on Feb 5, 2024

I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.

giva · on Feb 5, 2024

Yes, but the only sane way to generate integer IDs is in the database.

nevir · on Feb 5, 2024

Here are some reasons for using UUIDs; not apply to all businesses:

- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)

- sequential ids leak competitive information: https://en.wikipedia.org/wiki/German_tank_problem

- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)

int_19h · on Feb 8, 2024

It's also much easier to merge data from different sources when they all use UUIDs for row identification.

itslennysfault · on Feb 5, 2024

I would never advise this. I use UUIDv4 for basically everything. It adds minimal overhead to small systems and adds HUGE benefits if/when you need to scale. If you need to sort by creation date use a "created" column (or UUIDv7 if appropriate).

If your system ever becomes distributed you will sing the praises of whoever choose UUID over an int ID, and if it never becomes distributed UUID won't hurt you.

Note: this is for web systems. If it's embedded systems then the overhead starts to matter and the usefulness of UUID is probably nil.

jandrewrogers · on Feb 5, 2024

It is worth mentioning that the reason UUIDv4 is strictly forbidden in some large decentralized systems is the myriad cases of collisions because the "random number" wasn't quite as random as people thought it was. Far too many cases of people not using a cryptographically strong RNG, both unwittingly or out of ignorance that they need to.

Less of an issue if you have total control of the operational environment and code base, but that is not always the case.

mort96 · on Feb 5, 2024

How does this happen? Are people implementing UUIDv4 themselves using rand() or equivalent? Or has widely used UUIDv4 libraries had such bugs?

jandrewrogers · on Feb 6, 2024

It comes in a couple common flavors. Most commonly it is people just rolling their own implementation and using a PRNG or similar. Not every environment has a ready-made UUIDv4 implementation, and not all UUIDv4 implementations in the wild are strict. A rarer horror story I've heard a couple times is discovering that the strong RNG provided by their environment is broken in some way. Both of these cases are particularly problematic because they are difficult to detect operationally until something goes horribly wrong.

The main reason non-probabilistic UUID-like types are used for high-reliability environments is that it is easy to verify the correctness of the operational implementation. It isn't that difficult to deterministically generate globally unique keys in a distributed system unless you have extremely unusual requirements.

sgarland · on Feb 5, 2024

It adds a lot of overhead at any scale, it’s just that the overhead is hidden due to the absurd speed of modern hardware.

I’ll again point out (I said this elsewhere in a post today on UUIDs) that PlanetScale uses int PKs internally. [0] That is a MASSIVE distributed system, working flawlessly with integers as keys. They absolutely can scale, it just requires more thoughtful data modeling and queries.

[0]: https://github.com/planetscale/discussion/discussions/366

samlambert · on Feb 6, 2024

GitHub also uses int PKs and has over 100,000,000 users.

mbork_pl · on Feb 5, 2024

One reason you might want not to use integers for stuff like user ids is that you may leak the information about the magnitude of your userbase.

kkzz99 · on Feb 5, 2024

Huh, why would using uuidv4 be a problem? Collisions?

davydog187 · on Feb 5, 2024

My understanding is that they cause a lot of page fragmentation, which leads to excessive writes to the WAL

brodo · on Feb 5, 2024

You can't use them as cursors, because they are not inherently ordered like integer ids.

kiitos · on Feb 5, 2024

IDs are one thing, cursor-able fields/columns are a different thing.

You cursor on timestamps, or serial numbers, or etc., not IDs.

dventimi · on Feb 5, 2024

I have never wanted to use database cursors and I predict that I never will.

brodo · on Feb 5, 2024

Sure, if you don't offer pagination or only have small tables, you can get away with offsets. I tend to go for cursors as a default because I like to build applications with performance in mind and it’s the same effort.

dventimi · on Feb 5, 2024

We may be talking about different things. I thought you were referring specifically to [database cursors](https://en.wikipedia.org/wiki/Cursor_(databases)) so that's what I was talking about. If you're talking about something else, like the concept of so-called "cursor-based pagination" in general, then that is still an option even with even randomly-generated primary keys, so long as there are other attributes that can be used to establish an order (which attributes need not be visible to the a user)

dventimi · on Feb 5, 2024

I have offered pagination over large tables, without database cursors or non-random keys, without offsets, while keeping performance in mind, with little effort.

mariusor · on Feb 5, 2024

How about the rest of us?

I don't have a resource of the top of my head to present to you, but in the least keyset pagination is superior to the offset one because it does not get invalidated by new inserts.

dventimi · on Feb 5, 2024

The rest of you also don't need database cursors or non-random keys to have keyset pagination.

mrits · on Feb 5, 2024

When working with distributed systems or compiling a IDs from different systems it is helpful to make sure the ID is unique

akvadrako · on Feb 5, 2024

That's especially true if you care about performance and do a lot of joins, the hit can be over 10%.

sonium · on Jan 14, 2024

Aren't there already tons of apps answering that specific question? I think the strength of this approach is answering the non-obvious questions.