TLDR: There are two versions of the Parquet file format, but adoption of Version 2 is slow due to limited compatibility in major engines and tools. While Version 2 offers improvements (smaller file sizes, faster write/read times), these gains are modest, and ecosystem support remains fragmented. If full control over the data pipeline is possible, using Version 2 can be worthwhile; otherwise, compatibility concerns with third-party integrations may outweigh the benefits. Parquet remains dominant, and its utility far surpasses these challenges
Yes, but she never bothered to offer any other alternative and also never addressed the gorilla in the room: conducting experiments in fundamental physics is expensive. How are we supposed to best use our limited funds?
You can't run experiments willy-nilly in the hopes that something interesting happens. We run experiments to test theories and the only tool we have for developing theories is mathematics - and we're ignoring when our mathematical models lead to nonsense such as singularities.
This comes up every time gdpr or ads are discussed. But it’s pretty simple I think: not enforcing privacy regulations forces site owners to break them.
The reason is that so long as some sites show tracking ads, the monetization possible by privacy-friendly ads is almost nothing.
The long term goal must be that no one cheats, so that ad the revenue from well-behaving advertising can go up.
Remember the consent dialogs aren’t ever asking permission to show ads.
C’mon. OpenAI is a large company now with 1000+ employees. You’re really going to air this hot take?
- if they release a model “they’re just releasing models without use cases”
- if they release safety guardrails “they are just doing this to avoid launching models”
- if the release has a waitlist “they’re losing their velocity”
- if they launch without a waitlist “they weren’t considering the safety implications”
- if they hired a top researcher “they’re conspiring to out spend open source”
- if they fire a top researcher “there’s too much politics taking over”
Probably because the benchmarks with higher models are, at this time, negligible. Increasing transformers and iterating attention might be a dead-stop for more capable models beyond 2T parameters. But, I'm not sure.
To what extent 4o is a new model or a refinement depends on:
a) technology
b) thresholds for what it means for a model to be "new"
Not naming.
We have no clue about what happens within the super-secretive ironically-named OpenAI. To me, it feels like a new model. To you, it feels like a refinement. Unless one of us has insider information, I'm not sure it's worth disputing. We have a difference of opinion, and likely, neither of us has anything to back it up.
Literally every color grading example shows log footage as the "before". Of course this lacks contrast and vibrancy because it's not meant to be watched "as is". Please show me regular footage as a baseline so it's a fair comparison.
I just watched the demo with the Apollo 11 transcript. (sidenote: maybe Gemini is named after the space program?).
Wouldn't the transcript or at least a timeline of Apollo 11 be part of the training corpus?
So even without the 400 pages in the context window just given the drawing I would assume a prompt like "In the context of Apoll 11, what moment does the drawing refer to?" would yield the same result.
i asked chatgpt4 to identify three humorous moments in the apollo 11 transcript and it hallucinated all 3 of them (i think -- i can't find what it's referring to). Presumably it's in it's corpus, too.
> The "Snoopy" Moment:
During the mission, the crew had a small, black-and-white cartoon Snoopy doll as a semi-official mascot, representing safety and mission success. At one point, Collins joked about "Snoopy" floating into his view in the spacecraft, which was a light moment reflecting the camaraderie and the use of humor to ease the intense focus required for their mission.
The "Biohazard" Joke:
After the successful moon landing and upon preparing for re-entry into Earth's atmosphere, the crew humorously discussed among themselves the potential of being quarantined back on Earth due to unknown lunar pathogens. They joked about the extensive debriefing they'd have to go through and the possibility of being a biohazard. This was a light-hearted take on the serious precautions NASA was taking to prevent the hypothetical contamination of Earth with lunar microbes.
The "Mailbox" Comment:
In the midst of their groundbreaking mission, there was an exchange where one of the astronauts joked about expecting to find a mailbox on the Moon, or asking where they should leave a package, playing on the surreal experience of being on the lunar surface, far from the ordinary elements of Earthly life. This comment highlighted the astronauts' ability to find humor in the extraordinary circumstances of their journey.
You probably should not use UUIDs to start with in your database at least not as an ID. UUIDv7 aims solve some of the issues of UUIDv4 that are even less suitable in for databases. 99% of times using BigInt for an ID is better.
There are some nice features of using UUIDs rather than ints. It's been written about before, a few on the top of my head: Client side generation of ids. No risk of faulty joins (using the wrong ids to join 2 tables can never get any hits with UUIDs, it can with ints).
Those two sucks for us right now (planning to move to UUIDs).
The nice thing about a Snowflake ID is that you can encode it into 11 characters in base 62. If I have a UUID, I'm going to need 22 characters. Maybe that doesn't really matter given that 11 characters isn't something someone will want to be typing anyway and Snowflake IDs do require a bit of extra caution to make sure you don't get collisions (since the number you can make per second is limited to how big your sequence generation is).
Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]
I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.
It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.
I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.
In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.
I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.
I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.
Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.
If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.
If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.
If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.
The distributed database needs a coordination system anyway, so it's not an additional point.
> In general you shouldn't need to make a roundtrip to produce an ID.
Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.
> The distributed database needs a coordination system anyway, so it's not an additional point.
Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.
> Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.
OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.
> Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.
I'm assuming a system that tracks nodes and checks for quorum(s), because if you let isolated servers be authoritative then your data integrity goes to hell. If you have that system, you can use it for low-bandwidth coordinated decisions like reserving blocks of ids.
Am I wrong to think that most distributed databases have systems like that?
> OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.
Sure, but the first thing you said in this conversation was "Whatever is distributing the chunks is still a point of central coordination." which is equally narrow, so I wasn't expecting you to suddenly broaden when I asked why that mattered.
Though if you're running AP then I sure hope you have a reconciliation system, and a good reconciliation system can handle that kind of ID conflict. (Maybe you still want to avoid it to speed that process up but that really gets into the weeds.)
Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.
I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.
This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.
The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.
postgres=# CREATE TABLE foo(id INT, bar TEXT);
CREATE TABLE
postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
INSERT 0 1
postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED
ALWAYS AS IDENTITY (START WITH 2);
ALTER TABLE
postgres=# INSERT INTO foo (bar) VALUES ('ACK');
INSERT 0 1
postgres=# TABLE foo;
id | bar
----+--------------
1 | Hello, world
2 | ACK
(2 rows)
You said data import, so I assumed it was pulling rows into an empty table. The example I posted was a way to create a table with a static integer PK that you could rapidly generate in a loop, and then later convert it to auto-incrementing.
> I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.
IME, if something is easy with RDBMS in prod, it usually means you’re paying for it later. This is definitely the case with UUIDv4 PKs.
No I mean an active prod table with people adding new rows all the time. It's just so much easier not having to worry about integer conflicts and auto-incrementing shenanigans.
But I get you like integers so whatever works for you, I just don't think they're the right tradeoff for most projects.
Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.
I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.
Here are some reasons for using UUIDs; not apply to all businesses:
- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)
- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)
I would never advise this. I use UUIDv4 for basically everything. It adds minimal overhead to small systems and adds HUGE benefits if/when you need to scale. If you need to sort by creation date use a "created" column (or UUIDv7 if appropriate).
If your system ever becomes distributed you will sing the praises of whoever choose UUID over an int ID, and if it never becomes distributed UUID won't hurt you.
Note: this is for web systems. If it's embedded systems then the overhead starts to matter and the usefulness of UUID is probably nil.
It is worth mentioning that the reason UUIDv4 is strictly forbidden in some large decentralized systems is the myriad cases of collisions because the "random number" wasn't quite as random as people thought it was. Far too many cases of people not using a cryptographically strong RNG, both unwittingly or out of ignorance that they need to.
Less of an issue if you have total control of the operational environment and code base, but that is not always the case.
It comes in a couple common flavors. Most commonly it is people just rolling their own implementation and using a PRNG or similar. Not every environment has a ready-made UUIDv4 implementation, and not all UUIDv4 implementations in the wild are strict. A rarer horror story I've heard a couple times is discovering that the strong RNG provided by their environment is broken in some way. Both of these cases are particularly problematic because they are difficult to detect operationally until something goes horribly wrong.
The main reason non-probabilistic UUID-like types are used for high-reliability environments is that it is easy to verify the correctness of the operational implementation. It isn't that difficult to deterministically generate globally unique keys in a distributed system unless you have extremely unusual requirements.
It adds a lot of overhead at any scale, it’s just that the overhead is hidden due to the absurd speed of modern hardware.
I’ll again point out (I said this elsewhere in a post today on UUIDs) that PlanetScale uses int PKs internally. [0] That is a MASSIVE distributed system, working flawlessly with integers as keys. They absolutely can scale, it just requires more thoughtful data modeling and queries.
Sure, if you don't offer pagination or only have small tables, you can get away with offsets. I tend to go for cursors as a default because I like to build applications with performance in mind and it’s the same effort.
We may be talking about different things. I thought you were referring specifically to [database cursors](https://en.wikipedia.org/wiki/Cursor_(databases)) so that's what I was talking about. If you're talking about something else, like the concept of so-called "cursor-based pagination" in general, then that is still an option even with even randomly-generated primary keys, so long as there are other attributes that can be used to establish an order (which attributes need not be visible to the a user)
I have offered pagination over large tables, without database cursors or non-random keys, without offsets, while keeping performance in mind, with little effort.
I don't have a resource of the top of my head to present to you, but in the least keyset pagination is superior to the offset one because it does not get invalidated by new inserts.