Hacker News .hnnew | past | comments | ask | show | jobs | submit | sonium's commentslogin

TLDR: There are two versions of the Parquet file format, but adoption of Version 2 is slow due to limited compatibility in major engines and tools. While Version 2 offers improvements (smaller file sizes, faster write/read times), these gains are modest, and ecosystem support remains fragmented. If full control over the data pipeline is possible, using Version 2 can be worthwhile; otherwise, compatibility concerns with third-party integrations may outweigh the benefits. Parquet remains dominant, and its utility far surpasses these challenges


Or you simply use the pytorch.powerplant_no_blow_up operator [1]

[1] https://www.youtube.com/watch?v=vXsT6lBf0X4



Pretty much. From the article:

> Another solution is dummy calculations, which run while there are no spikes, to smooth out demand.


But I still need pip to install uv, right? Or download it using a one-liner alternatively.


You can install it in several ways without pip, easiest is standalone installer (which can upgrade itself)

https://docs.astral.sh/uv/getting-started/installation/


cargo install


Yes, but the criticism is that using mathematics is a really bad guide, as this approach keeps failing


Yes, but she never bothered to offer any other alternative and also never addressed the gorilla in the room: conducting experiments in fundamental physics is expensive. How are we supposed to best use our limited funds?

You can't run experiments willy-nilly in the hopes that something interesting happens. We run experiments to test theories and the only tool we have for developing theories is mathematics - and we're ignoring when our mathematical models lead to nonsense such as singularities.


I’d like the option to automatically choose the LEAST privacy conserving option, because

1. I don’t care

2. It should work better since it aligns with the goal of the site


Regarding 2: That's the fun part! Manual consent isn't required for functional cookies, only for marketing garbage that doesn't help you at all.


What if the goal of the site is to monetize views so it is economically viable to produce content?

Then GP's point towards 'it should work better' implies it works over the long-term and not a single interaction.

I find ads frustrating as well, but it is a powerful monetization strategy and that doesn't have a substitute.


You don't need invasive and pervasive tracking and wholesale trade of user data to display ads.

Google earned billions of dollars doing contextual ads before tracking user's every motion became the norm


This comes up every time gdpr or ads are discussed. But it’s pretty simple I think: not enforcing privacy regulations forces site owners to break them.

The reason is that so long as some sites show tracking ads, the monetization possible by privacy-friendly ads is almost nothing.

The long term goal must be that no one cheats, so that ad the revenue from well-behaving advertising can go up.

Remember the consent dialogs aren’t ever asking permission to show ads.


Hot take: People who produce content with the goal of getting money should just do something else.


That is an option with consent-o-matic. You just go to the first page of the preferences and turn everything on.


The extension allows you to choose what settings you want.


The fact that OpenAI is releasing a fancy UI instead of an improved model says something. I'm afraid GPT-5 won't be there any time soon.


C’mon. OpenAI is a large company now with 1000+ employees. You’re really going to air this hot take?

- if they release a model “they’re just releasing models without use cases” - if they release safety guardrails “they are just doing this to avoid launching models” - if the release has a waitlist “they’re losing their velocity” - if they launch without a waitlist “they weren’t considering the safety implications” - if they hired a top researcher “they’re conspiring to out spend open source” - if they fire a top researcher “there’s too much politics taking over”


> I'm afraid GPT-5 won't be there any time soon.

Based on nothing but idle speculation.


Probably because the benchmarks with higher models are, at this time, negligible. Increasing transformers and iterating attention might be a dead-stop for more capable models beyond 2T parameters. But, I'm not sure.


You realize GPT-4o was released in May? And the new Facebook models within the past week?

New models are coming fast too.


New models, not refinements on old models. You know what the OP was saying. Why the pedantry?


I don't think it's pedantry.

To what extent 4o is a new model or a refinement depends on:

a) technology

b) thresholds for what it means for a model to be "new"

Not naming.

We have no clue about what happens within the super-secretive ironically-named OpenAI. To me, it feels like a new model. To you, it feels like a refinement. Unless one of us has insider information, I'm not sure it's worth disputing. We have a difference of opinion, and likely, neither of us has anything to back it up.


Aren't 'new models' always technically just refinements of old models? isn't that the point?


refinements = faster, cheaper

new = better, new use cases


Literally every color grading example shows log footage as the "before". Of course this lacks contrast and vibrancy because it's not meant to be watched "as is". Please show me regular footage as a baseline so it's a fair comparison.


I didn't read this as a comparison against normal iPhone output

normal iPhone output is great looking just not "the look"

I prefer without a comparison of iPhones version.


It's a meaningless comparison. They're showing a "before and after" but the "before" is something that nobody uses.


Yeah talk about disingenuous.


I just watched the demo with the Apollo 11 transcript. (sidenote: maybe Gemini is named after the space program?).

Wouldn't the transcript or at least a timeline of Apollo 11 be part of the training corpus? So even without the 400 pages in the context window just given the drawing I would assume a prompt like "In the context of Apoll 11, what moment does the drawing refer to?" would yield the same result.


Gemini is named that way because of the collaboration between Google brain and deep mind


Gemini is named after the spacecraft that put the second person into orbit - pretty aptly named, but not sure if this was the intention.


The second person was put by MR-3 (Mercury, not Gemini) https://en.m.wikipedia.org/wiki/Timeline_of_space_travel_by_...


Google needs their Apollo.


Correct except that it spits out the timestamp


i asked chatgpt4 to identify three humorous moments in the apollo 11 transcript and it hallucinated all 3 of them (i think -- i can't find what it's referring to). Presumably it's in it's corpus, too.

> The "Snoopy" Moment: During the mission, the crew had a small, black-and-white cartoon Snoopy doll as a semi-official mascot, representing safety and mission success. At one point, Collins joked about "Snoopy" floating into his view in the spacecraft, which was a light moment reflecting the camaraderie and the use of humor to ease the intense focus required for their mission.

The "Biohazard" Joke: After the successful moon landing and upon preparing for re-entry into Earth's atmosphere, the crew humorously discussed among themselves the potential of being quarantined back on Earth due to unknown lunar pathogens. They joked about the extensive debriefing they'd have to go through and the possibility of being a biohazard. This was a light-hearted take on the serious precautions NASA was taking to prevent the hypothetical contamination of Earth with lunar microbes.

The "Mailbox" Comment: In the midst of their groundbreaking mission, there was an exchange where one of the astronauts joked about expecting to find a mailbox on the Moon, or asking where they should leave a package, playing on the surreal experience of being on the lunar surface, far from the ordinary elements of Earthly life. This comment highlighted the astronauts' ability to find humor in the extraordinary circumstances of their journey.


You probably should not use UUIDs to start with in your database at least not as an ID. UUIDv7 aims solve some of the issues of UUIDv4 that are even less suitable in for databases. 99% of times using BigInt for an ID is better.


There are some nice features of using UUIDs rather than ints. It's been written about before, a few on the top of my head: Client side generation of ids. No risk of faulty joins (using the wrong ids to join 2 tables can never get any hits with UUIDs, it can with ints).

Those two sucks for us right now (planning to move to UUIDs).


Uniqueness aside, UUIDs for public-facing IDs also prevent enumeration attacks and leaking business information other than timestamps.


> No risk of faulty joins

Wouldn't Snowflake IDs also solve that problem? A Snowflake ID will fit within a signed 64-bit int.

https://en.wikipedia.org/wiki/Snowflake_ID

The nice thing about a Snowflake ID is that you can encode it into 11 characters in base 62. If I have a UUID, I'm going to need 22 characters. Maybe that doesn't really matter given that 11 characters isn't something someone will want to be typing anyway and Snowflake IDs do require a bit of extra caution to make sure you don't get collisions (since the number you can make per second is limited to how big your sequence generation is).


the same idea, but this is a IETF standard.


The "faulty joins" can be solved by having a shared sequence for all tables. A bigint column should be enough for most use cases.


shared sequence for all tables can't be parallelized


Sorry but that’s terrible advice. I’ve worked on projects that started with integer ids and it caused nothing but problems.


What kind of problems did you encounter?


Not OP, but I can answer this:

Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]

[0]: https://blog.twitter.com/engineering/en_us/a/2010/announcing...


> Integers don't scale because you need a central server to keep track of the next integer in the sequence.

They most assuredly do scale. [0]

Also, Slack is built on MySQL + Vitess [1], the same system behind PlanetScale, which internally uses integer IDs [2].

[0]: https://www.enterprisedb.com/docs/pgd/latest/sequences/#glob...

[1]: https://slack.engineering/scaling-datastores-at-slack-with-v...

[2]: https://github.com/planetscale/discussion/discussions/366


I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.

It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.

I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.


In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.

I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.

I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.


Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.

If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.


If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.

If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.


Whatever is distributing the chunks is still a point of central coordination.


Yes, and?

If you can't manage minor levels of coordination because your database is on fire, the problem is that your database is on fire.


Fewer points of coordination is always better.

In general you shouldn't need to make a roundtrip to produce an ID.


> Fewer points of coordination is always better.

The distributed database needs a coordination system anyway, so it's not an additional point.

> In general you shouldn't need to make a roundtrip to produce an ID.

Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.


> The distributed database needs a coordination system anyway, so it's not an additional point.

Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

> Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.


> Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

I'm assuming a system that tracks nodes and checks for quorum(s), because if you let isolated servers be authoritative then your data integrity goes to hell. If you have that system, you can use it for low-bandwidth coordinated decisions like reserving blocks of ids.

Am I wrong to think that most distributed databases have systems like that?

> OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.

Sure, but the first thing you said in this conversation was "Whatever is distributing the chunks is still a point of central coordination." which is equally narrow, so I wasn't expecting you to suddenly broaden when I asked why that mattered.


> I'm assuming a system that tracks nodes and checks for quorum(s)

Not sure why.

> because if you let isolated servers be authoritative then your data integrity goes to hell

Many AP systems maintain data integrity without central authorities or quorums for data.

> Am I wrong to think that most distributed databases have systems like that?

No, not wrong! Just that it's one class of distributed systems, among many.


Though if you're running AP then I sure hope you have a reconciliation system, and a good reconciliation system can handle that kind of ID conflict. (Maybe you still want to avoid it to speed that process up but that really gets into the weeds.)


The way to solve that is giving each server it's own range of IDs.


Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.


I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.

This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.

The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.


    postgres=# CREATE TABLE foo(id INT, bar TEXT);
    CREATE TABLE
    postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
    INSERT 0 1
    postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED 
               ALWAYS AS IDENTITY (START WITH 2);
    ALTER TABLE
    postgres=# INSERT INTO foo (bar) VALUES ('ACK');
    INSERT 0 1
    postgres=# TABLE foo;
     id |     bar
    ----+--------------
      1 | Hello, world
      2 | ACK
    (2 rows)


I don’t understand what you’re getting at. This was a pre-existing Postgres db in production.

I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.


You said data import, so I assumed it was pulling rows into an empty table. The example I posted was a way to create a table with a static integer PK that you could rapidly generate in a loop, and then later convert it to auto-incrementing.

> I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.

IME, if something is easy with RDBMS in prod, it usually means you’re paying for it later. This is definitely the case with UUIDv4 PKs.


No I mean an active prod table with people adding new rows all the time. It's just so much easier not having to worry about integer conflicts and auto-incrementing shenanigans.

But I get you like integers so whatever works for you, I just don't think they're the right tradeoff for most projects.


This doesn’t really help you in this case, because the patch is to generate the UUIDs in the database?


Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.


IMO it’s always been easier to generate them in the client. Every major platform has had libraries since forever.


They also leak information.


I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.


Yes, but the only sane way to generate integer IDs is in the database.


Here are some reasons for using UUIDs; not apply to all businesses:

- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)

- sequential ids leak competitive information: https://en.wikipedia.org/wiki/German_tank_problem

- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)


It's also much easier to merge data from different sources when they all use UUIDs for row identification.


I would never advise this. I use UUIDv4 for basically everything. It adds minimal overhead to small systems and adds HUGE benefits if/when you need to scale. If you need to sort by creation date use a "created" column (or UUIDv7 if appropriate).

If your system ever becomes distributed you will sing the praises of whoever choose UUID over an int ID, and if it never becomes distributed UUID won't hurt you.

Note: this is for web systems. If it's embedded systems then the overhead starts to matter and the usefulness of UUID is probably nil.


It is worth mentioning that the reason UUIDv4 is strictly forbidden in some large decentralized systems is the myriad cases of collisions because the "random number" wasn't quite as random as people thought it was. Far too many cases of people not using a cryptographically strong RNG, both unwittingly or out of ignorance that they need to.

Less of an issue if you have total control of the operational environment and code base, but that is not always the case.


How does this happen? Are people implementing UUIDv4 themselves using rand() or equivalent? Or has widely used UUIDv4 libraries had such bugs?


It comes in a couple common flavors. Most commonly it is people just rolling their own implementation and using a PRNG or similar. Not every environment has a ready-made UUIDv4 implementation, and not all UUIDv4 implementations in the wild are strict. A rarer horror story I've heard a couple times is discovering that the strong RNG provided by their environment is broken in some way. Both of these cases are particularly problematic because they are difficult to detect operationally until something goes horribly wrong.

The main reason non-probabilistic UUID-like types are used for high-reliability environments is that it is easy to verify the correctness of the operational implementation. It isn't that difficult to deterministically generate globally unique keys in a distributed system unless you have extremely unusual requirements.


It adds a lot of overhead at any scale, it’s just that the overhead is hidden due to the absurd speed of modern hardware.

I’ll again point out (I said this elsewhere in a post today on UUIDs) that PlanetScale uses int PKs internally. [0] That is a MASSIVE distributed system, working flawlessly with integers as keys. They absolutely can scale, it just requires more thoughtful data modeling and queries.

[0]: https://github.com/planetscale/discussion/discussions/366


GitHub also uses int PKs and has over 100,000,000 users.


One reason you might want not to use integers for stuff like user ids is that you may leak the information about the magnitude of your userbase.


Huh, why would using uuidv4 be a problem? Collisions?


My understanding is that they cause a lot of page fragmentation, which leads to excessive writes to the WAL


You can't use them as cursors, because they are not inherently ordered like integer ids.


IDs are one thing, cursor-able fields/columns are a different thing.

You cursor on timestamps, or serial numbers, or etc., not IDs.


I have never wanted to use database cursors and I predict that I never will.


Sure, if you don't offer pagination or only have small tables, you can get away with offsets. I tend to go for cursors as a default because I like to build applications with performance in mind and it’s the same effort.


We may be talking about different things. I thought you were referring specifically to [database cursors](https://en.wikipedia.org/wiki/Cursor_(databases)) so that's what I was talking about. If you're talking about something else, like the concept of so-called "cursor-based pagination" in general, then that is still an option even with even randomly-generated primary keys, so long as there are other attributes that can be used to establish an order (which attributes need not be visible to the a user)


I have offered pagination over large tables, without database cursors or non-random keys, without offsets, while keeping performance in mind, with little effort.


How about the rest of us?

I don't have a resource of the top of my head to present to you, but in the least keyset pagination is superior to the offset one because it does not get invalidated by new inserts.


The rest of you also don't need database cursors or non-random keys to have keyset pagination.


When working with distributed systems or compiling a IDs from different systems it is helpful to make sure the ID is unique


That's especially true if you care about performance and do a lot of joins, the hit can be over 10%.


Aren't there already tons of apps answering that specific question? I think the strength of this approach is answering the non-obvious questions.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: