More

oautholaf · on Aug 27, 2024

As a former Eazel employee, I can say that indeed Eazel did not get acquired by Apple.

As the Wikipedia page states, a sizable pool of people went to work on Safari 1.0 (and some are still working on Safari). Others went to Apple to work on the Finder or Core Graphics.

Another big chunk of people went to Danger to work on the T-Mobile Sidekick.

But the company shut down. No one was left besides the CFO.

oautholaf · on May 31, 2024

This is totally a thing. In the late 90s when I worked at Microsoft and then at startups, QA was made up of full time employees whose leadership had input into the product process.

Today at my large tech company, QA is mostly contract employees validating test plans that the normal engineers author. Zero autonomy or ownership offered.

JoshTriplett · on May 31, 2024

Also, there's a huge difference between teams where QA is seen as a stepping stone on the path to graduating into "real" development, versus teams where QA is viewed as a critically important role that's valuable in itself. There needs to be a role for senior QA engineers, because if all your junior QA folks are looking to "graduate" to development you won't get expert QA folks who really care about QA.

I've seen what great senior QA engineers can do. Proactive approaches to testing; integrating new approaches to testing; influence on architecture and design to make codebases more robust; optimizing testsuites so they can run more often; better capture of long-tail errors from production; design and implementation of scratch infrastructure to test more things before production...

hinkley · on June 1, 2024

QA people are very bimodally distributed. Excluding managers, 90% of the best and the worst coworkers I’ve ever had have been QA. If you have drive and focus you can be amazing. If you have neither then you’re an albatross around the team’s neck.

It’s a very passive aggressive way to root out a problem in an organization by just removing it.

To your comment about being a path to coding: if you can code well and test well, you should skip entirely over being a software dev and go into security consulting. Instead of a 40% pay bump you could be looking at an 80% pay bump. What is a red team member but a coder with the suspicious mind of a QA person?

oautholaf · on May 31, 2024

Yeah I completely agree. In the era where people could build a career in this area, they developed skills and brought insight that made the whole product better. Automated testing and SRE/DevOps reliability that I see focused on today do not fill in the gaps.

oautholaf · on May 24, 2024

They seem to be a common part of the Airforce One entourage. When the President is in the SF Bay Area, I see these commonly fly over residential areas. Given their history and the apparent unexplained failure, this has always struck me as unwise.

oautholaf · on April 26, 2024

Have to say, I bought a DJI drone that I enjoyed, but then the app stopped working after my Android phone upgraded to a new OS version. Even months later, it appeared that DJI had made no effort to fix this issue (which clearly did not affect only me). There are open source apps, but the existing open source apps at the time did not control a specific unique and important feature on the drone I bought.

Never buying DJI again.

oautholaf · on Jan 8, 2023

Not reset, reprioritize. Fred Vogelstein only talked to one source who was on the Android team for his book, and that source may have had their own agenda.

Other sources, such as Chet Haas's book, make it clear that what became the G1 was already on the roadmap. It was just prioritized.

oautholaf · on Sept 29, 2022

In my experience, a working strategy for handling signout or revocation for statically verifyable tokens like JWT is straightforward:

- Clear client side state where you can. - Write signed out/expired tokens to something with a cheap heavy read/eventual consistency model - Fail to signed in if unavailable - Acknowledge that you are gaining latency/availability/ lower costs by trading some precision

I am aware of a very large website most folks use every day that did this for more than a decade and it worked fine.

jdthedisciple · on Sept 29, 2022

great idea! I'm using JWT in one of my projects and still unsure how to fix the irrevocability of JWT while keeping them stateless. But this seems like a nice intermediate solution

oautholaf · on Dec 22, 2021

I worked for a while on a well-known product that used (and perhaps still uses) WebSockets for its core feature. I very much agree with the bulk of the arguments made in this blog post.

In particular, I found this:

- Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once.

- On a smaller scale, but more frequently: office networks of large customers would do the same thing.

- Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation

- And of course, unless you do not want to upgrade your server software, you must at some point restart your servers (and again, your cloud hosting provider likely has no SLA on the uptime of an individual machine)

- As is pointed out in the article, a TCP connection can cease to transmit data even though it has not closed. So attention must be paid to this.

If you use WebSockets, you must make reconnects be completely free in the common case and you must employ people who are willing to become deeply knowledgeable in how TCP works.

WebSockets can be a tremendously powerful tool to help in making a great product, but in general they are almost always will add more complexity and toil with lower reliability.

(edited typos)

bri3d · on Dec 22, 2021

I built several large enterprise products over WebSockets. I didn't find it that bad.

Office networks that either blocked or killed WebSockets were annoying. For some customers they were a non-starter in the early 2010s, but by 2016 or so this seemed to be resolved.

Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

We would see mass TCP issues from time to time as well, but they were pretty much no-ops as they would just trigger a timeout and reconnect the next time the user performed an operation. We would send an ACK back instantly (prior to execution) for any client requested operation, so if we didn't see the ACK within a fairly tight window, the client could proactively reap the WebSocket and try again - customers didn't have to wait long to learn a connection was alive and unclosed.

> If you use WebSockets, you must make reconnects be completely free in the common case

I agree with this, or at least "close to completely free." But in a normal web application you also need to make latency and failed requests "close to completely free" as well or your application will also die along with the network. This is the point I make in my sibling comment - I think distributed state management is a hard problem, but WebSockets are just a layer on top of that, not a solution or cause of the problem.

> you must employ people who are willing to become deeply knowledgeable in how TCP works.

I think this is true insofar as you probably want a TCP expert somewhere in your organization to start with, but we never found this particularly complicated. Understanding that the connection isn't trustworthy (that is, when it says it's open, that doesn't mean it works) is the only important fundamental for most engineers to be able to work with WebSockets.

playing_colours · on Dec 23, 2021

> Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

Can you please share approaches to mitigate this issue?

dcuthbertson · on Dec 23, 2021

As rakoo said, exponential backoff mitigates the thundering herd. I was going to say add some jitter to the time before reconnecting, then I realized rakoo already said "after a random short time", which is exactly what jitter is. (edited for coffee kicking in)

malandrew · on Dec 23, 2021

Congestion avoidance algorithms such as TCP Reno and TCP Vegas. basically code clients to back off if they detect a situation where they may be a member of a thundering herd.

rakoo · on Dec 23, 2021

Exponential back off. Basically try to reconnect after a random short time, if that doesn't work try with a time twice longer, then twice again, etc..

slaymaker1907 · on Dec 24, 2021

Usually you want the 2x wait to be a random time between 1.5x and 2x longer or something.

apitman · on Dec 22, 2021

> Office networks that either blocked or killed WebSockets were annoying

Curious how did they detect WS usage? Were you running on HTTP or did they just kill any long-lived TCP connection? Root certs?

bri3d · on Dec 22, 2021

No, we always ran on TLS. There were a few classes of these:

* Filtering MITM application firewall solutions which installed a new trusted root CA on employee machines and looked at the raw traffic. These would usually be configured to wholesale kill the connection when they saw an UPGRADE because the filtering solutions couldn't understand the traffic format and they were considered a security risk.

* Oldschool HTTP proxy based systems which would blow up when CONNECT was kept alive for very long.

* Firewalls which killed long-lived TCP connections just at the TCP level. The worst here were where there was a mismatch somewhere and we never got a FIN. But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

We also tried running WebSockets on a different port for awhile, which was not a good idea as many organizations only allowed 443.

throwaway81523 · on Dec 22, 2021

> But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

I found the best way to handle this was with an application level heartbeat. That bypassed dealing with any weirdness of the client firewalls, TCP spoofing, etc.

mmis1000 · on Dec 23, 2021

Something like ping every 30 seconds and say goodbye to the socket if we don't receive 2 seems work reasonably well.

And it also prevent most idle killing base tcp disconnect from happening.

And even if some network is so dumb that decides to kill it under 30s, it is a non issue as that network won't be even usable in normal means. (How do you download any big file if it always disconnect instantly?)

xyzzy_plugh · on Dec 22, 2021

> disconnect all long-lived TCP sockets in an availability zone in unison

I don't know what this means, but it sounds ridiculous. This would cause havoc with any sort of persistent tunnel or stateful connection, such as most database clients. Do you perhaps mean this just happens at ingress? That is much more believable and not as big of a deal.

> office networks of large customers would do the same thing.

Sounds like a personal problem. In all seriousness, your clients should handly any sort of network disconnect gracefully. It's foolish to assume TCP connections are durable, or to assume that you won't be hit by a thundering herd.

Maybe I'm old fashioned but TCP hasn't changed much over the years, none of these problems are novel to me, it's well-trodden ground and there are many simple techniques to building durable clients.

Also, all of the things you mention also affect plain old HTTP, especially HTTP2. There shouldn't be a significant difference in how you treat them, other than the fact you cannot assume they're all short lived connections.

oautholaf · on Dec 22, 2021

Most applications written using HTTP, in my experience, do not have deep dependencies on the longevity of the HTTP2 connection. In my experience, TCP connections for HTTP2 are typically terminated at your load balancer or similar. So reconnections here happen completely unseen by either the client application in the field or the servers where the business logic is.

For us -- and I think this is common -- the persistent WebSocket connection allowed a set of assumptions around the shared state of the client and server that would have to be re-negotiated when reconnecting. The fact that this renegotiation was non-trivial was a major driver in selecting WebSockets in the first place. With HTTP, regardless of HTTP2 or QUIC, your application protocol very much is set up to re-negotiate things on a per-request basis. And so the issues I list don't tend to affect HTTP-based applications.

xyzzy_plugh · on Dec 22, 2021

> the persistent WebSocket connection allowed a set of assumptions around the shared state of the client and server that would have to be re-negotiated when reconnecting. The fact that this renegotiation was non-trivial was a major driver in selecting WebSockets in the first place. With HTTP, regardless of HTTP2 or QUIC, your application protocol very much is set up to re-negotiate things on a per-request basis. And so the issues I list don't tend to affect HTTP-based applications.

I think this describes a poor choice in technology. There's no silver bullet here, and it sounds like you made a lot of questionable tradeoffs. Assuming that "session" state persists beyond the lifetime of either the client or the server is generally problematic. It's always easier for one party to be stateless, but you can become stateful for the duration of the transaction.

Shared state is best used as communications optimization, and maybe sometimes useful for security reasons.

Dylan16807 · on Dec 23, 2021

> Assuming that "session" state persists beyond the lifetime of either the client or the server is generally problematic.

I don't think you're interpreting the problem right? The state is tied to the connection, not outliving client or server. But it outlives single requests, and would be uncomfortably expensive to re-establish per request.

xyzzy_plugh · on Dec 23, 2021

I'm saying is that it's unrealistic to expect to hold a persistent TCP connection for an extended period of time across networking environments you do not control.

Making things not uncomfortably expensive is a good idea.

Relying on websockets to solve this for you is a mistake. It's convenient, but not robust. How would you solve it without websockets using traditional HTTP? The same solution should be used with websockets, but unlocks tremendous opportunities for optimization.

Dylan16807 · on Dec 23, 2021

> How would you solve it without websockets using traditional HTTP?

You'd probably do the uncomfortably expensive setup, then give the client a token and store the settings in a database. And then do your best to cache it and have fast paths to reestablish from the cache on the same server or on different servers.

Not only could this add a lot of complication, now you've actually introduced the problem of state outliving your endpoints! You do unlock new ways to optimize, but you pay a high cost to get there. There's a very good chance this rearchitecture is a bad idea.

xyzzy_plugh · on Dec 23, 2021

Sure. Look I'm not advocating for any particular solution here, just trying to point out the hopefully obvious fact that websockets are not a silver bullet. You've basically described why websockets unlock optimizations, which was my point.

Nothing in the GP's post is novel to websockets. Session based resource management is difficult, doubly so for long lived sessions. Relying on websockets to magically make that easy is foolish.

> Not only could this add a lot of complication, now you've actually introduced the problem of state outliving your endpoints!

I only want to point out that this is true with websockets as well, so I find this argument unconvincing. For websockets, what do you do when re-establishing a connection? You start anew or find the existing session. What if the client suddenly disappears without actively closing the connection? You have some sort of TTL before abandoning the session.

It's the exact same problem either way.

tyingq · on Dec 22, 2021

>Sounds like a personal problem. In all seriousness, your clients should handly any sort of network disconnect gracefully

That can be complex. Corporate MITM filtering boxes, "intrusion detection" appliances, firewalls, etc, can just decide to drop NAT entries, drop packets, break MTU path discovery, etc. Yes, there are things you can do. But then customers restart/reload when things don't happen instantly, etc. I don't know that there's a simple playbook.

inopinatus · on Dec 22, 2021

None of this is particular to websockets, and in addition:

> you must employ people who are willing to become deeply knowledgeable in how TCP works

You already needed that for your HTTP based application; it's a fundamental of networked computing. Developers skipping out on mechanical sympathy are often duds, in my experience.

starik36 · on Dec 23, 2021

> employ people who are willing to become deeply knowledgeable in how TCP works

I used Microsoft's SignalR library. It knows TCP pretty well and handles most of the common pitfalls nearly automatically.

> customers to reconnect all at once.

That is definitely a problem. So we had to code it from the get go with the assumption that either the network will go down or the server will be bounced for an upgrade.

Actually most of the issues I encountered had to do with various iPad versions going to sleep and then handling WebSockets in different ways once it woke up.

tata71 · on Dec 23, 2021

Hosted? How are your costs? I hear this catches people sometimes.

Any other advice for SignalR?

api · on Dec 22, 2021

Quic is the right idea. Encrypt everything including state. Kill middle boxes.

History has shown that if you allow middle boxes they will ruin everything.

vlovich123 · on Dec 22, 2021

> - Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once.

I’m kind of surprised that it was that infrequent. I would expect software upgrades should cause long-lived sockets to reset…

inopinatus · on Dec 22, 2021

or a scale-up of an ELB

alexellisuk · on Dec 23, 2021

> - Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation

What's the alternative that's going to work here?

oautholaf · on Oct 2, 2021

Sort of feel it would really be helpful to replace the narrative of the creation of the modern smartphone ("the iPhone sprung forth from the head of Jobs!") with a picture of the decade+ of efforts made by the computer industry to make consumer electronics.

Note also that this article mentions the Danger Hiptop and folks who worked on it. Some folks who worked on that worked on iPhone 1.0 and some even still work on iPhones!

bch · on Oct 3, 2021

Of course Andy Rubin[0] from Danger (previously also General Magic as mentioned in article) went on to found Android as well - and Android itself owes (in a way) it’s own name to Apple (Andy’s nickname at Apple was apparently “Android”.)

What’s also interesting to me is that the Danger used NetBSD in at least some it’s projects, which is an OS dear to my heart. Always nice to see some of the “alternate takes”, “near misses”, or “could have been…” technologies on the past road to where we happen to be now.

[0] https://en.wikipedia.org/wiki/Andy_Rubin

oautholaf · on Oct 3, 2021

The original Danger OS was home grown by some of the same folks that later started Zircon!

CharlesW · on Oct 3, 2021

> Sort of feel it would really be helpful to replace the narrative of the creation of the modern smartphone ("the iPhone sprung forth from the head of Jobs!") with a picture of the decade+ of efforts made by the computer industry to make consumer electronics.

The thing is that many other companies could've made an iPhone before a Jobs-led Apple, just like other people could've run a 4-minute mile before Roger Bannister. But as a hardware + software + marketplace system, it truly was a discontinuous leap.

If you want to read an article that goes the other way and maybe overstates General Magic's contributions, you might also enjoy this oral history: https://nymag.com/intelligencer/2018/08/general-magic-oral-h...

tpmx · on Oct 2, 2021

There are of course literally hundreds of products that led the way, in some way or another.

oautholaf · on Sept 30, 2021

They had to refuse. They needed to grow their talent pool. The others were focused on defending against msft.

oautholaf · on Sept 30, 2021

Sorry, Android was well underway before this agreement was in force. And I can tell you key members of the original Safari team are still at Apple 20 years on.