More

trishume · on Nov 9, 2022

Oooh nice! Your Kyria posts are actually where I first learned about how awesome and cheap SendCutSend is, and got some of the inspiration for the magnets.

I actually ordered a plain steel plate first, but I realized that given that I needed them in a different position to travel compactly than the wide position I like for typing, I wanted them to snap in consistent positions so it wasn't finicky to line up exactly how I liked them.

trishume · on Oct 19, 2022

The latency numbers they state seem achievable or beatable with Infiniband, Amazon's EFA, or TCPDirect. 2us round-trip is achievable for very simple systems. If this kind of networking sounds good to you, you can buy it today! It's even available on AWS, Azure and Oracle Cloud (but not GCP yet AFAIK).

wumpus · on Oct 19, 2022

Latency measurements are tricky, the usual benchmarks kind of suck and aren't predictive of actual performance in real systems under load.

Given that the entire Myrinet team went to work for Google, and the InfiniPath microarchitecture can be discovered by reading the device driver and some open source code, I'm pretty sure Google's team was well aware of what has been done in the recent past.

wumpus · on Oct 20, 2022

Thank you upvoters! I wonder why my other comment has so many downvotes, when it's just as relevant as this one.

trishume · on Sept 7, 2022

This is a really cool example of tree diffing via path finding. I noticed that this was the approach I used when I did tree diffing, and sure enough looks like this was inspired by autochrome which was inspired by my post (https://thume.ca/2017/06/17/tree-diffing/).

I'm curious exactly why A* failed here. It worked great for me, as long as you design a good heuristic. I imagine it might have been complicated to design a good heuristic with an expanded move set. I see autochrome had to abandon A* and has an explanation of why, but that explanation shouldn't apply to difftastic I think.

dan-robertson · on Sept 7, 2022

I think (maybe I’m wrong) that your graph searches correspond to diffing single lists and you can have an expensive diagonal step to recurse into two sublists whereas the tool in this post has extra nodes for every token and extra edges for inserting/deleting delimiters. That seems to be the biggest difference to me and I guess is what you mean by it being complicated to design a good heuristic for the expanded move set. I agree it sounds complicated. I think that my guess was that bigger graphs would make things harder but that isn’t a reason for A* to fail.

trishume · on May 2, 2022

I really hope he can work with cloud vendors and Intel to make Processor Trace a more popular and easier to use capability.

It's unfortunate how https://github.com/janestreet/magic-trace and PMUs in general can't be used by lots of people using cloud VMs.

brendangregg · on May 2, 2022

Yes, getting PMCs enabled in VMs was just the start, I think the next hardware capabilities to enable are:

  - PEBS (Precise/Processor event based sampling, so that we can accurately get instruction pointers on PMC events)
  - uncore PMCs (in a safe manner)
  - LBR (last branch record, to aid stack walking)
  - BTS (branch trace store, " ")
  - Processor trace (for cycle traces)

Processor trace may be the final boss. We've got through level 1, PMCs, now onto PEBS and beyond.

mhh__ · on May 2, 2022

Can this be safely/efficiently virtualized? I love using these tools but post-spectre I could understand people being hesitant to expose more internal "state" (I.e. Technically unique to a VM but only one processor bug away from kaboom?).

Congrats on the job.

brendangregg · on May 2, 2022

Thanks! We have to work through each capability carefully. Some won't be safe, and will be available on bare-metal instances only. That may be ok, as it fits with the following evolution of an application (this is something I did for some recent talks):

  1. FaaS
  2. Containers
  3. Lightweight VMs (e.g., Firecracker)
  4. Bare-metal instances

As (and if) an application grows, it migrates to platforms with greater performance and observability.

The ship has sailed on neighbor detection BTW. There's so many ways to know you're a VM with neighbors that disabling PMCs for that reason alone doesn't make sense.

cperciva · on May 2, 2022

The ship has sailed on neighbor detection BTW.

In the crudest sense of "do I have a neighbour", sure. Of course, that's hardly secret -- if you're in EC2 you can just count your CPUs to figure that out.

But there's more questions you can ask:

1. Is my neighbour busy right now?

2. Is my neighbour a busy web server, a busy database, or a busy application server?

3. Is my neighbour hosting Brendan's website?

4. Is my neighbour hosting Brendan's website and he's logged in writing a blog post in vi right now?

5. What's Brendan writing right now?

It's not immediately clear which of these questions can be answered using certain capabilities! Few people would have guessed that you could read text off someone's screen using hyperthreading prior to 2005, for example. (Pretty simple although I don't know if anyone has published exploit code for it: Just look at which cache lines are fetched fetching glyphs to render to the screen.)

bostonsre · on May 3, 2022

Congrats man, it sounds like a dream job for you. It will be fun to follow your blog at your next job. Thanks again for sharing everything that you do, it is so incredibly humbling and such a great learning experience.

dragontamer · on May 2, 2022

On AMD systems, many hardware performance counters are locked behind BIOS flags/configuration.

I admit that I don't know how Intel works, but disabling the use of these performance-counters at startup should be sufficient for any potential security problem.

I'd expect that only development boxes (maybe staging?) would be interested in performance counters anyway. Maybe the occasional development box could be setup for performance-sampling and collecting these counters, but not all production boxes need to be run with performance-counters on.

mhh__ · on May 2, 2022

No I want these performance counters everywhere. Obviously I know they can be disabled but that doesn't really help.

I also really want them in CI but that might be a long way away.

dman · on May 2, 2022

Being able to collect performance data from production boxes is invaluable.

jeffbee · on May 2, 2022

Yes, getting LBR data from production workloads is the whole ballgame for AutoFDO/SamplePGO and BOLT/Propeller. You cannot access the LBR on any EC2 machine short of a "metal" instance.

mhh__ · on May 2, 2022

When it comes to PGO (vs. profiling the whole system) though it's worth noting that a lot of the speedup comes from things which are too trivial for us humans to consider.

When I profiled the D compiler with and without PGO enabled it became obvious that a lot of the speedup of PGO basically comes just from running the program, the choice of testcases made almost no difference.

aseipp · on May 2, 2022

> not all production boxes need to be run with performance-counters on.

Production is exactly the place where you want full performance counter support, all the time, everywhere, on every machine.

runjake · on May 2, 2022

Right. That's all good, but the important question is: what will your desk look like at Intel?[1]

1. Meta: https://twitter.com/brendangregg/status/1515482126871044098

sydthrowaway · on May 2, 2022

One question: are you hiring?

trishume · on July 31, 2021

Have you seen my Xi CRDT writeup from 2017 before? https://xi-editor.io/docs/crdt-details.html

It's a CRDT in Rust and it uses a lot of similar ideas. Raph and I had a plan for how to make it fast and memory efficient in very similar ways to your implementation. I think the piece I got working during my internship hits most of the memory efficiency goals like using a Rope and segment list representation. However we put off some of the speed optimizations you've done, like using a range tree instead of a Vec of ranges. I think it also uses a different style of algorithm without any parents.

We never finished the optimizing and polished it up, so it's awesome that there's now an optimized text CRDT in Rust people can use!

josephg · on July 31, 2021

Oooohhhh no I haven’t read that - thanks for the link! I feel embarrassed to say this but I knew about Xi editor years ago but I totally forgot to go read & learn about your crdt implementation when I was learning about Yjs and automerge and others. I’ll have a read.

And thanks for writing such an in depth article. It’s really valuable going forward. Maybe it’s addressed in your write up but are there any plans for that code, or has everyone moved on? I’d love to have a zoom chat about it and hear about your experiences at some point if you’d be willing.

neolog · on July 31, 2021

Out of curiosity, what do you use to make those diagrams?

trishume · on July 31, 2021

https://www.figma.com/ and putting a lot of effort into them

trishume · on July 18, 2021

This is awesome. In theory you could absolutely minimize the latency penalty to just the overhead of the gpu1->memory->gpu2 copy, if the display sync signals from the display the passthrough window was on were passed through to the GPU driver on Windows, and that was combined with fullscreen compositor bypass (available on many Linux WMs) or low-latency compositing (available on sway and now mutter https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1762 on Wayland).

trishume · on May 26, 2021

I really hope we get more technical information on how Lumen and Nanite work, and additionally that Epic doesn't patent the techniques in either of them. A patent on either would make me so sad, 20 years is really long in software, absent Epic's amazing work I expect we would have something else like it in like 3 years given what we've seen in things like http://dreams.mediamolecule.com/.

flipgimble · on May 26, 2021

A lot of information has been released here: https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Na... and here https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Lu...

There is also a source code release if you want to dive into that level of detail: https://github.com/EpicGames/UnrealEngine/releases/tag/5.0.0...

I don't think the released details are that surprising to those working on realtime computer graphics, but the engineering details and tradeoffs are certainly interesting. Epic has the budget and business case to allocate a team, including some of the best graphics engineers in the industry, to do R&D for over a year to make this a reality.

jayd16 · on May 26, 2021

So Nanite is just traditional LOD baking implemented in a wholistic and automatic way?

The major difference seems to be they've done the work end to end to handle all the occlusion corner cases as well as a sophisticated mesh and texture streaming implementation that targets modern SSDs.

arunoda · on May 29, 2021

It's not traditional LOD baking. There's no LOD baking. It's the new rasterization system doing the whole work.

And most of us doesn't use fast SSDs like in PS5 and it works really well. Also these engineers said, it even works just fine with slowers HDDs too. Because, they don't stream meshes for each camera movements. But it's a continuos setup.

flipgimble · on May 27, 2021

I’m not an expert on this, but there seems to be a custom GPU renderer optimized for dense triangle meshes, with its own occlusion pass. The LOD is also calculated based on clusters, multiple per-mesh with way to fix seams between cluster at different levels. This works best with very dense meshes such as those from photogrammetry or zbrush sculpted.

trishume · on April 30, 2021

Standard Fenwick trees can only do prefix sums, which only get you general range queries on things with a subtraction operator, not operations like maximum.

The reddit comment I link contains an implementation that allegedly does arbitrary range queries, but it's nigh-incomprehensible so I can't tell how or why it uses 3 arrays.

contravariant · on May 1, 2021

I see, yeah I can't help you there either. I don't see how a tree based approach would ever need more than twice the amount of space.

trishume · on April 30, 2021

Cool! I thought about using skip lists a bunch before I settled on this, trying to think of various ways to reduce complexity and memory usage. My best skip lists designs still had some pointer overhead that the implicit approach avoids, but it was pretty small and they seemed reasonably simple. I briefly tried thinking of what an implicit skip list would be, but then just ended up thinking about implicit search trees.

trishume · on April 30, 2021

Yah mipmaps are an N-dimensional generalization of the breadth first layout of implicit aggregation, where the aggregation function is averaging.

It may in theory be possible to generalize the in-order layout I talk about in a similar way, but I'm not sure it would be that useful, maybe it would allow you to append rows or columns to your mipmapped image more easily, but I don't know of any applications where that's useful.