In HFT, we typically pin processes to run on a single isolated core (on a multic...

suprjami · on Sept 18, 2022

I'm surprised how much crossing NUMA nodes can affect performance. We've seen NICs halve their throughout with (intentionally) wrong setups.

I think of NUMA nodes as multiple computers which just happen to share a common operating system.

ls65536 · on Sept 18, 2022

In general this makes sense, but I think you need to be careful in some cases where the lowest latency between two logical "cores" is likely to be between those which are SMT siblings on the same physical core (assuming you have an SMT-enabled system). These logical "cores" will be sharing much of the same physical core's resources (such as the low-latency L1/L2 and micro-op caches), so depending on the particular workload, pinning two threads to these two logical "cores" could very well result in worse performance overall.

slabity · on Sept 18, 2022

SMT is usually disabled in these situations to prevent it from being a concern.

nextaccountic · on Sept 18, 2022

Doesn't this leave some performance on the table? Each core has more ports than a single thread could reasonably use, exactly because two threads can run on a single core

slabity · on Sept 18, 2022

In terms of throughput, technically yes, you are leaving performance on the table. However, in HFT the throughput is greatly limited by IO anyways, so you don't get much benefit with it enabled.

What you want is to minimize latency, which means you don't want to be waiting for anything before you start processing whatever information you need. To do this, you need to ensure that the correct things are cached where they need to be, and SMT means that you have multiple threads fighting each other for that precious cache space.

In non-FPGA systems I've worked with, I've seen dozens of microseconds of latency added with SMT enabled vs disabled.

jnordwick · on Sept 19, 2022

Maybe 10 years ago that as the common things, but there are so many exta resources (esp registers) that is is now giving up almost half the chip. If you can be cache friendly enough, the extra cycles will make up for it.

slabity · on Sept 19, 2022

No, this is not true at all. "The extra cycles" is the exact thing you want to avoid in HFT. It doesn't matter how much throughput of processing you can put through a single core if you enable SMT, because somewhere in the path (either broker, exchange, or some switch in between) you will eventually be limited in throughput that it becomes irrelevant.

The only thing that matters at that point is latency, and unless you are cache-friendly enough to store your entire program in a single core's cache twice over, you would be better off disabling SMT altogether. And even if you were able to do that, it would not matter as a single thread would be done processing a message by the time the next one comes in. At least at the currently standard 10-25Gbps that the exchanges can handle.

In HFT, we're fine giving up half the registers in a core if it means we get an extra few microseconds of latency back.

bitcharmer · on Sept 18, 2022

No one in hft space runs with smt enabled

lucb1e · on Sept 19, 2022

What kind of "talking" are we talking about? I thought most IPC works somehow via shared memory under the hood rather than CPU cores actually communicating, how would you even do that?

bee_rider · on Sept 19, 2022

"Shared memory" is really more of a description of the memory model that is exposed to the programmer, rather than the hardware.

Under the hood, there are caches -- sometimes memory addresses live in a cache above you because you put them there, sometimes they live in a cache above you because a neighboring core that shares your cache put them there, sometimes they live in RAM, sometimes they live in another cache on your chip and you have to ask for them through the on-chip network. The advice I have been given (as a non-HFT guy) is just to try not to mess around to much with the temporal locality, pin threads to cores, and let the hardware handle the rest.