In HFT, we typically pin processes to run on a single isolated core (on a multicore machine). That allows the process to avoid a lot of kernel and other interrupts which could cause the process to not operate in a low latency manner.
If we have two of these processes, each on separate cores, and they occasionally need to talk to each other, then knowing the best choice of process/core location can keep the system operating in the lowest latency setup.
So, an app like this could be very helpful for determining where to place pinned processes onto specific cores.
There's also some common rules-of-thumb such as, don't put pinned processes that need to communicate on cores that are separated by the QPI, that just adds latency. Make sure if you're communicating with a NIC to find out which socket has the shortest path on the PCI bus to that NIC and other fun stuff. I never even thought about NUMA until I started to work with folks in HFT. It really makes you dig into the internals of the hardware to squeeze the most out of it.
In general this makes sense, but I think you need to be careful in some cases where the lowest latency between two logical "cores" is likely to be between those which are SMT siblings on the same physical core (assuming you have an SMT-enabled system). These logical "cores" will be sharing much of the same physical core's resources (such as the low-latency L1/L2 and micro-op caches), so depending on the particular workload, pinning two threads to these two logical "cores" could very well result in worse performance overall.
Doesn't this leave some performance on the table? Each core has more ports than a single thread could reasonably use, exactly because two threads can run on a single core
In terms of throughput, technically yes, you are leaving performance on the table. However, in HFT the throughput is greatly limited by IO anyways, so you don't get much benefit with it enabled.
What you want is to minimize latency, which means you don't want to be waiting for anything before you start processing whatever information you need. To do this, you need to ensure that the correct things are cached where they need to be, and SMT means that you have multiple threads fighting each other for that precious cache space.
In non-FPGA systems I've worked with, I've seen dozens of microseconds of latency added with SMT enabled vs disabled.
Maybe 10 years ago that as the common things, but there are so many exta resources (esp registers) that is is now giving up almost half the chip. If you can be cache friendly enough, the extra cycles will make up for it.
No, this is not true at all. "The extra cycles" is the exact thing you want to avoid in HFT. It doesn't matter how much throughput of processing you can put through a single core if you enable SMT, because somewhere in the path (either broker, exchange, or some switch in between) you will eventually be limited in throughput that it becomes irrelevant.
The only thing that matters at that point is latency, and unless you are cache-friendly enough to store your entire program in a single core's cache twice over, you would be better off disabling SMT altogether. And even if you were able to do that, it would not matter as a single thread would be done processing a message by the time the next one comes in. At least at the currently standard 10-25Gbps that the exchanges can handle.
In HFT, we're fine giving up half the registers in a core if it means we get an extra few microseconds of latency back.
What kind of "talking" are we talking about? I thought most IPC works somehow via shared memory under the hood rather than CPU cores actually communicating, how would you even do that?
"Shared memory" is really more of a description of the memory model that is exposed to the programmer, rather than the hardware.
Under the hood, there are caches -- sometimes memory addresses live in a cache above you because you put them there, sometimes they live in a cache above you because a neighboring core that shares your cache put them there, sometimes they live in RAM, sometimes they live in another cache on your chip and you have to ask for them through the on-chip network. The advice I have been given (as a non-HFT guy) is just to try not to mess around to much with the temporal locality, pin threads to cores, and let the hardware handle the rest.
If we have two of these processes, each on separate cores, and they occasionally need to talk to each other, then knowing the best choice of process/core location can keep the system operating in the lowest latency setup.
So, an app like this could be very helpful for determining where to place pinned processes onto specific cores.
There's also some common rules-of-thumb such as, don't put pinned processes that need to communicate on cores that are separated by the QPI, that just adds latency. Make sure if you're communicating with a NIC to find out which socket has the shortest path on the PCI bus to that NIC and other fun stuff. I never even thought about NUMA until I started to work with folks in HFT. It really makes you dig into the internals of the hardware to squeeze the most out of it.