Indeed - 1ms was a nice round number, but it was also not that far from the real, sub-1ms number. But note that by "latency" here I mean not only the round trip, but the total measured time of executing a single request, including the time its task spends in Tokio (and its queues), and until its response is fully processed as well. Since the program sent requests with configurable concurrency set to ~1024 or more, the overall throughput was still satisfying.
On the one hand it was (and still is) concerning that the observed latency per simple request was that high, on the other - it never really came up in our distributed tests, since there the network latency imposes a few milliseconds anyway. When we figure out the real source of this behavior, I'll be happy to describe the investigation process in a blog post (: