Could you be a little more specific about what's incredibly complex about writin...

sanxiyn · on Dec 28, 2014

I don't write a TCP stack, but Juho Snellman writes a TCP stack for living, and I found the following anecdote on writing an interoperable TCP stack interesting.

http://snellman.net/blog/archive/2014-11-11-tcp-is-harder-th...

TLDR: There are TCP implementations that can't handle SYN retransmission which you have to interoperate if your TCP stack is the product.

colmmacc · on Dec 27, 2014

I'm not the OP, but I think it's fair to call it complex, and I'd pick three requirements out in particular.

1. Path reachability, MTU discovery and MSS interaction

When sending outbound packets, you have to correlate incoming ICMP error messages in case they signal a problem. If the problem is that the packet is too big, you have to figure out what the MTU really is (which can take repeated attempts), so that you know what MSS to use (for TCP, or fragmentation boundary for UDP). If the path is unreachable, you have to remember that too. In both cases, you need some kind of global book-keeping so that you can do the right thing across connections. Some protocols (like active FTP) implicitly rely on MTU discovery on one connection signaling the MSS for another connection, so everything has to be path based, rather than connection based. Messy.

2. State management for error correlation

O.k., so you've figured out how to fragment an outgoing datagram and know what boundary to use, but how do you handle incoming error messages related to the fragments? Even for UDP, or other "stateless" protocols you actually do have to keep state so that you can correlate those error messages to the packets you sent. When the error message comes back, it will have the IP ID of the fragment, but nothing else is guaranteed.

This goes for (1.) too, but ICMP error messages can also be recursive and nested, and for a correct implementation you need to consider how to handle ICMP error messages that were themselves triggered by ICMP error messages. Several userspace stacks get this wrong, and can't correctly handle MTU discovery for UDP, or double-error correlation.

3. Heuristical and inconsistent caps on state

Many TCP implementations support selective acknowledgements and duplicate ack signalling, but what are their tolerances, just how much data can be retransmitted or handled out of order before you have the resend the whole window? there's no way to know, and if you get it wrong you can end up stalling a TCP connection for a significant delay. Unfortunately there are no simple limits, and in some cases the volumes are related to bandwidth delay products, necessitating some kind of integral control loop.

The problem with all of these is that they only show up "sometimes" and with particular networks or TCP stacks. I've limited these to interoperability issues - but there are other tricky complexities. For example, when building a TCP stack, do you optimise for throughput and so batch reads/writes of many packets - or do you optimize for a correct RTT estimate, and do things more synchronously. It's not possible to have both (at least with today's NIC interfaces); sometimes RTT is critical (e.g. an NTP implementation, a real-time control system or just any system that needs to rapidly recover from packet loss) , sometimes throughput is more important. Definitely complex.

tptacek · on Dec 28, 2014

Getting a performant TCP is certainly hard. So, for that matter, is getting congestion control right --- TCP congestion control is devilishly hard. But you don't have to do either of those things to get an interoperable TCP!