Saturday, August 23, 2008

Sending small data segments over TCP with Winsock

When a Microsoft TCP stack receives a data packet, a 200-ms delay timer goes off. When an ACK is eventually sent, the delay timer is reset and will initiate another 200-ms delay when the next data packet is received. To increase the efficiency in both Internet and the intranet applications, Microsoft TCP stack uses the following criteria to decide when to send one ACK on received data packets:
If the second data packet is received before the delay timer expires, the ACK is sent.
If there are data to be sent in the same direction as the ACK before the second data packet is received and the delay timer expires, the ACK is piggybacked with the data segment and sent immediately.
When the delay timer expires, the ACK is sent.
To avoid having small data packets congest the network, Microsoft TCP stack enables the Nagle algorithm by default, which coalesces a small data buffer from multiple send calls and delays sending it until an ACK for the previous data packet sent is received from the remote host. The following are two exceptions to the Nagle algorithm:
If the stack has coalesced a data buffer larger than the Maximum Transmission Unit (MTU), a full-sized packet is sent immediately without waiting for the ACK from the remote host. On an Ethernet network, the MTU for TCP/IP is 1460 bytes.
The TCP_NODELAY socket option is applied to disable the Nagle algorithm so that the small data packets are delivered to the remote host without delay.
To optimize performance at the application layer, Winsock copies data buffers from application send calls to a Winsock kernel buffer. Then, the stack uses its own heuristics (such as Nagle algorithm) to determine when to actually put the packet on the wire. You can change the amount of Winsock kernel buffer allocated to the socket using the SO_SNDBUF option (it is 8K by default). If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size. In most cases, the send completion in the application only indicates the data buffer in an application send call is copied to the Winsock kernel buffer and does not indicate that the data has hit the network medium. The only exception is when you disable the Winsock buffering by setting SO_SNDBUF to 0.

Winsock uses the following rules to indicate a send completion to the application (depending on how the send is invoked, the completion notification could be the function returning from a blocking call, signaling an event or calling a notification function, and so forth):
If the socket is still within SO_SNDBUF quota, Winsock copies the data from the application send and indicates the send completion to the application.
If the socket is beyond SO_SNDBUF quota and there is only one previously buffered send still in the stack kernel buffer, Winsock copies the data from the application send and indicates the send completion to the application.
If the socket is beyond SO_SNDBUF quota and there is more than one previously buffered send in the stack kernel buffer, Winsock copies the data from the application send. Winsock does not indicate the send completion to the application until the stack completes enough sends to put the socket back within SO_SNDBUF quota or only one outstanding send condition.