← Back to Logs

How TCP Actually Works: The Protocol That Pretends the Network Is Reliable

Try the interactive lab for this articleTake the quiz (6 questions · ~4 min)

TCP is the protocol that makes the internet feel less broken than it really is. Applications call send() and recv() as if there is a clean, ordered byte stream between two machines. In reality the network underneath is chaotic. Packets are dropped, duplicated, reordered, delayed, truncated, or delivered in bursts. Routers fill queues and then throw data away. Links flap. NAT devices forget state. Middleboxes rewrite headers. TCP sits in the middle of all that and tries to provide one simple abstraction: a reliable, ordered stream of bytes.

That abstraction is expensive. TCP has to number bytes, infer loss, estimate round-trip time, control how much data is in flight, avoid overwhelming the receiver, avoid overwhelming the network, recover from missing segments, and shut down cleanly when the application is done. Most developers know a few pieces of this, usually "three-way handshake" and maybe "slow start", but the protocol is much more interesting than that. This post walks through how it actually works.

1. TCP Is a Byte Stream, Not a Message Protocol

This is the first thing people get wrong. TCP does not preserve application message boundaries. It preserves byte order.

If your application writes:

sock.send(b"HELLO")
sock.send(b"WORLD")

the receiver might observe:

sock.recv(10)   # b"HELLOWORLD"

Or:

sock.recv(3)    # b"HEL"
sock.recv(7)    # b"LOWORLD"

Or any other partitioning of the same ordered ten bytes. The application sees a stream, not packets. The sender might emit two TCP segments, one segment, or ten, depending on buffering, the congestion window, MSS, delayed ACK behaviour, and kernel heuristics like Nagle's algorithm.

This matters because application protocols must define their own framing. HTTP/1.1 uses Content-Length, chunked transfer encoding, or connection close. PostgreSQL uses a one-byte type field followed by a four-byte length. TLS has its own record layer on top of TCP precisely because TCP gives it a stream, not messages.

At the wire level, TCP absolutely does send segments. But the application API is intentionally blind to them. This split between wire segments and application stream is one of the main reasons TCP is so useful and so often misunderstood.

2. The Three-Way Handshake: More Than Just "Hello"

Before data can flow, both sides need to agree on initial state. TCP is full-duplex, which means each direction has its own sequence space. Both sides must announce where their byte numbering begins, and both sides must confirm they can hear each other.

The classic three-way handshake looks like this:

Client                                           Server
 
SYN, seq = x
                        ----------------------->
 
                               SYN-ACK, seq = y, ack = x + 1
                        <-----------------------
 
ACK, seq = x + 1, ack = y + 1
                        ----------------------->

The client sends a SYN with an Initial Sequence Number (ISN) x. The server replies with its own SYN carrying ISN y, plus an ACK for x + 1. The client's final ACK confirms receipt of the server's SYN.

Why does the ACK advance by 1 when no application data has been sent yet? Because SYN consumes one sequence number. FIN does too. TCP numbers control events in the same sequence space as data so that ordering is unambiguous.

The handshake has several jobs:

  • It proves two-way reachability.
  • It establishes initial sequence numbers for each direction.
  • It negotiates options such as MSS, window scaling, SACK support, and timestamps.
  • It allocates kernel state on both endpoints.

Those negotiated options matter more than most people realise. A modern SYN often carries:

MSS=1460
SACK_PERMITTED
WINDOW_SCALE=7
TIMESTAMP val=...

Without window scaling, TCP's 16-bit advertised receive window tops out at 65,535 bytes, which is tiny on modern long-fat networks. Without SACK, loss recovery is far less efficient. Without timestamps, RTT estimation and PAWS protection are weaker. A large amount of TCP performance is decided before the first payload byte is sent.

Why ISNs Must Be Unpredictable

In old TCP stacks, sequence numbers were often predictable, sometimes based on a simple timer. That enabled off-path spoofing attacks. If an attacker could guess the server's expected next sequence number, they could inject forged packets into a connection they could not even observe.

Modern stacks randomise ISNs. RFC 6528 formalised stronger ISN generation specifically because naive sequence number selection created real security problems.

3. Sequence Numbers and ACKs: How TCP Knows What Arrived

TCP reliability is based on byte numbering. Every byte in the stream has a sequence number. The sequence number in a TCP header identifies the first byte carried in that segment.

Suppose the client starts at sequence number 1000 and sends 3000 bytes. If the Maximum Segment Size is 1460 bytes, the stream might be segmented like this:

Segment Sequence Number Payload Length
1 1000 1460
2 2460 1460
3 3920 80

The receiver acknowledges cumulatively. If it has received all bytes through 3999 inclusive, it sends:

ACK = 4000

That means "I have everything up to byte 3999, and the next byte I want is 4000."

The word cumulative is important. A TCP ACK does not say "I received segment 2." It says "the left edge of the missing data is here." If segment 2 is lost but segment 3 arrives, the receiver still sends ACK = 2460 because there is now a hole in the stream. It cannot advance the cumulative ACK past missing data.

This is why out-of-order delivery creates duplicate ACKs. If the receiver keeps getting data beyond the missing hole, it keeps repeating the same ACK value. Those duplicate ACKs are a signal to the sender that something is probably lost.

A Concrete Example

Sender sends:
  seq=1000 len=1000
  seq=2000 len=1000
  seq=3000 len=1000
 
Network drops:
  seq=2000 len=1000
 
Receiver gets:
  seq=1000 len=1000   -> sends ACK=2000
  seq=3000 len=1000   -> sends ACK=2000 again

That repeated ACK=2000 tells the sender: "I saw something beyond 2000, so 2000 is probably missing."

4. The TCP Header: Small, Dense, and Full of Consequences

The fixed TCP header is only 20 bytes, but nearly every field has deep behavioural consequences:

0                   15 16                  31
+---------------------+---------------------+
|    Source Port      |   Destination Port  |
+---------------------+---------------------+
|                 Sequence Number           |
+-------------------------------------------+
|              Acknowledgment Number        |
+----+---+----------------+-----------------+
|HLEN|Res| Flags          |    Window       |
+----+---+----------------+-----------------+
|     Checksum            | Urgent Pointer  |
+-------------------------------------------+
|     Options (variable)                    |
+-------------------------------------------+

The important flags are:

  • SYN: start connection
  • ACK: acknowledgment field is valid
  • FIN: sender has no more data
  • RST: abort connection immediately
  • PSH: hint to push buffered data upward
  • URG: urgent pointer is valid, now mostly irrelevant

The window field advertises how much receive buffer space is available. This is TCP flow control, not congestion control. The checksum covers a pseudo-header including source IP, destination IP, protocol number, plus the TCP header and payload. If the checksum fails, the segment is discarded silently.

Options carry a surprising amount of modern TCP's intelligence:

  • MSS
  • Window Scale
  • SACK Permitted
  • Timestamps

The data offset field (HLEN) tells where payload begins, because the options make the header variable length.

5. Flow Control: Do Not Overrun the Receiver

Reliability is not enough. The sender also has to avoid overwhelming the receiver's memory.

Each side maintains a receive buffer. The receiver advertises the available free space in the TCP window field. This advertised window is often written as rwnd. The sender must keep the amount of unacknowledged data in flight below rwnd.

This creates a sliding window:

[acknowledged][sent but unacknowledged][allowed to send][not yet allowed]

If the receiver application is slow to read data from the socket buffer, rwnd shrinks. If it stops reading entirely, the window may drop to zero. At that point the sender must stop transmitting payload, except for occasional zero-window probes to check whether the receiver has opened the window again.

This is a pure endpoint problem. Flow control protects the receiver host. It says nothing about whether the network path can handle the traffic.

Window Scaling

The 16-bit TCP window field can represent at most 65,535 bytes. That was reasonable in the 1980s. It is terrible on modern paths.

Consider a 10 Gbit/s link with 40 ms RTT. The bandwidth-delay product is:

10,000,000,000 bits/s * 0.040 s = 400,000,000 bits
400,000,000 / 8 = 50,000,000 bytes

You need about 50 MB in flight to fill that path. A 64 KB window would cap throughput at:

65,535 bytes / 0.040 s ≈ 1.64 MB/s ≈ 13.1 Mbit/s

That is three orders of magnitude below the path capacity.

RFC 7323 window scaling fixes this by negotiating a scale factor during the handshake. The actual receive window becomes:

actual_window = advertised_window << scale_factor

Without window scaling, modern high-speed TCP would be crippled.

6. Congestion Control: Do Not Overrun the Network

Flow control protects the receiver. Congestion control protects the network.

This distinction is fundamental. A receiver might happily advertise a 16 MB receive window, but the path between sender and receiver may only be able to sustain a tiny fraction of that without collapsing into queue overflow and loss.

TCP therefore keeps a second limit: the congestion window, cwnd. The amount of data in flight is constrained by:

flight_size <= min(rwnd, cwnd)

rwnd comes from the receiver. cwnd is inferred by the sender from network behaviour.

The historical reason for congestion control is painful. In the mid-1980s, parts of the internet experienced congestion collapse: offered load increased, throughput fell, routers filled with retransmissions, and useful work approached zero. Van Jacobson's 1988 paper introduced the core ideas that saved TCP from melting the network.

Slow Start

A new connection does not know the safe sending rate. It starts conservatively with a small cwnd, historically 1 MSS, now often around 10 MSS on modern systems.

During slow start, each ACK increases cwnd by 1 MSS per RTT's worth of acknowledgments, which makes growth effectively exponential by round trip:

1 MSS -> 2 MSS -> 4 MSS -> 8 MSS -> 16 MSS

This is not "slow" in the ordinary English sense. It is called slow start because it is slower than immediately blasting at line rate. Relative to the network, it is aggressive.

Slow start continues until either:

  • packet loss is inferred
  • cwnd reaches the slow start threshold, ssthresh

Congestion Avoidance

Once past ssthresh, growth becomes additive rather than exponential. A classic AIMD algorithm increases cwnd by roughly 1 MSS per RTT:

cwnd += MSS * MSS / cwnd   # approximately 1 MSS per RTT

This is the "AI" in AIMD, Additive Increase.

When loss occurs, the sender interprets it as a sign of congestion and cuts the window. That is the "MD", Multiplicative Decrease.

Why Loss Means Congestion

Classic TCP assumes packet loss is usually caused by queues overflowing in routers. That assumption was very accurate for wired networks and still mostly useful today, though it breaks down on wireless paths where corruption and radio-layer behaviour also cause loss. This mismatch is one reason TCP over WiFi or cellular can behave strangely.

7. Retransmission: How TCP Recovers Lost Data

If TCP never retransmitted, reliability would be fiction. The hard part is deciding when a segment is truly lost rather than merely delayed.

TCP uses two main mechanisms:

  • Retransmission timeout (RTO)
  • Fast retransmit triggered by duplicate ACKs

Retransmission Timeout

The sender measures round-trip time and maintains a smoothed estimate:

SRTT   = (1 - alpha) * SRTT + alpha * RTT_sample
RTTVAR = (1 - beta)  * RTTVAR + beta * |RTT_sample - SRTT|
RTO    = SRTT + max(G, K * RTTVAR)

Common values are alpha = 1/8, beta = 1/4, K = 4.

The point is not to predict the exact RTT. The point is to set a timeout that is long enough to avoid spurious retransmissions under jitter, but short enough to recover reasonably quickly from actual loss.

If the timer expires before the data is acknowledged, the sender retransmits the missing segment and treats this as strong congestion evidence.

Karn's Algorithm

There is an ambiguity problem. If a retransmitted segment is later acknowledged, which transmission instance did the ACK correspond to, the original or the retransmission? You cannot know.

Karn's algorithm solves this by refusing to take RTT samples from retransmitted segments. TCP only updates RTT estimators on segments that were transmitted exactly once.

Fast Retransmit

Waiting for an RTO is expensive. If the receiver sends three duplicate ACKs for the same missing byte, the sender infers a hole in the stream and retransmits immediately, without waiting for the timer.

Example:

ACK 5000
ACK 5000
ACK 5000
ACK 5000   -> 3 duplicate ACKs observed, retransmit segment starting at 5000

Why three? Fewer could be caused by mild reordering. Three was chosen as a practical heuristic: strong enough evidence of loss without waiting too long.

Fast Recovery

Classic TCP Reno treats duplicate ACK based loss less severely than timeout based loss. On triple duplicate ACK:

  • ssthresh = cwnd / 2
  • retransmit the missing segment
  • reduce sending rate, but do not collapse all the way back to 1 MSS

Timeouts are treated as worse because they indicate the ACK clock has likely broken completely.

8. Selective Acknowledgment: Fixing TCP's Original Blind Spot

Cumulative ACKs are simple, but inefficient when multiple segments are lost within one window.

Suppose the sender transmits segments 1 through 10, and segments 3 and 7 are lost. With only cumulative ACKs, the receiver can keep saying "I still need segment 3," but it cannot efficiently communicate which later blocks arrived. The sender may end up retransmitting more than necessary.

Selective Acknowledgment (SACK), standardised in RFC 2018, fixes this. The receiver still sends the cumulative ACK for the left edge of missing data, but also includes SACK blocks describing out-of-order data it has already received.

Example:

ACK = 3000
SACK = [4000, 7000), [8000, 11000)

This means:

  • bytes below 3000 are contiguous and delivered
  • bytes 3000 to 3999 are missing
  • bytes 4000 to 6999 have arrived
  • bytes 7000 to 7999 are missing
  • bytes 8000 to 10999 have arrived

Now the sender can retransmit only the actual holes.

SACK dramatically improves performance under burst loss, which is common in real networks. Without SACK, TCP behaves much worse than many engineers realise.

9. Connection Teardown: FIN Is Not RST

TCP shutdown is another place where people compress too much detail into one sentence.

Because TCP is full-duplex, each direction closes independently. A normal graceful close is a four-step exchange:

Client                                           Server
 
FIN, seq = u
                        ----------------------->
 
                               ACK, ack = u + 1
                        <-----------------------
 
                               FIN, seq = v
                        <-----------------------
 
ACK, ack = v + 1
                        ----------------------->

The first FIN means "I am done sending." It does not mean "I am done receiving." This is called a half-close. The other side may still send data until it also transmits FIN.

TIME_WAIT

After sending the final ACK, the closer often enters TIME_WAIT for 2 * MSL, commonly around 60 seconds on modern systems. This annoys developers who see many sockets stuck in TIME_WAIT, but it exists for a reason:

  • to allow retransmission of the final ACK if the peer's FIN is repeated
  • to ensure delayed duplicate segments from the old connection die before the same 4-tuple is reused

Without TIME_WAIT, sequence number overlap and stale packet confusion would become much more likely.

RST

RST is not a clean close. It is an abort. It tells the peer: "tear this connection down immediately; whatever state you thought existed is invalid."

You see RSTs when:

  • an application crashes or closes abruptly
  • a packet arrives for a nonexistent socket
  • a firewall or middlebox actively kills a connection
  • an application calls setsockopt(... SO_LINGER ...) in a way that triggers reset-on-close

A FIN is polite. An RST is a door slammed shut.

10. Why TCP Performs So Differently on Different Networks

TCP's visible behaviour is a product of several interacting variables:

  • RTT
  • loss rate
  • bandwidth-delay product
  • receive buffer size
  • congestion control algorithm
  • path reordering
  • middleboxes

One dropped packet on a low-latency fibre link may barely matter. The same drop on a high-RTT path can destroy throughput because the sender has far more data in flight and takes longer to recover.

The rough upper bound for classic Reno throughput under random loss is often approximated by the Mathis formula:

Throughput ≈ (MSS / RTT) * (1 / sqrt(p))

Where p is packet loss probability.

This is only an approximation, but it captures the important part: throughput falls with the square root of loss and linearly with RTT. A little loss on a long path hurts much more than many people expect.

This is why transcontinental data transfer, satellite links, crowded WiFi, and mobile radio networks can make TCP feel "mysteriously slow" even when bandwidth on paper looks excellent.

11. MSS, MTU, and Why Packet Size Is a Performance Variable

Another quiet constraint on TCP performance is segment size.

The TCP layer talks about MSS, Maximum Segment Size. The IP layer talks about MTU, Maximum Transmission Unit. They are related but not identical.

If an Ethernet path supports a 1500-byte MTU, an IPv4 packet carrying a standard TCP header usually leaves:

1500 - 20 bytes IPv4 header - 20 bytes TCP header = 1460 bytes MSS

With IPv6, the base header is 40 bytes, so the common MSS becomes:

1500 - 40 bytes IPv6 header - 20 bytes TCP header = 1440 bytes MSS

As a result, 1460 shows up constantly in packet captures on IPv4 Ethernet networks.

Why Bigger Segments Usually Help

Every packet carries overhead:

  • Ethernet framing
  • IP header
  • TCP header
  • interrupt and DMA cost on the NIC
  • per-packet work in the kernel and router forwarding path

If you send 1 MB of application data using 1460-byte segments, you need about 719 segments. If you sent the same data in tiny 200-byte segments, you would need over 5000. The total payload is the same, but the per-packet overhead explodes.

Larger segments therefore improve efficiency:

  • fewer headers per byte of useful data
  • fewer ACKs
  • less CPU overhead
  • better wire efficiency

Path MTU Discovery

The sender wants to avoid IP fragmentation because fragmentation is slow, fragile, and hostile to performance. Instead, it tries to discover the path MTU.

In classical IPv4 Path MTU Discovery:

  1. The sender transmits packets with DF, Don't Fragment, set.
  2. If a router along the path has a smaller outgoing MTU, it drops the packet and sends back ICMP "Fragmentation Needed".
  3. The sender lowers its packet size accordingly.

This is elegant when ICMP works and maddening when firewalls drop the ICMP messages. Then you get a Path MTU black hole: large packets disappear, no useful feedback comes back, and the connection stalls in ways that look mysterious at the application layer.

Modern stacks often use Packetization Layer PMTU Discovery, PLPMTUD, to probe packet sizes more defensively at the transport layer rather than blindly trusting ICMP behaviour.

Offload Complicates Captures

On a modern Linux server, packet captures can be deceptive because of TSO and GSO, TCP Segmentation Offload and Generic Segmentation Offload. The kernel may hand a very large chunk, say 32 KB, to the NIC, and the NIC then splits it into wire-sized segments.

So in a host-level capture you might see what looks like a giant TCP segment, even though the actual wire traffic is still normal MSS-sized frames. This has confused generations of engineers reading tcpdump traces for the first time on modern hardware.

12. The Sender and Receiver Are Full of Buffers

Application developers often think of a TCP connection as a direct pipe between two processes. In reality several buffers sit between the application and the wire.

On the send side, data usually passes through:

  1. user-space application buffer
  2. kernel socket send buffer
  3. kernel packet queues and qdisc
  4. NIC transmit rings
  5. actual wire

On the receive side:

  1. NIC receive rings
  2. kernel network stack
  3. kernel socket receive buffer
  4. user-space application buffer

This layering matters because latency and throughput problems often come from queueing, not raw link speed.

Send Buffer Behaviour

When an application calls send(), the data is typically copied into the kernel's send buffer. The call returning does not mean the peer has received the data. It usually means the kernel accepted responsibility for trying to send it.

If the send buffer fills because the network is slower than the application's write rate, later send() calls may:

  • block
  • return EAGAIN on a non-blocking socket
  • partially write the requested data

Applications that assume a single send() writes the entire buffer are subtly broken.

Receive Buffer Behaviour

Likewise, recv() returns whatever is currently available up to the requested size. It does not wait politely until your "logical message" is complete unless your application protocol defines that logic on top.

If the application reads too slowly, the receive buffer fills, the advertised window shrinks, and eventually the sender is flow-controlled.

Bufferbloat

Big buffers are not automatically good. Excessively deep queues can keep throughput high while making latency terrible. This is bufferbloat.

Imagine a home router with a bloated outbound queue on an uplink in Athens. A large upload fills the queue with hundreds of milliseconds of packets. TCP still works. The throughput may even look fine. But interactive traffic, SSH keystrokes, VoIP, DNS lookups, now sit behind a wall of queued packets.

This is why modern queue management schemes like fq_codel and CAKE matter. They try to control queueing delay rather than simply maximising buffer occupancy.

13. Nagle, Delayed ACKs, and the Tinygram Problem

TCP performance is not only about huge transfers. Small interactive writes have their own pathologies.

Nagle's Algorithm

John Nagle's algorithm tries to reduce the number of tiny segments. In simplified form:

  • if there is unacknowledged data in flight
  • and the application writes only a small amount
  • buffer that small write until either an ACK arrives or enough data accumulates for a full MSS-sized segment

This helps efficiency for workloads that would otherwise send a storm of tiny packets.

Delayed ACK

Receivers often do not ACK every segment immediately. They may intentionally wait a short time, commonly tens of milliseconds or less, hoping to:

  • piggyback the ACK on outgoing data
  • acknowledge multiple segments together

This reduces pure ACK traffic.

The Bad Interaction

Now combine them:

  1. Application sends a tiny write.
  2. Nagle sees unacknowledged data and wants to wait.
  3. Receiver uses delayed ACK and also wants to wait.

You can end up with avoidable latency spikes on interactive protocols. This is why applications like SSH, gaming servers, RPC systems, and databases often disable Nagle with:

setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, ...)

This is not because Nagle was a bad idea. It is because the optimal behaviour depends heavily on the traffic pattern. Bulk transfer and interactive command-response traffic want different things.

14. Connection Establishment Under Load: SYN Queues, Backlogs, and Floods

A quiet but important part of TCP is what happens before the connection is fully established.

When a server receives a SYN, it cannot yet trust that the client is real. If it allocated full connection state for every SYN immediately, an attacker could exhaust memory with forged connection attempts. This is the classic SYN flood problem.

Modern TCP stacks therefore split state into stages:

  • a SYN queue or half-open queue for connections in progress
  • an accept queue for fully established connections waiting for the application to accept()

SYN Backlog

If a server receives SYNs faster than it can complete handshakes, the SYN backlog fills. New connection attempts may be dropped or handled defensively.

SYN Cookies

One clever defence is the SYN cookie. Instead of storing full state on receipt of the SYN, the server encodes essential state into the sequence number of the SYN-ACK. When the client replies with the final ACK, the server reconstructs the state from the cookie and only then allocates the connection.

This is an elegant hack:

  • it protects against state exhaustion
  • it keeps the service reachable under flood conditions

But it also has tradeoffs. Some TCP options cannot be handled as flexibly under SYN cookie operation because the server intentionally avoided storing rich pre-connection state.

Accept Queue Pressure

Even if the TCP handshake succeeds, a busy application can still fall behind. If the kernel's accept queue fills because the process is not calling accept() fast enough, new fully established connections may be reset or dropped despite the network itself being healthy.

This is one reason overloaded services can show strange symptoms:

  • SYN packets arrive
  • SYN-ACKs go out
  • some clients still fail to connect

The bottleneck may be application accept backlog pressure rather than raw packet loss.

15. Reordering, Duplicate Packets, and Why TCP Must Be Conservative

Real networks do not always deliver segments in the order they were sent. Equal-cost multipath, load-balanced fabrics, route changes, retransmissions, and radio systems can all create packet reordering.

TCP therefore has to distinguish between:

  • a segment that is truly lost
  • a segment that is only late

That sounds easy until you realise the sender never sees "late". It only sees ACK patterns and timers.

If TCP retransmits too aggressively, it wastes bandwidth and may falsely interpret harmless jitter as congestion. If it retransmits too slowly, recovery drags on and the application stalls.

This is why mechanisms like:

  • duplicate ACK thresholds
  • SACK scoreboards
  • timestamps
  • RACK and related modern loss detection ideas

exist in modern stacks. The sender is doing statistical inference on an imperfect signal.

The network is not telling TCP what happened. TCP is guessing, using evidence.

16. Observability: How Engineers Actually See TCP Behaviour

When TCP misbehaves, engineers do not usually begin with application logs. They begin by asking:

  • what are the sequence numbers doing?
  • what is the RTT doing?
  • is cwnd collapsing?
  • is the receive window closing?
  • are there retransmissions, duplicate ACKs, or resets?

On Linux, common tools include:

ss -ti dst 203.0.113.10

which can show:

  • congestion control algorithm
  • RTT estimates
  • retransmission counters
  • cwnd
  • rto

And packet captures with:

tcpdump -i eth0 -nn host 203.0.113.10 and tcp

In Wireshark, the most useful first clues are often:

  • repeated duplicate ACKs
  • TCP retransmission markers
  • zero-window announcements
  • out-of-order segments
  • resets

At scale, operators also rely on kernel counters and telemetry:

  • retransmission rates
  • SYN backlog overflows
  • listen queue drops
  • tail RTT
  • throughput versus loss correlation

TCP debugging is often less about one brilliant insight and more about finding which of several feedback loops is going wrong.

17. Modern TCP Is Really a Family of Algorithms

When people say "TCP", they often mean the wire protocol. But performance depends heavily on which congestion control algorithm the operating system uses.

Reno

The classic mental model. AIMD, fast retransmit, fast recovery. Good for teaching. No longer state of the art.

CUBIC

The default on Linux for years. CUBIC grows the congestion window according to a cubic function of time since the last loss event. It is designed to perform better on high-bandwidth, high-latency links than Reno while remaining reasonably fair to Reno flows.

BBR

Google's BBR takes a very different approach. Instead of treating packet loss as the primary signal, it tries to model:

  • bottleneck bandwidth
  • minimum RTT

Then it paces traffic around the estimated bandwidth-delay product rather than simply probing with loss-driven AIMD. On some paths BBR massively improves throughput and latency. On others it creates fairness issues or interacts awkwardly with queueing disciplines and other algorithms.

This matters because two machines both "using TCP" may behave very differently depending on whether the sender is running Reno, CUBIC, BBR, or something else.

On Linux you can inspect the current algorithm with:

sysctl net.ipv4.tcp_congestion_control

And see what is available:

sysctl net.ipv4.tcp_available_congestion_control

18. The Limits of TCP

TCP solved a huge problem, but it carries design assumptions from another era.

It assumes the network mostly drops packets because of congestion. That is imperfect on wireless links.

It exposes head-of-line blocking at the byte-stream level. If one segment is missing, later bytes cannot be delivered to the application, even if they have arrived. This is part of why HTTP/3 moved to QUIC over UDP, where multiple streams can be managed in user space without transport-level head-of-line blocking between them.

It lives largely in the kernel, which makes rapid iteration harder. New TCP ideas often take years to spread through operating systems, NIC offload stacks, NAT devices, and firewalls.

And yet TCP is still everywhere because the abstraction is so useful. Databases, SSH, SMTP, HTTPS, VPNs, message queues, and countless internal services all depend on it. The protocol is old, but the engineering inside it is anything but simplistic.

19. Why QUIC Exists Without Making TCP Obsolete

It is tempting to look at QUIC and conclude that TCP is finished. That is not the right conclusion.

QUIC exists because it solves several problems TCP is structurally awkward at solving:

  • user-space deployability
  • connection migration across IP changes
  • stream multiplexing without cross-stream head-of-line blocking
  • tighter integration with modern TLS

But TCP still wins in many environments:

  • mature kernel implementations
  • extremely broad middlebox compatibility
  • ubiquitous operational tooling
  • decades of tuning in databases, storage, RPC systems, and internet services

QUIC is not a verdict that TCP failed. It is evidence that transport needs changed. TCP remains the default reliable transport for an enormous amount of infrastructure because it is stable, understood, and very hard to replace completely.

The more honest view is that TCP became the baseline that newer transports have to justify moving away from. That is a sign of how successful the design was, not a sign that it stopped mattering.

20. NAT, Firewalls, and the Middleboxes That Distort TCP

TCP was specified end to end. The modern internet is not.

Between client and server, packets often cross:

  • home NAT devices
  • carrier-grade NAT
  • stateful firewalls
  • L4 load balancers
  • proxies
  • intrusion-prevention appliances

Each of those devices maintains state about the flow. That state usually assumes a connection has a beginning, a middle, and an end. If the state expires too early, the TCP connection breaks even if both endpoints were perfectly happy.

Idle Timeouts

A common example is the idle timeout. An SSH connection might be quiet for several minutes while you read logs. The endpoints still consider the connection open. A NAT device in the middle may disagree and reclaim the mapping after 60 or 300 seconds of inactivity. The next packet from one endpoint then arrives at a middlebox that no longer remembers the flow, and the connection is reset or black-holed.

This is why keepalives exist:

  • TCP keepalive at the socket layer
  • application-layer heartbeats in protocols like AMQP, MQTT, and WebSocket stacks

Strictly speaking these are not part of the data transfer itself. They are maintenance packets sent to keep middleboxes convinced that the flow still exists.

Sequence Number Scrubbing and MSS Clamping

Some middleboxes also rewrite TCP options in flight. VPN concentrators and PPPoE gateways often clamp MSS values to avoid downstream fragmentation. Firewalls may strip options they do not understand. This can turn a well-tuned connection into a subtly degraded one without either endpoint being fully aware of what changed.

The ugly truth is that a lot of transport engineering today is not just about how endpoints behave. It is about how much damage the path's middleboxes do to the transport assumptions.

21. Pacing, Offload, and Why the Wire Does Not Look Like the Socket API

On modern systems, the kernel and NIC try hard to make TCP efficient at very high throughput. That creates another gap between the abstract protocol and what engineers actually observe.

Pacing

Classic TCP often emitted bursts because ACKs opened the window in chunks. Modern stacks increasingly use pacing: instead of dumping a large batch of segments immediately when cwnd allows it, the sender spaces them over time according to an estimated rate.

Pacing matters because it:

  • smooths bursts
  • reduces queue spikes
  • works better with modern queue disciplines like fq
  • improves behaviour on high-bandwidth paths

BBR in particular depends heavily on pacing because its whole model is built around sending at the estimated bottleneck rate, not just probing with bursty window growth.

Segmentation and Receive Offloads

NICs also perform:

  • TSO, TCP Segmentation Offload
  • GSO, Generic Segmentation Offload
  • GRO, Generic Receive Offload
  • LRO, Large Receive Offload on some systems

These features reduce per-packet CPU overhead by grouping work. On transmit, a large buffer may be handed to the NIC and split later into MSS-sized wire packets. On receive, multiple adjacent segments may be coalesced before the kernel hands data up the stack.

This is wonderful for throughput. It is also why packet captures can be misleading:

  • captures near the socket path may show giant pseudo-segments that never existed on the wire
  • captures on a mirror port may show the real wire segmentation

An engineer debugging TCP who does not know where in the stack the capture was taken can easily misunderstand what the sender actually transmitted.

Why This Matters Operationally

At low speeds, these details hardly matter. At 10, 40, or 100 Gbit/s, they are the difference between one core being overwhelmed and the stack staying efficient. Modern TCP performance is partly a transport story and partly a kernel-plus-NIC co-design story.

24. Connection Reuse and Why Short-Lived TCP Is Expensive

A lot of application-level TCP performance comes down to one decision: do you keep connections around, or do you constantly create and destroy them?

Short-lived connections are expensive because each one pays for:

  • a handshake
  • kernel state allocation
  • congestion window warm-up
  • possible TLS setup on top
  • teardown and often TIME_WAIT

Modern systems lean so hard on:

  • HTTP keep-alive
  • HTTP/2 multiplexing
  • gRPC channel reuse
  • database connection pools

Connection reuse lowers latency, reduces control-plane churn, and makes better use of the transport state the connection has already learned.

Pooling Has Tradeoffs Too

Of course, long-lived connections create their own problems:

  • idle timeout interactions with NAT and firewalls
  • stale sessions after topology changes
  • uneven load if pools are too sticky
  • memory overhead from too many idle sockets

So the real question is not whether to reuse connections at all. The real question is how aggressively to reuse them while still handling idleness, rebalancing, and failure cleanly.

25. Ephemeral Ports and the Limits of Connection Churn

The client side does not have infinite source ports. Outbound connections usually come from the ephemeral port range. Under heavy churn, especially toward the same destination, that finite range becomes a real constraint.

This is one reason connection reuse matters operationally, not only for latency. Without reuse, a busy client can create avoidable pressure on:

  • ephemeral ports
  • TIME_WAIT
  • kernel bookkeeping
  • NAT state

At scale, TCP behaviour is not just a property of the transport itself. It is a property of how the application manages connection lifecycle above it.

26. The ACK Clock and Why TCP Feels Self-Timed

One of the most elegant aspects of TCP is that the return path helps pace the forward path. ACKs do not only confirm delivery. They also act as a timing signal.

In a healthy flow:

  1. sender transmits data
  2. receiver generates ACKs
  3. ACKs arrive spaced by path behaviour and receiver behaviour
  4. the sender uses that rhythm to release more data

This is often called the ACK clock. It is one of the reasons TCP can stabilize itself on many paths without any central scheduler.

The mechanism matters because when the ACK clock is disrupted, TCP gets worse quickly. That can happen because of:

  • loss
  • ACK compression in the reverse path
  • middleboxes distorting timing
  • delayed ACK behaviour interacting badly with the sender

Once the sender loses a clean sense of pacing, bursts and inefficient recovery become much more likely.

27. Silly Window Syndrome and Why Tiny Writes Hurt Twice

TCP has a classic pathological case known as Silly Window Syndrome. It appears when the system keeps sending or advertising tiny amounts of data repeatedly instead of waiting for a more useful chunk.

That is bad because:

  • header overhead rises sharply
  • ACK traffic becomes less efficient
  • CPU work per useful byte increases

This is another reminder that transport efficiency depends heavily on granularity. An application that emits endless tiny writes can create ugly transport behaviour even on an otherwise healthy network.

28. Keepalives, Half-Open Connections, and the Problem of Silent Failure

TCP is good at handling explicit closes. It is worse at detecting peers that disappeared without a proper teardown.

Imagine a server process crashes, a firewall state entry vanishes, or a NAT device forgets the mapping. One endpoint may still believe the connection is open because no FIN or RST was received. The connection is now half-open from the application's perspective.

This is where keepalives matter.

TCP Keepalive

The kernel can periodically send probes on an otherwise idle connection. If the peer responds, the connection is still alive. If not, the kernel eventually marks it dead.

This is useful, but the defaults are often very conservative. On Linux they are measured in hours unless tuned. For many application protocols that is far too slow.

Application Heartbeats

For that reason, many systems implement their own liveness checks at a higher layer:

  • WebSocket ping/pong
  • AMQP heartbeats
  • MQTT keepalive intervals
  • custom RPC heartbeat frames

Application heartbeats are often faster and semantically richer than bare TCP keepalive because they can express not only "the socket is alive" but also "the remote service loop is still healthy."

Why This Matters

At scale, silent half-open connections waste resources:

  • file descriptors
  • memory
  • connection-pool slots
  • load-balancer state

They also create confusing failure patterns where one side thinks the link is fine until it tries to use it again. TCP gives you the tools to detect this eventually. Good application design decides how quickly "eventually" needs to happen.

28. TCP Is Also an Operational Interface

One reason TCP lasted so long is that it is not only a wire protocol. It is also an operational interface that engineers understand deeply. People know how to:

  • capture it
  • graph it
  • tune it
  • debug it
  • reason about failures through it

That accumulated operational literacy matters. A technically better transport does not replace TCP automatically if it lacks the same depth of real-world tooling and intuition around it.

29. TCP Is Still the Default For a Reason

Engineers still choose TCP constantly because the tradeoffs are understood, the tooling is mature, and the abstraction is strong enough to support almost everything from databases to HTTPS to SSH. That maturity is itself a major technical advantage.

30. What Application Engineers Usually Get Wrong About TCP

Many application problems blamed on "the network" are really mismatches between application behaviour and TCP's model.

Common examples include:

  • assuming one send() becomes one recv()
  • emitting endless tiny writes and then wondering where latency came from
  • setting aggressive timeouts without understanding retransmission behaviour
  • pooling too few or too many connections
  • ignoring backpressure until memory usage spikes

TCP does not preserve message boundaries. It does not promise that latency is stable from write to write. It does not make slow readers harmless. It does not rescue an application that treats every request as if a brand new connection is free.

This matters because a lot of distributed-system design quietly depends on assumptions about transport. If you understand how TCP batches, delays, retransmits, and eventually gives up, your application behaviour becomes easier to explain. If you do not, the same production incident gets rediscovered under a different name every six months.

31. Reliability Is Never Free

TCP's biggest gift to applications is that it hides a lot of failure handling. Its biggest danger is that this makes people forget the cost of that hiding.

Reliable delivery means buffering, retransmission, timeout estimation, head-of-line blocking, and a lot of careful conservatism when the path looks strange. Those mechanisms are why TCP is so useful, but they are also why it can feel unexpectedly slow under loss, jitter, or queue buildup. The protocol is spending time and state to preserve the abstraction.

That is the deeper tradeoff. TCP does not remove network uncertainty. It absorbs some of that uncertainty so the application can pretend the socket is simpler than the network really is.

32. How Engineers Usually Diagnose TCP Problems

In practice, TCP debugging is often a search for the first place where the abstraction stops matching reality.

Engineers usually look for:

  • retransmissions rising faster than throughput
  • receive windows collapsing because the application is not reading fast enough
  • SYNs completing slowly because a middlebox or remote service is overloaded
  • RTT inflation caused by queueing rather than path length
  • resets or idle timeouts introduced by firewalls or load balancers

Packet captures remain so useful for exactly this reason. A capture shows whether the problem is loss, delay, reordering, tiny writes, window pressure, or a peer that stopped responding cleanly. Metrics can tell you that a request got slow. TCP traces often tell you why.

This is also why transport literacy matters outside networking teams. Database engineers, backend engineers, and platform engineers all run into TCP behaviour whether they want to or not. The protocol is below the application, but its failure modes are often what shape the incident.

Once you have seen a few of these traces in production, a lot of "random network weirdness" stops looking random.

It starts looking like a system with recognizable failure patterns, which is exactly what TCP has always been.

Once you learn those patterns, transport problems become much easier to name and fix.

That familiarity is one of TCP's real advantages.

It compounds over time.

Very few transports have accumulated that much shared operational memory.

That history matters in production.

What TCP Is Really Doing

When you reduce TCP to "reliable transport," you miss the real achievement. TCP is continuously solving several problems at once:

  • numbering a byte stream so order can be reconstructed
  • detecting which data did and did not arrive
  • retransmitting lost bytes without creating chaos
  • adapting send rate to receiver memory limits
  • adapting send rate to network congestion
  • estimating time in a system where clocks are noisy and delay is variable
  • shutting connections down without confusing old and new traffic

All of that happens so your application can pretend the network is just a file descriptor.

That pretense is one of the most successful lies in computer networking.