How Time Synchronization Actually Works
Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)Every distributed system eventually runs into the same humiliating fact: clocks lie.
They do not lie because vendors are incompetent, or because quartz is mysterious, or because software teams forgot to call ntpd. They lie because physical clocks drift, networks add asymmetric delay, operating systems schedule work late, oscillators change frequency with temperature, and "what time is it?" turns out to be a much harder engineering question than most application developers assume.
If you have ever seen logs arrive out of order, Kerberos tickets fail after a VM resume, a distributed database reject a transaction because one node believed the future had already happened, or a telecom grandmaster advertise nanosecond precision while its antenna lost GNSS lock two hours earlier, you have seen time synchronisation as an operational problem rather than a textbook one.
Most people learn the simplified version first. A machine asks an NTP server for the time, the server replies, the machine adjusts its clock, and everyone more or less agrees. That summary is not wrong, but it hides the real machinery: stratum hierarchies, oscillator discipline loops, timestamp filtering, falseticker rejection, pulse-per-second inputs, hardware timestamping, boundary clocks, transparent clocks, grandmasters, holdover oscillators, and the quiet but decisive role GPS plays under nearly all of it.
The internet does not keep time because every machine owns a perfect clock. It keeps time because large numbers of imperfect clocks are continuously compared against a smaller number of better clocks, which are themselves tied to national metrology labs and, in many operational deployments, to GNSS receivers disciplined by atomic time scales. In Europe, institutions such as PTB in Berlin and the Observatoire de Paris sit close to the top of that chain. Downstream, data centres, exchanges, telco networks, and ordinary servers inherit their time from those references through multiple layers of protocol and hardware.
This article walks through that full stack. We will start with why quartz oscillators drift at all. Then we will follow the path from UTC and GPS time to NTP strata, the four-timestamp exchange, Marzullo-style interval reasoning, clock discipline, PPS wiring, and finally PTP and IEEE 1588, where sub-microsecond accuracy stops being a software trick and becomes a hardware design problem.
Timekeeping Starts with Bad Clocks
The awkward truth is that nearly every system starts from a mediocre local oscillator.
Your laptop, router, hypervisor, base station, and top-of-rack switch all have some kind of local timebase. In commodity equipment that timebase is usually a quartz crystal oscillator. Quartz is useful because it is cheap, small, stable enough for general electronics, and predictable enough to discipline. It is not useful because it is perfect.
A quartz crystal oscillates at a frequency determined by its physical dimensions and cut. That frequency shifts with temperature, age, mechanical stress, supply noise, and manufacturing tolerance. If a nominal 10 MHz oscillator is off by 20 parts per million, its actual frequency is:
10,000,000 Hz × 20 × 10^-6 = 200 Hz errorThat sounds small until you convert it into time error. Twenty parts per million means the clock gains or loses about 20 microseconds per second. Over one minute that becomes 1.2 milliseconds. Over one hour it becomes 72 milliseconds. Over a day it becomes roughly 1.7 seconds.
For a human reading a wall clock, 1.7 seconds is trivial. For a TLS certificate validity check, high-frequency trading feed alignment, distributed trace correlation, or LTE frame timing reference, it is a disaster.
Higher-end systems improve on this with TCXOs, OCXOs, rubidium oscillators, or even cesium standards. A temperature-compensated crystal oscillator corrects part of the thermal drift. An oven-controlled oscillator keeps the crystal at a constant elevated temperature so environmental swings matter less. A rubidium clock uses atomic resonance to discipline a local oscillator and delivers far better long-term stability. But these are cost, size, and power tradeoffs, not magic escapes from physics.
Time synchronisation exists because free-running clocks diverge. The job of protocols such as NTP and PTP is not to create time from nowhere. Their job is to estimate how far a local clock has drifted from a better reference, then steer that local oscillator gently enough that the system becomes more correct instead of less stable.
UTC, TAI, GPS Time, and Why "True Time" Depends on Which Clock You Mean
Before talking about protocols, we need to define what the system is trying to follow.
There is no single universal timescale used by every engineering system in the same way. The most important ones here are TAI, UTC, and GPS time.
TAI, International Atomic Time, is the continuous atomic timescale produced by combining measurements from atomic clocks operated by national metrology institutes around the world. It does not contain leap seconds. It just counts forward uniformly.
UTC, Coordinated Universal Time, is the civil timescale people usually mean when they say "the current time". It is derived from TAI but adjusted with leap seconds so that it does not drift too far from mean solar time and the Earth’s actual rotation. UTC is therefore not perfectly continuous. Occasionally it inserts an extra second, historically at the end of June or December, when Earth rotation and atomic time diverge too much.
GPS time is another continuous atomic timescale. It started aligned with UTC on 1980-01-06, but unlike UTC it does not apply leap seconds. That means GPS time and UTC differ by a fixed integer number of seconds that changes whenever UTC inserts a leap second. A GPS receiver therefore usually provides both a continuous internal timescale and enough metadata to recover UTC.
This distinction matters operationally:
- Databases and logs often want UTC because it matches the civil world.
- Telecom equipment often prefers continuous timescales because leap seconds are annoying.
- GNSS receivers often track GPS time internally, then export UTC with an offset.
- PTP profiles in industrial and telecom environments may distribute TAI or UTC depending on the profile and deployment.
When someone says "synchronise to GPS", what they usually mean is "use a GNSS receiver as a traceable source of atomic time, then expose a PPS edge and time-of-day message from which UTC can be reconstructed".
That source still needs interpretation. A pulse on a coax cable has no timezone, no leap second semantics, and no date attached to it. The receiver, kernel, or grandmaster has to marry the second boundary to a timescale. Get that mapping wrong and you can have a beautifully stable clock that is exactly one leap-second error away from reality.
National Labs Sit Above Most of the Internet's Time Tree
At the top of civil and scientific timekeeping sit national metrology institutes and observatories. In Europe, PTB in Berlin and the Observatoire de Paris are two of the institutions that matter.
PTB, Physikalisch-Technische Bundesanstalt, is Germany’s national metrology institute. It contributes atomic clock data into the international computation of TAI and UTC and maintains UTC(PTB), a real-time local realisation of UTC. The Observatoire de Paris plays a similar role in France with UTC(OP). These are not abstract acronyms. They are physical clock ensembles, measurement systems, transfer links, calibration procedures, and uncertainty budgets maintained by people whose entire job is to know how much time error remains after every correction.
No ordinary internet server talks directly to UTC(PTB) or UTC(OP). The hierarchy is more layered than that:
- National labs maintain local realisations of UTC.
- GNSS systems broadcast atomic timescales derived from satellite clocks and ground control.
- Reference clocks, timing receivers, and grandmasters in data centres and telecom networks lock to those signals.
- NTP or PTP servers distribute time downstream.
- Client machines discipline their local clocks from those servers.
The reason GPS becomes the practical root of so much internet time is that it is the most convenient globally available way to carry traceable atomic time to ordinary infrastructure. A laboratory can compare clocks over specialised links. A colocation rack in Frankfurt or Amsterdam usually cannot. What it can do is mount a GNSS antenna, run a cable to a timing receiver, and extract a PPS signal plus current time-of-day from the constellation overhead.
This is why "GPS is the real root of most internet time" is basically true in operations, even though the more formal chain ultimately points back to international atomic timescales and the labs that contribute to them.
NTP Was Built for the Real Internet, Not for Ideal Links
The Network Time Protocol was designed around a blunt fact: the open internet is noisy, asymmetric, bursty, and full of machines with very different quality clocks.
NTP does not assume a dedicated timing network. It assumes packet-switched links, changing delay, queueing noise, route asymmetry, and a broad population of clients. It therefore aims for robustness before absolute precision. On a good LAN it can deliver sub-millisecond time. Across a WAN it may land in the single-digit millisecond range. That is good enough for most servers, authentication systems, schedulers, and logs.
The classic mental model is the stratum hierarchy:
- Stratum 0: reference clocks. These are not network servers. They are atomic clocks, GNSS receivers, radio clocks, or other hardware references.
- Stratum 1: servers directly attached to a stratum 0 source, often via serial timecode, PPS, or a kernel PPS discipline path.
- Stratum 2: servers synchronised to stratum 1 servers.
- Stratum 3 and below: progressively further descendants.
Stratum is often misunderstood as accuracy. It is not a direct accuracy score. It is hop count from a reference clock. A bad stratum 1 server with a damaged antenna can be worse than a well-managed stratum 2 server receiving time from several healthy upstreams. NTP implementations therefore evaluate not only stratum but also root delay, root dispersion, peer jitter, and reachability.
The protocol tries to answer two separate questions:
- Which peers are believable?
- Given the believable peers, how should the local oscillator be steered?
Those are selection and discipline problems. They are related, but not identical.
The Four-Timestamp Exchange Is the Core Measurement
NTP’s basic measurement is elegant enough to fit on a whiteboard.
Suppose a client sends a request at local time t1. The server receives it at its time t2, transmits a reply at its time t3, and the client receives that reply at local time t4.
Those four timestamps allow the client to estimate two things:
round_trip_delay = (t4 - t1) - (t3 - t2)
clock_offset = ((t2 - t1) + (t3 - t4)) / 2The delay expression subtracts the server processing time from the full elapsed client-side interval. The offset expression assumes the path delay is symmetric in both directions and then estimates how far the local clock differs from the server clock.
That symmetry assumption is the first place reality bites.
If the forward path delay and reverse path delay are different, the offset estimate is biased. Suppose the request takes 1 ms to travel outbound but the reply takes 5 ms to travel back because of queueing on the reverse path. The midpoint assumption silently interprets that asymmetry as clock error.
This is why NTP accuracy depends not just on low latency but on stable, roughly symmetric latency. A transatlantic path with a very consistent 20 ms each way may produce better time than a nearby congested path oscillating between 1 ms and 9 ms asymmetrically.
NTP does not solve asymmetry directly. It manages around it statistically. It polls repeatedly, keeps a sliding window of recent samples per peer, prefers the sample with the lowest delay, and treats high-delay or high-jitter samples as less trustworthy.
The protocol works because queueing delay is often additive noise while the underlying path minimum is relatively stable. If you keep taking samples, the minimum-delay samples tend to represent the moments with the least queueing contamination and therefore the least asymmetry error.
One Sample Is Worth Very Little, a Filtered History Is Worth Much More
A single NTP exchange is barely a measurement. It is one noisy observation.
Implementations therefore maintain a clock filter for each peer. In classical NTP this is an eight-stage shift register containing the most recent delay and offset samples. The peer process picks the sample with the lowest delay because that sample is most likely to reflect the least queueing distortion. It also computes peer jitter from the spread of the offsets.
This is a deceptively powerful trick. Network delay is usually positive-only noise. Routers and switches can add queueing delay, but they cannot create a packet that arrives earlier than the physical minimum path delay. So the minimum-delay sample acts as a lower envelope estimate of the real path.
From each peer, the implementation derives values such as:
- offset: current estimated difference between the local clock and the peer
- delay: measured round-trip path delay
- dispersion: error bound that grows with time since the last good sample
- jitter: short-term variability across recent samples
These numbers determine whether a peer remains credible, how much weight it gets, and whether the local clock should be nudged more by phase correction or frequency correction.
This is also where NTP begins to look less like a request-response protocol and more like a control system with statistical input filters. The packet exchange is only the sensor. The real product is a continuously updated confidence interval around the current offset estimate.
NTP Does Not Trust a Single Peer, It Builds a Consensus
The reason production NTP deployments use several upstreams is not redundancy in the casual sense. It is epistemology.
If one server says your clock is 2.4 ms slow, you still do not know whether your clock is wrong or that server is wrong. If four servers say you are 2.4 ms slow and a fifth says you are 12 ms fast, the outlier becomes identifiable.
This is where interval-based reasoning enters. Each peer can be represented not as one exact offset but as an interval around the offset estimate. The width of the interval reflects uncertainty contributed by delay and dispersion. A good sample from a low-jitter, low-delay upstream produces a narrow interval. A noisy path produces a wider one.
The classical NTP intersection algorithm is closely related in spirit to Marzullo’s algorithm, which takes multiple intervals from different sources and finds the region with maximum overlap. The lab that accompanies this article uses that framing because it makes the idea visible: each server says "the true offset is probably somewhere in this range", and the best estimate lies where the most credible ranges overlap.
Imagine these correction intervals in milliseconds:
PTB-backed server: [2.31, 2.47]
Observatoire de Paris peer: [2.28, 2.44]
Frankfurt stratum 2 peer: [2.22, 2.58]
Local ISP peer: [2.35, 2.51]
Broken WAN peer: [0.80, 4.20]The first four peers overlap tightly near 2.35 to 2.44 ms. The broken peer overlaps almost everything because its uncertainty is huge, but it contributes very little real information. If another peer instead claimed [-5.0, -4.5], it would not overlap the majority at all and would be a clear falseticker.
Production daemons then go further than pure interval intersection. They apply selection, clustering, and combining stages that down-weight noisy peers and reject outliers. The important operational takeaway is that NTP becomes trustworthy not because one remote clock is sacred, but because several independent observations agree within bounded uncertainty.
Root Delay, Root Dispersion, and Why Stratum Alone Is a Weak Signal
NTP carries ancestry information downstream so clients can estimate how uncertainty accumulates.
Two key fields are root delay and root dispersion.
Root delay is the total round-trip delay from the current server back toward its reference source, accumulated over the synchronisation chain. Root dispersion is an estimate of how much maximum error has accumulated as time passes, largely because clocks drift between corrections and because not all previous measurements were perfect.
These values matter because a server can be close to the top of the stratum hierarchy but still have poor current quality. Consider two cases:
- A stratum 1 machine directly attached to a GNSS receiver with intermittent antenna problems.
- A stratum 2 machine in Amsterdam taking clean feeds from several healthy stratum 1 servers over low-jitter links.
Blindly preferring the lower stratum can be worse. The second machine may have better current stability and lower effective uncertainty. Mature NTP software therefore evaluates a synchronisation distance derived from delay and dispersion, not just stratum number.
This is one of the reasons serious deployments typically configure several upstreams from different administrative domains. You want diversity:
- different physical paths
- different facilities
- different operators
- ideally different reference chains
That does not make time perfect, but it makes catastrophic common-mode failure less likely.
Clock Discipline Is a Control Loop, Not a Periodic Set-Time Command
People often imagine NTP as something that occasionally asks for the time and then sets the system clock. Good implementations try very hard not to do that.
Stepping the clock is crude. If the local clock is suddenly jumped backward, timers break, logs reorder, leases look wrong, and applications that assumed monotonic progress become upset. Jumping it forward creates its own problems: scheduled tasks run early, timeouts expire instantly, and caches can misbehave.
So once a host is roughly correct, NTP usually disciplines the oscillator by slewing rather than stepping. Instead of saying "it is now 12:00:00.000", the daemon says "your local oscillator is running a bit too slow, increase its rate until the phase error shrinks". This is a phase-locked loop and frequency-locked loop problem dressed up as systems administration.
In Linux, the kernel keeps frequency and offset state for the system clock. User-space daemons such as ntpd, chronyd, or systemd-timesyncd feed new observations into that kernel discipline machinery. chronyd, in particular, is designed to converge quickly even on intermittently connected machines, laptops, and VMs that sleep and resume.
The basic control split looks like this:
- Phase correction fixes present offset.
- Frequency correction fixes persistent drift.
If a host consistently gains 18 ppm, the daemon learns that and asks the kernel to slow the local clock slightly all the time. Then future corrections can be smaller.
This distinction explains why a disciplined machine can stay accurate for some time even if it briefly loses all upstream servers. Once the frequency error is known, the system can coast in holdover mode. How well it coasts depends on oscillator quality. A laptop crystal may wander quickly. An OCXO or rubidium-backed appliance may stay excellent for hours.
Why PPS Changes Everything
Network protocols carry both time-of-day and uncertainty. PPS, pulse per second, carries only one thing, but it carries it exceptionally well: the exact second boundary.
A PPS signal is usually an electrical pulse, often on coax, that occurs once per second aligned to a reference timescale. GNSS receivers commonly output PPS alongside a serial or network message containing time-of-day. The serial message might tell you that the next pulse corresponds to 2026-04-20 12:34:56 UTC. The pulse tells you precisely when that second begins.
Why split the information?
Because time-of-day messages over serial links, USB, or general operating system paths suffer variable latency. A PPS edge, when wired into a proper hardware capture path, can be timestamped with much lower uncertainty. The typical arrangement is:
- GNSS receiver locks to satellites.
- Receiver emits PPS each second.
- Kernel captures that edge via a PPS API and associates it with the current second count.
- NTP or
chronyddisciplines the local kernel clock from that precise edge.
With PPS, the system no longer depends on the latency of parsing a time string or USB frame to locate the second boundary. The time-of-day path tells you which second it is. The PPS path tells you exactly when it starts.
Many stratum 1 servers are not just "servers using GPS" for that reason. They are machines with a GNSS receiver, a PPS-capable kernel path, and local discipline software that fuses coarse absolute time with extremely precise second-edge timing.
Without PPS, a host might live in the sub-millisecond regime. With PPS and a good kernel path, tens of microseconds become feasible. With hardware timestamping and PTP, we go much further.
GPS Became the Operational Root Clock Because It Is Everywhere
Strictly speaking, GPS is not the philosophical origin of time. International atomic time and national lab ensembles sit above the public broadcast layer. Operationally, though, GPS or other GNSS systems became the root source for enormous amounts of deployed infrastructure because they solve the distribution problem.
You can buy a rack-mount timing receiver, run an antenna to the roof, and obtain:
- a stable PPS output
- UTC or GPS time-of-day
- quality indicators such as satellite count and holdover state
- sometimes 10 MHz reference output
That is enough to build a local stratum 1 source or a PTP grandmaster.
This convenience has consequences. It means a huge fraction of the internet and telecom world shares a similar dependency chain:
GNSS constellation -> rooftop antenna -> receiver -> PPS or grandmaster -> NTP/PTP distribution -> servers and devicesThat is powerful, cheap, and globally available. It is also fragile in common-mode ways:
- antenna cable faults
- lightning damage
- jamming
- spoofing
- bad sky visibility
- firmware bugs in timing receivers
If a stratum 1 server says stratum 1 because it is physically attached to a GNSS receiver, but that receiver has silently entered bad holdover or is being spoofed, the whole downstream hierarchy can inherit bad time with an air of authority.
This is why serious operators monitor not just offset but source health: GNSS lock status, PPS presence, holdover duration, antenna current, satellite geometry, and disagreement against independent peers.
PTB Berlin, Observatoire de Paris, and European Timing Context
Europe’s timing infrastructure is not an abstract footnote. It appears in both metrology and operations.
PTB in Berlin, the Physikalisch-Technische Bundesanstalt, maintains atomic standards and contributes to UTC computation through BIPM processes. The Observatoire de Paris does the same in France with UTC(OP). These local realisations matter because UTC itself is computed retrospectively from global measurements, while operational systems need a real-time local approximation today, not a perfect answer next month.
In practical European deployments, you often see timing chains that reference or calibrate against national institutes, then distribute time through:
- GNSS-disciplined grandmasters in Frankfurt, Amsterdam, Paris, or London
- regional NTP pools and institutional servers
- PTP domains inside telecom or trading networks
- holdover oscillators for resilience against GNSS outages
For example, a carrier network in Germany may use grandmasters disciplined by GNSS and validated against national timing references, then distribute phase and frequency through PTP across boundary clocks. A research facility in Paris may compare local references against UTC(OP) and also distribute NTP for general-purpose hosts. A trading venue may run multiple independent grandmasters and monitor their divergence at microsecond scale because audit and sequencing rules demand evidence of traceable time.
The protocol names are global, but the infrastructure is local, physical, and institutional. Someone always owns the antenna, the oscillator, the fibre path, and the uncertainty budget.
Why NTP Accuracy Tops Out Before PTP Starts to Shine
NTP can do impressive work in software, but it is boxed in by one hard limitation: it usually timestamps packets in the operating system after network stack processing has already added uncertainty.
Even if the wire-level delay is stable, a software timestamp taken in the kernel or daemon can move around because of:
- interrupt moderation
- scheduler latency
- queueing inside the NIC or driver
- CPU power-state transitions
- virtualisation overhead
On a normal server this is acceptable. If your logs line up within a millisecond, you are happy. If you need the precise phase relationship between baseband units, substation event records, or packet captures across a trading plant, you are not.
This is where Precision Time Protocol, IEEE 1588, enters.
PTP uses a similar idea to NTP, exchange timestamped messages and estimate offset, but it is designed for tightly controlled networks and high-precision hardware support. In good deployments, timestamps are taken in the network interface hardware at packet ingress and egress, not late in the software stack. That removes much of the variable latency that dominates software timestamping.
PTP also assumes the network itself may cooperate. Switches can become boundary clocks or transparent clocks.
- A boundary clock terminates PTP on one side, synchronises itself, then serves time downstream.
- A transparent clock measures how long a PTP event packet spent inside the switch and updates a correction field so downstream devices can subtract that residence time.
These features allow phase accuracy in the microsecond, sub-microsecond, or even tens-of-nanoseconds range inside a controlled LAN or metro environment.
IEEE 1588 PTP Works Because the Hardware Participates
A minimal PTP exchange involves Sync, Follow_Up, Delay_Req, and Delay_Resp messages. In two-step mode, the grandmaster sends a Sync event and then a Follow_Up containing the exact transmit timestamp. The slave later sends Delay_Req, and the master replies with Delay_Resp containing the receipt timestamp.
The slave then has a four-timestamp dataset analogous to NTP, but the timestamps are often hardware-generated at the MAC or PHY boundary:
master sends Sync at t1
slave receives Sync at t2
slave sends Delay_Req at t3
master receives Delay_Req at t4From these, the slave estimates:
mean_path_delay = ((t2 - t1) + (t4 - t3)) / 2
offset_from_master = (t2 - t1) - mean_path_delayAgain, asymmetry still hurts. PTP is not immune to different forward and reverse path delays. The difference is that in engineered networks you can often control or at least characterise those delays much more tightly than on the public internet.
The hardware support is decisive. If a NIC timestamps a PTP frame the moment it crosses the wire, the uncertainty might be tens of nanoseconds. If the kernel timestamps it after interrupt handling and scheduling, the uncertainty can be thousands of times worse. This is why "PTP support" on a product sheet is ambiguous until you know whether it means software timestamping, hardware timestamping, transparent clock support, boundary clock mode, SyncE integration, and which profile is implemented.
Sub-Microsecond Accuracy Usually Means PTP Plus Good Network Design
When people hear that PTP can deliver sub-microsecond synchronisation, they often attribute all the credit to the protocol. That is not the right lesson.
Sub-microsecond results come from a whole design stack:
- a good grandmaster, often GNSS-disciplined, with decent oscillator holdover
- hardware timestamping in every relevant NIC
- switches operating as boundary or transparent clocks
- low and stable asymmetry across paths
- careful VLAN and QoS treatment for timing traffic
- sometimes Synchronous Ethernet, where frequency is distributed at the physical layer
PTP without this engineering can be disappointing. You can run IEEE 1588 over a random enterprise network and still end up with mediocre results because ordinary switches add uncontrolled residence time and queueing asymmetry.
By contrast, telecom and industrial profiles are strict about topology and behaviour. The grandmaster election process, announced quality levels, and failover logic are all tailored for environments where losing phase alignment is a service-impacting event.
This is the biggest conceptual difference between NTP and PTP:
- NTP is robust on messy networks and aims for good-enough software time.
- PTP is demanding on the network and rewards that discipline with much better phase accuracy.
Frequency Is Not the Same as Phase, and SyncE Shows Why
One subtle point in time engineering is that frequency and phase are related but not identical.
If two devices run at exactly the same frequency, their clocks tick at the same rate. But they may still disagree on the current time by a constant offset. If one device is 700 ns ahead and both run at identical frequency forever, that 700 ns phase difference remains.
Synchronous Ethernet, often used alongside PTP in telecom networks, distributes frequency through the Ethernet physical layer. It lets downstream devices recover a stable clock frequency from the line signal. That greatly improves oscillator stability across the network. But SyncE alone does not tell devices what the absolute time-of-day is or where the one-second boundary should be. PTP handles phase and time. SyncE handles frequency.
Together they are powerful:
- SyncE keeps everyone ticking at the same rate.
- PTP aligns the phase and absolute time.
This combination is one reason modern mobile networks can meet tight timing budgets for radio coordination.
Holdover Is the Difference Between a Good Clock and a Good Timing System
Any timing source can disappear.
GNSS can be jammed. Antennas can fill with water. Fibre can be cut. A top-of-rack switch can reboot. If your whole design assumes the reference is always available, you do not have a timing system, you have a timing dependency.
Holdover is the period during which a clock keeps time acceptably after losing its reference. The quality of holdover depends heavily on oscillator stability:
- a cheap quartz oscillator may drift badly within minutes
- a decent OCXO may hold acceptable accuracy for hours
- a rubidium clock may hold useful stability much longer
Protocols alone cannot create holdover. They can only feed corrections into the local oscillator while the reference exists. After loss of lock, the hardware quality becomes visible immediately.
This is why serious timing appliances advertise not just synchronised accuracy but holdover performance. "±100 ns while locked" sounds impressive. "How bad after 2 hours without GNSS?" is the more operational question.
Good monitoring therefore tracks state transitions such as:
- locked to GNSS
- locked to PTP upstream
- holdover
- freerun
If you only graph current offset, you may miss the moment the system stopped being traceable and started coasting.
What Operators Actually Configure on Real Hosts
The theory is useful, but time systems fail in configuration files and service graphs, not in elegant diagrams.
On a normal Linux server, the most common serious setup today is chronyd talking to several upstream NTP sources, optionally with a local PPS input if the host is attached to a timing receiver. chronyd is preferred in many modern estates because it converges quickly, handles intermittent connectivity well, and learns frequency error aggressively after sleep, resume, or VM migration events.
A stripped-down but realistic chrony.conf for a European server might look like this:
server ntp1.example.net iburst
server ntp2.example.net iburst
server ntp3.example.net iburst
server ntp4.example.net iburst
makestep 0.5 3
rtcsync
driftfile /var/lib/chrony/drift
leapsectz right/UTC
# If a local GNSS receiver exposes PPS through the kernel PPS API
refclock PPS /dev/pps0 lock NMEA refid PPS
refclock SHM 0 offset 0.0 delay 0.2 refid NMEAThe four upstream servers give source diversity. iburst accelerates the first sync after boot. makestep 0.5 3 allows an initial step only during early startup if the host is badly wrong, then stops doing that because repeated stepping is dangerous to running applications. rtcsync periodically copies disciplined system time back to the motherboard RTC so the next boot begins closer to reality. The PPS and NMEA lines show the classic split: the NMEA sentence gives coarse time-of-day, the PPS edge gives the precise second boundary.
Once the host is synchronised, the useful commands are not exotic:
chronyc tracking
chronyc sources -v
chronyc sourcestats -v
timedatectl timesync-statuschronyc tracking shows current offset, frequency correction, root delay, root dispersion, and how far the clock is from its current reference. chronyc sources -v shows whether the daemon trusts each peer, whether a source is selected, and how unstable each path is. In practice, operators watch these outputs the way they watch disk SMART data or BGP neighbour state: not because the numbers are beautiful, but because drift trends tell you what will break next.
When sub-microsecond timing matters, host configuration grows another layer. The grandmaster might run ptp4l against a hardware timestamping NIC, while the host system clock is disciplined from the NIC's PHC, the PTP hardware clock, by phc2sys:
ptp4l -i eno1 -f /etc/ptp4l.conf -m
phc2sys -a -r -mptp4l keeps the network interface aligned to the PTP domain. phc2sys then copies that disciplined time into CLOCK_REALTIME. This split is easy to miss if you come from NTP land. A NIC can be beautifully synchronised in hardware while the Linux system clock is still mediocre until phc2sys bridges the gap. Many deployments that believe they are "running PTP" have only solved half the problem.
Failure Modes People Actually Debug
Most time incidents do not begin with "the clock is wrong". They begin with something adjacent that looks unrelated.
An authentication team sees Kerberos failures after a cluster resumes from a maintenance window. An SRE sees traces where a child span appears to finish before the parent begins. A mobile core engineer sees a base station fall out of phase alarm after a GNSS antenna amplifier starts failing intermittently in rain. A database operator sees Raft elections happening too often because one node's clock keeps wandering during holdover. Time faults are usually discovered through side effects.
The failure patterns are consistent:
- Sudden large offset after boot or resume: the host started from a bad RTC, then stepped late because the time daemon came up after application services.
- Stable but wrong by a few milliseconds: the selected upstream has path asymmetry or a local PPS input is missing while the daemon still believes it is present.
- Excellent while locked, terrible during outages: the oscillator is cheap and the system has almost no holdover budget.
- Only one rack is bad: the problem is often physical, a failing antenna splitter, damaged coax, bad SFP latency behaviour on a boundary clock uplink, or one top-of-rack switch not applying the right PTP profile.
- Everything disagrees at a leap event: one fleet smeared, another inserted the leap literally, and a third had stale leap-second metadata.
The debugging sequence is usually physical first, logical second.
On an NTP host you check which sources are selected, whether reachability is decaying, whether offset and jitter widened before the fault, and whether the daemon is stepping or slewing. On a PTP segment you check grandmaster identity, path delay stability, hardware timestamp capability, servo state, and whether boundary clocks are still announcing the expected class and accuracy. On GNSS-backed infrastructure you also check the antenna system: lock status, satellite count, spoofing alarms, holdover duration, and whether the receiver is still disciplining PPS or merely replaying its last good estimate.
A representative telecom fault flow might look like this:
phase alarm on cell site
-> check grandmaster state
-> grandmaster in holdover for 97 minutes
-> GNSS antenna current abnormal
-> rooftop LNA failed after water ingress
-> rubidium holdover masked the problem for an hour
-> edge clocks slowly drifted out of budgetNothing in that chain is conceptually difficult, but every step matters. The timing system failed long before the radio alarm appeared. It just took time for the drift to become operationally visible.
The same pattern appears in ordinary server estates at a less dramatic scale. A host can lose all four upstream peers, remain plausible for twenty minutes because the learned frequency estimate is decent, then slowly accumulate enough error to break certificate checks or log ordering. If you alert only on absolute offset, you see the incident late. If you alert on source state transitions and rising dispersion, you see it while the host is still functioning and before applications start telling you the clock is broken.
Leap Seconds, Smearing, and the Politics of a Single Second
Leap seconds are a perfect example of how timekeeping becomes messy when civil requirements meet computing systems.
Because UTC occasionally inserts leap seconds, a day can contain 86,401 seconds instead of 86,400. Some systems handle this literally by repeating 23:59:59 or exposing 23:59:60. Others smear the leap second over a longer interval by slightly slowing or speeding the clock so there is no abrupt discontinuity.
Both approaches have tradeoffs.
Literal leap insertion preserves UTC exactly but can break software that assumes every minute has 60 seconds. Smearing avoids abrupt jumps but means the clock is temporarily not exactly UTC. If different systems use different smear policies, cross-system comparisons around the leap event become awkward.
NTP and PTP deployments therefore need a policy, not just a protocol. Operators must know:
- whether their upstream announces leap seconds
- whether they smear
- which smear profile they use
- whether downstream systems expect UTC or continuous time
Many outages attributed vaguely to "time issues" are really policy mismatches. One fleet smears, another does not, and signatures or sequence checks cross the boundary.
Security Matters Because False Time Breaks More Than Logs
Time is a trust input.
If an attacker can distort time, they may not only reorder logs. They may invalidate certificate checks, break token expiry, disturb consensus protocols, disrupt telecom synchronisation, or shift forensic timelines. GNSS jamming and spoofing are obvious risks at the source layer, but ordinary NTP also has attack surfaces.
Classic unauthenticated NTP traffic can be spoofed or manipulated if the network path is hostile. Network Time Security, NTS, was introduced to address that: it adds modern cryptographic authentication around NTP associations while preserving the basic protocol shape for time transfer itself.
Even with authentication, asymmetry attacks remain possible. An adversary on path may be unable to forge timestamps but can still delay one direction more than the other. Protocols that assume symmetric delay cannot fully detect that without external help.
Operational mitigations include:
- multiple independent upstreams
- path diversity
- monitoring sudden offset divergence
- GNSS anti-spoofing and antenna supervision
- authenticated NTP or managed PTP domains
- sanity thresholds before accepting large steps
No time protocol can make the network honest. It can only give you better ways to detect dishonesty or refuse implausible corrections.
What Actually Happens on a Normal Linux Server
Most engineers do not run grandmasters. They run ordinary servers. So what does time synchronisation look like there in practice?
A typical Linux host boots with a realtime clock that may already be a bit wrong. On startup, the kernel begins with that coarse time and a local oscillator frequency estimate. A daemon such as chronyd or systemd-timesyncd then starts polling configured NTP servers.
The daemon:
- Measures offset and delay against several peers.
- Filters noisy samples.
- Selects a consensus set.
- Estimates current phase and frequency error.
- Asks the kernel to adjust the clock.
If the error is enormous at boot, the system may step the clock once. After that, it usually slews.
Meanwhile, the kernel exposes two conceptually different clocks:
CLOCK_REALTIME, which tracks wall-clock time and can be adjustedCLOCK_MONOTONIC, which only moves forward and is safer for measuring intervals
This distinction matters. Applications should use monotonic clocks for timeouts and elapsed durations precisely because wall-clock synchronisation may still nudge realtime. Good systems programming treats "what time is it?" and "how long has this taken?" as separate questions.
Virtual machines complicate matters further. Hypervisors may offer paravirtual clocks or host-guest time sharing. If the guest pauses, migrates, or resumes, large time error can appear instantly. chronyd handles this kind of environment better than older daemon designs because it can relearn oscillator frequency quickly after discontinuities.
Accuracy Tiers: What Different Methods Actually Buy You
It helps to think in rough accuracy bands rather than mystical claims.
| Method | Typical Operational Range | What It Is Good For |
|---|---|---|
| Free-running quartz | seconds per day of drift | almost nothing requiring correlation |
| Public WAN NTP | 1 to 20 ms | logs, auth, ordinary servers |
| Well-managed LAN NTP | sub-ms to low ms | data centres, better observability |
| NTP with PPS discipline | tens of microseconds | local stratum 1 servers |
| Software PTP | microseconds to tens of microseconds | controlled LANs with modest needs |
| Hardware-timestamped PTP | sub-microsecond | telecom, industrial, trading, measurement |
| Lab-grade time transfer | far below that | metrology and calibration work |
These are not guarantees. They are realistic operating bands when the whole stack is engineered competently.
Notice what changes as you move down the table. The improvement does not come just from changing protocol acronyms. It comes from changing where timestamps are taken, how stable the oscillator is, how symmetric the path is, and whether the network cooperates.
The Most Important Mental Model: Time Synchronisation Is Error Budget Management
By this point, the simplest accurate summary is this:
Time synchronisation is the continuous management of several independent error sources:
- oscillator frequency error
- phase error
- packet delay variation
- path asymmetry
- source uncertainty
- timestamping latency
- source loss and holdover drift
Protocols such as NTP and PTP are merely the control surfaces through which those error budgets are measured and reduced.
NTP reduces them statistically over imperfect networks. PTP reduces them through hardware and controlled topology. PPS reduces second-boundary uncertainty by bypassing software delay. GPS and other GNSS constellations make atomic time available almost everywhere. Labs such as PTB Berlin and the Observatoire de Paris anchor those operational systems to national and international metrology.
Once you see the stack that way, many operational behaviours stop looking mysterious:
- A low-stratum server can still be bad.
- A PTP deployment can still be mediocre on a poor network.
- A GNSS-fed grandmaster can still lie if the antenna path is compromised.
- A host can remain pretty accurate for a while after losing sync because the frequency estimate was good.
- Logs can still look wrong when systems use different leap-second policies.
Virtualisation Adds Another Clock Layer You Still Have to Discipline
Virtual machines are a useful reminder that time synchronisation is not only about packets and oscillators. The guest is also living on a scheduler and clock abstraction supplied by the hypervisor.
That creates extra failure modes:
- the guest pauses and resumes with a sudden offset
- live migration changes timing characteristics
- host overload delays guest execution unevenly
- guest tools and NTP daemons disagree about who should adjust the clock
Good virtualised timekeeping is therefore partly protocol design and partly systems hygiene. The host clock must be sane, the guest must know which source it should trust, and discontinuities must be handled by software that can recover cleanly. This is one reason chronyd became popular in mixed physical and virtual fleets. It adapts well when the clock environment changes sharply.
Why GPS Is the Quiet Root of Internet Time
So is GPS really the root of most internet time?
In the practical operational sense, yes.
Not because GPS invented time, and not because every system talks directly to satellites, but because GPS and other GNSS services are the most common way operators obtain traceable atomic time in ordinary facilities. The root chain for a huge amount of deployed infrastructure looks like:
atomic timescale -> national labs and control segments -> GNSS broadcast -> local timing receiver -> PPS or grandmaster -> NTP/PTP distribution -> hostsThat is the chain behind your cloud VM logs, your switch timestamps, your enterprise Kerberos tickets, your LTE frame phase alignment, and many of the clocks in the internet core that never advertise their ancestry to you directly.
The internet feels like a purely software system until you follow time back to its source. Then you hit rooftop antennas, coax cables, oscillator ovens, boundary clocks, metrology institutes, and satellites carrying atomic clocks at orbital speed.
That is how time synchronisation actually works.