debug.htm [plain text]

<HTML><HEAD><TITLE>
NTP Debugging Techniques
</TITLE></HEAD><BODY><H3>
NTP Debugging Techniques
</H3>

<IMG align=left SRC="pic/pogo.gif"><I>Pogo Possum</I>, with toolkit
and bug, Walt Kelly
<br clear=left><hr>

<P>Once the NTP software distribution has been compiled and installed
and the configuration file constructed, the next step is to verify
correct operation and fix any bugs that may result. Usually, the command
line that starts the daemon is included in the system startup file, so
it is executed only at system boot time; however, the daemon can be
stopped and restarted from root at any time. Usually, no command-line
arguments are required, unless special actions described in the
<TT><A HREF="ntpd.htm">ntpd</A></TT> page are required. Once started,
the daemon will begin sending messages, as specified in the
configuration file, and interpreting received messages.

<P>The best way to verify correct operation is using the <TT><A
HREF="ntpq.htm">ntpq</A></TT> and <TT><A HREF="ntpdc.htm">ntpdc</A></TT>
utility programs, either on the server itself or from another machine
elsewhere in the network. The <TT>ntpq</TT> program implements the
management functions specified in Appendix A of the NTP specification <A
HREF="http://www.eecis.udel.edu/~mills/database/rfc/rfc1305/rfc1305c.ps"
>
RFC-1305, Appendix A</A>. The <TT>ntpdc</TT> program implements
additional functions not provided in the standard. Both programs can be
used to inspect the state variables defined in the specification and, in
the case of <TT>ntpdc</TT>, additional ones of interest. In addition,
the <TT>ntpdc</TT> program can be used to selectively enable and disable
some functions of the daemon while the daemon is running.

<P>In extreme cases with elusive bugs, the daemon can operate in two
modes, depending on the presence of the <TT>-d</TT> command-line debug
switch. If not present, the daemon detaches from the controlling
terminal and proceeds autonomously. If one or more <TT>-d</TT> switches
are present, the daemon does not detach and generates special output
useful for debugging. In general, interpretation of this output requires
reference to the sources. However, a single <TT>-d</TT> does produce
only mildly cryptic output and can be very useful in finding problems
with configuration and network troubles. With a little experience, the
volume of output can be reduced by piping the output to <TT>grep
</TT>and specifying the keyword of the trace you want to see.

<P>Some problems are immediately apparent when the daemon first starts
running. The most common of these are the lack of a ntp (UDP port 123)
in the host <TT>/etc/services</TT> file. Note that NTP does not use TCP
in any form. Other problems are apparent in the system log file. The log
file should show the startup banner, some cryptic initialization data,
and the computed precision value. The next most common problem is
incorrect DNS names. Check that each DNS name used in the configuration
file responds to the Unix <TT>ping</TT> command.

<P>When first started, the daemon normally polls the servers listed in
the configuration file at 64-second intervals. In order to allow a
sufficient number of samples for the NTP algorithms to reliably
discriminate between correctly operating servers and possible intruders,
at least four valid messages from at least one server is required before
the daemon can set the local clock. However, if the current local time
is greater than 1000 seconds in error from the server time, the daemon
will not set the local clock; instead, it will plant a message in the
system log and shut down. It is necessary to set the local clock to
within 1000 seconds first, either by a time-of-year hardware clock, by
first using the <A HREF="ntpdate.htm"><TT>ntpdate</TT> </A>program or
manually be eyeball and wristwatch.

<P>After starting the daemon, run the <TT>ntpq</TT> program using the
<TT>-n</TT> switch, which will avoid possible distractions due to name
resolution problems. Use the <TT>pe</TT> command to display a billboard
showing the status of configured peers and possibly other clients poking
the daemon. After operating for a few minutes, the display should be
something like:

<PRE>ntpq>pe
remote      refid       st t when poll reach   delay  offset   disp
===================================================================
+128.4.2.6  132.249.16.1 2 u  131  256   373    9.89   16.28  23.25
*128.4.1.20 .WWVB.       1 u  137  256   377  280.62   21.74  20.23
-128.8.2.88 128.8.10.1   2 u  49   128   376  294.14    5.94  17.47
+128.4.2.17 .WWVB.       1 u  173  256   377  279.95   20.56  16.40
</PRE>

The host addresses shown in the <TT>remote</TT> column should agree with
the DNS entries in the configuration file, plus any peers not mentioned
in the file at the same or lower than your stratum that happen to be
configured to peer with you. Be prepared for surprises in cases where
the peer has multiple addresses or multiple names. The <TT>refid</TT>
entry shows the current source of synchronization for each peer, while
the <TT>st</TT> reveals the stratum, <TT>t</TT> the type (<TT>u</TT> =
unicast, <TT>m</TT> = multicast, <TT>l</TT> = local, <TT>-</TT> = don't
know), and <TT>poll</TT> the polling interval in seconds. The
<TT>when</TT> entry shows the time since the peer was last heard,
normally in seconds, while the <TT>reach</TT> entry shows the status of
the reachability register (see RFC-1305) in octal. The remaining entries
show the latest delay, offset and dispersion computed for the peer in
milliseconds. Note that in NTP Version 4 the dispersion entry includes
only the RMS error component; earlier versions included all components.

<P>The tattletale character at the left margin displays the
synchronization status of each peer. The currently selected peer is
marked <TT>*</TT>, while additional peers designated acceptable for
synchronization, but not currently selected, are marked <TT>+</TT>.
Peers marked <TT>*</TT> and <TT>+</TT> are included in a weighted
average computation to set the local clock; the data produced by peers
marked with other symbols are discarded. See the <TT>ntpq</TT>
documentation for the meaning of these symbols.

<P>Additional details for each peer separately can be determined by the
following procedure. First, use the <TT>as</TT> command to display an
index of association identifiers, such as

<PRE>ntpq>as
ind assID status conf reach auth condition last_event cnt
=========================================================
 1  11670   7414   no   yes   ok candidate  reachable   1
 2  11673   7614   no   yes   ok sys.peer   reachable   1
 3  11833   7314   no   yes   ok outlyer    reachable   1
 4  11868   7414   no   yes   ok candidate  reachable   1
 </PRE>

Each line in this billboard is associated with the corresponding line
the <TT>pe</TT> billboard above. Next, use the <TT>rv</TT> command and
the respective identifier to display a detailed synopsis of the selected
peer, such as

<PRE>ntpq>rv 11670
status=7414 reach, auth, sel_sync, 1 event, event_reach
srcadr=128.4.2.6, srcport=123, dstadr=128.4.2.7, dstport=123, keyid=1,
stratum=2, precision=-10, rootdelay=362.00, rootdispersion=21.99,
refid=132.249.16.1,
reftime=af00bb44.849b0000 Fri, Jan 15 1993 4:25:40.517,
delay= 9.89, offset= 16.28,
dispersion=23.25, reach=373, valid=8,
hmode=2, pmode=1, hpoll=8, ppoll=10, leap=00, flash=0x0,
org=af00bb48.31a90000 Fri, Jan 15 1993 4:25:44.193,
rec=af00bb48.305e3000 Fri, Jan 15 1993 4:25:44.188,
xmt=af00bb1e.16689000 Fri, Jan 15 1993 4:25:02.087,
filtdelay=  16.40 9.89  140.08  9.63  9.72  9.22 10.79 122.99,
filtoffset= 13.24 16.28 -49.19 16.04 16.83 16.49 16.95 -39.43,
filterror=  16.27 20.17  27.98 31.89 35.80 39.70 43.61  47.52
</PRE>

A detailed explanation of the fields in this billboard are beyond the
scope of this discussion; however, most variables defined in the
specification RFC-1305 can be found. The most useful portion for
debugging is the last three lines, which give the roundtrip delay, clock
offset and dispersion for each of the last eight measurement rounds, all
in milliseconds. Note that the dispersion, which is an estimate of the
error, increases as the age of the sample increases. From these data, it
is usually possible to determine the incidence of severe packet loss,
network congestion, and unstable local clock oscillators. There are no
hard and fast rules here, since every case is unique; however, if one or
more of the rounds show zeros, or if the clock offset changes
dramatically in the same direction for each round, cause for alarm
exists.

<P>Finally, the state of the local clock can be determined using the
<TT>rv</TT> command (without the argument), such as

<PRE>ntpq>rv
status=0664 leap_none, sync_ntp, 6 events, event_peer/strat_chg
system="UNIX", leap=00, stratum=2, rootdelay=280.62,
rootdispersion=45.26, peer=11673, refid=128.4.1.20,
reftime=af00bb42.56111000 Fri, Jan 15 1993 4:25:38.336,
poll=8, clock=af00bbcd.8a5de000 Fri, Jan 15 1993 4:27:57.540,
phase=21.147, freq=13319.46, compliance=2
</PRE>

The most useful data in this billboard show when the clock was last
adjusted <TT>reftime</TT>, together with its status and most recent
exception event. An explanation of these data is in the specification
RFC-1305.

<P>When nothing seems to happen in the <TT>pe</TT> billboard after some
minutes, there may be a network problem. The most common network problem
is an access controlled router on the path to the selected peer. No
known public NTP time server selectively restricts access at this time,
although this may change in future; however, many private networks do.
It also may be the case that the server is down or running in
unsynchronized mode due to a local problem. Use the <TT>ntpq</TT>
program to spy on its own variables in the same way you can spy on your
own.

<P>Once the daemon has set the local clock, it will continuously track
the discrepancy between local time and NTP time and adjust the local
clock accordingly. There are two components of this adjustment, time and
frequency. These adjustments are automatically determined by the clock
discipline algorithm, which functions as a hybrid phase/frequency
feedback loop. The behavior of this algorithm is carefully controlled to
minimize residual errors due to network jitter and frequency variations
of the local clock hardware oscillator that normally occur in practice.
However, when started for the first time, the algorithm may take some
time to converge on the intrinsic frequency error of the host machine.

<P>It has sometimes been the experience that the local clock oscillator
frequency error is too large for the NTP discipline algorithm, which can
correct frequency errors as large as 43 seconds per day. There are two
possibilities that may result in this problem. First, the hardware time-
of-year clock chip must be disabled when using NTP, since this can
destabilize the discipline process. This is usually done using the
<TT><A HREF="tickadj.htm">tickadj</A></TT> program and the <TT>-s</TT>
command line argument, but other means may be necessary. For instance,
in the Sun Solaris kernel, this can be done using a command in the
system startup file.

<P>Normally, the daemon will adjust the local clock in small steps in
such a way that system and user programs are unaware of its operation.
The adjustment process operates continuously as long as the apparent
clock error exceeds 128 milliseconds, which for most Internet paths is a
quite rare event. If the event is simply an outlyer due to an occasional
network delay spike, the correction is simply discarded; however, if the
apparent time error persists for an interval of about 20 minutes, the
local clock is stepped to the new value (as an option, the daemon can be
compiled to slew at an accelerated rate to the new value, rather than be
stepped). This behavior is designed to resist errors due to severely
congested network paths, as well as errors due to confused radio clocks
upon the epoch of a leap second.

<H4>Debugging Checklist</H4>

If the <TT>ntpq</TT> or <TT>ntpdc</TT> programs do not show that
messages are being received by the daemon or that received messages do
not result in correct synchronization, verify the following:

<OL>

<P><LI>Verify the <TT>/etc/services</TT> file host machine is configured
to
accept UDP packets on the NTP port 123. NTP is specifically designed to
use UDP and does not respond to TCP.</LI>

<P><LI>Check the system log for <TT>ntpd</TT> messages about
configuration
errors, name-lookup failures or initialization problems.</LI>

<P><LI>Using the <TT>ntpdc</TT> program and <TT>iostats</TT> command,
verify that the received packets and packets sent counters are
incrementing. If the packets send counter does not increment and the
configuration file includes designated servers, something may be wrong
in the network configuration of the ntpd host. If this counter does
increment and packets are actually being sent to the network, but the
received packets counter does not increment, something may be wrong in
the network or the server may not be responding.</LI>

<P><LI>If both the packets sent counter and received packets counter do
increment, but the <TT>rec</TT> timestamp in the <TT>pe</TT> billboard
shows far from the current date, received packets are probably being
discarded for some reason. There is a handy, undocumented state variable
<TT>flash</TT> visible in the <TT>pe</TT>billboard. The value is in hex
and normally has the value zero (OK). However, if something is wrong,
the bits of this variable, reading from the right, correspond to the
sanity checks listed in Section 3.4.3 of the NTP specification <A
HREF="http://www.eecis.udel.edu/~mills/database/rfc/rfc1305/rfc1305b.ps"
>RFC-1305</A>. A bit other than zero indicates the associated sanity
check failed.</LI>

<P><LI>If the <TT>org, rec</TT> and <TT>xmt</TT> timestamps in the
<TT>pe</TT> billboard appear current, but the local clock is not set, as
indicated by a stratum number less than 16 in the <TT>rv</TT> command
without arguments, verify that valid clock offset, roundtrip delay and
dispersion are displayed for at least one peer. The clock offset should
be less than 1000 seconds, the roundtrip delay less than one second and
the dispersion less than one second.</LI>


<P><LI>While the algorithm can tolerate a relatively large frequency
error (up to 500 parts per million or 43 seconds per day), various
configuration errors (and in some cases kernel bugs) can exceed this
tolerance, leading to erratic behavior. This can result in frequent loss
of synchronization, together with wildly swinging offsets. Use the
<TT>ntpdc</TT> program (or temporary configuration file) and <TT>disable
pll</TT> command to prevent the <TT>ntpd</TT> daemon from setting the
clock. Using the <TT>ntpq</TT> or <TT>ntpdc</TT> programs, watch the
apparent offset as it varies over time to determine the intrinsic
frequency error. If the error increases by more than 22 milliseconds per
64-second poll interval, the intrinsic frequency must be reduced by some
means. The easiest way to do this is with the <TT><A
HREF="tickadj.htm">tickadj</A></TT> program and the <TT>-t</TT>
command line argument.</LI>

</OL>

<hr><a href=index.htm><img align=left src=pic/home.gif></a><address><a
href=mailto:mills@udel.edu> David L. Mills &lt;mills@udel.edu&gt;</a>
</address></a></body></html>