NTP slew in clusters

Q: Why is NTP slew important?

In order to keep the system time synchronized with other nodes in an HACMP cluster or across the enterprise, Network Time Protocol (NTP) should be implemented. In its default configuration, NTP will periodically update the system time to match a reference clock by resetting the system time on the node. If the time on the reference clock is behind the time of the system clock, the system clock will be set backwards causing the same time period to be passed twice. This can cause internal timers in HACMP and Oracle databases to wait longer periods of time under some circumstances. When these circumstances arise, HACMP may stop the node or the Oracle instance may shut itself down.

Oracle will log an ORA-29740 error when it shuts down the instance due to inconsistent timers. The hatsd daemon utilized by HACMP will log a TS_THREAD_STUCK_ER error in the system error log just before HACMP stops a node due to an expired timer.

To avoid this issue, system managers should configure the NTP daemon to increment time on the node slower until the system clock and the reference clock are in sync (this is called “slewing” the clock) instead of resetting the time in one large increment. The behavior is configured with the -x flag for the xntpd daemon.

Reference:

http://www.aixhealthcheck.com/blog?id=286
Troubleshooting ORA-29740 in a RAC Environment (Doc ID 219361.1)
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

Tuning Inter-Instance Performance in RAC and OPS (Doc ID 181489.1)
The view X$KCLCRST (CR Statistics) may be helpful in debugging ‘global cache cr request’ wait issues. It will return the number of requests that were handled for data or undo header blocks, the number of requests resulting in the shipment of a block (cr or current), and the number of times a read from disk status is returned.
The GCS/GES may run out of tickets. When viewing the racdiag.sql output Note: 135714.1 or querying the gv$ges_traffic_controller or
gv$dlm_traffic_controller views, you may find that the TCKT_AVAIL shows ‘0’. To find out the available network buffer space we introduce the concepts of tickets. The maximum number of tickets available is a function of the network send buffer size. In the case of lmd and lmon, they always buffer their messages in case of ticket unavailability. A node relies on messages to come back from the remote node to release tickets for reuse.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s