It is well known that poor performance on the standby server of a DataGuard pair can affect the performance of the primary database. This post shows an example and how to use the view GV$EVENT_HISTOGRAM to track down an issue.
The databases were 11.2.0.1 on HPUX. I had been seeing alerts from OEM to state that the standby was seeing lag_apply delays when applying redo to standby. Looking at the primary database alert log I could see the entries
ORA-16198: LGWR received timedout error from KSR LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198) LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned Errors in file /app/oracle/diag/rdbms/xxxprd1a/BSMPRD1A/trace/xxxPRD1A_lgwr_24722.trc: ORA-16198: Timeout incurred on internal channel during remote archival Error 16198 for archive log file 1 to 'xxxPRD1B' Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
That seemed to correct itself later on but the timeout error was indicative of a network problem – well at least that was my original hypothesis.
However I have a script which I call rfs_writes.sql which I use on the standby database quite often and once I had run that I was sent in a different direction.
A basic 101 now just in case anybody does not know what the wait event rfs_writes means and how it occurs.
- The user commits a transaction creating a redo record in the SGA, the LGWR reads the redo record from the log buffer and writes it to the online redo log file and waits for confirmation from the LNS
- The LNS reads the same redo record from the buffer and transmits it to the standby database using Oracle Net Services, the RFS receives the redo at the standby database and writes it to the SRL
- When the RFS receives a write complete from the disk, it transmits an acknowledgment back to the LNS process on the primary database which in turns notifies the LGWR that the transmission is complete, the LGWR then sends a commit acknowledgment to the use
The time it takes the RFS process to write the record into the Standby Redo Log on the standby is captured under the wait event “RFS write” on the standby .Whilst this is happening, LGWR at the primary is waiting on the RFS to send its response back to LNS. So it waits on “LGWR-LNS wait on channel”.
So to determine how long LGWR is waiting for “LGWR-LNS wait on channel”, and subtract how long RFS performs its write, then we can roughly assume the rest of the time is spent preparing and sending messages over the network.
However due to various issues in the past with poorly performing disk my standard practise is to immediately run my rfs_writes.sql script
set lines 200<br />select inst_id,round(WAIT_TIME_MILLI/1000,2) wait_secs, last_update_time when, wait_count "How_many_times||since startup"<br />from GV$EVENT_HISTOGRAM where event like 'RFS write%'<br />--and round(WAIT_TIME_MILLI/1000,2) > 2<br />order by 2 desc<br />--It is useful to order by 3 as well to see the latest bucket first<br />/
Running that a couple of minutes apart and the issue was pretty obvious
INST_ID WAIT_SECS WHEN How_many_times||since startup ---------- ---------- ------------------------------------------------------------------------- ----------------------------- 1 32.77 28-FEB-14 12.29.08.050637 AM +00:00 10 1 16.38 04-MAR-14 12.32.46.560177 AM +00:00 41 1 8.19 04-MAR-14 01.38.45.481812 AM +00:00 215 1 4.1 04-MAR-14 01.48.35.603204 AM +00:00 548 1 2.05 04-MAR-14 08.46.00.055955 AM +00:00 16130 1 1.02 04-MAR-14 08.47.07.311706 AM +00:00 15620 1 .51 04-MAR-14 08.45.54.182190 AM +00:00 17805 1 .26 04-MAR-14 08.44.28.031511 AM +00:00 16137 1 .13 04-MAR-14 08.46.10.214425 AM +00:00 15786 1 .06 04-MAR-14 08.47.13.869701 AM +00:00 9826 1 .03 04-MAR-14 08.44.33.950324 AM +00:00 7256 1 .02 04-MAR-14 08.47.19.916896 AM +00:00 12607 1 .01 04-MAR-14 08.46.46.507260 AM +00:00 50712 1 0 04-MAR-14 08.47.25.979131 AM +00:00 148158 1 0 04-MAR-14 08.47.26.990175 AM +00:00 681294 1 0 04-MAR-14 08.47.16.873984 AM +00:00 939329 INST_ID WAIT_SECS WHEN How_many_times||since startup ---------- ---------- ------------------------------------------------------------------------- ----------------------------- 1 32.77 28-FEB-14 12.29.08.048333 AM +00:00 10 1 16.38 04-MAR-14 12.32.46.557873 AM +00:00 41 1 8.19 04-MAR-14 01.38.45.479508 AM +00:00 215 1 4.1 04-MAR-14 01.48.35.600900 AM +00:00 548 1 2.05 04-MAR-14 08.49.12.605524 AM +00:00 16137 1 1.02 04-MAR-14 08.49.52.840409 AM +00:00 15630 1 .51 04-MAR-14 08.48.59.762030 AM +00:00 17808 1 .26 04-MAR-14 08.49.27.053124 AM +00:00 16142 1 .13 04-MAR-14 08.49.32.497106 AM +00:00 15791 1 .06 04-MAR-14 08.48.08.513296 AM +00:00 9828 1 .03 04-MAR-14 08.48.24.669325 AM +00:00 7259 1 .02 04-MAR-14 08.49.28.915320 AM +00:00 12608 1 .01 04-MAR-14 08.46.46.504956 AM +00:00 50712 1 0 04-MAR-14 08.49.50.806599 AM +00:00 148173 1 0 04-MAR-14 08.49.57.900828 AM +00:00 681361 1 0 04-MAR-14 08.49.58.928416 AM +00:00
From this report it is quite obvious that the times of some of the bigger buckets have increased and in the 2 minute window shown we have had 10 writes over between 1.02 and 2.05 seconds and 7 writes that took between 2 and 4 seconds. Glance shows that disk I/O is at 100%
9
and a Glance U looking at the HBA cards indicates massive servicetimes on one HBA
Idx Controller Util % Qlen ServTm IO/s Read/s Write/s IO KB/s Type -------------------------------------------------------------------------------- 2 fcd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 HBA 3 fcd2 97.3 87.0 2.7 17.1 5.0 12.1 2469.8 HBA 4 fcd3 96.2 1.0 209.7 16.5 4.2 12.3 1506.1 HBA
A quick check on the light levels down the fibre did not indicate any problems at least as far as fcd3 goes
$/shared/admin/bin/get-fibre-power.sh /dev/fcd0 RX Power: 0.40 - ok /dev/fcd1 RX Power: 0.00 - Unplugged or Disabled /dev/fcd2 RX Power: 0.29 - ok /dev/fcd3 RX Power: 0.34 – ok
So I was pretty certain that the original lag apply message was actually an indication of poor disk throughput on the standby database and this was being translated into issues on the primary database which eventually became a P1 incident. Luckily because I was investigating before I knew we had production issues we got ahead of the game and I had asked storage to investigate already. The performance problems on standby were then causing issues on production as can be seen from the wait events
EVENT WAIT_CLASS EVENT_COUNT PCT_TOTAL —————————————- ————————- ———– ———- log file sync Commit 8065 64.57 LNS wait on SENDREQ Network 1351 10.82 LGWR-LNS wait on channel Other 1348 10.79 db file sequential read User I/O 753 6.03 CPU + Wait for CPU CPU 718 5.75 log file switch completion Configuration 145 1.16 direct path read User I/O 110 .88
I hope this post proves informative and I have the rfs_writes.sql script as one of my main tools when investigating DataGuard issues associated with lag_apply errors.
I also think there is more research to be done into identifying the network component of the total transfer time using the formula given earlier.
