Quantcast
Channel: Oracle DBA – A lifelong learning experience
Viewing all articles
Browse latest Browse all 189

The ‘rfs writes’ event on a DataGuard standby database

$
0
0

It is well known that poor performance on the standby server of a DataGuard pair can affect the performance of the primary database. This post shows an example and how to use the view GV$EVENT_HISTOGRAM to track down an issue.

The databases were 11.2.0.1 on HPUX. I had been seeing alerts from OEM to state that the standby was seeing lag_apply delays when applying redo to standby. Looking at the primary database alert log I could see the entries

ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Errors in file /app/oracle/diag/rdbms/xxxprd1a/BSMPRD1A/trace/xxxPRD1A_lgwr_24722.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
Error 16198 for archive log file 1 to 'xxxPRD1B'
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED

That seemed to correct itself later on but the timeout error was indicative of a network problem  – well at least that was my original hypothesis.

However I have a script which I call rfs_writes.sql which I use on the standby database quite often and once I had run that I was sent in a different direction.

A basic 101 now just in case anybody does not know what the wait event rfs_writes means and how it occurs.

  • The user commits a transaction creating a redo record in the SGA, the LGWR reads the redo record from the log buffer and writes it to the online redo log file and waits for confirmation from the LNS
  • The LNS reads the same redo record from the buffer and transmits it to the standby database using Oracle Net Services, the RFS receives the redo at the standby database and writes it to the SRL
  • When the RFS receives a write complete from the disk, it transmits an acknowledgment back to the LNS process on the primary database which in turns notifies the LGWR that the transmission is complete, the LGWR then sends a commit acknowledgment to the use

lgwrsync

The time it takes the RFS process to write the record into the Standby Redo Log on the standby is captured under the  wait event “RFS write” on the standby .Whilst this is happening, LGWR at the primary is waiting on the RFS to send its response back to LNS. So it waits on “LGWR-LNS wait on channel”.

So to determine how long LGWR is waiting for “LGWR-LNS wait on channel”, and subtract how long RFS performs its write, then we can roughly assume the rest of the time is spent preparing and sending messages over the network.

However due to various issues in the past with poorly performing disk my standard practise is to immediately run my rfs_writes.sql script

set lines 200<br />select inst_id,round(WAIT_TIME_MILLI/1000,2) wait_secs, last_update_time when, wait_count "How_many_times||since startup"<br />from GV$EVENT_HISTOGRAM where event like 'RFS write%'<br />--and round(WAIT_TIME_MILLI/1000,2) &gt; 2<br />order by 2 desc<br />--It is useful to order by 3 as well to see the latest bucket first<br />/

Running that a couple of minutes apart and the issue was pretty obvious

INST_ID  WAIT_SECS WHEN                                                                      How_many_times||since startup
 ---------- ---------- ------------------------------------------------------------------------- -----------------------------
 1      32.77 28-FEB-14 12.29.08.050637 AM +00:00                                                                  10
 1      16.38 04-MAR-14 12.32.46.560177 AM +00:00                                                                  41
 1       8.19 04-MAR-14 01.38.45.481812 AM +00:00                                                                 215
 1        4.1 04-MAR-14 01.48.35.603204 AM +00:00                                                                 548
 1       2.05 04-MAR-14 08.46.00.055955 AM +00:00                                                               16130
 1       1.02 04-MAR-14 08.47.07.311706 AM +00:00                                                               15620
 1        .51 04-MAR-14 08.45.54.182190 AM +00:00                                                               17805
 1        .26 04-MAR-14 08.44.28.031511 AM +00:00                                                               16137
 1        .13 04-MAR-14 08.46.10.214425 AM +00:00                                                               15786
 1        .06 04-MAR-14 08.47.13.869701 AM +00:00                                                                9826
 1        .03 04-MAR-14 08.44.33.950324 AM +00:00                                                                7256
 1        .02 04-MAR-14 08.47.19.916896 AM +00:00                                                               12607
 1        .01 04-MAR-14 08.46.46.507260 AM +00:00                                                               50712
 1          0 04-MAR-14 08.47.25.979131 AM +00:00                                                              148158
 1          0 04-MAR-14 08.47.26.990175 AM +00:00                                                              681294
 1          0 04-MAR-14 08.47.16.873984 AM +00:00                                                              939329
INST_ID  WAIT_SECS WHEN                                                                      How_many_times||since startup
 ---------- ---------- ------------------------------------------------------------------------- -----------------------------
 1      32.77 28-FEB-14 12.29.08.048333 AM +00:00                                                                  10
 1      16.38 04-MAR-14 12.32.46.557873 AM +00:00                                                                  41
 1       8.19 04-MAR-14 01.38.45.479508 AM +00:00                                                                 215
 1        4.1 04-MAR-14 01.48.35.600900 AM +00:00                                                                 548
 1       2.05 04-MAR-14 08.49.12.605524 AM +00:00                                                               16137
 1       1.02 04-MAR-14 08.49.52.840409 AM +00:00                                                               15630
 1        .51 04-MAR-14 08.48.59.762030 AM +00:00                                                               17808
 1        .26 04-MAR-14 08.49.27.053124 AM +00:00                                                               16142
 1        .13 04-MAR-14 08.49.32.497106 AM +00:00                                                               15791
 1        .06 04-MAR-14 08.48.08.513296 AM +00:00                                                                9828
 1        .03 04-MAR-14 08.48.24.669325 AM +00:00                                                                7259
 1        .02 04-MAR-14 08.49.28.915320 AM +00:00                                                               12608
 1        .01 04-MAR-14 08.46.46.504956 AM +00:00                                                               50712
 1          0 04-MAR-14 08.49.50.806599 AM +00:00                                                              148173
 1          0 04-MAR-14 08.49.57.900828 AM +00:00                                                              681361
 1          0 04-MAR-14 08.49.58.928416 AM +00:00

From this report it is quite obvious that the times of some of the bigger buckets have increased and in the 2 minute window shown we have had 10 writes over between 1.02 and 2.05 seconds  and 7  writes that took between 2 and 4 seconds.   Glance shows that disk I/O is at 100%

                                                           9

rfs1

and a Glance U looking at the HBA cards indicates massive servicetimes on one HBA

rfs2

Idx Controller  Util %   Qlen  ServTm     IO/s   Read/s  Write/s  IO KB/s  Type
--------------------------------------------------------------------------------
   2 fcd1          0.0    0.0     0.0      0.0      0.0      0.0      0.0 HBA
   3 fcd2         97.3   87.0     2.7     17.1      5.0     12.1   2469.8 HBA
   4 fcd3         96.2    1.0   209.7     16.5      4.2     12.3   1506.1 HBA

A quick check on the light levels down the fibre did not indicate any problems at least as far as fcd3 goes

 $/shared/admin/bin/get-fibre-power.sh
/dev/fcd0 RX Power: 0.40 - ok
/dev/fcd1 RX Power: 0.00 - Unplugged or Disabled
/dev/fcd2 RX Power: 0.29 - ok
/dev/fcd3 RX Power: 0.34 – ok

So I was pretty certain that the original lag apply message was actually an indication of poor disk throughput on the standby database  and this was being translated into issues on the primary database which eventually became a P1 incident. Luckily because I was investigating before I knew we had production issues we got ahead of the game and I had asked storage to investigate already. The performance problems on standby were then causing issues on production as can be seen from the wait events

EVENT                                    WAIT_CLASS                EVENT_COUNT  PCT_TOTAL —————————————- ————————- ———– ———- log file sync                            Commit                           8065      64.57 LNS wait on SENDREQ                      Network                          1351      10.82 LGWR-LNS wait on channel                 Other                            1348      10.79 db file sequential read                  User I/O                          753       6.03 CPU + Wait for CPU                       CPU                               718       5.75 log file switch completion               Configuration                     145       1.16 direct path read                         User I/O                          110        .88

I hope this post proves informative and I have the rfs_writes.sql script as one of my main tools when investigating DataGuard issues associated with lag_apply errors.

I also think there is more research to be done into identifying the network component of the total transfer time using the formula given earlier.



Viewing all articles
Browse latest Browse all 189

Trending Articles