The ‘rfs writes’ event on a DataGuard standby database

It is well known that poor performance on the standby server of a DataGuard pair can affect the performance of the primary database. This post shows an example and how to use the view GV$EVENT_HISTOGRAM to track down an issue.

The databases were 11.2.0.1 on HPUX. I had been seeing alerts from OEM to state that the standby was seeing lag_apply delays when applying redo to standby. Looking at the primary database alert log I could see the entries

ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Errors in file /app/oracle/diag/rdbms/xxxprd1a/BSMPRD1A/trace/xxxPRD1A_lgwr_24722.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
Error 16198 for archive log file 1 to 'xxxPRD1B'
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED

That seemed to correct itself later on but the timeout error was indicative of a network problem – well at least that was my original hypothesis.

However I have a script which I call rfs_writes.sql which I use on the standby database quite often and once I had run that I was sent in a different direction.

A basic 101 now just in case anybody does not know what the wait event rfs_writes means and how it occurs.

The user commits a transaction creating a redo record in the SGA, the LGWR reads the redo record from the log buffer and writes it to the online redo log file and waits for confirmation from the LNS
The LNS reads the same redo record from the buffer and transmits it to the standby database using Oracle Net Services, the RFS receives the redo at the standby database and writes it to the SRL
When the RFS receives a write complete from the disk, it transmits an acknowledgment back to the LNS process on the primary database which in turns notifies the LGWR that the transmission is complete, the LGWR then sends a commit acknowledgment to the use

The time it takes the RFS process to write the record into the Standby Redo Log on the standby is captured under the wait event “RFS write” on the standby .Whilst this is happening, LGWR at the primary is waiting on the RFS to send its response back to LNS. So it waits on “LGWR-LNS wait on channel”.

So to determine how long LGWR is waiting for “LGWR-LNS wait on channel”, and subtract how long RFS performs its write, then we can roughly assume the rest of the time is spent preparing and sending messages over the network.

However due to various issues in the past with poorly performing disk my standard practise is to immediately run my rfs_writes.sql script

set lines 200<br />select inst_id,round(WAIT_TIME_MILLI/1000,2) wait_secs, last_update_time when, wait_count "How_many_times||since startup"<br />from GV$EVENT_HISTOGRAM where event like 'RFS write%'<br />--and round(WAIT_TIME_MILLI/1000,2) &gt; 2<br />order by 2 desc<br />--It is useful to order by 3 as well to see the latest bucket first<br />/

Running that a couple of minutes apart and the issue was pretty obvious

INST_ID  WAIT_SECS WHEN                                                                      How_many_times||since startup
 ---------- ---------- ------------------------------------------------------------------------- -----------------------------
 1      32.77 28-FEB-14 12.29.08.050637 AM +00:00                                                                  10
 1      16.38 04-MAR-14 12.32.46.560177 AM +00:00                                                                  41
 1       8.19 04-MAR-14 01.38.45.481812 AM +00:00                                                                 215
 1        4.1 04-MAR-14 01.48.35.603204 AM +00:00                                                                 548
 1       2.05 04-MAR-14 08.46.00.055955 AM +00:00                                                               16130
 1       1.02 04-MAR-14 08.47.07.311706 AM +00:00                                                               15620
 1        .51 04-MAR-14 08.45.54.182190 AM +00:00                                                               17805
 1        .26 04-MAR-14 08.44.28.031511 AM +00:00                                                               16137
 1        .13 04-MAR-14 08.46.10.214425 AM +00:00                                                               15786
 1        .06 04-MAR-14 08.47.13.869701 AM +00:00                                                                9826
 1        .03 04-MAR-14 08.44.33.950324 AM +00:00                                                                7256
 1        .02 04-MAR-14 08.47.19.916896 AM +00:00                                                               12607
 1        .01 04-MAR-14 08.46.46.507260 AM +00:00                                                               50712
 1          0 04-MAR-14 08.47.25.979131 AM +00:00                                                              148158
 1          0 04-MAR-14 08.47.26.990175 AM +00:00                                                              681294
 1          0 04-MAR-14 08.47.16.873984 AM +00:00                                                              939329
INST_ID  WAIT_SECS WHEN                                                                      How_many_times||since startup
 ---------- ---------- ------------------------------------------------------------------------- -----------------------------
 1      32.77 28-FEB-14 12.29.08.048333 AM +00:00                                                                  10
 1      16.38 04-MAR-14 12.32.46.557873 AM +00:00                                                                  41
 1       8.19 04-MAR-14 01.38.45.479508 AM +00:00                                                                 215
 1        4.1 04-MAR-14 01.48.35.600900 AM +00:00                                                                 548
 1       2.05 04-MAR-14 08.49.12.605524 AM +00:00                                                               16137
 1       1.02 04-MAR-14 08.49.52.840409 AM +00:00                                                               15630
 1        .51 04-MAR-14 08.48.59.762030 AM +00:00                                                               17808
 1        .26 04-MAR-14 08.49.27.053124 AM +00:00                                                               16142
 1        .13 04-MAR-14 08.49.32.497106 AM +00:00                                                               15791
 1        .06 04-MAR-14 08.48.08.513296 AM +00:00                                                                9828
 1        .03 04-MAR-14 08.48.24.669325 AM +00:00                                                                7259
 1        .02 04-MAR-14 08.49.28.915320 AM +00:00                                                               12608
 1        .01 04-MAR-14 08.46.46.504956 AM +00:00                                                               50712
 1          0 04-MAR-14 08.49.50.806599 AM +00:00                                                              148173
 1          0 04-MAR-14 08.49.57.900828 AM +00:00                                                              681361
 1          0 04-MAR-14 08.49.58.928416 AM +00:00

From this report it is quite obvious that the times of some of the bigger buckets have increased and in the 2 minute window shown we have had 10 writes over between 1.02 and 2.05 seconds and 7 writes that took between 2 and 4 seconds. Glance shows that disk I/O is at 100%

and a Glance U looking at the HBA cards indicates massive servicetimes on one HBA

Idx Controller  Util %   Qlen  ServTm     IO/s   Read/s  Write/s  IO KB/s  Type
--------------------------------------------------------------------------------
   2 fcd1          0.0    0.0     0.0      0.0      0.0      0.0      0.0 HBA
   3 fcd2         97.3   87.0     2.7     17.1      5.0     12.1   2469.8 HBA
   4 fcd3         96.2    1.0   209.7     16.5      4.2     12.3   1506.1 HBA

A quick check on the light levels down the fibre did not indicate any problems at least as far as fcd3 goes

 $/shared/admin/bin/get-fibre-power.sh
/dev/fcd0 RX Power: 0.40 - ok
/dev/fcd1 RX Power: 0.00 - Unplugged or Disabled
/dev/fcd2 RX Power: 0.29 - ok
/dev/fcd3 RX Power: 0.34 – ok

So I was pretty certain that the original lag apply message was actually an indication of poor disk throughput on the standby database and this was being translated into issues on the primary database which eventually became a P1 incident. Luckily because I was investigating before I knew we had production issues we got ahead of the game and I had asked storage to investigate already. The performance problems on standby were then causing issues on production as can be seen from the wait events

EVENT WAIT_CLASS EVENT_COUNT PCT_TOTAL —————————————- ————————- ———– ———- log file sync Commit 8065 64.57 LNS wait on SENDREQ Network 1351 10.82 LGWR-LNS wait on channel Other 1348 10.79 db file sequential read User I/O 753 6.03 CPU + Wait for CPU CPU 718 5.75 log file switch completion Configuration 145 1.16 direct path read User I/O 110 .88

I hope this post proves informative and I have the rfs_writes.sql script as one of my main tools when investigating DataGuard issues associated with lag_apply errors.

I also think there is more research to be done into identifying the network component of the total transfer time using the formula given earlier.

The ‘rfs writes’ event on a DataGuard standby database

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112