Troubleshooting 11.2 Clusterware Node Evictions (Note 1050693.1)

Starting 11.2.0.2, a node eviction may not actually reboot the machine.  This is called a rebootless restart.

To identify which process initiates a reboot, you need to review below are important files

  • Clusterware alert log in <GRID_HOME>/log/<nodename>alertnodename
  • The cssdagent log(s) in <GRID_HOME>/log/<nodename>/agent/ohasd/oracssdagent_root
  • The cssdmonitor log(s) in <GRID_HOME>/log/<nodename>/agent/ohasd/oracssdmonitor_root
  • The ocssd log(s) in <GRID_HOME>/log/<nodename>/cssd
  • The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
  • IPD/OS or OS Watcher data.  IPD/OS is an old name for the Cluster Health Monitor.  The names can be used interchaneably although Oracle now calls the tool Cluster Health Monitor
  • 'opatch lsinventory -detail' output for the GRID home
  • Message files /var/log/message
Common Causes of eviction:

OCSSD Eviction: 1) Network failure or latencies issue between nodes.  It takes 30 consecutive missed checkins to cause a node eviction.  2)  Problem writing / reading the voting disk  3) A member kill escallation like the LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanisim.  If this times out, it could escalate to a node evict.

CSSDAGENT or CSSDMONITOR Eviction:  1) OS Scheduler problem as a result of OS is locked upor execsive amounts of load on the server such as CPU utilization is as high as 100% 2) CSS process is hung 3) Oracle bug

3 comments:

  1. Gr8.. Thanks a lot for sharing, Really....

    ReplyDelete
  2. Very good article.

    I have also listed top 4 Reasons for node reboot or node eviction at http://www.dbas-oracle.com/2013/06/Top-4-Reasons-Node-Reboot-Node-Eviction-in-Real-Application-Cluster-RAC-Environment.html

    ReplyDelete
  3. It’s really helfull………… I really like this page so much, so better to keep on posting! Thanks…
    eviction law firm broward

    ReplyDelete