Troubleshooting Clusterware

1)      Make sure your nodes have exactly the same system time.  the best recommendation is sync nodes using Network Time Protocol (NTP) by modifying the NFTP initialization file with –x flag
vi /etc/sysconfig/ntpd
OPTIONS="-x -u ntp:ntp -p /var/run/ntpd.pid"

2)      Run diagnostics Collection script: 
./diagcollection.pl –collect

3)      Run cluster verify to verify Oracle grid installation and RAC installation, configuration, and operation.

lltcind01.fnf.com{+ASM1}/apps/oracle/product/11.2.0/grid/bin> cluvfy comp -list

USAGE:
cluvfy comp  <component-name> <component-specific options>  [-verbose]

Valid components are:
        nodereach : checks reachability between nodes
        nodecon   : checks node connectivity
        cfs       : checks CFS integrity
        ssa       : checks shared storage accessibility
        space     : checks space availability
        sys       : checks minimum system requirements
        clu       : checks cluster integrity
        clumgr    : checks cluster manager integrity
        ocr       : checks OCR integrity
        olr       : checks OLR integrity
        ha        : checks HA integrity
        crs       : checks CRS integrity
        nodeapp   : checks node applications existence
        admprv    : checks administrative privileges
        peer      : compares properties with peers
        software  : checks software distribution
        asm       : checks ASM integrity
        acfs       : checks ACFS integrity
        gpnp      : checks GPnP integrity
        gns       : checks GNS integrity
        scan      : checks SCAN configuration
        ohasd     : checks OHASD integrity
        clocksync      : checks Clock Synchronization
        vdisk      : check Voting Disk Udev settings


Example:

/u01/app/oracle/product/11.2.0/grid/bin> cluvfy comp crs -n all -verbose

Verifying CRS integrity

Checking CRS integrity...
The Oracle clusterware is healthy on node "lltcind02"
The Oracle clusterware is healthy on node "lltcind01"

CRS integrity check passed

Verification of CRS integrity was successful.


/u01/app/oracle/product/11.2.0/grid/bin> cluvfy stage -list


USAGE:
cluvfy stage {-pre|-post} <stage-name> <stage-specific options>  [-verbose]

Valid stage options and stage names are:
        -post hwos    :  post-check for hardware and operating system
        -pre  cfs     :  pre-check for CFS setup
        -post cfs     :  post-check for CFS setup
        -pre  crsinst :  pre-check for CRS installation
        -post crsinst :  post-check for CRS installation
        -pre  hacfg   :  pre-check for HA configuration
        -post hacfg   :  post-check for HA configuration
        -pre  dbinst  :  pre-check for database installation
        -pre  acfscfg  :  pre-check for ACFS Configuration.
        -post acfscfg  :  post-check for ACFS Configuration.
        -pre  dbcfg   :  pre-check for database configuration
        -pre  nodeadd :  pre-check for node addition.
        -post nodeadd :  post-check for node addition.
        -post nodedel :  post-check for node deletion.

4)       Enable Resouce Debugging to turn on /off tracing
sudo -u root crsctl set log res "ora.lltcind01.vip:1"
sudo] password for oracle:
Set Resource ora.lltcind01.vip Log Level: 1

sudo -u root crsctl set log res "ora.lltcind01.vip:0"
Set Resource ora.lltcind01.vip Log Level: 0

5)      Use SRVM_TRACE=TRUE for srvctl, cluvfy, netca, dbca, dbua
srvctl config database –d TEST –a
6)      Check the following log files when node eviction occurs
$ORACLE_GIRD/log/host_name/cssd/ocssd.log.  You need to look for “Begin Dump” or End Dump” just before the reboot.

$ORACLE_GRID/log/host_name/client/oclskd.log
7)       Set diagwait value
crsctl set css diagwait 13 –force

8)      Avoid false reboots
crsctl get css miscount (to determine current setting.  Misccount must > (Timeout + Margin_ and > than diagwait.  Default of 30s recommended, do not change the value of miscount or disk timeout unless Oracle Support recommends doing so)
9)      ocrdump

No comments:

Post a Comment