Update: run the above a, b, c, d, and skip log gathers as we know this is a bug in the isilon 7.x firmware
This is one of those cases where we would need to have had debug logging, and packet captures, to reproduce and "catch" the issue in the act. In the future, I strongly recommend allowing Isilon Support to engage over WebEx to collect the necessary data, so that we can avoid having insufficient information. This can be done by opening a Severity 1 ticket, and posting the WebEx details directly into the SR, so that an engineer can join promptly, and collect the necessary information. Once the required data is collected, restarting of processes may ensue.
IF a WebEx is not possible and/or the business needs to bring the services back online immediately, I recommend running through the following procedure (copy/paste commands as shown):
1) Make the directory for data collection
# mkdir -p /ifs/data/Isilon_Support/$(date +%m%d%y)
2) Gather data
# isi_for_array '/usr/likewise/bin/lw-lsa ad-get-machine password > /ifs/data/Isilon_Support/$(date +%m%d%y)/'hostname'.lw-lsa_ad-get-machine_password.txt'
# isi_for_array '/usr/likewise/bin/lw-lsa get-status > /ifs/data/Isilon_Support/$(date +%m%d%y)/'hostname'.lw-lsa_get-status.txt'
# isi_for_array -s 'klist -K -k DYNAMIC:/usr/lib/kt_isi_pstore.so:cifs:FQDN.DOMAIN.COM > /ifs/data/Isilon_Support/$(date +%m%d%y)/'hostname'.klist.txt'
3) Gather cores (gcore)
# isi_for_array -s "pgrep -l \"lwio|lsass|srvsvc|netlogon|lwreg|lwsmd\" | awk '{system(\"gcore -s -c /ifs/data/Isilon_Support/\$(date +%m%d%y)/\'hostname\'.\$(date +%m%d%y_%H%M.%S).\"\$2\".core \"\$1)}'"
4) Use the following to turn on debug logging, and start packet captures
# isi_for_array 'isi smb log-level --set=debug; isi auth log-level --set=debug; for i in 'ifconfig | grep -B2 ether | grep flags | cut -d: -f1'; do tcpdump -i ${i} -s0 -w /ifs/data/Isilon_Support/$(date +%m%d%y)/'hostname'.${i}_$(date +%m%d%Y_%H%M%S).pcap &; done' &; sleep 5; echo "Debug logging and captures running! Please reproduce issue, hit [ENTER] once reproduced to end capture and debug logging..."; read; echo "Stopping debug/pcaps"; isi_for_array 'isi smb log-level --set=warning; isi auth log-level --set=warning; pkill -INT tcpdump'
NOTE:
After running the above, debug logging and pcaps will be started on ALL interfaces. After the issue is reproduced (error encountered), hit ENTER to stop captures and the logging back to default. This will produce a LOT of data, so limit this to a very brief moment, long enough to reproduce a failure - Seconds, not minutes.
5) Begin restarting processes in order of: netlogon, lsass, srvsvc, lwio - testing access in between each restart - STOPPING WHEN ACCESS IS RESTORED:
IF core.gz files exist in /var/crash, gather those/copy them before overwriting them as below
a) isi_for_array 'killall -6 netlogon'
b) isi_for_array 'killall -6 lsass'
c) isi_for_array 'killall -6 srvsvc'
d) isi_for_array 'killall -6 lwio' NOTE: Potentially DISRUPTIVE --> This will close *ALL* active connections, only use this when SMB is 100% unavailable, and the above 3 processes did NOT resolve the issue
6) Collect two log gathers, one for the data we collected, and one for debug logs (normal gather):
# isi_gather_info --nologs --local-only -i isi_hw_status -f /ifs/data/Isilon_Support/$(date +%m%d%y)# isi_gather_info -f "/var/crash/*.core.gz"
Update: run the above a, b, c, d, and skip log gathers as we know this is a bug in the isilon 7.x firmware
## consolidated commands:
isi_for_array -n1 "pgrep -l \"lwio|lsass|srvsvc|netlogon|lwreg|lwsmd\" | awk '{system(\"gcore -s -c /ifs/data/Isilon_Support/\$(date +%m%d%y)/\`hostname\`.\$(date +%m%d%y_%H%M.%S).\"\$2\".core \"\$1)}'"
isi_for_array 'killall -6 netlogon'; isi_for_array 'killall -6 lsass';isi_for_array 'killall -6 srvsvc'; isi_for_array 'killall -6 lwio'
isi_gather_info --nohttp --noemail --upload --ftp --ftp-host=ftp.isilon.com --ftp-user=anonymous\@ftp.isilon.com --ftp-pass=youremailaddress --ftp-path=/incoming --save
killall -6 netlogon; killall -6 lsass; killall -6 srvsvc