[olug] NEED to kill a process....

Tue May 22 17:29:41 UTC 2012

On 05/22/2012 12:07 PM, Christopher Cashell wrote:
> Run the following:
>
> $ *ps auxww | egrep 'COMMAND|fdisk'*
root      4816  0.0  0.0   3896   516 ?        D    07:53   0:00 fdisk
/dev/mapper/....
>
>
> Did anything happen with the SAN while the fdisk was happening?  Did the
> connection fail, or the SAN control card/module go offline or reboot or
> failover?
Nope, new pool on SAN, attempting to present to hosts... (in-effectively)
>
> Depending on exactly where the command was and where things failed, you
> might not be able to cleanly recover.  OS kernels don't handle it well when
> something that should be reliable isn't.  You can try unmounting all
> volumes from the SAN, if any are mounted.  You can try unloading the driver
> module for the SAN access.  If it isn't completely broken, you can try
> severing the connectivity between the server and the SAN.  If there are SAN
> LUNs that were visible and now aren't, you can try rescanning the SCSI bus.
>
> Worst case, you'll have to reboot.  Which may require a press on the reset
> button or power cycle from a Remote Access Card.  A hung process that can't
> be killed will tend to cause a clean shutdown to hang.  If that's the case,
> you can manually sync/unmount as many filesystems as possible to make the
> shutdown as clean as possible.
>
> Note, a process hung in IO Wait like this (ps status of '*D'*) generally
> won't affect the running system outside of itself, although additional
> processes trying to access the non-responsive or problematic device/disk
> may also hang in the same way.  I've had machines with a horked up process
> like this that I've left for weeks until the box could be rebooted without
> impact.
<snip>
>
> Final note, the older the kernel, the less gracefully it handled stuff like
> this (there was a longstanding issue with NFS mounts that would lead to
> hung processes in an IO Wait state that couldn't be killed if the NFS share
> dropped).  Recent kernels (last 4-5 years) have improved significantly.
current RHEL 5.7
> Also, take all of this with a bit of caution.  I'm not familiar with your
> environment or system specifics, so the above suggestions and information
> may not apply 100%.
Oddly, it's clustered w/ another box that accesses SAN, LOAD on both
nearing 100 98 94, although, performance hasn't "pooped the boat" yet....

BTW, killing attempts including calling bash shell, so now processes
owned by "1".  :-/

reboot it may be, but it's not gonna be a happy time....

Thanks for the guidance!! OLUG always the best and ALWAYS first to reply
if I were so silly to try multiple avenues. Been there, done that.

Noel
>

-- 
#######################################################
#  Noel Leistad, CISSP                                #
#  noel at metc.net                                      #
#######################################################

A hypothetical paradox:
	What would happen in a battle between an Enterprise security team,
	who always get killed soon after appearing, and a squad of Imperial
	Stormtroopers, who can't hit the broad side of a planet?
		-- Tom Galloway