[olug] NEED to kill a process....

Tue May 22 17:07:35 UTC 2012

On Tue, May 22, 2012 at 11:12 AM, Noel Leistad <noel at metc.net> wrote:
> So, I've got an fdisk process attempting to work on a SAN device? Can't
> seem to get it killed??
>
> Possibly I started something I have little understanding of, or
> incorrect understanding of anyway.
>
> Anyone have any suggestions?

Run the following:

$ *ps auxww | egrep 'COMMAND|fdisk'*

You should get something that looks roughly like this:

*USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND*
root      2382  0.0  0.0   1740    16 ?        D    2011   2:06 /sbin/fdisk

What we're interested in is the STAT column.  If you see the line with
fdisk showing
a '*D*', you might be somewhat SOL.  The status of '*D*' means *
uninterruptible sleep*, usually related to IO.  The process has made a
system call to read or write to an IO device (probably the SAN) and inside
the kernel, it's still waiting for that system call to complete.  Until
that system call completes, the KILL signal cannot be delivered, so the
process can't be killed.  It's hung in the worst sort of way.

Did anything happen with the SAN while the fdisk was happening?  Did the
connection fail, or the SAN control card/module go offline or reboot or
failover?

Depending on exactly where the command was and where things failed, you
might not be able to cleanly recover.  OS kernels don't handle it well when
something that should be reliable isn't.  You can try unmounting all
volumes from the SAN, if any are mounted.  You can try unloading the driver
module for the SAN access.  If it isn't completely broken, you can try
severing the connectivity between the server and the SAN.  If there are SAN
LUNs that were visible and now aren't, you can try rescanning the SCSI bus.

Worst case, you'll have to reboot.  Which may require a press on the reset
button or power cycle from a Remote Access Card.  A hung process that can't
be killed will tend to cause a clean shutdown to hang.  If that's the case,
you can manually sync/unmount as many filesystems as possible to make the
shutdown as clean as possible.

Note, a process hung in IO Wait like this (ps status of '*D'*) generally
won't affect the running system outside of itself, although additional
processes trying to access the non-responsive or problematic device/disk
may also hang in the same way.  I've had machines with a horked up process
like this that I've left for weeks until the box could be rebooted without
impact.

If you don't see a ps status of '*D*' in the result of the ps command,
there's a possibility you might see a '*T*'.  I haven't seen this happen as
often, but I have seen a couple of times where a process was *STOP*ed and
wasn't responding to signals (including KILL) until it was woken back up.
 You can try sending the *CONT* signal to the process, see if that does the
trick.

Final note, the older the kernel, the less gracefully it handled stuff like
this (there was a longstanding issue with NFS mounts that would lead to
hung processes in an IO Wait state that couldn't be killed if the NFS share
dropped).  Recent kernels (last 4-5 years) have improved significantly.

Also, take all of this with a bit of caution.  I'm not familiar with your
environment or system specifics, so the above suggestions and information
may not apply 100%.

> have tried:

[Snip.]

> Noel

-- 
Christopher