[OLUG] Isolating flaky hardware problems

Dave Burchell burchell at inetnebr.com
Wed Feb 9 22:39:36 UTC 2000


I've got some hardware that may be flaky, and I need some advice on
narrowing down the problem.

Long story short, how do I isolate possible CPU or RAM intermittent
failures?

Here's why I'm asking.

One of my users has an NT box.  It died one day, and I decided its SCSI
card might be bad because it would try to boot from the SCSI disk but
wouldn't make it past a certain point in the NT boot sequence (where
I think it was trying to initialize the SCSI devices).  I got the BSOD
each time.

To test NT boxes I like to load up Linux.  Booting Linux (Debian 2.1
rescue floppy) went ok at first, but where it should have listed all
of the SCSI devices it hung.  It found the SCSI card but could not list
the devices on the SCSI bus.

Replacing the SCSI card with another allowed the machine to boot the
rescue floppy and install Linux (on a Jaz disk because the SCSI HD was
(still is) full of NTFS partitions).  I built a 2.2.14 kernel with NTFS
support, booted it, and backed up the NTFS partitions across the LAN to
a tape drive using ssh.

I should say I _tried_ to back it up, but before the backup was done
the machine hung with the error message

Kernel panic: VFS: LRU block list corrupted

I brought it up again, this time running a Perl script to figure sines
and thus stress the CPU some.  It hung again after a few hours, this
time with no error message.

Finally last night it did the backup without crashing.

So, what's the problem?  I suspect, based on some traffic on the Linux
kernel mailing list, that it could be a SIMM or a CPU going out, perhaps
in part due to overheating.  What's everyone think of this theory?

Let's say we do think it may be RAM.  Can I boot Linux with a "mem=32"
option to limit the memory that is used by Linux?  Or would I be better
off removing half the SIMMs and narrowing it down that way?

Is there any good software to run if I want to stress components of the
system to try to induce a failure?

How could I do a repetitive read/write test of all the RAM on the system?
Could I get the memory address of a bad piece of RAM?

It seems to me I remember doing this with a Unixware '486 years back
but I don't remember the utils and I don't think they were running
under Unixware.

The machine in question here is an all-SCSI (disabled on-board IDE) 200
Mhz Pentium with 128 MB RAM and video, sound, ethernet, and modem
cards.

(If I'm missing something obvious here I'm sure I'll be notified,
right? :-)

-- 
Dave Burchell                                          40.49'N, 96.41'W
Free your mind and your software will follow.              402-467-1619
http://incolor.inetnebr.com/burchell/                  burchell at acm.org     

-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/ 
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net` 



More information about the OLUG mailing list