more RAM + faster disk -> slower box?!

Wed Nov 28 22:03:58 GMT 2001

On Thursday 29 November 2001 12:34 pm, Ian Pallfreeman wrote:

> RAID, which, frankly, sucks, and has been the cause of much sorrow. The
> replacement disk is a SCSI 160MB/s, replacing an 80 -- same size, same
> geometry. And I took the opportunity to increase the memory from 256MB to
> 768MB, in the hope that this might further compensate for the bandwidth
> problem in the IDE RAID.

But not the same speed. I see from your attatched dmesg that you appear to 
have one UDMA drive which you mount the root partition from, and two SCSI 
disks at different speeds, with an Adaptec that will do 160Mb/sec - I think 
putting two disks of different speeds is going to confuse the RAID 
controller, and possibly BSD. More importantly, I think the faster drive is 
going to physically thrash whilst waiting for the slower disk to catch up.

A good friend told me about one RAID array he saw with a load of 7200rpm 
disks and one 5600 (or whatever), and the ensuing fun. You could *hear* the 
disks clunking whilst waiting for the slow drive to catch up. It completely 
confuses RAID controllers, and can cause serious hardware damage. I would 
imagine that much insanity is going on inside this box with a faster drive 
compared to the other. Think 'Mr. Toad syndrome' - going fast and shouting 
'poop, poop!' whilst those around you prefer a slower pace of life can get 
you into trouble. :-)

> popular groups into it OK. Now it ain't catching up at all.

I would normally start looking at the network, but if this is only the case 
since the memory and disk got bumped up, that is unlikely to be the cause. To 
me, this wreaks of a disk-thrash. I have in the past though, been famously 
and dramatically proven wrong, so DYOR around this before taking my advice.

> Looking at ``systat -vm'' tells me none of the disks, even the poxy IDE
> RAID where the articles live, is terribly busy (whereas I'd be seeing
> 80-100% before the "upgrade"). The vinum-mirrored history/overview volume
> is practically idle. The load average has gone up from 1-2 to 4-5, and
> ``top'' shows me far more processes in RUN state, and for far longer,
> than I'd expect:
>
> 55248 news      64   0  4060K  3592K RUN    112:47 80.03% 80.03% fastrm

Yeah, this to me sounds more like disk thrash. Although [vm|sys]stat will 
show a quiet disk, that's because they show the number of operations from an 
OS point of view. It won't show that the machine is waiting on a disk or set 
of disks, or that there is a speed mismatch in the box. The processes sitting 
in RUN for ages is particularly interesting - they're waiting for something. 
I'll give you three guesses what that might be... :-)

> I'm not used to seeing a ``fastrm'' burning CPU, even on old 50MHz Suns,
> and it's been running for hours longer than normal. A quick ``truss''
> shows me the expected calls to unlink(2), and nothing else.

Now I'm getting worried. This is starting to sound like serious SCSI and disk 
problems. One suggestion - switch it off now. Try and lay your hands on 
another 160Mb/sec disk, or another 80Mb/sec disk and try with both disks at 
same speed and see where you go. Alternaitvely, drop the hardware RAID and 
see if you can get it to work with vinum. I still think you'll have problems 
though.

> Does anybody have any suggestions, please? Obviously I could remove some of
> the RAM and see what happens, but that won't help me understand...

Go for it, but if taking out memory fixes the problem, then there will be a 
public ceremony for Manchester BSDers in the Lass O' Gowrie where I will eat 
my BSD horns and tail, date and time to be announced. :-)

> da1 at ahc1 bus 0 target 0 lun 0
> da1: <IBM DDYS-T18350N S96H> Fixed Direct Access SCSI-3 device
> da1: 160.000MB/s transfers (80.000MHz, offset 63, 16bit), Tagged Queueing
> Enabled da1: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)
> da2 at ahc1 bus 0 target 1 lun 0
> da2: <IBM DNES-318350W SAH0> Fixed Direct Access SCSI-3 device
> da2: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged Queueing
> Enabled da2: 17501MB (35843670 512 byte sectors: 255H 63S/T 2231C)
> Mounting root from ufs:/dev/ad0s1a

If those two are in the same RAID set, I think I would have cause for concern.

-- 
Paul Robinson