View previous topic :: View next topic |
Author |
Message |
gui92 n00b
Joined: 23 Nov 2005 Posts: 8 Location: France
|
Posted: Tue Sep 05, 2006 12:43 pm Post subject: Gentoo server random crashs |
|
|
Hello
I run a gentoo box for a busy web server (dual Xeon 2.6 HT, 6GB RAM, 72GB RAID 5 SCSI Adaptec 2000S).
Kernel 2.6.17-r6 with last glibc and gcc 4.1.1
I'm facing random heavy crashes.
Sometime, after few hours or few days the services stop responding. No http, no ftp, no ssh (stuck at password prompt), only the ping continue to respond.
I manage to open a top and a dstat console before a crash happen, and during the failure the two process continue to respond and show that :
top - 14:13:12 up 1:47, 0 users, load average: 920.84, 902.29, 399.07
Tasks: 278 total, 1 running, 268 sleeping, 0 stopped, 9 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 24.8%id, 74.9%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 6232460k total, 2428772k used, 3803688k free, 51772k buffers
Swap: 4008208k total, 0k used, 4008208k free, 718260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10168 root 16 0 2244 1236 836 R 0 0.0 0:08.96 top
1 root 16 0 1480 520 452 S 0 0.0 0:01.82 init
2 root RT 0 0 0 0 S 0 0.0 0:00.06 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.02 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.10 migration/1
5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
6 root RT 0 0 0 0 S 0 0.0 0:00.10 migration/2
7 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2
8 root RT 0 0 0 0 S 0 0.0 0:00.04 migration/3
9 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/3
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/0
11 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/1
12 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/2
13 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/3
14 root 10 -5 0 0 0 S 0 0.0 0:00.03 khelper
15 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
20 root 10 -5 0 0 0 S 0 0.0 0:00.11 kblockd/0
21 root 10 -5 0 0 0 S 0 0.0 0:00.03 kblockd/1
22 root 10 -5 0 0 0 S 0 0.0 0:00.02 kblockd/2
23 root 10 -5 0 0 0 S 0 0.0 0:00.04 kblockd/3
24 root 10 -5 0 0 0 S 0 0.0 0:00.00 kseriod
27 root 10 -5 0 0 0 S 0 0.0 0:00.00 khubd
And
---procs--- ------memory-usage----- ---paging-- -disk/total ---system-- ----total-cpu-usage----
run blk new|_used _buff _cach _free|__in_ _out_|_read write|_int_ _csw_|usr sys idl wai hiq siq
0 5 0|1618M 51M 701M 3717M| 0 0 | 0 0 | 317 27 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3717M| 0 0 | 0 0 | 308 13 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3717M| 0 0 | 0 0 | 319 21 | 0 0 25 75 0 0
0 5 6|1618M 51M 701M 3716M| 0 0 | 0 0 | 386 112 | 1 0 24 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 316 21 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 316 15 | 0 0 25 75 0 0
0 5 2|1618M 51M 701M 3716M| 0 0 | 0 0 | 334 39 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 325 25 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 319 33 | 0 0 25 75 0 0
After a hard rebbot, i can see the apache log show random errors like :
[Mon Sep 04 18:02:24 2006] [notice] child pid 2724 exit signal Segmentation fault (11)
*** glibc detected *** /usr/sbin/apache2: double free or corruption (out): 0xa7861a98 ***
[Tue Sep 05 06:38:04 2006] [notice] child pid 30660 exit signal Segmentation fault (11)
[Tue Sep 05 10:45:45 2006] [notice] child pid 3364 exit signal Segmentation fault (11)
[Tue Sep 05 11:59:00 2006] [notice] child pid 3916 exit signal Bus error (7)
Do you have an idea of what happen ?
I can understand an apache2 failure because overloading, but why the whole server crash ?
Please excuse my poor english.
Thanks for your help. _________________ --
Guillaume |
|
Back to top |
|
|
Kruegi Guru
Joined: 09 Feb 2005 Posts: 406 Location: Clausthal-Zellerfeld; DE
|
Posted: Tue Sep 05, 2006 1:28 pm Post subject: |
|
|
Could be a hardware error.
At first run a complete disk (fsck) and memory check (-> http://www.memtest.org).
Thomas |
|
Back to top |
|
|
Janne Pikkarainen Veteran
Joined: 29 Jul 2003 Posts: 1143 Location: Helsinki, Finland
|
Posted: Tue Sep 05, 2006 2:03 pm Post subject: |
|
|
I think there are two options (based on Apache errors): either you compiled the system with some über wicked CFLAGS or there is a hardware problem. I suspect the latter. _________________ Yes, I'm the man. Now it's your turn to decide if I meant "Yes, I'm the male." or "Yes, I am the Unix Manual Page.". |
|
Back to top |
|
|
gui92 n00b
Joined: 23 Nov 2005 Posts: 8 Location: France
|
Posted: Tue Sep 05, 2006 3:08 pm Post subject: |
|
|
Janne Pikkarainen wrote: | I think there are two options (based on Apache errors): either you compiled the system with some über wicked CFLAGS or there is a hardware problem. I suspect the latter. |
I think the CFLAGS are very standard :
CFLAGS="-O2 -march=pentium4 -pipe"
CHOST="i686-pc-linux-gnu"
I run all the hardware test, without errors... _________________ --
Guillaume |
|
Back to top |
|
|
Janne Pikkarainen Veteran
Joined: 29 Jul 2003 Posts: 1143 Location: Helsinki, Finland
|
Posted: Tue Sep 05, 2006 3:13 pm Post subject: |
|
|
Ok. So starts to sound like hardware error. During the years I've seen all kind of odd errors that in the perfect world shouldn't exist: for example, one brand-new server was installed with a CPU, which was originally meant for a server operating at different frontside-bus speed than our server (400 MHz vs 533 MHz or so).
As a result the server booted, and even run its hardware tests ok. But problems started during Gentoo installation (a nice stress test, by the way ) - during couple of tries the symptoms varied from some random compilation errors to total hangups. Right after we replaced the CPU server has been trouble-free.
Apache shouldn't throw messages like "Bus error" if hardware is ok. I suspect the CPU or the memory. _________________ Yes, I'm the man. Now it's your turn to decide if I meant "Yes, I'm the male." or "Yes, I am the Unix Manual Page.". |
|
Back to top |
|
|
gui92 n00b
Joined: 23 Nov 2005 Posts: 8 Location: France
|
Posted: Tue Sep 05, 2006 5:42 pm Post subject: |
|
|
Janne Pikkarainen wrote: | Ok. So starts to sound like hardware error. During the years I've seen all kind of odd errors that in the perfect world shouldn't exist: for example, one brand-new server was installed with a CPU, which was originally meant for a server operating at different frontside-bus speed than our server (400 MHz vs 533 MHz or so).
As a result the server booted, and even run its hardware tests ok. But problems started during Gentoo installation (a nice stress test, by the way ) - during couple of tries the symptoms varied from some random compilation errors to total hangups. Right after we replaced the CPU server has been trouble-free.
Apache shouldn't throw messages like "Bus error" if hardware is ok. I suspect the CPU or the memory. |
Ok, thanks.
This kind of harware failure will not be easy to detect and solve :-/
I try to use lighttpd to see what happen with it. _________________ --
Guillaume |
|
Back to top |
|
|
Ast0r Guru
Joined: 11 Apr 2006 Posts: 404 Location: Dallas, Tx - USA
|
Posted: Tue Sep 05, 2006 10:42 pm Post subject: |
|
|
You aren't, by chance, overclocking that CPU are you?
Also, running memtest86 would be a good idea. |
|
Back to top |
|
|
gui92 n00b
Joined: 23 Nov 2005 Posts: 8 Location: France
|
Posted: Wed Sep 06, 2006 7:57 am Post subject: |
|
|
Ast0r wrote: | You aren't, by chance, overclocking that CPU are you?
Also, running memtest86 would be a good idea. |
No, this is a 3 years old stock production server.
It was quite stable during his first year, and then we begin to have monthly then weekly random crashes.
I think it was due to overloading.
But since last week, we are facing hourly crashes, and now, without load (only named and postfix, i stop apache and mysql) the server crash after half an hour only.
I know the server's chipset (Intel) has memory limitation (it only accept few type of memory), but why this sudden instability fater many mounth of (relative) stability ??? _________________ --
Guillaume |
|
Back to top |
|
|
r4d1x Apprentice
Joined: 25 Nov 2003 Posts: 157 Location: Japan
|
Posted: Wed Sep 06, 2006 8:03 am Post subject: |
|
|
sounds like hardware :/ . The best part is, you get to go on a hunt to find out whats going bad! *cheers* _________________ Gentoo Linux 2.6.19.2-grsec
Dual Athlon-MP 1900
1024Mb PC2100 DDR
Radeon 9600 pro
1TB File Server / FTP |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 12:42 am Post subject: |
|
|
I have a similar problem
My server crash after 10-12 days....
Nothing in the log and the hardward is new. I changed the memory to be sure.. but it still crash...
HELP me please...
I have a AMD 2800 + Semptron 64 bits
MotherBoard: asus k8s-mx
512 meg corsair |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 12:57 am Post subject: |
|
|
Can you determine when exactly the instabilities started to occur? If it is software-related, try to determine what relevant software you have changed just before that started (kernel? gcc? toolchain upgrade?).
Have you run memtest86? Most common problems are probably memory timing/error problems...
If there is a problem with the CPU/memory, do a stress test with one of the scripts listed at:
https://stier.dynu.com/~moinmoin/MarksWiki/LinuxKernel/KernelTests
Check the power supply. |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 1:18 am Post subject: |
|
|
Cinquero wrote: | Can you determine when exactly the instabilities started to occur? If it is software-related, try to determine what relevant software you have changed just before that started (kernel? gcc? toolchain upgrade?).
Have you run memtest86? Most common problems are probably memory timing/error problems...
If there is a problem with the CPU/memory, do a stress test with one of the scripts listed at:
https://stier.dynu.com/~moinmoin/MarksWiki/LinuxKernel/KernelTests
Check the power supply. |
Hi,
I changed the power supply, the memory and I did a memtest86 and all of those test was fine. The computer is new...the instabilities started at the begening of the PC life. I installed Gentoo 2005.1 and I had some problems with the sata drive. I finally found the driver, but its was always crashing.... The server is runing apache2/pure-ftpd/ssh/mysql.
I'm from Quebec City in Canada... scuse me for my poor english.
Thanks alot
Joel |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 1:25 am Post subject: |
|
|
Run the kernel build stress test.
Is the crash time related to the room temperature? Or to the server load? |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 1:28 am Post subject: |
|
|
Cinquero wrote: |
Run the kernel build stress test.
Is the crash time related to the room temperature? Or to the server load? |
I will test the kernel build stress test. I added a fan for the temperature and the serevr load is not high went it crash. |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 1:31 am Post subject: |
|
|
Hmmm... are you running an X server? If yes, switch to VESA driver and/or prevent it from starting at all. There are some notoriously instable graphics chips around...
Which gcc version do you use?
Do the fans on the graphics card still work? |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 1:37 am Post subject: |
|
|
Cinquero wrote: | Hmmm... are you running an X server? If yes, switch to VESA driver and/or prevent it from starting at all. There are some notoriously instable graphics chips around... |
No, the X server is not running.
I read some posts about simlar problem and people are talking about DMA and sata driver problem.. Went i'm coping a big file by the FTP server on the LAN, sometime the server crash....in the "top" command the "wa" section for the CPU is to 100%...
Maybe this could help you.. I have the kernel 2.6.16-r9
The driver for the sata are those of the kernel (scsi low-level driver, SiS 964/180 sata)
Thanks |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 1:39 am Post subject: |
|
|
Well, then try
hdparm -d0 /dev/sda
or so to disable DMA access. It won't give you an insane speed, but you will be able to check if it is related to DMA disk transfers.
You could also try bugzilla.kernel.org to see if the bug has been fixed in more recent kernel versions. |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 1:43 am Post subject: |
|
|
Cinquero wrote: | Well, then try
hdparm -d0 /dev/sda
or so to disable DMA access. It won't give you an insane speed, but you will be able to check if it is related to DMA disk transfers.
You could also try bugzilla.kernel.org to see if the bug has been fixed in more recent kernel versions. |
hdparm -d0 /dev/sda give me :
Quote: | jdhosts ~ # hdparm -d0 /dev/sda
/dev/sda:
setting using_dma to 0 (off)
HDIO_SET_DMA failed: Inappropriate ioctl for device |
Is it ok or I need to wait and see if it still crash ?
thank |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 1:45 am Post subject: |
|
|
hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS? |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 1:59 am Post subject: |
|
|
Cinquero wrote: |
use
hdparm /dev/sda
to see if using_dma is disabled. I'm not using SATA, so I cannot really be of any help here. Do you get any warnings/errors in /var/log/messages? (or when entering "dmesg"?) |
I just tried to tranfert a 800 mo file with the FTP server, I started "top" with SSH and the "wa" in the CPU section went to 89% and then the server crashed...
hdparm /dev/sda give me :
Quote: |
/dev/sda:
IO_support = 0 (default 16-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 19457/255/63, sectors = 160041885696, start = 0
|
I don't have some warning errors in /var/log/message and dmesg.... |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 2:00 am Post subject: |
|
|
Cinquero wrote: |
hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS? |
Ok I will look in the bios |
|
Back to top |
|
|
Joel D. n00b
Joined: 14 Sep 2006 Posts: 13
|
Posted: Thu Sep 14, 2006 2:32 am Post subject: |
|
|
Joel D. wrote: | Cinquero wrote: |
hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS? |
Ok I will look in the bios |
nothing in the bios about sata and DMA. I installed sdparm and i'm now looking at it.
I really think that the problem is the acces to the hard drive... I very don't know what I need to do to fix it.
I retryed to copie a 800mo via FTP on the server and again the "wa" went to 100% after the load average was very high and then all was crashed.... |
|
Back to top |
|
|
gui92 n00b
Joined: 23 Nov 2005 Posts: 8 Location: France
|
Posted: Thu Sep 14, 2006 6:47 am Post subject: |
|
|
r4d1x wrote: | sounds like hardware :/ . The best part is, you get to go on a hunt to find out whats going bad! *cheers* |
About my problem, it was due to Adaptec Raid ZCR Card. A chipset clip broke and hit the card, destroying a little chip.
It's SOLVED for me, thanks all. _________________ --
Guillaume |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 11:13 am Post subject: |
|
|
You could disable DMA in libata.... as far as I have read elsewhere. But that sure ain't easy if you don't know C. Or try "ide=nodma" first.
Check if "local APIC" in the kernel config is disabled. I remember that that option caused problems on some systems. |
|
Back to top |
|
|
Cinquero Apprentice
Joined: 24 Jun 2004 Posts: 249
|
Posted: Thu Sep 14, 2006 11:26 am Post subject: |
|
|
You always copy data from network. Did you try copying data locally? |
|
Back to top |
|
|
|