Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Gentoo server random crashs
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
gui92
n00b
n00b


Joined: 23 Nov 2005
Posts: 8
Location: France

PostPosted: Tue Sep 05, 2006 12:43 pm    Post subject: Gentoo server random crashs Reply with quote

Hello

I run a gentoo box for a busy web server (dual Xeon 2.6 HT, 6GB RAM, 72GB RAID 5 SCSI Adaptec 2000S).
Kernel 2.6.17-r6 with last glibc and gcc 4.1.1

I'm facing random heavy crashes.
Sometime, after few hours or few days the services stop responding. No http, no ftp, no ssh (stuck at password prompt), only the ping continue to respond.

I manage to open a top and a dstat console before a crash happen, and during the failure the two process continue to respond and show that :

top - 14:13:12 up 1:47, 0 users, load average: 920.84, 902.29, 399.07
Tasks: 278 total, 1 running, 268 sleeping, 0 stopped, 9 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 24.8%id, 74.9%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 6232460k total, 2428772k used, 3803688k free, 51772k buffers
Swap: 4008208k total, 0k used, 4008208k free, 718260k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10168 root 16 0 2244 1236 836 R 0 0.0 0:08.96 top
1 root 16 0 1480 520 452 S 0 0.0 0:01.82 init
2 root RT 0 0 0 0 S 0 0.0 0:00.06 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.02 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.10 migration/1
5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
6 root RT 0 0 0 0 S 0 0.0 0:00.10 migration/2
7 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/2
8 root RT 0 0 0 0 S 0 0.0 0:00.04 migration/3
9 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/3
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/0
11 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/1
12 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/2
13 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/3
14 root 10 -5 0 0 0 S 0 0.0 0:00.03 khelper
15 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
20 root 10 -5 0 0 0 S 0 0.0 0:00.11 kblockd/0
21 root 10 -5 0 0 0 S 0 0.0 0:00.03 kblockd/1
22 root 10 -5 0 0 0 S 0 0.0 0:00.02 kblockd/2
23 root 10 -5 0 0 0 S 0 0.0 0:00.04 kblockd/3
24 root 10 -5 0 0 0 S 0 0.0 0:00.00 kseriod
27 root 10 -5 0 0 0 S 0 0.0 0:00.00 khubd

And

---procs--- ------memory-usage----- ---paging-- -disk/total ---system-- ----total-cpu-usage----
run blk new|_used _buff _cach _free|__in_ _out_|_read write|_int_ _csw_|usr sys idl wai hiq siq
0 5 0|1618M 51M 701M 3717M| 0 0 | 0 0 | 317 27 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3717M| 0 0 | 0 0 | 308 13 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3717M| 0 0 | 0 0 | 319 21 | 0 0 25 75 0 0
0 5 6|1618M 51M 701M 3716M| 0 0 | 0 0 | 386 112 | 1 0 24 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 316 21 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 316 15 | 0 0 25 75 0 0
0 5 2|1618M 51M 701M 3716M| 0 0 | 0 0 | 334 39 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 325 25 | 0 0 25 75 0 0
0 5 0|1618M 51M 701M 3716M| 0 0 | 0 0 | 319 33 | 0 0 25 75 0 0

After a hard rebbot, i can see the apache log show random errors like :
[Mon Sep 04 18:02:24 2006] [notice] child pid 2724 exit signal Segmentation fault (11)
*** glibc detected *** /usr/sbin/apache2: double free or corruption (out): 0xa7861a98 ***
[Tue Sep 05 06:38:04 2006] [notice] child pid 30660 exit signal Segmentation fault (11)
[Tue Sep 05 10:45:45 2006] [notice] child pid 3364 exit signal Segmentation fault (11)
[Tue Sep 05 11:59:00 2006] [notice] child pid 3916 exit signal Bus error (7)

Do you have an idea of what happen ?
I can understand an apache2 failure because overloading, but why the whole server crash ?

Please excuse my poor english.
Thanks for your help.
_________________
--
Guillaume
Back to top
View user's profile Send private message
Kruegi
Guru
Guru


Joined: 09 Feb 2005
Posts: 406
Location: Clausthal-Zellerfeld; DE

PostPosted: Tue Sep 05, 2006 1:28 pm    Post subject: Reply with quote

Could be a hardware error.
At first run a complete disk (fsck) and memory check (-> http://www.memtest.org).

Thomas
Back to top
View user's profile Send private message
Janne Pikkarainen
Veteran
Veteran


Joined: 29 Jul 2003
Posts: 1143
Location: Helsinki, Finland

PostPosted: Tue Sep 05, 2006 2:03 pm    Post subject: Reply with quote

I think there are two options (based on Apache errors): either you compiled the system with some über wicked CFLAGS or there is a hardware problem. I suspect the latter.
_________________
Yes, I'm the man. Now it's your turn to decide if I meant "Yes, I'm the male." or "Yes, I am the Unix Manual Page.".
Back to top
View user's profile Send private message
gui92
n00b
n00b


Joined: 23 Nov 2005
Posts: 8
Location: France

PostPosted: Tue Sep 05, 2006 3:08 pm    Post subject: Reply with quote

Janne Pikkarainen wrote:
I think there are two options (based on Apache errors): either you compiled the system with some über wicked CFLAGS or there is a hardware problem. I suspect the latter.


I think the CFLAGS are very standard :

CFLAGS="-O2 -march=pentium4 -pipe"
CHOST="i686-pc-linux-gnu"

I run all the hardware test, without errors...
_________________
--
Guillaume
Back to top
View user's profile Send private message
Janne Pikkarainen
Veteran
Veteran


Joined: 29 Jul 2003
Posts: 1143
Location: Helsinki, Finland

PostPosted: Tue Sep 05, 2006 3:13 pm    Post subject: Reply with quote

Ok. So starts to sound like hardware error. During the years I've seen all kind of odd errors that in the perfect world shouldn't exist: for example, one brand-new server was installed with a CPU, which was originally meant for a server operating at different frontside-bus speed than our server (400 MHz vs 533 MHz or so).

As a result the server booted, and even run its hardware tests ok. But problems started during Gentoo installation (a nice stress test, by the way :D) - during couple of tries the symptoms varied from some random compilation errors to total hangups. Right after we replaced the CPU server has been trouble-free.

Apache shouldn't throw messages like "Bus error" if hardware is ok. I suspect the CPU or the memory.
_________________
Yes, I'm the man. Now it's your turn to decide if I meant "Yes, I'm the male." or "Yes, I am the Unix Manual Page.".
Back to top
View user's profile Send private message
gui92
n00b
n00b


Joined: 23 Nov 2005
Posts: 8
Location: France

PostPosted: Tue Sep 05, 2006 5:42 pm    Post subject: Reply with quote

Janne Pikkarainen wrote:
Ok. So starts to sound like hardware error. During the years I've seen all kind of odd errors that in the perfect world shouldn't exist: for example, one brand-new server was installed with a CPU, which was originally meant for a server operating at different frontside-bus speed than our server (400 MHz vs 533 MHz or so).

As a result the server booted, and even run its hardware tests ok. But problems started during Gentoo installation (a nice stress test, by the way :D) - during couple of tries the symptoms varied from some random compilation errors to total hangups. Right after we replaced the CPU server has been trouble-free.

Apache shouldn't throw messages like "Bus error" if hardware is ok. I suspect the CPU or the memory.


Ok, thanks.
This kind of harware failure will not be easy to detect and solve :-/

I try to use lighttpd to see what happen with it.
_________________
--
Guillaume
Back to top
View user's profile Send private message
Ast0r
Guru
Guru


Joined: 11 Apr 2006
Posts: 404
Location: Dallas, Tx - USA

PostPosted: Tue Sep 05, 2006 10:42 pm    Post subject: Reply with quote

You aren't, by chance, overclocking that CPU are you?

Also, running memtest86 would be a good idea.
Back to top
View user's profile Send private message
gui92
n00b
n00b


Joined: 23 Nov 2005
Posts: 8
Location: France

PostPosted: Wed Sep 06, 2006 7:57 am    Post subject: Reply with quote

Ast0r wrote:
You aren't, by chance, overclocking that CPU are you?

Also, running memtest86 would be a good idea.


No, this is a 3 years old stock production server.
It was quite stable during his first year, and then we begin to have monthly then weekly random crashes.
I think it was due to overloading.
But since last week, we are facing hourly crashes, and now, without load (only named and postfix, i stop apache and mysql) the server crash after half an hour only.

I know the server's chipset (Intel) has memory limitation (it only accept few type of memory), but why this sudden instability fater many mounth of (relative) stability ???
_________________
--
Guillaume
Back to top
View user's profile Send private message
r4d1x
Apprentice
Apprentice


Joined: 25 Nov 2003
Posts: 157
Location: Japan

PostPosted: Wed Sep 06, 2006 8:03 am    Post subject: Reply with quote

sounds like hardware :/ . The best part is, you get to go on a hunt to find out whats going bad! *cheers*
_________________
Gentoo Linux 2.6.19.2-grsec
Dual Athlon-MP 1900
1024Mb PC2100 DDR
Radeon 9600 pro
1TB File Server / FTP
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 12:42 am    Post subject: Reply with quote

I have a similar problem

My server crash after 10-12 days....

Nothing in the log and the hardward is new. I changed the memory to be sure.. but it still crash...

HELP me please...

I have a AMD 2800 + Semptron 64 bits
MotherBoard: asus k8s-mx
512 meg corsair
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 12:57 am    Post subject: Reply with quote

Can you determine when exactly the instabilities started to occur? If it is software-related, try to determine what relevant software you have changed just before that started (kernel? gcc? toolchain upgrade?).

Have you run memtest86? Most common problems are probably memory timing/error problems...

If there is a problem with the CPU/memory, do a stress test with one of the scripts listed at:

https://stier.dynu.com/~moinmoin/MarksWiki/LinuxKernel/KernelTests

Check the power supply.
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 1:18 am    Post subject: Reply with quote

Cinquero wrote:
Can you determine when exactly the instabilities started to occur? If it is software-related, try to determine what relevant software you have changed just before that started (kernel? gcc? toolchain upgrade?).

Have you run memtest86? Most common problems are probably memory timing/error problems...

If there is a problem with the CPU/memory, do a stress test with one of the scripts listed at:

https://stier.dynu.com/~moinmoin/MarksWiki/LinuxKernel/KernelTests

Check the power supply.



Hi,

I changed the power supply, the memory and I did a memtest86 and all of those test was fine. The computer is new...the instabilities started at the begening of the PC life. I installed Gentoo 2005.1 and I had some problems with the sata drive. I finally found the driver, but its was always crashing.... The server is runing apache2/pure-ftpd/ssh/mysql.

I'm from Quebec City in Canada... scuse me for my poor english.

Thanks alot

Joel
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 1:25 am    Post subject: Reply with quote

Joel D. wrote:
...

Joel


Run the kernel build stress test.

Is the crash time related to the room temperature? Or to the server load?
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 1:28 am    Post subject: Reply with quote

Cinquero wrote:
Joel D. wrote:
...

Joel


Run the kernel build stress test.

Is the crash time related to the room temperature? Or to the server load?


I will test the kernel build stress test. I added a fan for the temperature and the serevr load is not high went it crash.
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 1:31 am    Post subject: Reply with quote

Hmmm... are you running an X server? If yes, switch to VESA driver and/or prevent it from starting at all. There are some notoriously instable graphics chips around...

Which gcc version do you use?

Do the fans on the graphics card still work?
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 1:37 am    Post subject: Reply with quote

Cinquero wrote:
Hmmm... are you running an X server? If yes, switch to VESA driver and/or prevent it from starting at all. There are some notoriously instable graphics chips around...


No, the X server is not running.

I read some posts about simlar problem and people are talking about DMA and sata driver problem.. Went i'm coping a big file by the FTP server on the LAN, sometime the server crash....in the "top" command the "wa" section for the CPU is to 100%...

Maybe this could help you.. I have the kernel 2.6.16-r9

The driver for the sata are those of the kernel (scsi low-level driver, SiS 964/180 sata)

Thanks
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 1:39 am    Post subject: Reply with quote

Well, then try

hdparm -d0 /dev/sda

or so to disable DMA access. It won't give you an insane speed, but you will be able to check if it is related to DMA disk transfers.

You could also try bugzilla.kernel.org to see if the bug has been fixed in more recent kernel versions.
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 1:43 am    Post subject: Reply with quote

Cinquero wrote:
Well, then try

hdparm -d0 /dev/sda

or so to disable DMA access. It won't give you an insane speed, but you will be able to check if it is related to DMA disk transfers.

You could also try bugzilla.kernel.org to see if the bug has been fixed in more recent kernel versions.


hdparm -d0 /dev/sda give me :

Quote:
jdhosts ~ # hdparm -d0 /dev/sda

/dev/sda:
setting using_dma to 0 (off)
HDIO_SET_DMA failed: Inappropriate ioctl for device


Is it ok or I need to wait and see if it still crash ?

thank
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 1:45 am    Post subject: Reply with quote

Joel D. wrote:
...
thank


hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS?
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 1:59 am    Post subject: Reply with quote

Cinquero wrote:
Joel D. wrote:
...
thank


use

hdparm /dev/sda

to see if using_dma is disabled. I'm not using SATA, so I cannot really be of any help here. Do you get any warnings/errors in /var/log/messages? (or when entering "dmesg"?)


I just tried to tranfert a 800 mo file with the FTP server, I started "top" with SSH and the "wa" in the CPU section went to 89% and then the server crashed...

hdparm /dev/sda give me :
Quote:

/dev/sda:
IO_support = 0 (default 16-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 19457/255/63, sectors = 160041885696, start = 0


I don't have some warning errors in /var/log/message and dmesg....
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 2:00 am    Post subject: Reply with quote

Cinquero wrote:
Joel D. wrote:
...
thank


hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS?


Ok I will look in the bios
Back to top
View user's profile Send private message
Joel D.
n00b
n00b


Joined: 14 Sep 2006
Posts: 13

PostPosted: Thu Sep 14, 2006 2:32 am    Post subject: Reply with quote

Joel D. wrote:
Cinquero wrote:
Joel D. wrote:
...
thank


hmmm... ok, for /dev/sd* you need to use sdparm, but I don't know how to enable/disable DMA for SATA devices... maybe in the BIOS?


Ok I will look in the bios


nothing in the bios about sata and DMA. I installed sdparm and i'm now looking at it.

I really think that the problem is the acces to the hard drive... I very don't know what I need to do to fix it.

I retryed to copie a 800mo via FTP on the server and again the "wa" went to 100% after the load average was very high and then all was crashed....
Back to top
View user's profile Send private message
gui92
n00b
n00b


Joined: 23 Nov 2005
Posts: 8
Location: France

PostPosted: Thu Sep 14, 2006 6:47 am    Post subject: Reply with quote

r4d1x wrote:
sounds like hardware :/ . The best part is, you get to go on a hunt to find out whats going bad! *cheers*


About my problem, it was due to Adaptec Raid ZCR Card. A chipset clip broke and hit the card, destroying a little chip.
It's SOLVED for me, thanks all.
_________________
--
Guillaume
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 11:13 am    Post subject: Reply with quote

Joel D. wrote:
....


You could disable DMA in libata.... as far as I have read elsewhere. But that sure ain't easy if you don't know C. Or try "ide=nodma" first.

Check if "local APIC" in the kernel config is disabled. I remember that that option caused problems on some systems.
Back to top
View user's profile Send private message
Cinquero
Apprentice
Apprentice


Joined: 24 Jun 2004
Posts: 249

PostPosted: Thu Sep 14, 2006 11:26 am    Post subject: Reply with quote

You always copy data from network. Did you try copying data locally?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum