unexpected server halt

knight77 · n00b Joined: 29 Jun 2009 Posts: 25

Hello.

We are managing 2 Gentoo servers and one of them just shut down without any root telling it to do so.

Checking /var/log/everything/current we found the following lines:
[...]
Jun 29 10:21:39 [postfix/qmgr] 6EED11CAE7: removed
Jun 29 10:22:10 [vol_id] no device_
- Last output repeated 8 times -
Jun 29 10:22:13 [shutdown] shutting down for system halt
Jun 29 10:22:13 [init] Switching to runlevel: 0
Jun 29 10:22:16 [snmpd] Received TERM or STOP signal... shutting down...
[...]

As far as i can see, right after the [vol_id] message the server started to shut down. I have no idea how come the vol_id entries were issue, all i could find on Internet is that it may be related to udev, but nothing containing vol_id and "no device_".

Anybody ever ran into a similar problem? Any information about the server is available on request.

The Gentoo server is fully up to date and serves as a mail and web server (postfix + apache (vhosts)).

Thank you for your time or hints.

audiodef · Posted: Mon Jun 29, 2009 8:20 pm Post subject:

Never saw this happened. Has it happened more than once?
_________________
decibel Linux: https://decibellinux.org
Github: https://github.com/Gentoo-Music-and-Audio-Technology
Facebook: https://www.facebook.com/decibellinux
Discord: https://discord.gg/73XV24dNPN

knight77 · n00b Joined: 29 Jun 2009 Posts: 25

Nope, it didn't happen before, it's the first time i ever see this behaviour. That's why i'm confused too, since i can't figure out how come the server just decided to shutdown by itself.

It's located in a provider datacenter, so another possibility, as far as i can guess is that somebody mistook it for another server and maybe believing it was a Windows, hit the Ctrl-Alt-Del making it shut down. Still, if so, it should have rebooted, not halted.

What i still can't explain is what caused the 9 [vol_id] messages. It may be related or not to the shutdown, so it may be the closest explaination to why it shut down.

Any idea anybody of what might have caused the [vol_id] syslog messages? If any data about the server is needed, please ask.

Thank you for your time.

audiodef · Posted: Tue Jun 30, 2009 1:17 pm Post subject:

Has it happened since?

Also, any reason to believe someone might be trying to hack/crack in the data center?
_________________
decibel Linux: https://decibellinux.org
Github: https://github.com/Gentoo-Music-and-Audio-Technology
Facebook: https://www.facebook.com/decibellinux
Discord: https://discord.gg/73XV24dNPN

knight77 · n00b Joined: 29 Jun 2009 Posts: 25

No, it hasn't happened since.

I have rechecked the server and the other Gentoo server we manage (similar role, but without a MTA), and the vol_id message didn't show up again in the last 5 day at least.

The datacenter provider has quite everything under lock, so we believe it's a very slim possibility somebody physically interfered with the server.

As fas as can tell, there are 2 possibilities:
1. the vol_id is related to the shutdown being initiated, but we don't know how.
2. something else still unknown caused the halt.

The only "strange" thing about that server is that the temperature of a hard-drive was nearing it's upper limit (it was running at 53 degrees C for a few days, with the maximum allowed in the vendor specs of 55). Right now it's running at 47.

The rc-status on the self-halted server is the following:
Runlevel: default
apache2
coldplug
courier-authlib
courier-imapd
courier-pop3d-ssl
fcron
fwinit
local
lsa
metalog
mysql
named
net.eth0
net.eth1
netmount
postfix
pure-ftpd
rngd
saslauthd
snmpd
sshd
uptimed

Note: The lsa service is a hardware resources (CPU, RAM, Disks) monitoring agent we use.

So far we're still in the dark as to what caused the halt. In case nobody else updates this thread with any idea, i believe the thread can be closed, we'll reopen it (if possible) or create a new one if another unexpected shutdown happens again.

Thank you for your time.

audiodef · Posted: Wed Jul 01, 2009 6:05 pm Post subject:

Since you said the hard disk was running hot, do you have a temp sensor running that could shut down the machine if it gets too hot? Or perhaps there is a HW sensor that simply shuts the machine off under certain heat conditions without the need for software daemons.

Also, if you have a log daemon running, you should check the logs for clues.
_________________
decibel Linux: https://decibellinux.org
Github: https://github.com/Gentoo-Music-and-Audio-Technology
Facebook: https://www.facebook.com/decibellinux
Discord: https://discord.gg/73XV24dNPN

unixbhaskar

Once I tried to emerge apache I got this error;

checking if POSIX sems affect threads in the same process... no
checking if SysV sems affect threads in the same process... no
checking if fcntl locks affect threads in the same process... no
checking if flock locks affect threads in the same process... no
checking for entropy source... configure: error: /dev/urandom not found or unreadable.

!!! Please attach the following file when seeking support:
!!! /var/tmp/portage/dev-libs/apr-1.3.5/work/apr-1.3.5/config.log
*
* ERROR: dev-libs/apr-1.3.5 failed.
* Call stack:
* ebuild.sh, line 49: Called src_configure
* environment, line 2653: Called econf '--enable-layout=gentoo' '--enable-nonportable-atomics' '--enable-threads' '--with-devrandom=/dev/urandom'
* ebuild.sh, line 534: Called die
* The specific snippet of code:
* die "econf failed"
* The die message:
* econf failed
*
* If you need support, post the topmost build error, and the call stack if relevant.
* A complete build log is located at '/var/tmp/portage/dev-libs/apr-1.3.5/temp/build.log'.
* The ebuild environment file is located at '/var/tmp/portage/dev-libs/apr-1.3.5/temp/environment'.
*

>>> Failed to emerge dev-libs/apr-1.3.5, Log file:

>>> '/var/tmp/portage/dev-libs/apr-1.3.5/temp/build.log'

* Messages for package dev-libs/apr-1.3.5:

*
* ERROR: dev-libs/apr-1.3.5 failed.
* Call stack:
* ebuild.sh, line 49: Called src_configure
* environment, line 2653: Called econf '--enable-layout=gentoo' '--enable-nonportable-atomics' '--enable-threads' '--with-devrandom=/dev/urandom'
* ebuild.sh, line 534: Called die
* The specific snippet of code:
* die "econf failed"
* The die message:
* econf failed
*
* If you need support, post the topmost build error, and the call stack if relevant.
* A complete build log is located at '/var/tmp/portage/dev-libs/apr-1.3.5/temp/build.log'.
* The ebuild environment file is located at '/var/tmp/portage/dev-libs/apr-1.3.5/temp/environment'.
*

Any clear cut solution would be appreciated .Thanks in advance.

knight77 · n00b Joined: 29 Jun 2009 Posts: 25

Hello again.

We checked with the datacenter provider and they confirmed nobody had accessed the room where the server is located in the timeframe when the server started the halt. As such, accidental halt by somebody in the datacenter has been ruled out.

We'll try to stop the server for a few minutes in order to check the BIOS settings for any hardware temperature protection that might be enabled. Also, as i side note, we'll try to remove the cover on the tower hoping the A/C will cool it better than the already installed fans in the case (too few).

I will post here again in case we find out something new.

Thank you for your time.

PS. What does the apache emerge error from the previous post have anything to do with this thread? Doesn't unixbhaskar know how to open a new thread?

audiodef · Posted: Tue Jul 07, 2009 12:46 pm Post subject: Re: unexpected server halt

unixbhaskar · Posted: Tue Jul 07, 2009 2:54 pm Post subject:

Ignore the apache thing ,I have rectify it.

Knight have you read the subject line of my post???
_________________
Musing with GNU/Linux

Lenovo Thinkpad x250
x86_64 Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz GenuineIntel GNU/Linux
RAM : 8 GB
Kernel :Latest customized kernel
OS: Gentoo/Arch/Slackware/Debian/openSUSE/Fedora
Intel 965GM Chipset