Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[solved] OCFS2 Causing super high load (1200+)
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
planet-admin
Apprentice
Apprentice


Joined: 27 Mar 2004
Posts: 213
Location: Boise, ID

PostPosted: Sun Mar 11, 2007 6:07 am    Post subject: [solved] OCFS2 Causing super high load (1200+) Reply with quote

I use ocfs2 in a production environment, with gentoo 2006.1, kernel 2.6.20, qlogic fiber channel cards, attached to a 16 drive raid array. I have ocfs2-tools 1.2.1.

These machines are a cluster of web servers, all sharing the same data, all are attached to the save volume formatted in ocfs2.

Randomly, with no immediately discernible reason, all the machines will have a load average of 1200, and all apache processes will be in an uninterruptible sleep state. (This of course causes them to be taken out of the server pool, as I use ldirectord with healthchecks)

I have looked at everything I know to look at, there is no i/o load, no cpu load, and yet load averages are 1200+. I can do an lsof and grep for my ocfs2 mounted directory (/mnt/www) and see a few (maybe 5) open files in that directory (let’s say they’re all jpgs).

This is a 3 node cluster, so not a 2 node.

Strangely, when I type “mount” all the ocfs2 partitions are listed as “heartbeat=local”, on all 3 machines, but I know they are talking, since dmesg reports which nodes are in the cluster. (Nodes 0 1 2, etc)

OCFS2 is great for sharing storage, but when it locks up the machines(not fencing, but just unable to serve files anymore), it defeats the purpose of high availability. I’m sure it’s something I’m doing wrong, as other people are successfully using it.

Any help which can be provided, is great. I’d even be willing to give ocfs2 developers access to a machine to help me to figure it out.

Thanks,
Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com


Last edited by planet-admin on Mon Mar 19, 2007 11:05 pm; edited 1 time in total
Back to top
View user's profile Send private message
Janne Pikkarainen
Veteran
Veteran


Joined: 29 Jul 2003
Posts: 1143
Location: Helsinki, Finland

PostPosted: Mon Mar 19, 2007 12:35 pm    Post subject: Reply with quote

Anything suspicious in the logs / dmesg output?
_________________
Yes, I'm the man. Now it's your turn to decide if I meant "Yes, I'm the male." or "Yes, I am the Unix Manual Page.".
Back to top
View user's profile Send private message
planet-admin
Apprentice
Apprentice


Joined: 27 Mar 2004
Posts: 213
Location: Boise, ID

PostPosted: Mon Mar 19, 2007 11:05 pm    Post subject: Reply with quote

Janne Pikkarainen wrote:
Anything suspicious in the logs / dmesg output?


No, nothing, but I believe I have it figured out. Turns out, it's not a good idea to set the O2CB_HEARTBEAT_THRESHOLD to 900.

900 certainly solved my self-fencing problem, but the problem with a threshold that high, if a machine goes down, etc, in the cluster, it waits through O2CB_HEARTBEAT_THRESHOLD * 2 + 1, (1801 seconds, 30 minutes at 900), before new file locks can be established.

15 turned out to be a much more reasonable value, and seems to have solved it.

Thanks,
Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com
Back to top
View user's profile Send private message
olli.bo
Apprentice
Apprentice


Joined: 16 Jul 2003
Posts: 208
Location: Germany

PostPosted: Wed Jul 25, 2007 6:56 am    Post subject: Reply with quote

Hi,

I have a similiar Problem. In what file can I set the O2CB_HEARTBEAT_THRESHOLD-Variable?
Is it possible to check weather ocfs2 uses it?

thx
olli
Back to top
View user's profile Send private message
planet-admin
Apprentice
Apprentice


Joined: 27 Mar 2004
Posts: 213
Location: Boise, ID

PostPosted: Wed Jul 25, 2007 7:01 am    Post subject: Reply with quote

olli.bo wrote:
Hi,

I have a similiar Problem. In what file can I set the O2CB_HEARTBEAT_THRESHOLD-Variable?
Is it possible to check weather ocfs2 uses it?

thx
olli


You set it in /etc/defaults/o2cb

Also, the latest kernel sources, 2.6.21, and ocfs2-tools 1.2.6 are a "good" idea, and fix a few bugs/bad defaults. They made the default timeouts, and other settings more realistic.

I don't know of a way to check "what" the timeout is currently (and no, you can't change it in a round-robin fashion, all nodes need to be shut down, it changed, then start the o2cb back up, and mount your fs).

We've been running ocfs2 stable for months in a very high traffic (3.1 million hits per day) situation, with everything from images to videos, to php/html.

Hope this helps.

Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com
Back to top
View user's profile Send private message
planet-admin
Apprentice
Apprentice


Joined: 27 Mar 2004
Posts: 213
Location: Boise, ID

PostPosted: Wed Jul 25, 2007 7:03 am    Post subject: Reply with quote

olli.bo wrote:
Hi,

I have a similiar Problem. In what file can I set the O2CB_HEARTBEAT_THRESHOLD-Variable?
Is it possible to check weather ocfs2 uses it?

thx
olli


Also, if you give me more details, and changing the heartbeat doesn't solve it, maybe I can help further, as I'm probably one of the most ocfs2 experienced users on the planet (been inside out, in the code, in everything, tested different schedulers, different hardwares, etc)

Michael
_________________
Michael S. Moody
Sr. Systems Engineer
Global Systems Consulting
Web: http://www.GlobalSystemsConsulting.com
Back to top
View user's profile Send private message
olli.bo
Apprentice
Apprentice


Joined: 16 Jul 2003
Posts: 208
Location: Germany

PostPosted: Wed Jul 25, 2007 8:09 am    Post subject: Reply with quote

Hi, thx for the fast answer...

planet-admin wrote:

Also, if you give me more details, and changing the heartbeat doesn't solve it, maybe I can help further, as I'm probably one of the most ocfs2 experienced users on the planet (been inside out, in the code, in everything, tested different schedulers, different hardwares, etc)
Michael


I have a three node cluster with ocfs2. On every node I have the same fs mounted. If i change a file on the mounted ocfs2-fs it is changed on every other node too. So the cluster works perfect I think.
In the background i have a working gnbd with drbd.

The problem is the following:
If I copy on one node files into the ocfs2-fs and while it is copying I power off one other node (for simulating a power cut) the remaining two nodes are getting a kernel panic after ca. 13 seconds.
If I do no copy or something like this power cut works perfect. After some seconds I can write files in the remaining two nodes and when the third node is restarted and mounted the ocfs2-fs the files are there too...

I think to solve this problem I can set HEARTBEAT_THRESHOLD to something like 121 for waiting 60 seconds. But I don't know in which file I can set this. I tried /etc/conf.d/ocfs2 and /etc/conf.d/o2cb but I don't know whether the ocfs2 is using this variable. Can I see this anywhere?

Here are some configuration files and command outputs (same on every node):
/etc/ocfs2/cluster.conf
Code:
node:
        ip_port = 7777
        ip_address = 184.1.72.201
        number = 0
        name = xentest01
        cluster = xencluster

node:
        ip_port = 7777
        ip_address = 184.1.72.202
        number = 1
        name = xentest02
        cluster = xencluster

node:
        ip_port = 7777
        ip_address = 184.1.72.203
        number = 2
        name = xentest03
        cluster = xencluster

cluster:
        node_count = 3
        name = xencluster


/sbin/mounted.ocfs2 /dev/gnbd/xen
Code:
Device                FS     Nodes
/dev/gnbd/xen         ocfs2  xentest01, xentest02, xentest03


/etc/fstab
Code:
none                    /config         configfs        defaults
none                    /dlm            ocfs2_dlmfs     defaults
/dev/gnbd/xen           /xen            ocfs2           noauto 0 0


/etc/default/o2cb
Code:
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=xencluster

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=121


/etc/conf.d/ocfs2
Code:
# Copyright 1999-2006 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
# $Header: /var/cvsroot/gentoo-x86/sys-fs/ocfs2-tools/files/ocfs2.conf,v 1.1 2006/07/20 05:13:14 dberkholz Exp $

# Put your cluster names here, separated by space, ie.
OCFS2_CLUSTER="xencluster"
O2CB_HEARTBEAT_THRESHOLD=121


/etc/cluster/cluster.conf
Code:
<?xml version="1.0"?>
<cluster name="xencluster" config_version="5">
<cman>
</cman>
<clusternodes>
<clusternode name="xentest01">
        <fence>
                <method name="single">
                  <device name="human" nodename="xentest01"/>
                </method>
        </fence>
</clusternode>
<clusternode name="xentest02">
        <fence>
                <method name="single">
                  <device name="human" nodename="xentest02"/>
                </method>
        </fence>
</clusternode>
<clusternode name="xentest03">
        <fence>
                <method name="single">
                  <device name="human" nodename="xentest03"/>
                </method>
        </fence>
</clusternode>
</clusternodes>
<fencedevices>
        <fencedevice name="human" agent="fence_manual"/>
</fencedevices>
</cluster>

/etc/drbd.conf
Code:
global { usage-count yes; }
common { syncer { rate 10M; } }
resource xencluster {
        protocol C;
        net {
#               cram-hmac-alg sha1;
                shared-secret "XXXXXXXX";
                allow-two-primaries;
                after-sb-0pri discard-least-changes;
        }
        on xentest01 {
                device    /dev/drbd0;
                disk      /dev/sdb1;
                address   184.1.72.201:7789;
                meta-disk  internal;
        }
        on xentest02 {
                device    /dev/drbd0;
                disk      /dev/sdb1;
                address   184.1.72.202:7789;
                meta-disk  internal;
        }
}


uname -r
Code:
2.6.20-gentoo-r8
Back to top
View user's profile Send private message
olli.bo
Apprentice
Apprentice


Joined: 16 Jul 2003
Posts: 208
Location: Germany

PostPosted: Wed Jul 25, 2007 8:16 am    Post subject: Reply with quote

Oh, I forgot:

The schduler I'm using is deadline:
see http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html - point 74 for details.

dmesg | grep scheduler
Code:
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered (default)
io scheduler cfq registered


I formatted the ocfs2 with
Code:
mkfs.ocfs2 -N 3 /dev/gnbd/xen
Back to top
View user's profile Send private message
olli.bo
Apprentice
Apprentice


Joined: 16 Jul 2003
Posts: 208
Location: Germany

PostPosted: Thu Jul 26, 2007 7:44 am    Post subject: Reply with quote

Hi, I opened a new thread for my problem... See:
https://forums.gentoo.org/viewtopic-p-4160169.html#4160169
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum