Need help tweaking performance for highly interactive server

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

This may not belong here, but the help I think I need relates to the kernel, so here goes...

I am trying to use linux as a file server with an application that is extremely sensitive to delay and jitter (think VOIP). The application needs to write thousands of files very quickly. I have tried several flavors of linux, and Gentoo is providing superior results to anything else I have tried. In fact, Gentoo running Samba seems to significantly outperform windows on the same hardware, as demonstrated by the two graphs linked below (these are enormous, I know, but I need the visual resolution so you can see what is going on):

Windows:
http://www.mciann.com/windozegraph.jpg

Gentoo:
http://www.mciann.com/linuxgraph.jpg

These are graphs of file write times. The items of interest are the dark blue line (file write time) and the red blocks (delay related error condition).

Although the Windows run went without error, the baseline is all over the place. This will not work in a production environment. The Gentoo baseline is rock solid, but slowly ramps up over time. I have been able to determine that this ramping effect is caused by the number of files in a directory. If I cause the application to write to a new subdirectory every 1000 files, for example, the ramp "resets". Although it would seem that this problem should be directly related to the file system, I have tried JFS, Reiser, XFS, and ext2 filesystems, and all demonstrate this problem. Perhaps there is some sort of directory entry cache parameters somewhere that I could change??

I know that the "right" answer to this is to change my application so that it creates new subdirectories every few thousand items, but this isn't an option for me, mainly because I am not the developer.

The baseline would be golden if it would just not ramp! Can anyone offer any insight that could help me? Thanks!

Tsonn · Guru Joined: 03 Jun 2004 Posts: 550

One alternative would be to have a cron job moving files into subdirectories every ten minutes, half hour, or whatever...

smart · Guru Joined: 19 Nov 2002 Posts: 455

Na, that's not what he is looking for.
But sidenote... what you describe seems to match reiserfs perfectly, so despite this not being the real issue, you should most probably choose that.
Regarding the ramping effect itself, my guess would be samba, not kernel for as long as you see no swapping.
In any case to rule that latter out in unexpected moments (especially 2.6 kernels in my opinion are not ideal with their default in that respect) at least for the test i would do "echo 0 > /proc/sys/vm/swappiness". Check the exact spelling though, i'm "offline"

If that nails the ramping, then make big time sure you got enough RAM in the box, cause with this, once it is in real need of swapping, it will definitely cause a dent up in repsonse time, which then goes down to normal again. In short, the vm is not "prepared" for new requests, but at the same time it doesn't fiddle for as long as there is no real need. With a RAM wise well equipped box, and with RAM prices nowadays though, that would normally be the better choice IMHO.

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

I am running a twin Xeon 3.2 gig machine with 3 gigabytes of RAM. I am not swapping at all. I did try Reiser, and found that XFS just barely outperformed it for this application.

smart · Guru Joined: 19 Nov 2002 Posts: 455

Yep, 3G, fine. The point is if you run kernel 2.6 "I am not swapping at all" might be a lie without you knowing it. It might still swap when you don't expect it. So you may try the tweak or not.
XFS outperforming reiser in this task i think is a bit odd.
The other thing if not done (it's not described) is that i'd in this case give that data section its own harddrive to work on, so that e.g. logging et al don't influence the data work. Since you're looking at milliseconds, seektime counts. It doesn't change the ramping (again, my guess is on Samba), but since you asked for continuity most, it would shave some ripple.

Redux · n00b Joined: 05 Sep 2004 Posts: 13 Location: South Dakota

The most important factor here is probably your hard drives and the interface. The ramp up that is occurring is probably the back log of files being written to the disk. The are waiting to be written in the hd cache or in system memory.

You probably need to look at a high speed RAID controller that supports on board RAM for extra cache (up to 1GB depending on the card) and some SCSI drives put in RAID 0 to give you the high speed writing that you desire.

It is an expensive solution. The RAID card alone will cost between $200 and $400 depending on the model. Adaptec and Promise are both good manufacturers that make cards that will do the job for you.

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

I am using a Perc 4DI 128 meg caching RAID controller with 15,000 rpm ultra320 drives in a RAID 0 + 1 stripe.

smart:

Thanks - I'll doublecheck my reiser results, and try your vm tweak.

smart · Guru Joined: 19 Nov 2002 Posts: 455

You might try deadline scheduler, possibly setting writes starved to 1, over the default anticipatory scheduler.
Didn't find details about that controller ... if it does write-through (batteries on it ?) then you might want to try and pull it and do software raid. Otherwise, if that's configurable, switch it to write back if performance counts most.

smart · Guru Joined: 19 Nov 2002 Posts: 455

BTW, how about the other end. You use Gig Ethernet ? Switched ? Could even do a direct link mabye ? Set socket options in smb.conf ? All those files are relativel< small i guess... tried tcp_low_latency ? Is there a typical file size with those files ... maybe increase package size to match that nicely ?

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

Okay - new information.

I repeated my XFS vs. Reiser tests. I got quite unexpected results.

I was able to confirm that XFS indeed outperforms Reiser for my application, but I also got spectacularly better results when I switched the volume back to XFS. I figured out that this is the first time I have tried XFS on a seperate logical volume from the root. Me == bonehead for not catching that sooner. I keep forgetting what I have tried on Mandrake but not yet tried on Gentoo.

smart · Guru Joined: 19 Nov 2002 Posts: 455

Still cannot believe that XFS better than reiser in small quick many file access. but you gave a hint. Which is "logical volume".
Beware. This time it might not be your friend. Go with cleanly physical distinctions in these test or you might get wildly wrong stats, 'cause in no instance you know where you logical volume sits, mostly relative to the other stuff.
Have group 1, physical group one, and put on this whatever you want, be it lvm.
Then have group two, physical group two, with discs being part of nothing else, and best, not even lvm and have this be your data, small files many accesses storage. If the controller does writeback, only 0+1, by the controller and a directly physical partition. Then smack reiser on it, and compare to XFS.
Until you did this test in this way, i give nought on 'em, 'cause as mentionned, you've got no clue where you HD heads scrubb around. In one test they may just have to go next to your logging partition, next test they will have to fly all over the disc, next test they sit on different platters just below, or on different drive. No clue. Physical distinction is key in these comparisons.
So forget about all the mesurements you did so far, if i got your setup right, they are all worthless.
Even worse, misleading.
Worse even, it seems you have two logical volumes at hand, one with reiser, one with XFS. This means, completely different conditions for the two. You need to take the same thing and run it, once reiser, once XFS.
Oh, just to make sure i didn't forget to mention, same thing, physically.
I'm so keen to say until you get: reiser beats XFS in small many files, XFS beats reiser in few big files, consider your test method is flawed, or something else is wrong

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

I would not extrapolate anything from what I am doing and try to make judgements about the overall performance of one filesystem versus another. (the quality of my data or testing practices notwithstanding

) What I am doing is extremely specific, and has a very narrow range of demands upon the system.

That said...

Still cannot believe that XFS better than reiser in small quick many file access.

My test situation does not access files at all. I need to write many small files quickly.

but you gave a hint. Which is "logical volume".
Beware. This time it might not be your friend. Go with cleanly physical distinctions in these test or you might get wildly wrong stats, 'cause in no instance you know where you logical volume sits, mostly relative to the other stuff.

logical volume in this conversation != lvm. The drive arrangement consists of four drives in a hardware RAID 0+1 configuration. The operating system sees one physical disk, which stripes across two drives and then is mirrored to a second pair configured the same way. Drive redundancy is a production requirement of the application, so I can't plop a physical disk straight to the operating system (unless I do software mirroring, and I don't think anyone can suggest that would work better than hardware RAID 1). Given that requirement, it made the most sense to span disks (you get twice the per-disk performance), and mirror the spans (RAID 0+1). I did try seperate, unmirrored physical disks in Mandrake, just to test, but the performance was less than what I am seeing on the RAID 0+1 array, and since it won't work in production, it didn't make sense to pursue the issue. I would like to have seperate physical disks presented to the operating system, but that would require 8 drives, and my cage only holds 6.

The physical disk has 4 partitions. sda1 is boot. sda2 is swap. sda3 is root. sda4 is /mnt/vol1 (my Novell background is showing. sorry

)

The test results I described were derived in the following manner:

mkreiserfs /dev/sda4
mount /dev/sda4 /mnt/vol1

(test)

mkfs -t xfs -f /dev/sda4
mount /dev/sda4 /mnt/vol1

(repeat test)

So a more accurate way to describe what I was doing would be to use the word partition rather than "logical volume". Sorry for creating that confusion.

smart · Guru Joined: 19 Nov 2002 Posts: 455

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

Ok, just verified that the array is in write-back mode. I'll try your 6 disk configuration suggestion. That sounds like it would work better.

I fear, however that all of this hardware/kernel tweaking is only going to delay the onset of the ramping, but not eliminate it entirely. I think your initial suggestion that the problem is with Samba and not the O/S carries a great deal of weight. I've asked about this on the Samba mailing list, but never got a response. I hate to cross-post, but would it be appropriate to ask for help from the networking and security forum (here) for Samba specific tuning hints?

P.S. - I've tried TCP_NODELAY and increasing SO_RCVBUF and SO_SNDBUF, plus a good many other Samba tweaks I can't remember just now. I've been working on this for weeks now.

smart · Guru Joined: 19 Nov 2002 Posts: 455

If you've got a typical filesize to add, you can base your decision of what size you set RCV and SND buffers. I for myself have set them bot to 16384. But again, might be possible to do better.
What's in between your two machines networkwise ?
How much data do you want to move ? How much is the network connection saturated ? Are the files roughly same size all the time ?
I guess nobody would object you requesting here what you asked on samba list, at least in my opinion. This is a different platform. Doubts would prabably arise if you'd do so in different topics or threads here on gentoo.

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

There are four files that must be copied for every document transaction. 2 are approximately 200K each, and 2 are less than 10K.

The network consists of Cat5E cabling with a netgear GSM712 gig over copper switch. (I can't get the company to spend money on networking gear) Itel pro1000Ms are in all the hosts. I haven't tried jumbo frames yet, but I doubted that the win2k client machines could deal with them. We're pumping right at 100Mbits of traffic, so we aren't really taxing the network.

smart · Guru Joined: 19 Nov 2002 Posts: 455

Switches also may know about two modes comparable to write-through and write back. The latter they call store and forward, we want write though (fat write??, dunno). If the two machines are close to each other, maybe you can use a dedicated pair of NICs to do a crossover connection.

You could try "echo 1 > /proc/sys/net/ipv4/tcp_low_latency", compare measurements on this.
The 16384 should be ok for your case.

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

The comparison you are thinking about is cut-through vs. store-and-forward. It is only really meaningful in older Cisco switches, where the backplane latency was high enough for you to want to attempt cut-through. The idea of cut-through is that the switch makes a frame forwarding decision before the entire frame has been received. The downside to this is that if the frame is malformed, the switch can't do anything to deal with it, and the receiving station gets the workload of having to figure out that it is a bad frame.

There really isn't a reason to put a modern switch into cut-through mode, especially to resolve a latency problem. Modern switches have traffic management and hardware queueing features that will do that much better. Also, modern high performance switches, such as the Nortel 8600, actually have a multi-layer buffer design that implements cut-through WRT the central processor (where the mac forwarding table is) and pumps frames straight to x-mit ASIC, where bad frame detection can be done.

That said, we did attempt to use direct wiring (with a previous OS). The problem is that two hosts have to talk to the server. The overhead of driving the second network card eliminated the benefit, and things worked better when using the switch.

Besides all of that, if what we are trying to do is elminate a ramp effect that directly relates to the number of files in a directory, how can network behavior be a factor?

/network engineer in a previous life

smart · Guru Joined: 19 Nov 2002 Posts: 455

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

I read about deadline scheduler, but couldn't fnd any good information as to how to implement it. How would I do so?

smart · Guru Joined: 19 Nov 2002 Posts: 455

Two things came to mind meanwhile.
If i remember right, the NIC you use is an active card with sth. like an i960 for offloading the CPU. I'm not fully aware what the curernt situation is, but historically, Linux doesn't support this as an active card, using the interface HW on it directly and not using the i960/active capability due to lack of license/support for the firmware. But i think intel itself meanwhile changed that. Try modinfo on the driver module you use for that card. It should offer module options to tune the card with respect to throughput vs latency or CPU offloading... if so, amke use of it, if not check intels webpages if you can get a better driver module.
The other one is the suggestion to make sure that you compiled kernel with memory mapped io support for networking.

smart · Guru Joined: 19 Nov 2002 Posts: 455

You can decide about the scheduler in the kernel configuration in the
General Setup -> Configure Standard Kernel Features ->
there you can switch off no-op, anticipatory and CFQ schedulers. Then you get deadline automatically.
With it you should get options regarding a read/write preference ratio. It's default is 2:1 for reads (by the setting of "2"). You could change that to "1" for reads/writes 1:1.
Would have to check which sysctl that is....

mciann · Tux's lil' helper Joined: 02 Sep 2004 Posts: 102

Score.

Your e1000 recommendation about offloading checksums was right on the money. It gave me about a 15% performance improvement (in terms of delaying the ramp). Thanks!

What is the right way to automate module loading with command line arguments?

I've also recompiled with deadline scheduler support, but can't find the tunable paramter in /proc.

smart · Guru Joined: 19 Nov 2002 Posts: 455

/etc/modules.conf for the e1000

static int writes_starved = 2;

seems its not offered via sysctl, though.
Just try the one as is or modify
/usr/src/linux/drivers/block/deadline-iosched.c:
static int writes_starved = 2; /* max times reads can starve a write */
to
static int writes_starved = 1; /* max times reads can starve a write */
but then, maybe not. Try as is and see if it helps, otherwise maybe just forget about it.

DrWilken · Posted: Sat Sep 25, 2004 10:47 am Post subject:

Have u got noatime and nodiratime in your /etc/fstab for the filesystem in use?
_________________
-=[DrWilken]=-
ASUS AT5IONT-I (64bit Dual Core Atom D525 processor with Nvidia ION(2) GPU) running Gentoo Linux... Latest and Greatest...

tux-power.dk