Prohibit Kernel File Caching? -- Legit Question, REALLY!!

yottabit

First off: no yelling at me for asking the question. I have my purposes.

Is there a way (perhaps through /proc or with mount options), without hacking the kernel, to stop the kernel from caching reads & writes? Ideally I'd be able to specify on a mount-by-mount basis...

(Because some will want to know: the kernel is honestly slowing down disk reads to a point where it's causing degradation of performance. The kernel seems to place a higher priority on reading all of the file possible into memory before actually serving the data to the requestor... It's crazy!)

Anyone? Bueller? Bueller?
_________________
Play The Hitchhiker's Guide to the Galaxy!

adaptr · Posted: Thu Mar 10, 2005 10:39 am Post subject:

If your kernel is causing disk performance degradation - and the CPU is not a 486 or something that slow - then the problem is really with your kernel, not with Linux...
Or the disk controller is dodgy, or simply broken.
What kind of throughput do you get from the drive(s) ?

EDIT: for one, your statement simply makes no practical sense - the kernel does not cache "anything it can" from a file - it caches blocks, whose size is determined by the maximum amount of data that can be transferred in one operation.
It is not possible to read faster by reading less, so reading one byte will not be any faster than reading the maximum block size (usually 128K for modern drives, 256 sectors).
Second, the notion that processing cached disk contents slows down the system is rather funny - the kernel disk cache is on the order of 1000 times as fast as the actual disks can ever be.

There are kernel (sysctl) settings for this, but your post implies that you have set these to non-standard values rather than that they could or should be improved from the defaults...
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen

yottabit · Posted: Thu Mar 10, 2005 10:46 am Post subject:

The degradation is a general I/O type, not specifically limited to the disk. The contributing factors could be highmem kernel support, the choice of kernel I/O scheduler (deadline at the moment, which seems reasonable), interface driver (promise), controller (Promise S150 TX4), drives (Hitachi 7K250 250 GB SATA-100 w/ 8 MB Cache), striping size (no striping and 32k tested thus far), and filesystem (tested ReiserFS 3.6 and ext2 thus far).

Yes, those present a lot of variables. But while watching top, and with the results of Iozone, it is quite evident that the performance bottleneck disappears when passing the -I option to Iozone, bypassing the kernel cache.

You can reference my on-going struggle in this thread, but here I'd like to concentrate on facilitating a cache-disable on the necessary filesystem in case my testing doesn't provide a clear winner that interoperates with the kernel's cache in a more friendly manner.

J

EDIT: After reading your edit I guess I should clarify. The kernel must know the I/O request is for sequential data in a mammoth file, and it seems the kernel decides to 'anticipate' the next requests (which is a logical thing to do since they're 100% predictable in this case) by filling its cache at a higher priority than simply passing the data to the requesting application. I can see an interesting pattern based on NIC utilization (to the workstation requesting the data) where utilization starts very low while top shows the kernel cache growing like mad, and then network utilization peaks much higher. After a short while of this, the pattern repeats. With Iozones disabling the kernel caching mech (by I/O call type) the performance difference is dramatic.
_________________
Play The Hitchhiker's Guide to the Galaxy!

adaptr · Posted: Thu Mar 10, 2005 10:53 am Post subject:

Well, maybe you have already tested this to the limit (or think you have), but the issue is really quite simple:

There are two factors, pure disk I/O and pure virtual (page) memory I/O, which has slightly more overhead than real RAM I/O, but not so much as you'd notice - the entire kernel runs on the virtual memory subsystem, after all.
These two are bound together by the caching code in the kernel.
I am curious as to what numbers you got exactly, since if the kernel's memory caching of disk accesses really slows things down then you have encountered a serious issue indeed - a relatively arbitrary piece of code that can under some circumstances cause a 100-fold or bigger throughput degradation right inside the kernel.
Not good, and very, very unlikely.
Hence my questions - have you examined all the angles ?

EDIT: having read your edit ;-)

I understand a little better what you mean.

Still, the difference in access speed between any I/O system and pure memory I/O is more than 100-fold in any case - if the kernel's caching code were to dump the entire free RAM at once you would not even be able to see this, since modern DRAM easily achieves transfer speeds of gigabytes per second, virtual memory overhead included.
This means that there is more complex logic at work with the discrepancy (stair-stepping) you observe in top, and I think it indicates that either end of the caching process is actually waiting for something before taking the next step - which, like you said, is completely unnecessary when caching linear blocks from large files; just assign a DMA transfer and forget about it.
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen

yottabit · Posted: Thu Mar 10, 2005 10:58 am Post subject:

As soon as my current Iozone test on Reiser4 completes (15-30 minutes) I'll run an Iozone test with ext2 (format optimized for large files) with cache enabled and with cache disabled, and post the results.

The other testing based on NIC utilization and watching top is a little more difficult to put into reliable numbers.
_________________
Play The Hitchhiker's Guide to the Galaxy!

yottabit · Posted: Fri Mar 11, 2005 7:50 pm Post subject:

Okay, sorry it was a little longer than 15-30 minutes (like 2 days), hehhe. But I had a disastrous experience with ReiserFS 4 Beta that warranted some documentation. See this thread if you're interested.

ext2 format options (I mounted to /mnt/bigarray but it's not an array, just FYI):