Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
slow NFS degradation, 98% access calls
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
criacow
n00b
n00b


Joined: 08 Jan 2008
Posts: 1
Location: Vancouver, Canada

PostPosted: Tue Jan 08, 2008 10:36 pm    Post subject: slow NFS degradation, 98% access calls Reply with quote

Good afternoon!

We've got a cluster of Gentoo boxes. There's a central unit that load-balances out to four servers, each of which are NFS clients to a sixth box:

Code:

         load-balancer
     /       |          |         \    <- traffic forwarding
  box1  box2    box3   box4
     \       |          |         /    <- NFS mounts
         central fileshare


At first, everything's great -- it does a GETATTR call, gets the reply, the file goes.

Over the course of a week or two, there get to be more and more ACCESS calls on every grab, until it starts looking like this:

Code:

No.     Time        Source                Destination           Protocol Info
 367848 58.923514   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367849), FH:0xb85fb85d
 367849 58.923785   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367848)
 367850 58.923800   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367851), FH:0xb85fb85d
 367851 58.923915   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367850)
 367852 58.923928   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367853), FH:0x7bd67bd7
 367853 58.924045   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367852)
 367854 58.924059   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367855), FH:0xb85fb85d
 367855 58.924173   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367854)
 367856 58.924185   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367857), FH:0x7bd67bd7
 367857 58.924301   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367856)
 367858 58.924313   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367859), FH:0x637b7c7a
 367859 58.924430   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367858)
 367860 58.924444   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367861), FH:0xb85fb85d
 367861 58.924558   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367860)
 367862 58.924570   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367863), FH:0x7bd67bd7
 367863 58.924687   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367862)
 367864 58.924698   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367865), FH:0x637b7c7a
 367865 58.924814   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367864)
 367866 58.924826   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367867), FH:0xf3bdecbc
 367867 58.924944   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367866)
 367868 58.924964   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367869), FH:0xb85fb85d
 367869 58.925078   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367868)
 367870 58.925090   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367871), FH:0x7bd67bd7
 367871 58.925211   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367870)
 367872 58.925223   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367873), FH:0x637b7c7a
 367873 58.925339   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367872)
 367874 58.925351   192.168.1.13          192.168.1.1           NFS      V3 ACCESS Call (Reply In 367875), FH:0xf3bdecbc
 367875 58.925466   192.168.1.1           192.168.1.13          NFS      V3 ACCESS Reply (Call In 367874)
 367876 58.925478   192.168.1.13          192.168.1.1           NFS      V3 GETATTR Call (Reply In 367877), FH:0xdbbbc4ba
 367877 58.925593   192.168.1.1           192.168.1.13          NFS      V3 GETATTR Reply (Call In 367876)  Regular File mode:0644 uid:1000 gid:440


nfsstat -c starts to look like this:

Code:

Client rpc stats:
calls      retrans    authrefrsh
369630277   866766     0       
Client nfs v2:
null       getattr    setattr    root       lookup     readlink   
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%
read       wrcache    write      create     remove     rename     
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%
link       symlink    mkdir      rmdir      readdir    fsstat     
0       0% 0       0% 0       0% 0       0% 0       0% 0       0%

Client nfs v3:
null       getattr    setattr    lookup     access     readlink   
0       0% 1075664587 23% 38260   0% 4181249  0% -711993663 76% 0       0%
read       write      create     mkdir      symlink    mknod     
192202  0% 402150  0% 324530  0% 135     0% 0       0% 0       0%
remove     rmdir      rename     link       readdir    readdirplus
327099  0% 0       0% 56      0% 0       0% 0       0% 225024  0%
fsstat     fsinfo     pathconf   commit     
4       0% 2       0% 0       0% 268642  0%


and will eventually get to be 98% access.

I've searched through the forums and through Google, but can't find anything relevant. Any clue why this happens? Is there somewhere that filehandles get incorrectly cached, or something along those lines?

Thanks in advance!
-criacow
Back to top
View user's profile Send private message
guruvan
Tux's lil' helper
Tux's lil' helper


Joined: 21 Aug 2007
Posts: 132

PostPosted: Thu Jan 10, 2008 9:00 am    Post subject: Reply with quote

sounds like you're on the right track. I found one little discussion somewhere http://www.scooter.cx/~mozbot/%23vesta-20050517-070000.xml where someone mentions a similar problem. (not much to go on) from your log excerpt, it would seem that the filehandles are not released. maybe you can isolate it to certain types of files, files that are accessed while a certain operation takes place, some reason that the clients aren't closing the files? can you plot the filehandles and what the access pattern is? is it a growing number of files that each are being accessed over and over, or is it the number of accesses for each original file operation grows over time?

simpler to fix, sometimes easy to overlook:
have you done a trace on the network to see the contents of the excess nfs packets? do they have the right source addresses? in particular, is there any way the nfs server can find the address of the boxes wrong interface? (i.e. dns) this could be a possible explanation of files not closing

can you find a way to duplicate this on a test load?

when in doubt, rip it out! maybe roll up another fresh cluster box, with freshly compiled toolchain, glibc, kernel, and nfs daemons. (not necessarily upgraded)
_________________
Everything is broken......(b.dylan). 8)

guruvan
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum