Problems with locking, permanent 'lockd: server in grace period'

On Tue, Aug 09, 2011 at 12:51:14AM +1200, Malcolm Locke wrote:
> First off, apologies for bringing such mundane matters to the list, but
> we're at the end of our tethers and way out of our depth on this.  We
> have a problem on our production machine that we are unable to replicate
> on a test machine, and would greatly appreciate any pointers of where to
> look next.
> 
> We're in the process of upgrading a DRBD pair running Ubuntu hardy to
> Debian squeeze.  The first of the pair has been upgraded, and NFS works
> correctly except for locking.  Calls to flock() from any client on an
> NFS mount hang indefinitely.
> 
> We've installed a fresh Debian squeeze machine to test, but are
> completely unable to reproduce the issue.  Pertinent details about the
> set up:
> 
> Kernel on both machines:
>   Linux debian 2.6.32-5-openvz-amd64 #1 SMP Tue Jun 14 10:46:15 UTC 2011
>   x86_64 GNU/Linux
> 
>   Debian package versions:
>   nfs-common 1.2.2-4
>   nfs-kernel-server 1.2.2-4
>   rpcbind 0.2.0-4.1
> 
>   Filesystem is ext3 rw,relatime,errors=remount-ro,data=ordered
>   /etc/exports has rw,no_root_squash,async,no_subtree_check
> 
> On both the working and failing hosts, the NFS is mounted with default
> options, e.g. mount host:/home /mnt
> 
> Below is the nlm debug from the working host (hostname debian on the
> left) and the failing host (itchy on the right).  Apologies for the wide
> text, I've aligned the log messages from a single flock() attempt so the
> corresponding lines match up for each host.  In both cases, the NFS
> client and server are the same host.
> 
> Points I note from this are:
> 
> - xdr_dec_stat_res doesn't get called on the failing host
> - nlm_lookup_host reports 'found host' on the failing host, and
>   'created host' on the working host.
> - vfs_lock_file returned 0 doesn't log on the failing host.  I think
>   this is because one of the following checks is returning true:
> 
>     // fs/lockd/svclock.c:411
>     if (locks_in_grace() && !reclaim) {
>             ret = nlm_lck_denied_grace_period;
>             goto out;
>     }
>     if (reclaim && !locks_in_grace()) {
>             ret = nlm_lck_denied_grace_period;
>             goto out;
>     }
>     
>   I've come to this conclusion because of the 'lockd: server in grace
>   period'.  The failing host has been up for several days, and on both
>   machines /proc/sys/fs/nfs/nlm_grace_period is 0.
> 
> Any help on this would be greatly appreciated, including where to go
> next.  If you require any more info let me know.  Thanks for your time.

It might be worth trying this in addition to the recoverydir fixes
previously posted.

--b.

commit c52560f10794b9fb8c050532d27ff999d8f5c23c
Author: J. Bruce Fields <bfields@redhat.com>
Date:   Fri Aug 12 11:59:44 2011 -0400

    some grace period fixes and debugging

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Problems with locking, permanent 'lockd: server in grace period'

Commit Message

Comments

Patch