diff mbox

Live lock in silly-rename.

Message ID 20140529164521.02324559@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown May 29, 2014, 6:45 a.m. UTC
The program below (provided by a customer) demonstrates a livelock that can
trigger in NFS.

"/mnt" should be an NFSv4 mount from a server which will hand out READ
delegations (e.g. the Linux NFS server) and should contain a subdirectory
"/mnt/data".

The program forks off some worker threads which poll a particular file in
that directory until it disappears.  Then each worker will exit.
The main program waits a few seconds and then unlinks the file.

The number of threads can be set with the first arg (4 is good). The number of
seconds with the second.  Both have useful defaults.

The unlink should happen promptly and then all the workers should  exit.  But
they don't.

What happens is that between when the "unlink" returns the delegation that it
will inevitably have due to all those "open"s, and when it performs the
required silly-rename (because some thread will have the file open), some
other thread opens the file and gets a delegation.
So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
delegation.  15 seconds later the rename will be retried, but there will still
(or again) be an active delegation.  So the pattern repeats indefinitely.
All this time the i_mutex on the directory and file are held so "ls" on the
directory will also hang.

As an interesting twist, if you remove the file on the server, the main
process will duly notice when it next tries to rename it, and so will exit.
The clients will continue to successfully open and close the file, even
though "ls -l" shows that it doesn't exist.
If you then rm the file on the client, it will tell you that it doesn't
exist, and the workers will suddenly notice that too.

I haven't really looked into the cause of this second issue, but I can work
around the original problem with the patch below.  It effectively serialised
'open' with 'unlink' (and with anything else that grabs the file's mutex).

I think some sort of serialisation is needed.  I don't know whether i_mutex
is suitable or if we should find (or make) some other locks.

Thoughts?

Thanks,
NeilBrown

Patch:



Program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>


// nfsv4 mount point /mnt
const char check[] = "/mnt/data/checkTST";
const char data[] = "/mnt/data/dummy.data";

int num_client = 4;
int wait_sec   = 3;

// call open/close in endless loop until open fails
void client (void)
{
    for (;;)
    {
        int f = open (check, O_RDONLY);

        if (f == -1)
        {
            printf ("client: done\n");
            _exit(0);
        }
        close (f);
    }
}

int main (int argc, char **argv)
{
    int i, fd;
    FILE *f_dummy;

    if (argc > 1)
        num_client = atoi (argv[1]);

    if (argc > 2)
        wait_sec = atoi (argv[2]);

    fd = open (check, O_RDWR|O_CREAT, S_IRWXU);

    if (fd == -1)
    {
        perror ("open failed:\n");
        _exit (1);
    }

    close (fd);

    for (i=0; i<num_client; i++)
    {
        // fork childs to poll the 'checkTST' file
        pid_t p = fork ();
        if (p==0)
            client();
        else if (p==-1)
        {
            perror ("fork failed");
            _exit (1);
        }
    }

      // parent process
      // - wait a few seconds and unlink 'checkTST' file
      // - all childs are polling the 'checkTST' and should
      //   end now
      sleep (wait_sec);
      unlink (check);
      printf ("server: done\n");
}

Comments

Trond Myklebust May 29, 2014, 4:38 p.m. UTC | #1
Apologies Neil. Resending due to gmail defaulting to html formatting
which gets rejected by vger.kernel.org...

On Thu, May 29, 2014 at 2:45 AM, NeilBrown <neilb@suse.de> wrote:
>
> The program below (provided by a customer) demonstrates a livelock that can
> trigger in NFS.
>
> "/mnt" should be an NFSv4 mount from a server which will hand out READ
> delegations (e.g. the Linux NFS server) and should contain a subdirectory
> "/mnt/data".
>
> The program forks off some worker threads which poll a particular file in
> that directory until it disappears.  Then each worker will exit.
> The main program waits a few seconds and then unlinks the file.
>
> The number of threads can be set with the first arg (4 is good). The number of
> seconds with the second.  Both have useful defaults.
>
> The unlink should happen promptly and then all the workers should  exit.  But
> they don't.
>
> What happens is that between when the "unlink" returns the delegation that it
> will inevitably have due to all those "open"s, and when it performs the
> required silly-rename (because some thread will have the file open), some
> other thread opens the file and gets a delegation.
> So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
> delegation.  15 seconds later the rename will be retried, but there will still
> (or again) be an active delegation.  So the pattern repeats indefinitely.
> All this time the i_mutex on the directory and file are held so "ls" on the
> directory will also hang.

Why would the server hand out another delegation just moments after it
recalled the last one? That sounds like a nasty server bug.

You can invent variants of this problem with a second client trying to
open() the file while the first one is trying to unlink(), where the
i_mutex hack will not suffice to prevent client 2 from getting the
delegation back.

> As an interesting twist, if you remove the file on the server, the main
> process will duly notice when it next tries to rename it, and so will exit.
> The clients will continue to successfully open and close the file, even
> though "ls -l" shows that it doesn't exist.
> If you then rm the file on the client, it will tell you that it doesn't
> exist, and the workers will suddenly notice that too.
>
> I haven't really looked into the cause of this second issue, but I can work
> around the original problem with the patch below.  It effectively serialised
> 'open' with 'unlink' (and with anything else that grabs the file's mutex).
>
> I think some sort of serialisation is needed.  I don't know whether i_mutex
> is suitable or if we should find (or make) some other locks.
>
> Thoughts?
>
> Thanks,
> NeilBrown
>
> Patch:
>
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 8de3407e0360..96108f88d3f9 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -33,6 +33,10 @@ nfs4_file_open(struct inode *inode, struct file *filp)
>
>         dprintk("NFS: open file(%pd2)\n", dentry);
>
> +       // hack - if being unlinked, pause for it to complete
> +       mutex_lock(&inode->i_mutex);
> +       mutex_unlock(&inode->i_mutex);
> +
>         if ((openflags & O_ACCMODE) == 3)
>                 openflags--;
>
>
>
> Program:
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <errno.h>
>
>
> // nfsv4 mount point /mnt
> const char check[] = "/mnt/data/checkTST";
> const char data[] = "/mnt/data/dummy.data";
>
> int num_client = 4;
> int wait_sec   = 3;
>
> // call open/close in endless loop until open fails
> void client (void)
> {
>     for (;;)
>     {
>         int f = open (check, O_RDONLY);
>
>         if (f == -1)
>         {
>             printf ("client: done\n");
>             _exit(0);
>         }
>         close (f);
>     }
> }
>
> int main (int argc, char **argv)
> {
>     int i, fd;
>     FILE *f_dummy;
>
>     if (argc > 1)
>         num_client = atoi (argv[1]);
>
>     if (argc > 2)
>         wait_sec = atoi (argv[2]);
>
>     fd = open (check, O_RDWR|O_CREAT, S_IRWXU);
>
>     if (fd == -1)
>     {
>         perror ("open failed:\n");
>         _exit (1);
>     }
>
>     close (fd);
>
>     for (i=0; i<num_client; i++)
>     {
>         // fork childs to poll the 'checkTST' file
>         pid_t p = fork ();
>         if (p==0)
>             client();
>         else if (p==-1)
>         {
>             perror ("fork failed");
>             _exit (1);
>         }
>     }
>
>       // parent process
>       // - wait a few seconds and unlink 'checkTST' file
>       // - all childs are polling the 'checkTST' and should
>       //   end now
>       sleep (wait_sec);
>       unlink (check);
>       printf ("server: done\n");
> }
J. Bruce Fields May 30, 2014, 12:44 a.m. UTC | #2
On Fri, May 30, 2014 at 07:51:35AM +1000, NeilBrown wrote:
> On Thu, 29 May 2014 12:38:19 -0400 Trond Myklebust
> <trond.myklebust@primarydata.com> wrote:
> 
> > Apologies Neil. Resending due to gmail defaulting to html formatting
> > which gets rejected by vger.kernel.org...
> > 
> > On Thu, May 29, 2014 at 2:45 AM, NeilBrown <neilb@suse.de> wrote:
> > >
> > > The program below (provided by a customer) demonstrates a livelock that can
> > > trigger in NFS.
> > >
> > > "/mnt" should be an NFSv4 mount from a server which will hand out READ
> > > delegations (e.g. the Linux NFS server) and should contain a subdirectory
> > > "/mnt/data".
> > >
> > > The program forks off some worker threads which poll a particular file in
> > > that directory until it disappears.  Then each worker will exit.
> > > The main program waits a few seconds and then unlinks the file.
> > >
> > > The number of threads can be set with the first arg (4 is good). The number of
> > > seconds with the second.  Both have useful defaults.
> > >
> > > The unlink should happen promptly and then all the workers should  exit.  But
> > > they don't.
> > >
> > > What happens is that between when the "unlink" returns the delegation that it
> > > will inevitably have due to all those "open"s, and when it performs the
> > > required silly-rename (because some thread will have the file open), some
> > > other thread opens the file and gets a delegation.
> > > So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
> > > delegation.  15 seconds later the rename will be retried, but there will still
> > > (or again) be an active delegation.  So the pattern repeats indefinitely.
> > > All this time the i_mutex on the directory and file are held so "ls" on the
> > > directory will also hang.
> > 
> > Why would the server hand out another delegation just moments after it
> > recalled the last one? That sounds like a nasty server bug.
> 
> Exactly how long should it wait?
> Bruce: do you have any thoughts on whether the server should hold off giving
> out new delegations after trying to recall one (e.g. due to a lease-break
> caused by rename)??
> I don't suppose the RFC addresses this?

Yes, it's a known server bug.

As a first attempt I was thinking of just sticking a timestamp in struct
inode to record the time of the most recent conflicting access and deny
delegations if the timestamp is too recent, for some definition of too
recent.

--b.

> 
> > 
> > You can invent variants of this problem with a second client trying to
> > open() the file while the first one is trying to unlink(), where the
> > i_mutex hack will not suffice to prevent client 2 from getting the
> > delegation back.
> 
> True.  So we do need support from the server.
> 
> Thanks,
> NeilBrown
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown May 30, 2014, 3:44 a.m. UTC | #3
On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Fri, May 30, 2014 at 07:51:35AM +1000, NeilBrown wrote:
> > On Thu, 29 May 2014 12:38:19 -0400 Trond Myklebust
> > <trond.myklebust@primarydata.com> wrote:
> > 
> > > Apologies Neil. Resending due to gmail defaulting to html formatting
> > > which gets rejected by vger.kernel.org...
> > > 
> > > On Thu, May 29, 2014 at 2:45 AM, NeilBrown <neilb@suse.de> wrote:
> > > >
> > > > The program below (provided by a customer) demonstrates a livelock that can
> > > > trigger in NFS.
> > > >
> > > > "/mnt" should be an NFSv4 mount from a server which will hand out READ
> > > > delegations (e.g. the Linux NFS server) and should contain a subdirectory
> > > > "/mnt/data".
> > > >
> > > > The program forks off some worker threads which poll a particular file in
> > > > that directory until it disappears.  Then each worker will exit.
> > > > The main program waits a few seconds and then unlinks the file.
> > > >
> > > > The number of threads can be set with the first arg (4 is good). The number of
> > > > seconds with the second.  Both have useful defaults.
> > > >
> > > > The unlink should happen promptly and then all the workers should  exit.  But
> > > > they don't.
> > > >
> > > > What happens is that between when the "unlink" returns the delegation that it
> > > > will inevitably have due to all those "open"s, and when it performs the
> > > > required silly-rename (because some thread will have the file open), some
> > > > other thread opens the file and gets a delegation.
> > > > So the NFSv4 RENAME returns NFS4ERR_DELAY while it tries to reclaim the
> > > > delegation.  15 seconds later the rename will be retried, but there will still
> > > > (or again) be an active delegation.  So the pattern repeats indefinitely.
> > > > All this time the i_mutex on the directory and file are held so "ls" on the
> > > > directory will also hang.
> > > 
> > > Why would the server hand out another delegation just moments after it
> > > recalled the last one? That sounds like a nasty server bug.
> > 
> > Exactly how long should it wait?
> > Bruce: do you have any thoughts on whether the server should hold off giving
> > out new delegations after trying to recall one (e.g. due to a lease-break
> > caused by rename)??
> > I don't suppose the RFC addresses this?
> 
> Yes, it's a known server bug.
> 
> As a first attempt I was thinking of just sticking a timestamp in struct
> inode to record the time of the most recent conflicting access and deny
> delegations if the timestamp is too recent, for some definition of too
> recent.
> 

Hmmm... I'll have a look next week and see what I can come up with.

Thanks,
NeilBrown
J. Bruce Fields May 30, 2014, 9:55 p.m. UTC | #4
On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
> 
> > Yes, it's a known server bug.
> > 
> > As a first attempt I was thinking of just sticking a timestamp in struct
> > inode to record the time of the most recent conflicting access and deny
> > delegations if the timestamp is too recent, for some definition of too
> > recent.
> > 
> 
> Hmmm... I'll have a look next week and see what I can come up with.

Thanks!

If we didn't think it was worth another struct inode field, we could
probably get away with global state.  Even just refusing to give out any
delegations for a few seconds after any delegation break would be enough
to fix this bug.

Or you could make it a little less harsh with a small hash table: "don't
give out a delegation on any inode whose inode number hashes to X for a
few seconds."

As long as the delegations can be turned down at the whim of the server,
we've got a lot of leeway.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown May 30, 2014, 10:13 p.m. UTC | #5
On Fri, 30 May 2014 17:55:23 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Fri, May 30, 2014 at 01:44:42PM +1000, NeilBrown wrote:
> > On Thu, 29 May 2014 20:44:23 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > wrote:
> > 
> > > Yes, it's a known server bug.
> > > 
> > > As a first attempt I was thinking of just sticking a timestamp in struct
> > > inode to record the time of the most recent conflicting access and deny
> > > delegations if the timestamp is too recent, for some definition of too
> > > recent.
> > > 
> > 
> > Hmmm... I'll have a look next week and see what I can come up with.
> 
> Thanks!
> 
> If we didn't think it was worth another struct inode field, we could
> probably get away with global state.  Even just refusing to give out any
> delegations for a few seconds after any delegation break would be enough
> to fix this bug.
> 
> Or you could make it a little less harsh with a small hash table: "don't
> give out a delegation on any inode whose inode number hashes to X for a
> few seconds."

I was thinking of using a bloom filter - or possibly two.
- avoid handing out delegations if either bloom filter reports a match
- when reclaiming a delegation add the inode to the second bloom filter
- every so-often zero-out the older filter and swap them.

Might be a bit of overkill, but I won't know until I implement it.

NeilBrown

> 
> As long as the delegations can be turned down at the whim of the server,
> we've got a lot of leeway.
> 
> --b.
diff mbox

Patch

diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 8de3407e0360..96108f88d3f9 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -33,6 +33,10 @@  nfs4_file_open(struct inode *inode, struct file *filp)
 
 	dprintk("NFS: open file(%pd2)\n", dentry);
 
+	// hack - if being unlinked, pause for it to complete
+	mutex_lock(&inode->i_mutex);
+	mutex_unlock(&inode->i_mutex);
+
 	if ((openflags & O_ACCMODE) == 3)
 		openflags--;