dax_lock_mapping_entry was never safe

Message ID	20181126161240.GH3065@bombadil.infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received-SPF: None (no SPF record) identity=mailfrom; client-ip=2607:7c80:54:e::133; helo=bombadil.infradead.org; envelope-from=willy@infradead.org; receiver=linux-nvdimm@lists.01.org Date: Mon, 26 Nov 2018 08:12:40 -0800 From: Matthew Wilcox <willy@infradead.org> To: Dan Williams <dan.j.williams@intel.com> Subject: dax_lock_mapping_entry was never safe Message-ID: <20181126161240.GH3065@bombadil.infradead.org> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.9.2 (2017-12-15) Precedence: list Cc: linux-fsdevel@vger.kernel.org, Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
Series	dax_lock_mapping_entry was never safe \| expand dax_lock_mapping_entry was never safe

Message ID

20181126161240.GH3065@bombadil.infradead.org (mailing list archive)

State

New, archived

Headers

Received-SPF: None (no SPF record) identity=mailfrom;
 client-ip=2607:7c80:54:e::133; helo=bombadil.infradead.org;
 envelope-from=willy@infradead.org; receiver=linux-nvdimm@lists.01.org
Date: Mon, 26 Nov 2018 08:12:40 -0800
From: Matthew Wilcox <willy@infradead.org>
To: Dan Williams <dan.j.williams@intel.com>
Subject: dax_lock_mapping_entry was never safe
Message-ID: <20181126161240.GH3065@bombadil.infradead.org>
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.9.2 (2017-12-15)
Precedence: list
Cc: linux-fsdevel@vger.kernel.org, Jan Kara <jack@suse.cz>,
 linux-nvdimm@lists.01.org
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: linux-nvdimm-bounces@lists.01.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Series

dax_lock_mapping_entry was never safe | expand

Commit Message

Matthew Wilcox Nov. 26, 2018, 4:12 p.m. UTC

I noticed this path while I was doing the 4.19 backport of
dax: Avoid losing wakeup in dax_lock_mapping_entry

                xa_unlock_irq(&mapping->i_pages);
                revalidate = wait_fn();
                finish_wait(wq, &ewait.wait);
                xa_lock_irq(&mapping->i_pages);

It's not safe to call xa_lock_irq() if mapping can have been freed while
we slept.  We'll probably get away with it; most filesystems use a unique
slab for their inodes, so you'll likely get either a freed inode or an
inode which is now the wrong inode.  But if that page has been freed back
to the page allocator, that pointer could now be pointing at anything.

Fixing this in the current codebase is no easier than fixing it in the
4.19 codebase.  This is the best I've come up with.  Could we do better
by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
with all the things that need to be considered when using this family
of interfaces.

Comments

Jan Kara Nov. 26, 2018, 5:11 p.m. UTC | #1

On Mon 26-11-18 08:12:40, Matthew Wilcox wrote:
> 
> I noticed this path while I was doing the 4.19 backport of
> dax: Avoid losing wakeup in dax_lock_mapping_entry
> 
>                 xa_unlock_irq(&mapping->i_pages);
>                 revalidate = wait_fn();
>                 finish_wait(wq, &ewait.wait);
>                 xa_lock_irq(&mapping->i_pages);

I guess this is a snippet from get_unlocked_entry(), isn't it?

> It's not safe to call xa_lock_irq() if mapping can have been freed while
> we slept.  We'll probably get away with it; most filesystems use a unique
> slab for their inodes, so you'll likely get either a freed inode or an
> inode which is now the wrong inode.  But if that page has been freed back
> to the page allocator, that pointer could now be pointing at anything.

Correct. Thanks for catching this bug!

> Fixing this in the current codebase is no easier than fixing it in the
> 4.19 codebase.  This is the best I've come up with.  Could we do better
> by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
> with all the things that need to be considered when using this family
> of interfaces.
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 9bcce89ea18e..154b592b18eb 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -232,6 +232,24 @@ static void *get_unlocked_entry(struct xa_state *xas)
>  	}
>  }
>  
> +static void wait_unlocked_entry(struct xa_state *xas, void *entry)
> +{
> +	struct wait_exceptional_entry_queue ewait;
> +	wait_queue_head_t *wq;
> +
> +	init_wait(&ewait.wait);
> +	ewait.wait.func = wake_exceptional_entry_func;
> +
> +	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
> +	prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
> +	xas_unlock_irq(xas);
> +	/* We can no longer look at xas */
> +	schedule();
> +	finish_wait(wq, &ewait.wait);
> +	if (waitqueue_active(wq))
> +		__wake_up(wq, TASK_NORMAL, 1, &ewait.key);
> +}
> +

The code looks good. Maybe can we call this wait_entry_unlocked() to stress
that entry is not really usable after this function returns? And comment
before the function that this is safe to call even if we don't have a
reference keeping mapping alive?

>  static void put_unlocked_entry(struct xa_state *xas, void *entry)
>  {
>  	/* If we were the only waiter woken, wake the next one */
> @@ -389,9 +407,7 @@ bool dax_lock_mapping_entry(struct page *page)
>  		entry = xas_load(&xas);
>  		if (dax_is_locked(entry)) {
>  			rcu_read_unlock();
> -			entry = get_unlocked_entry(&xas);
> -			xas_unlock_irq(&xas);
> -			put_unlocked_entry(&xas, entry);
> +			wait_unlocked_entry(&xas, entry);
>  			rcu_read_lock();
>  			continue;

The continue here actually is not safe either because if the mapping got
freed, page->mapping will be NULL and we oops at the beginning of the loop.
So that !dax_mapping() check should also check for mapping != NULL.

								Honza

Dan Williams Nov. 26, 2018, 8:36 p.m. UTC | #2

On Mon, Nov 26, 2018 at 9:11 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 26-11-18 08:12:40, Matthew Wilcox wrote:
> >
> > I noticed this path while I was doing the 4.19 backport of
> > dax: Avoid losing wakeup in dax_lock_mapping_entry
> >
> >                 xa_unlock_irq(&mapping->i_pages);
> >                 revalidate = wait_fn();
> >                 finish_wait(wq, &ewait.wait);
> >                 xa_lock_irq(&mapping->i_pages);
>
> I guess this is a snippet from get_unlocked_entry(), isn't it?
>
> > It's not safe to call xa_lock_irq() if mapping can have been freed while
> > we slept.  We'll probably get away with it; most filesystems use a unique
> > slab for their inodes, so you'll likely get either a freed inode or an
> > inode which is now the wrong inode.  But if that page has been freed back
> > to the page allocator, that pointer could now be pointing at anything.
>
> Correct. Thanks for catching this bug!

Yes, nice catch!

>
> > Fixing this in the current codebase is no easier than fixing it in the
> > 4.19 codebase.  This is the best I've come up with.  Could we do better
> > by not using the _exclusive form of prepare_to_wait()?  I'm not familiar
> > with all the things that need to be considered when using this family
> > of interfaces.
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 9bcce89ea18e..154b592b18eb 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -232,6 +232,24 @@ static void *get_unlocked_entry(struct xa_state *xas)
> >       }
> >  }
> >
> > +static void wait_unlocked_entry(struct xa_state *xas, void *entry)
> > +{
> > +     struct wait_exceptional_entry_queue ewait;
> > +     wait_queue_head_t *wq;
> > +
> > +     init_wait(&ewait.wait);
> > +     ewait.wait.func = wake_exceptional_entry_func;
> > +
> > +     wq = dax_entry_waitqueue(xas, entry, &ewait.key);
> > +     prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
> > +     xas_unlock_irq(xas);
> > +     /* We can no longer look at xas */
> > +     schedule();
> > +     finish_wait(wq, &ewait.wait);
> > +     if (waitqueue_active(wq))
> > +             __wake_up(wq, TASK_NORMAL, 1, &ewait.key);
> > +}
> > +
>
> The code looks good. Maybe can we call this wait_entry_unlocked() to stress
> that entry is not really usable after this function returns? And comment
> before the function that this is safe to call even if we don't have a
> reference keeping mapping alive?

Yes, maybe even something more ambiguous like "wait_entry_event()",
because there's no guarantee the entry is unlocked just that now is a
good time to try to interrogate the entry again.

>
> >  static void put_unlocked_entry(struct xa_state *xas, void *entry)
> >  {
> >       /* If we were the only waiter woken, wake the next one */
> > @@ -389,9 +407,7 @@ bool dax_lock_mapping_entry(struct page *page)
> >               entry = xas_load(&xas);
> >               if (dax_is_locked(entry)) {
> >                       rcu_read_unlock();
> > -                     entry = get_unlocked_entry(&xas);
> > -                     xas_unlock_irq(&xas);
> > -                     put_unlocked_entry(&xas, entry);
> > +                     wait_unlocked_entry(&xas, entry);
> >                       rcu_read_lock();
> >                       continue;
>
> The continue here actually is not safe either because if the mapping got
> freed, page->mapping will be NULL and we oops at the beginning of the loop.
> So that !dax_mapping() check should also check for mapping != NULL.

Yes.

Matthew Wilcox Nov. 27, 2018, 6:59 p.m. UTC | #3

On Mon, Nov 26, 2018 at 12:36:26PM -0800, Dan Williams wrote:
> On Mon, Nov 26, 2018 at 9:11 AM Jan Kara <jack@suse.cz> wrote:
> > The code looks good. Maybe can we call this wait_entry_unlocked() to stress
> > that entry is not really usable after this function returns? And comment
> > before the function that this is safe to call even if we don't have a
> > reference keeping mapping alive?
> 
> Yes, maybe even something more ambiguous like "wait_entry_event()",
> because there's no guarantee the entry is unlocked just that now is a
> good time to try to interrogate the entry again.

It _became_ unlocked ... it might be locked again, or have disappeared
by the time we get to it, but somebody called dax_wake_entry() for this
exact entry.  I mean, we could call it wait_entry_wake(), but I think
that's a little generic.  I'm going with Jan's version ;-)

> > The continue here actually is not safe either because if the mapping got
> > freed, page->mapping will be NULL and we oops at the beginning of the loop.
> > So that !dax_mapping() check should also check for mapping != NULL.
> 
> Yes.

Sigh.  I'll make that a separate patch so it applies cleanly to 4.19.

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..154b592b18eb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -232,6 +232,24 @@  static void *get_unlocked_entry(struct xa_state *xas)
 	}
 }
 
+static void wait_unlocked_entry(struct xa_state *xas, void *entry)
+{
+	struct wait_exceptional_entry_queue ewait;
+	wait_queue_head_t *wq;
+
+	init_wait(&ewait.wait);
+	ewait.wait.func = wake_exceptional_entry_func;
+
+	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+	prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
+	xas_unlock_irq(xas);
+	/* We can no longer look at xas */
+	schedule();
+	finish_wait(wq, &ewait.wait);
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, 1, &ewait.key);
+}
+
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
@@ -389,9 +407,7 @@  bool dax_lock_mapping_entry(struct page *page)
 		entry = xas_load(&xas);
 		if (dax_is_locked(entry)) {
 			rcu_read_unlock();
-			entry = get_unlocked_entry(&xas);
-			xas_unlock_irq(&xas);
-			put_unlocked_entry(&xas, entry);
+			wait_unlocked_entry(&xas, entry);
 			rcu_read_lock();
 			continue;
 		}

dax_lock_mapping_entry was never safe

Commit Message

Comments

Patch