diff mbox

[4/8] mm, dax: truncate dax mappings at bdev or fs shutdown

Message ID 1447892533.13153.8.camel@intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Dan Williams Nov. 19, 2015, 12:22 a.m. UTC
On Wed, 2015-11-18 at 16:09 +0100, Jan Kara wrote:
> Hum, I don't get this. truncate_inode_pages_final() gets called when inode

> has no more users. So there are no mappings of the inode. So how could

> truncate_pagecache() possibly make a difference?


True.  I confirmed with more focus testing that the change to
truncate_inode_pages_final() is not necessary.  After
invalidate_inodes() does unmap_mapping_range() we are protected by
future calls to get_block() and blk_queue_enter() failing when there
are attempts to re-establish a mapping after the block device has been
torn down.

Here's a revised patch.  Note that the call truncate_pagecache() is
replaced with a call to unmap_mapping_range() since it is fine to
access zero pages that might still be in the page cache.

8<----
Subject: mm, dax: unmap dax mappings at bdev or fs shutdown

From: Dan Williams <dan.j.williams@intel.com>


Currently dax mappings leak past / survive block_device shutdown.  While
page cache pages are permitted to be read/written after the block_device
is torn down this is not acceptable in the dax case as all media access
must end when the device is disabled.  The pfn backing a dax mapping is
permitted to be invalidated after bdev shutdown and this is indeed the
case with brd.

When a dax capable block_device driver calls del_gendisk() in its
shutdown path del_gendisk() needs to ensure that all DAX pfns are
unmapped.  This is different than the pagecache backed case where the
disk is protected by the queue being torn down which ends I/O to the
device.  Since dax bypasses the page cache we need to unconditionally
unmap the inode.

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[honza: drop changes to truncate_inode_pages_final]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

---
 fs/inode.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Jan Kara Nov. 19, 2015, 12:55 p.m. UTC | #1
> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
> From: Dan Williams <dan.j.williams@intel.com>
> 
> Currently dax mappings leak past / survive block_device shutdown.  While
> page cache pages are permitted to be read/written after the block_device
> is torn down this is not acceptable in the dax case as all media access
> must end when the device is disabled.  The pfn backing a dax mapping is
> permitted to be invalidated after bdev shutdown and this is indeed the
> case with brd.
> 
> When a dax capable block_device driver calls del_gendisk() in its
> shutdown path del_gendisk() needs to ensure that all DAX pfns are
> unmapped.  This is different than the pagecache backed case where the
> disk is protected by the queue being torn down which ends I/O to the
> device.  Since dax bypasses the page cache we need to unconditionally
> unmap the inode.
...
> +static void unmap_list(struct list_head *head)
> +{
> +	struct inode *inode, *_i;
> +
> +	list_for_each_entry_safe(inode, _i, head, i_lru) {
> +		list_del_init(&inode->i_lru);
> +		unmap_mapping_range(&inode->i_data, 0, 0, 1);
> +		iput(inode);
> +		cond_resched();
> +	}
> +}
> +
>  /**
>   * evict_inodes	- evict all evictable inodes for a superblock
>   * @sb:		superblock to operate on
> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  	int busy = 0;
>  	struct inode *inode, *next;
>  	LIST_HEAD(dispose);
> +	LIST_HEAD(unmap);
>  
>  	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  			busy = 1;
>  			continue;
>  		}
> +		if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
> +			/*
> +			 * dax mappings can't live past this invalidation event
> +			 * as there is no page cache present to allow the data
> +			 * to remain accessible.
> +			 */
> +			__iget(inode);
> +			inode_lru_list_del(inode);
> +			spin_unlock(&inode->i_lock);
> +			list_add(&inode->i_lru, &unmap);
> +			busy = 1;
> +			continue;
> +		}

I'm somewhat concerned about the abuse of i_lru list here. The inodes have
i_count > 0 and so the code generally assumes such inodes should be removed
from LRU list if they are there. Now I didn't find a place where this could
really hit you but it is fragile. And when that happens, you have a
corruption of your unmap list (since you access it without any locks) and
also you will not unmap your inode.

Also are you sure that your unmapping cannot race with other process
mapping the pfns again?

BTW what happens when you access a DAX pfn that got removed?

								Honza
Dan Williams Nov. 19, 2015, 4:55 p.m. UTC | #2
On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
>> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Currently dax mappings leak past / survive block_device shutdown.  While
>> page cache pages are permitted to be read/written after the block_device
>> is torn down this is not acceptable in the dax case as all media access
>> must end when the device is disabled.  The pfn backing a dax mapping is
>> permitted to be invalidated after bdev shutdown and this is indeed the
>> case with brd.
>>
>> When a dax capable block_device driver calls del_gendisk() in its
>> shutdown path del_gendisk() needs to ensure that all DAX pfns are
>> unmapped.  This is different than the pagecache backed case where the
>> disk is protected by the queue being torn down which ends I/O to the
>> device.  Since dax bypasses the page cache we need to unconditionally
>> unmap the inode.
> ...
>> +static void unmap_list(struct list_head *head)
>> +{
>> +     struct inode *inode, *_i;
>> +
>> +     list_for_each_entry_safe(inode, _i, head, i_lru) {
>> +             list_del_init(&inode->i_lru);
>> +             unmap_mapping_range(&inode->i_data, 0, 0, 1);
>> +             iput(inode);
>> +             cond_resched();
>> +     }
>> +}
>> +
>>  /**
>>   * evict_inodes      - evict all evictable inodes for a superblock
>>   * @sb:              superblock to operate on
>> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>>       int busy = 0;
>>       struct inode *inode, *next;
>>       LIST_HEAD(dispose);
>> +     LIST_HEAD(unmap);
>>
>>       spin_lock(&sb->s_inode_list_lock);
>>       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
>> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>>                       busy = 1;
>>                       continue;
>>               }
>> +             if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
>> +                     /*
>> +                      * dax mappings can't live past this invalidation event
>> +                      * as there is no page cache present to allow the data
>> +                      * to remain accessible.
>> +                      */
>> +                     __iget(inode);
>> +                     inode_lru_list_del(inode);
>> +                     spin_unlock(&inode->i_lock);
>> +                     list_add(&inode->i_lru, &unmap);
>> +                     busy = 1;
>> +                     continue;
>> +             }
>
> I'm somewhat concerned about the abuse of i_lru list here. The inodes have
> i_count > 0 and so the code generally assumes such inodes should be removed
> from LRU list if they are there. Now I didn't find a place where this could
> really hit you but it is fragile. And when that happens, you have a
> corruption of your unmap list (since you access it without any locks) and
> also you will not unmap your inode.

"i_lru" list abuse was my main concern with this patch, I'll look into
a different way.

> Also are you sure that your unmapping cannot race with other process
> mapping the pfns again?

You're right, there is indeed a gap between the unmap and when
get_blocks() will start returning errors in the fault path.  I think I
need to defer the unmapping until after blk_cleanup_queue() where we
know that no new I/o to the device is possible.

> BTW what happens when you access a DAX pfn that got removed?

SIGBUS.  I don't see a way to be any kinder to the application.  After
the ->remove() method for the block_device is complete we can't be
sure that the pfn is valid or even present in the system (think brd,
or VM hot provisioning).
Jan Kara Nov. 19, 2015, 5:12 p.m. UTC | #3
On Thu 19-11-15 08:55:48, Dan Williams wrote:
> On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
> > Also are you sure that your unmapping cannot race with other process
> > mapping the pfns again?
> 
> You're right, there is indeed a gap between the unmap and when
> get_blocks() will start returning errors in the fault path.  I think I
> need to defer the unmapping until after blk_cleanup_queue() where we
> know that no new I/o to the device is possible.

Yeah, you need to squeeze it somewhere after the point where get_blocks()
start returning errors and before the point where pfn can go away.

> > BTW what happens when you access a DAX pfn that got removed?
> 
> SIGBUS.  I don't see a way to be any kinder to the application.  After
> the ->remove() method for the block_device is complete we can't be
> sure that the pfn is valid or even present in the system (think brd,
> or VM hot provisioning).

I see. So if we kept the PFN mapped, it could result e.g. in memory
corruption (at least in case of brd). So we really need this to be 100%
reliable. That's what I was interested in.

									Honza
Dave Chinner Nov. 19, 2015, 11:17 p.m. UTC | #4
On Thu, Nov 19, 2015 at 08:55:48AM -0800, Dan Williams wrote:
> On Thu, Nov 19, 2015 at 4:55 AM, Jan Kara <jack@suse.cz> wrote:
> >> Subject: mm, dax: unmap dax mappings at bdev or fs shutdown
> >> From: Dan Williams <dan.j.williams@intel.com>
> >>
> >> Currently dax mappings leak past / survive block_device shutdown.  While
> >> page cache pages are permitted to be read/written after the block_device
> >> is torn down this is not acceptable in the dax case as all media access
> >> must end when the device is disabled.  The pfn backing a dax mapping is
> >> permitted to be invalidated after bdev shutdown and this is indeed the
> >> case with brd.
> >>
> >> When a dax capable block_device driver calls del_gendisk() in its
> >> shutdown path del_gendisk() needs to ensure that all DAX pfns are
> >> unmapped.  This is different than the pagecache backed case where the
> >> disk is protected by the queue being torn down which ends I/O to the
> >> device.  Since dax bypasses the page cache we need to unconditionally
> >> unmap the inode.
> > ...
> >> +static void unmap_list(struct list_head *head)
> >> +{
> >> +     struct inode *inode, *_i;
> >> +
> >> +     list_for_each_entry_safe(inode, _i, head, i_lru) {
> >> +             list_del_init(&inode->i_lru);
> >> +             unmap_mapping_range(&inode->i_data, 0, 0, 1);
> >> +             iput(inode);
> >> +             cond_resched();
> >> +     }
> >> +}
> >> +
> >>  /**
> >>   * evict_inodes      - evict all evictable inodes for a superblock
> >>   * @sb:              superblock to operate on
> >> @@ -642,6 +654,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >>       int busy = 0;
> >>       struct inode *inode, *next;
> >>       LIST_HEAD(dispose);
> >> +     LIST_HEAD(unmap);
> >>
> >>       spin_lock(&sb->s_inode_list_lock);
> >>       list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> >> @@ -655,6 +668,19 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >>                       busy = 1;
> >>                       continue;
> >>               }
> >> +             if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
> >> +                     /*
> >> +                      * dax mappings can't live past this invalidation event
> >> +                      * as there is no page cache present to allow the data
> >> +                      * to remain accessible.
> >> +                      */
> >> +                     __iget(inode);
> >> +                     inode_lru_list_del(inode);
> >> +                     spin_unlock(&inode->i_lock);
> >> +                     list_add(&inode->i_lru, &unmap);
> >> +                     busy = 1;
> >> +                     continue;
> >> +             }
> >
> > I'm somewhat concerned about the abuse of i_lru list here. The inodes have
> > i_count > 0 and so the code generally assumes such inodes should be removed
> > from LRU list if they are there. Now I didn't find a place where this could
> > really hit you but it is fragile. And when that happens, you have a
> > corruption of your unmap list (since you access it without any locks) and
> > also you will not unmap your inode.
> 
> "i_lru" list abuse was my main concern with this patch, I'll look into
> a different way.

Yeah, you can't use i_lru at all for purposes like this - even if
there are active references to the inode, the shrinker could be
walking the LRU list and accessing this inode (e.g.
inode_lru_isolate()) at the same time this code is removing it from
the LRU. Then things will go real bad....

> > Also are you sure that your unmapping cannot race with other process
> > mapping the pfns again?
> 
> You're right, there is indeed a gap between the unmap and when
> get_blocks() will start returning errors in the fault path.

get_blocks() won't start returning errors until the filesystem has
had an IO error. Given they cache metadata and do async
transactions, it could be some time before the filesystem notices
that the device has gone away in a fatal way.

> I think I
> need to defer the unmapping until after blk_cleanup_queue() where we
> know that no new I/o to the device is possible.

Actually, I think we need to trigger a filesystem shutdown before
doing anything else (e.g. before unmapping the inodes). That way the
filesystem will error out new calls to get_blocks() and prevent any
new IO submission while the block device teardown and inode
unmapping is done. i.e. solving the problem at the block device
level is hard, but we already have all the necessary infrastructure
to shut off IO and new block mappings at the filesystem level....

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 1be5f9003eb3..dcb31d2c15e6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -579,6 +579,18 @@  static void dispose_list(struct list_head *head)
 	}
 }
 
+static void unmap_list(struct list_head *head)
+{
+	struct inode *inode, *_i;
+
+	list_for_each_entry_safe(inode, _i, head, i_lru) {
+		list_del_init(&inode->i_lru);
+		unmap_mapping_range(&inode->i_data, 0, 0, 1);
+		iput(inode);
+		cond_resched();
+	}
+}
+
 /**
  * evict_inodes	- evict all evictable inodes for a superblock
  * @sb:		superblock to operate on
@@ -642,6 +654,7 @@  int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	int busy = 0;
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
+	LIST_HEAD(unmap);
 
 	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
@@ -655,6 +668,19 @@  int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 			busy = 1;
 			continue;
 		}
+		if (IS_DAX(inode) && atomic_read(&inode->i_count)) {
+			/*
+			 * dax mappings can't live past this invalidation event
+			 * as there is no page cache present to allow the data
+			 * to remain accessible.
+			 */
+			__iget(inode);
+			inode_lru_list_del(inode);
+			spin_unlock(&inode->i_lock);
+			list_add(&inode->i_lru, &unmap);
+			busy = 1;
+			continue;
+		}
 		if (atomic_read(&inode->i_count)) {
 			spin_unlock(&inode->i_lock);
 			busy = 1;
@@ -669,6 +695,7 @@  int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
+	unmap_list(&unmap);
 
 	return busy;
 }