diff mbox

[v6,4/7] dax: add support for fsync/msync

Message ID 1450899560-26708-5-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Ross Zwisler Dec. 23, 2015, 7:39 p.m. UTC
To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 198 insertions(+), 6 deletions(-)

Comments

Dan Williams Jan. 3, 2016, 6:13 p.m. UTC | #1
On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
>
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I'm hitting the following report with the ndctl dax test [1] on
next-20151231.  I bisected it to
 commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
closer look tomorrow, but in case someone can beat me to it, here's
the back-trace:

------------[ cut here ]------------
kernel BUG at fs/inode.c:497!
[..]
CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP: 0018:ffff880307797d50  EFLAGS: 00010002
RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
Stack:
ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
Call Trace:
[<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
[<ffffffff81281038>] evict+0xb8/0x180
[<ffffffff8128113b>] dispose_list+0x3b/0x50
[<ffffffff81282014>] evict_inodes+0x144/0x170
[<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
[<ffffffff81264837>] kill_block_super+0x27/0x70
[<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
[<ffffffff81264e9c>] deactivate_super+0x5c/0x60
[<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
[<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
[<ffffffff810c4f26>] task_work_run+0x76/0x90
[<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
[<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
74 1a a8 40 75 18 48 c7 8
3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
0b 0f 1f 44 00 00 0f 1f
44 00 00 55
RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP <ffff880307797d50>
---[ end trace 3b1d8898a94a4fc1 ]---

[1]: git://git@github.com:pmem/ndctl.git pending
make TESTS="test/dax.sh" check
Jan Kara Jan. 5, 2016, 11:13 a.m. UTC | #2
On Sun 03-01-16 10:13:06, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!

I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
Hum, I don't see how that could happen given we call
truncate_inode_pages_final() just before the clear_inode() call which
removes all the exceptional entries from the radix tree.  And there's not
much room for a race during umount... Does the radix tree really contain
any entry or is it an accounting bug?

								Honza

> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check
>
Jan Kara Jan. 5, 2016, 11:13 a.m. UTC | #3
On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
...
> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		dax_unmap_atomic(bdev, &dax);
> +		return ret;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * We need to revalidate our radix entry while holding tree_lock
> +	 * before we do the writeback.
> +	 */

Do we really need to revalidate here? dax_map_atomic() makes sure the addr
& size is still part of the device. I guess you are concerned that due to
truncate or similar operation those sectors needn't belong to the same file
anymore but we don't really care about flushing sectors for someone else,
do we?

Otherwise the patch looks good to me.

> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unmap;
> +	if (*slot != entry)
> +		goto unmap;
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}

								Honza
Ross Zwisler Jan. 5, 2016, 3:50 p.m. UTC | #4
On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote:
> On Sun 03-01-16 10:13:06, Dan Williams wrote:
> > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > > pages so it is able to flush them durably to media on demand.
> > >
> > > The tracking of dirty pages is done via the radix tree in struct
> > > address_space.  This radix tree is already used by the page writeback
> > > infrastructure for tracking dirty pages associated with an open file, and
> > > it already has support for exceptional (non struct page*) entries.  We
> > > build upon these features to add exceptional entries to the radix tree for
> > > DAX dirty PMD or PTE pages at fault time.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > I'm hitting the following report with the ndctl dax test [1] on
> > next-20151231.  I bisected it to
> >  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> > closer look tomorrow, but in case someone can beat me to it, here's
> > the back-trace:
> > 
> > ------------[ cut here ]------------
> > kernel BUG at fs/inode.c:497!
> 
> I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
> Hum, I don't see how that could happen given we call
> truncate_inode_pages_final() just before the clear_inode() call which
> removes all the exceptional entries from the radix tree.  And there's not
> much room for a race during umount... Does the radix tree really contain
> any entry or is it an accounting bug?
> 
> 								Honza

I think this is a bug with the existing way that we handle PMD faults.  The
issue is that the PMD path doesn't properly remove radix tree entries for zero
pages covering holes.  The PMD path calls unmap_mapping_range() to unmap the
range out of the struct address_space, but it is missing a call to
truncate_inode_pages_range() or similar to clear out those entries in the
radix tree.  Up until now we didn't notice, we just had an orphaned entry in
the radix tree, but with my code we then find the page entry in the radix
tree when handling a PMD fault, we remove it and add in a PMD entry.  This
causes us to be off on both our mapping->nrpages and mapping->nrexceptional
counts.	

In the PTE path we properly remove the pages from the radix tree when
upgrading from a hole to a real DAX entry via the delete_from_page_cache()
call, which eventually calls page_cache_tree_delete().

I'm working on a fix now (and making sure all the above is correct).

- Ross
Ross Zwisler Jan. 5, 2016, 5:12 p.m. UTC | #5
On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ...
> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		dax_unmap_atomic(bdev, &dax);
> > +		return ret;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * We need to revalidate our radix entry while holding tree_lock
> > +	 * before we do the writeback.
> > +	 */
> 
> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> & size is still part of the device. I guess you are concerned that due to
> truncate or similar operation those sectors needn't belong to the same file
> anymore but we don't really care about flushing sectors for someone else,
> do we?
> 
> Otherwise the patch looks good to me.

Yep, the concern is that we could have somehow raced against a truncate
operation while we weren't holding the tree_lock, and that now the address we
are about to flush belongs to another file or is unallocated by the
filesystem.

I agree that this should be non-destructive - if you think the additional
check and locking isn't worth the overhead, I'm happy to take it out.  I don't
have a strong opinion either way.

Thanks for the review!
Dan Williams Jan. 5, 2016, 5:20 p.m. UTC | #6
On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
>> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
>> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
>> > pages so it is able to flush them durably to media on demand.
>> >
>> > The tracking of dirty pages is done via the radix tree in struct
>> > address_space.  This radix tree is already used by the page writeback
>> > infrastructure for tracking dirty pages associated with an open file, and
>> > it already has support for exceptional (non struct page*) entries.  We
>> > build upon these features to add exceptional entries to the radix tree for
>> > DAX dirty PMD or PTE pages at fault time.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ...
>> > +static int dax_writeback_one(struct block_device *bdev,
>> > +           struct address_space *mapping, pgoff_t index, void *entry)
>> > +{
>> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
>> > +   int type = RADIX_DAX_TYPE(entry);
>> > +   struct radix_tree_node *node;
>> > +   struct blk_dax_ctl dax;
>> > +   void **slot;
>> > +   int ret = 0;
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * Regular page slots are stabilized by the page lock even
>> > +    * without the tree itself locked.  These unlocked entries
>> > +    * need verification under the tree lock.
>> > +    */
>> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
>> > +           goto unlock;
>> > +   if (*slot != entry)
>> > +           goto unlock;
>> > +
>> > +   /* another fsync thread may have already written back this entry */
>> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
>> > +           goto unlock;
>> > +
>> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
>> > +
>> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>> > +           ret = -EIO;
>> > +           goto unlock;
>> > +   }
>> > +
>> > +   dax.sector = RADIX_DAX_SECTOR(entry);
>> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>> > +   spin_unlock_irq(&mapping->tree_lock);
>> > +
>> > +   /*
>> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
>> > +    * eventually calls cond_resched().
>> > +    */
>> > +   ret = dax_map_atomic(bdev, &dax);
>> > +   if (ret < 0)
>> > +           return ret;
>> > +
>> > +   if (WARN_ON_ONCE(ret < dax.size)) {
>> > +           ret = -EIO;
>> > +           dax_unmap_atomic(bdev, &dax);
>> > +           return ret;
>> > +   }
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * We need to revalidate our radix entry while holding tree_lock
>> > +    * before we do the writeback.
>> > +    */
>>
>> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
>> & size is still part of the device. I guess you are concerned that due to
>> truncate or similar operation those sectors needn't belong to the same file
>> anymore but we don't really care about flushing sectors for someone else,
>> do we?
>>
>> Otherwise the patch looks good to me.
>
> Yep, the concern is that we could have somehow raced against a truncate
> operation while we weren't holding the tree_lock, and that now the address we
> are about to flush belongs to another file or is unallocated by the
> filesystem.
>
> I agree that this should be non-destructive - if you think the additional
> check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> have a strong opinion either way.
>

My concern is whether flushing potentially invalid virtual addresses
is problematic on some architectures.  Maybe it's just FUD, but it's
less work in my opinion to just revalidate the address versus auditing
each arch for this concern.

At a minimum we can change the comment to not say "We need to" and
instead say "TODO: are all archs ok with flushing potentially invalid
addresses?"
Ross Zwisler Jan. 5, 2016, 6:14 p.m. UTC | #7
On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
> On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> >> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> >> > pages so it is able to flush them durably to media on demand.
> >> >
> >> > The tracking of dirty pages is done via the radix tree in struct
> >> > address_space.  This radix tree is already used by the page writeback
> >> > infrastructure for tracking dirty pages associated with an open file, and
> >> > it already has support for exceptional (non struct page*) entries.  We
> >> > build upon these features to add exceptional entries to the radix tree for
> >> > DAX dirty PMD or PTE pages at fault time.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> ...
> >> > +static int dax_writeback_one(struct block_device *bdev,
> >> > +           struct address_space *mapping, pgoff_t index, void *entry)
> >> > +{
> >> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
> >> > +   int type = RADIX_DAX_TYPE(entry);
> >> > +   struct radix_tree_node *node;
> >> > +   struct blk_dax_ctl dax;
> >> > +   void **slot;
> >> > +   int ret = 0;
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * Regular page slots are stabilized by the page lock even
> >> > +    * without the tree itself locked.  These unlocked entries
> >> > +    * need verification under the tree lock.
> >> > +    */
> >> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> >> > +           goto unlock;
> >> > +   if (*slot != entry)
> >> > +           goto unlock;
> >> > +
> >> > +   /* another fsync thread may have already written back this entry */
> >> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> >> > +           goto unlock;
> >> > +
> >> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> >> > +
> >> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> >> > +           ret = -EIO;
> >> > +           goto unlock;
> >> > +   }
> >> > +
> >> > +   dax.sector = RADIX_DAX_SECTOR(entry);
> >> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> >> > +   spin_unlock_irq(&mapping->tree_lock);
> >> > +
> >> > +   /*
> >> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
> >> > +    * eventually calls cond_resched().
> >> > +    */
> >> > +   ret = dax_map_atomic(bdev, &dax);
> >> > +   if (ret < 0)
> >> > +           return ret;
> >> > +
> >> > +   if (WARN_ON_ONCE(ret < dax.size)) {
> >> > +           ret = -EIO;
> >> > +           dax_unmap_atomic(bdev, &dax);
> >> > +           return ret;
> >> > +   }
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * We need to revalidate our radix entry while holding tree_lock
> >> > +    * before we do the writeback.
> >> > +    */
> >>
> >> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> >> & size is still part of the device. I guess you are concerned that due to
> >> truncate or similar operation those sectors needn't belong to the same file
> >> anymore but we don't really care about flushing sectors for someone else,
> >> do we?
> >>
> >> Otherwise the patch looks good to me.
> >
> > Yep, the concern is that we could have somehow raced against a truncate
> > operation while we weren't holding the tree_lock, and that now the address we
> > are about to flush belongs to another file or is unallocated by the
> > filesystem.
> >
> > I agree that this should be non-destructive - if you think the additional
> > check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> > have a strong opinion either way.
> >
> 
> My concern is whether flushing potentially invalid virtual addresses
> is problematic on some architectures.  Maybe it's just FUD, but it's
> less work in my opinion to just revalidate the address versus auditing
> each arch for this concern.

I don't think that the addresses have the potential of being invalid from the
driver's point of view - we are still holding a reference on the block queue
via dax_map_atomic(), so we should be protected against races vs block device
removal.  I think the only question is whether it is okay to flush an address
that we know to be valid from the block device's point of view, but which the
filesystem may have truncated from being allocated to our inode.

Does that all make sense?

> At a minimum we can change the comment to not say "We need to" and
> instead say "TODO: are all archs ok with flushing potentially invalid
> addresses?"
Dan Williams Jan. 5, 2016, 6:22 p.m. UTC | #8
On Tue, Jan 5, 2016 at 10:14 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
[..]
>> My concern is whether flushing potentially invalid virtual addresses
>> is problematic on some architectures.  Maybe it's just FUD, but it's
>> less work in my opinion to just revalidate the address versus auditing
>> each arch for this concern.
>
> I don't think that the addresses have the potential of being invalid from the
> driver's point of view - we are still holding a reference on the block queue
> via dax_map_atomic(), so we should be protected against races vs block device
> removal.  I think the only question is whether it is okay to flush an address
> that we know to be valid from the block device's point of view, but which the
> filesystem may have truncated from being allocated to our inode.
>
> Does that all make sense?

Yes, I was confusing which revalidation we were talking about.  As
long as the dax_map_atomic() is there I don't think we need any
further revalidation.
Ross Zwisler Jan. 6, 2016, 6:10 p.m. UTC | #9
On Sun, Jan 03, 2016 at 10:13:06AM -0800, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!
> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check

This issue is fixed with patch 2 of v7:

https://lists.01.org/pipermail/linux-nvdimm/2016-January/003888.html

Thanks for the report!
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 82d0bff..050610d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@ 
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
@@ -323,6 +324,176 @@  static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+#define NO_SECTOR -1
+
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, index);
+
+	if (entry) {
+		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+			goto dirty;
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrexceptional--;
+	}
+
+	if (sector == NO_SECTOR) {
+		/*
+		 * This can happen during correct operation if our pfn_mkwrite
+		 * fault raced against a hole punch operation.  If this
+		 * happens the pte that was hole punched will have been
+		 * unmapped and the radix tree entry will have been removed by
+		 * the time we are called, but the call will still happen.  We
+		 * will return all the way up to wp_pfn_shared(), where the
+		 * pte_same() check will fail, eventually causing page fault
+		 * to be retried by the CPU.
+		 */
+		goto unlock;
+	}
+
+	error = radix_tree_insert(page_tree, index,
+			RADIX_DAX_ENTRY(sector, pmd_entry));
+	if (error)
+		goto unlock;
+
+	mapping->nrexceptional++;
+ dirty:
+	if (dirty)
+		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
+static int dax_writeback_one(struct block_device *bdev,
+		struct address_space *mapping, pgoff_t index, void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	struct radix_tree_node *node;
+	struct blk_dax_ctl dax;
+	void **slot;
+	int ret = 0;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+
+	/* another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto unlock;
+
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	dax.sector = RADIX_DAX_SECTOR(entry);
+	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * We cannot hold tree_lock while calling dax_map_atomic() because it
+	 * eventually calls cond_resched().
+	 */
+	ret = dax_map_atomic(bdev, &dax);
+	if (ret < 0)
+		return ret;
+
+	if (WARN_ON_ONCE(ret < dax.size)) {
+		ret = -EIO;
+		dax_unmap_atomic(bdev, &dax);
+		return ret;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * We need to revalidate our radix entry while holding tree_lock
+	 * before we do the writeback.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unmap;
+	if (*slot != entry)
+		goto unmap;
+
+	wb_cache_pmem(dax.addr, dax.size);
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t indices[PAGEVEC_SIZE];
+	pgoff_t start_page, end_page;
+	struct pagevec pvec;
+	void *entry;
+	int i, ret = 0;
+
+	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
+		return -EIO;
+
+	rcu_read_lock();
+	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	rcu_read_unlock();
+
+	/* see if the start of our range is covered by a PMD entry */
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start &= PMD_MASK;
+
+	start_page = start >> PAGE_CACHE_SHIFT;
+	end_page = end >> PAGE_CACHE_SHIFT;
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			ret = dax_writeback_one(bdev, mapping, indices[i],
+					pvec.pages[i]);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	wmb_pmem();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -362,6 +533,11 @@  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
+	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+			vmf->flags & FAULT_FLAG_WRITE);
+	if (error)
+		goto out;
+
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
@@ -486,6 +662,7 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -579,7 +756,7 @@  int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	struct block_device *bdev = NULL;
 	pgoff_t size, pgoff;
 	sector_t block;
-	int result = 0;
+	int error, result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -721,6 +898,16 @@  int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		if (write) {
+			error = dax_radix_entry(mapping, pgoff, dax.sector,
+					true, true);
+			if (error) {
+				dax_pmd_dbg(bdev, address,
+						"PMD radix insertion failed");
+				goto fallback;
+			}
+		}
+
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
 				__func__, current->comm, address,
@@ -779,15 +966,12 @@  EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..8204c3d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@  static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e215fc..2e7c8d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,12 @@  int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, lstart, lend);
+		if (err)
+			return err;
+	}
+
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);