Message ID | 1450899560-26708-5-git-send-email-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> I'm hitting the following report with the ndctl dax test [1] on next-20151231. I bisected it to commit 3cb108f941de "dax-add-support-for-fsync-sync-v6". I'll take a closer look tomorrow, but in case someone can beat me to it, here's the back-trace: ------------[ cut here ]------------ kernel BUG at fs/inode.c:497! [..] CPU: 1 PID: 3001 Comm: umount Tainted: G O 4.4.0-rc7+ #2412 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000 RIP: 0010:[<ffffffff81280171>] [<ffffffff81280171>] clear_inode+0x71/0x80 RSP: 0018:ffff880307797d50 EFLAGS: 00010002 RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28 RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530 RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000 R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530 R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0 FS: 00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0 Stack: ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0 ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328 Call Trace: [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs] [<ffffffff81281038>] evict+0xb8/0x180 [<ffffffff8128113b>] dispose_list+0x3b/0x50 [<ffffffff81282014>] evict_inodes+0x144/0x170 [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0 [<ffffffff81264837>] kill_block_super+0x27/0x70 [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70 [<ffffffff81264e9c>] deactivate_super+0x5c/0x60 [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90 [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20 [<ffffffff810c4f26>] task_work_run+0x76/0x90 [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280 [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20 74 1a a8 40 75 18 48 c7 8 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 44 00 00 0f 1f 44 00 00 55 RIP [<ffffffff81280171>] clear_inode+0x71/0x80 RSP <ffff880307797d50> ---[ end trace 3b1d8898a94a4fc1 ]--- [1]: git://git@github.com:pmem/ndctl.git pending make TESTS="test/dax.sh" check
On Sun 03-01-16 10:13:06, Dan Williams wrote: > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > I'm hitting the following report with the ndctl dax test [1] on > next-20151231. I bisected it to > commit 3cb108f941de "dax-add-support-for-fsync-sync-v6". I'll take a > closer look tomorrow, but in case someone can beat me to it, here's > the back-trace: > > ------------[ cut here ]------------ > kernel BUG at fs/inode.c:497! I suppose this is the check that mapping->nr_exceptional is zero, isn't it? Hum, I don't see how that could happen given we call truncate_inode_pages_final() just before the clear_inode() call which removes all the exceptional entries from the radix tree. And there's not much room for a race during umount... Does the radix tree really contain any entry or is it an accounting bug? Honza > [..] > CPU: 1 PID: 3001 Comm: umount Tainted: G O 4.4.0-rc7+ #2412 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000 > RIP: 0010:[<ffffffff81280171>] [<ffffffff81280171>] clear_inode+0x71/0x80 > RSP: 0018:ffff880307797d50 EFLAGS: 00010002 > RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28 > RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530 > RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000 > R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530 > R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0 > FS: 00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0 > Stack: > ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af > ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0 > ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328 > Call Trace: > [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs] > [<ffffffff81281038>] evict+0xb8/0x180 > [<ffffffff8128113b>] dispose_list+0x3b/0x50 > [<ffffffff81282014>] evict_inodes+0x144/0x170 > [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0 > [<ffffffff81264837>] kill_block_super+0x27/0x70 > [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70 > [<ffffffff81264e9c>] deactivate_super+0x5c/0x60 > [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90 > [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20 > [<ffffffff810c4f26>] task_work_run+0x76/0x90 > [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280 > [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f > Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20 > 74 1a a8 40 75 18 48 c7 8 > 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f > 0b 0f 1f 44 00 00 0f 1f > 44 00 00 55 > RIP [<ffffffff81280171>] clear_inode+0x71/0x80 > RSP <ffff880307797d50> > ---[ end trace 3b1d8898a94a4fc1 ]--- > > [1]: git://git@github.com:pmem/ndctl.git pending > make TESTS="test/dax.sh" check >
On Wed 23-12-15 12:39:17, Ross Zwisler wrote: > To properly handle fsync/msync in an efficient way DAX needs to track dirty > pages so it is able to flush them durably to media on demand. > > The tracking of dirty pages is done via the radix tree in struct > address_space. This radix tree is already used by the page writeback > infrastructure for tracking dirty pages associated with an open file, and > it already has support for exceptional (non struct page*) entries. We > build upon these features to add exceptional entries to the radix tree for > DAX dirty PMD or PTE pages at fault time. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> ... > +static int dax_writeback_one(struct block_device *bdev, > + struct address_space *mapping, pgoff_t index, void *entry) > +{ > + struct radix_tree_root *page_tree = &mapping->page_tree; > + int type = RADIX_DAX_TYPE(entry); > + struct radix_tree_node *node; > + struct blk_dax_ctl dax; > + void **slot; > + int ret = 0; > + > + spin_lock_irq(&mapping->tree_lock); > + /* > + * Regular page slots are stabilized by the page lock even > + * without the tree itself locked. These unlocked entries > + * need verification under the tree lock. > + */ > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unlock; > + if (*slot != entry) > + goto unlock; > + > + /* another fsync thread may have already written back this entry */ > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > + goto unlock; > + > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > + > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) { > + ret = -EIO; > + goto unlock; > + } > + > + dax.sector = RADIX_DAX_SECTOR(entry); > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE); > + spin_unlock_irq(&mapping->tree_lock); > + > + /* > + * We cannot hold tree_lock while calling dax_map_atomic() because it > + * eventually calls cond_resched(). > + */ > + ret = dax_map_atomic(bdev, &dax); > + if (ret < 0) > + return ret; > + > + if (WARN_ON_ONCE(ret < dax.size)) { > + ret = -EIO; > + dax_unmap_atomic(bdev, &dax); > + return ret; > + } > + > + spin_lock_irq(&mapping->tree_lock); > + /* > + * We need to revalidate our radix entry while holding tree_lock > + * before we do the writeback. > + */ Do we really need to revalidate here? dax_map_atomic() makes sure the addr & size is still part of the device. I guess you are concerned that due to truncate or similar operation those sectors needn't belong to the same file anymore but we don't really care about flushing sectors for someone else, do we? Otherwise the patch looks good to me. > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > + goto unmap; > + if (*slot != entry) > + goto unmap; > + > + wb_cache_pmem(dax.addr, dax.size); > + unmap: > + dax_unmap_atomic(bdev, &dax); > + unlock: > + spin_unlock_irq(&mapping->tree_lock); > + return ret; > +} Honza
On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote: > On Sun 03-01-16 10:13:06, Dan Williams wrote: > > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler > > <ross.zwisler@linux.intel.com> wrote: > > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > > pages so it is able to flush them durably to media on demand. > > > > > > The tracking of dirty pages is done via the radix tree in struct > > > address_space. This radix tree is already used by the page writeback > > > infrastructure for tracking dirty pages associated with an open file, and > > > it already has support for exceptional (non struct page*) entries. We > > > build upon these features to add exceptional entries to the radix tree for > > > DAX dirty PMD or PTE pages at fault time. > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > > I'm hitting the following report with the ndctl dax test [1] on > > next-20151231. I bisected it to > > commit 3cb108f941de "dax-add-support-for-fsync-sync-v6". I'll take a > > closer look tomorrow, but in case someone can beat me to it, here's > > the back-trace: > > > > ------------[ cut here ]------------ > > kernel BUG at fs/inode.c:497! > > I suppose this is the check that mapping->nr_exceptional is zero, isn't it? > Hum, I don't see how that could happen given we call > truncate_inode_pages_final() just before the clear_inode() call which > removes all the exceptional entries from the radix tree. And there's not > much room for a race during umount... Does the radix tree really contain > any entry or is it an accounting bug? > > Honza I think this is a bug with the existing way that we handle PMD faults. The issue is that the PMD path doesn't properly remove radix tree entries for zero pages covering holes. The PMD path calls unmap_mapping_range() to unmap the range out of the struct address_space, but it is missing a call to truncate_inode_pages_range() or similar to clear out those entries in the radix tree. Up until now we didn't notice, we just had an orphaned entry in the radix tree, but with my code we then find the page entry in the radix tree when handling a PMD fault, we remove it and add in a PMD entry. This causes us to be off on both our mapping->nrpages and mapping->nrexceptional counts. In the PTE path we properly remove the pages from the radix tree when upgrading from a hole to a real DAX entry via the delete_from_page_cache() call, which eventually calls page_cache_tree_delete(). I'm working on a fix now (and making sure all the above is correct). - Ross
On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote: > On Wed 23-12-15 12:39:17, Ross Zwisler wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > ... > > +static int dax_writeback_one(struct block_device *bdev, > > + struct address_space *mapping, pgoff_t index, void *entry) > > +{ > > + struct radix_tree_root *page_tree = &mapping->page_tree; > > + int type = RADIX_DAX_TYPE(entry); > > + struct radix_tree_node *node; > > + struct blk_dax_ctl dax; > > + void **slot; > > + int ret = 0; > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * Regular page slots are stabilized by the page lock even > > + * without the tree itself locked. These unlocked entries > > + * need verification under the tree lock. > > + */ > > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > > + goto unlock; > > + if (*slot != entry) > > + goto unlock; > > + > > + /* another fsync thread may have already written back this entry */ > > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > > + goto unlock; > > + > > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > > + > > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) { > > + ret = -EIO; > > + goto unlock; > > + } > > + > > + dax.sector = RADIX_DAX_SECTOR(entry); > > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE); > > + spin_unlock_irq(&mapping->tree_lock); > > + > > + /* > > + * We cannot hold tree_lock while calling dax_map_atomic() because it > > + * eventually calls cond_resched(). > > + */ > > + ret = dax_map_atomic(bdev, &dax); > > + if (ret < 0) > > + return ret; > > + > > + if (WARN_ON_ONCE(ret < dax.size)) { > > + ret = -EIO; > > + dax_unmap_atomic(bdev, &dax); > > + return ret; > > + } > > + > > + spin_lock_irq(&mapping->tree_lock); > > + /* > > + * We need to revalidate our radix entry while holding tree_lock > > + * before we do the writeback. > > + */ > > Do we really need to revalidate here? dax_map_atomic() makes sure the addr > & size is still part of the device. I guess you are concerned that due to > truncate or similar operation those sectors needn't belong to the same file > anymore but we don't really care about flushing sectors for someone else, > do we? > > Otherwise the patch looks good to me. Yep, the concern is that we could have somehow raced against a truncate operation while we weren't holding the tree_lock, and that now the address we are about to flush belongs to another file or is unallocated by the filesystem. I agree that this should be non-destructive - if you think the additional check and locking isn't worth the overhead, I'm happy to take it out. I don't have a strong opinion either way. Thanks for the review!
On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote: >> On Wed 23-12-15 12:39:17, Ross Zwisler wrote: >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty >> > pages so it is able to flush them durably to media on demand. >> > >> > The tracking of dirty pages is done via the radix tree in struct >> > address_space. This radix tree is already used by the page writeback >> > infrastructure for tracking dirty pages associated with an open file, and >> > it already has support for exceptional (non struct page*) entries. We >> > build upon these features to add exceptional entries to the radix tree for >> > DAX dirty PMD or PTE pages at fault time. >> > >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> >> ... >> > +static int dax_writeback_one(struct block_device *bdev, >> > + struct address_space *mapping, pgoff_t index, void *entry) >> > +{ >> > + struct radix_tree_root *page_tree = &mapping->page_tree; >> > + int type = RADIX_DAX_TYPE(entry); >> > + struct radix_tree_node *node; >> > + struct blk_dax_ctl dax; >> > + void **slot; >> > + int ret = 0; >> > + >> > + spin_lock_irq(&mapping->tree_lock); >> > + /* >> > + * Regular page slots are stabilized by the page lock even >> > + * without the tree itself locked. These unlocked entries >> > + * need verification under the tree lock. >> > + */ >> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) >> > + goto unlock; >> > + if (*slot != entry) >> > + goto unlock; >> > + >> > + /* another fsync thread may have already written back this entry */ >> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) >> > + goto unlock; >> > + >> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); >> > + >> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) { >> > + ret = -EIO; >> > + goto unlock; >> > + } >> > + >> > + dax.sector = RADIX_DAX_SECTOR(entry); >> > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE); >> > + spin_unlock_irq(&mapping->tree_lock); >> > + >> > + /* >> > + * We cannot hold tree_lock while calling dax_map_atomic() because it >> > + * eventually calls cond_resched(). >> > + */ >> > + ret = dax_map_atomic(bdev, &dax); >> > + if (ret < 0) >> > + return ret; >> > + >> > + if (WARN_ON_ONCE(ret < dax.size)) { >> > + ret = -EIO; >> > + dax_unmap_atomic(bdev, &dax); >> > + return ret; >> > + } >> > + >> > + spin_lock_irq(&mapping->tree_lock); >> > + /* >> > + * We need to revalidate our radix entry while holding tree_lock >> > + * before we do the writeback. >> > + */ >> >> Do we really need to revalidate here? dax_map_atomic() makes sure the addr >> & size is still part of the device. I guess you are concerned that due to >> truncate or similar operation those sectors needn't belong to the same file >> anymore but we don't really care about flushing sectors for someone else, >> do we? >> >> Otherwise the patch looks good to me. > > Yep, the concern is that we could have somehow raced against a truncate > operation while we weren't holding the tree_lock, and that now the address we > are about to flush belongs to another file or is unallocated by the > filesystem. > > I agree that this should be non-destructive - if you think the additional > check and locking isn't worth the overhead, I'm happy to take it out. I don't > have a strong opinion either way. > My concern is whether flushing potentially invalid virtual addresses is problematic on some architectures. Maybe it's just FUD, but it's less work in my opinion to just revalidate the address versus auditing each arch for this concern. At a minimum we can change the comment to not say "We need to" and instead say "TODO: are all archs ok with flushing potentially invalid addresses?"
On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote: > On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote: > >> On Wed 23-12-15 12:39:17, Ross Zwisler wrote: > >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty > >> > pages so it is able to flush them durably to media on demand. > >> > > >> > The tracking of dirty pages is done via the radix tree in struct > >> > address_space. This radix tree is already used by the page writeback > >> > infrastructure for tracking dirty pages associated with an open file, and > >> > it already has support for exceptional (non struct page*) entries. We > >> > build upon these features to add exceptional entries to the radix tree for > >> > DAX dirty PMD or PTE pages at fault time. > >> > > >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > >> ... > >> > +static int dax_writeback_one(struct block_device *bdev, > >> > + struct address_space *mapping, pgoff_t index, void *entry) > >> > +{ > >> > + struct radix_tree_root *page_tree = &mapping->page_tree; > >> > + int type = RADIX_DAX_TYPE(entry); > >> > + struct radix_tree_node *node; > >> > + struct blk_dax_ctl dax; > >> > + void **slot; > >> > + int ret = 0; > >> > + > >> > + spin_lock_irq(&mapping->tree_lock); > >> > + /* > >> > + * Regular page slots are stabilized by the page lock even > >> > + * without the tree itself locked. These unlocked entries > >> > + * need verification under the tree lock. > >> > + */ > >> > + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) > >> > + goto unlock; > >> > + if (*slot != entry) > >> > + goto unlock; > >> > + > >> > + /* another fsync thread may have already written back this entry */ > >> > + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) > >> > + goto unlock; > >> > + > >> > + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); > >> > + > >> > + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) { > >> > + ret = -EIO; > >> > + goto unlock; > >> > + } > >> > + > >> > + dax.sector = RADIX_DAX_SECTOR(entry); > >> > + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE); > >> > + spin_unlock_irq(&mapping->tree_lock); > >> > + > >> > + /* > >> > + * We cannot hold tree_lock while calling dax_map_atomic() because it > >> > + * eventually calls cond_resched(). > >> > + */ > >> > + ret = dax_map_atomic(bdev, &dax); > >> > + if (ret < 0) > >> > + return ret; > >> > + > >> > + if (WARN_ON_ONCE(ret < dax.size)) { > >> > + ret = -EIO; > >> > + dax_unmap_atomic(bdev, &dax); > >> > + return ret; > >> > + } > >> > + > >> > + spin_lock_irq(&mapping->tree_lock); > >> > + /* > >> > + * We need to revalidate our radix entry while holding tree_lock > >> > + * before we do the writeback. > >> > + */ > >> > >> Do we really need to revalidate here? dax_map_atomic() makes sure the addr > >> & size is still part of the device. I guess you are concerned that due to > >> truncate or similar operation those sectors needn't belong to the same file > >> anymore but we don't really care about flushing sectors for someone else, > >> do we? > >> > >> Otherwise the patch looks good to me. > > > > Yep, the concern is that we could have somehow raced against a truncate > > operation while we weren't holding the tree_lock, and that now the address we > > are about to flush belongs to another file or is unallocated by the > > filesystem. > > > > I agree that this should be non-destructive - if you think the additional > > check and locking isn't worth the overhead, I'm happy to take it out. I don't > > have a strong opinion either way. > > > > My concern is whether flushing potentially invalid virtual addresses > is problematic on some architectures. Maybe it's just FUD, but it's > less work in my opinion to just revalidate the address versus auditing > each arch for this concern. I don't think that the addresses have the potential of being invalid from the driver's point of view - we are still holding a reference on the block queue via dax_map_atomic(), so we should be protected against races vs block device removal. I think the only question is whether it is okay to flush an address that we know to be valid from the block device's point of view, but which the filesystem may have truncated from being allocated to our inode. Does that all make sense? > At a minimum we can change the comment to not say "We need to" and > instead say "TODO: are all archs ok with flushing potentially invalid > addresses?"
On Tue, Jan 5, 2016 at 10:14 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote: [..] >> My concern is whether flushing potentially invalid virtual addresses >> is problematic on some architectures. Maybe it's just FUD, but it's >> less work in my opinion to just revalidate the address versus auditing >> each arch for this concern. > > I don't think that the addresses have the potential of being invalid from the > driver's point of view - we are still holding a reference on the block queue > via dax_map_atomic(), so we should be protected against races vs block device > removal. I think the only question is whether it is okay to flush an address > that we know to be valid from the block device's point of view, but which the > filesystem may have truncated from being allocated to our inode. > > Does that all make sense? Yes, I was confusing which revalidation we were talking about. As long as the dax_map_atomic() is there I don't think we need any further revalidation.
On Sun, Jan 03, 2016 at 10:13:06AM -0800, Dan Williams wrote: > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > pages so it is able to flush them durably to media on demand. > > > > The tracking of dirty pages is done via the radix tree in struct > > address_space. This radix tree is already used by the page writeback > > infrastructure for tracking dirty pages associated with an open file, and > > it already has support for exceptional (non struct page*) entries. We > > build upon these features to add exceptional entries to the radix tree for > > DAX dirty PMD or PTE pages at fault time. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > I'm hitting the following report with the ndctl dax test [1] on > next-20151231. I bisected it to > commit 3cb108f941de "dax-add-support-for-fsync-sync-v6". I'll take a > closer look tomorrow, but in case someone can beat me to it, here's > the back-trace: > > ------------[ cut here ]------------ > kernel BUG at fs/inode.c:497! > [..] > CPU: 1 PID: 3001 Comm: umount Tainted: G O 4.4.0-rc7+ #2412 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000 > RIP: 0010:[<ffffffff81280171>] [<ffffffff81280171>] clear_inode+0x71/0x80 > RSP: 0018:ffff880307797d50 EFLAGS: 00010002 > RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28 > RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530 > RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000 > R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530 > R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0 > FS: 00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0 > Stack: > ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af > ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0 > ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328 > Call Trace: > [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs] > [<ffffffff81281038>] evict+0xb8/0x180 > [<ffffffff8128113b>] dispose_list+0x3b/0x50 > [<ffffffff81282014>] evict_inodes+0x144/0x170 > [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0 > [<ffffffff81264837>] kill_block_super+0x27/0x70 > [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70 > [<ffffffff81264e9c>] deactivate_super+0x5c/0x60 > [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90 > [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20 > [<ffffffff810c4f26>] task_work_run+0x76/0x90 > [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280 > [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f > Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20 > 74 1a a8 40 75 18 48 c7 8 > 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f > 0b 0f 1f 44 00 00 0f 1f > 44 00 00 55 > RIP [<ffffffff81280171>] clear_inode+0x71/0x80 > RSP <ffff880307797d50> > ---[ end trace 3b1d8898a94a4fc1 ]--- > > [1]: git://git@github.com:pmem/ndctl.git pending > make TESTS="test/dax.sh" check This issue is fixed with patch 2 of v7: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003888.html Thanks for the report!
diff --git a/fs/dax.c b/fs/dax.c index 82d0bff..050610d 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -24,6 +24,7 @@ #include <linux/memcontrol.h> #include <linux/mm.h> #include <linux/mutex.h> +#include <linux/pagevec.h> #include <linux/pmem.h> #include <linux/sched.h> #include <linux/uio.h> @@ -323,6 +324,176 @@ static int copy_user_bh(struct page *to, struct inode *inode, return 0; } +#define NO_SECTOR -1 + +static int dax_radix_entry(struct address_space *mapping, pgoff_t index, + sector_t sector, bool pmd_entry, bool dirty) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int error = 0; + void *entry; + + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + spin_lock_irq(&mapping->tree_lock); + entry = radix_tree_lookup(page_tree, index); + + if (entry) { + if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + goto dirty; + radix_tree_delete(&mapping->page_tree, index); + mapping->nrexceptional--; + } + + if (sector == NO_SECTOR) { + /* + * This can happen during correct operation if our pfn_mkwrite + * fault raced against a hole punch operation. If this + * happens the pte that was hole punched will have been + * unmapped and the radix tree entry will have been removed by + * the time we are called, but the call will still happen. We + * will return all the way up to wp_pfn_shared(), where the + * pte_same() check will fail, eventually causing page fault + * to be retried by the CPU. + */ + goto unlock; + } + + error = radix_tree_insert(page_tree, index, + RADIX_DAX_ENTRY(sector, pmd_entry)); + if (error) + goto unlock; + + mapping->nrexceptional++; + dirty: + if (dirty) + radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); + unlock: + spin_unlock_irq(&mapping->tree_lock); + return error; +} + +static int dax_writeback_one(struct block_device *bdev, + struct address_space *mapping, pgoff_t index, void *entry) +{ + struct radix_tree_root *page_tree = &mapping->page_tree; + int type = RADIX_DAX_TYPE(entry); + struct radix_tree_node *node; + struct blk_dax_ctl dax; + void **slot; + int ret = 0; + + spin_lock_irq(&mapping->tree_lock); + /* + * Regular page slots are stabilized by the page lock even + * without the tree itself locked. These unlocked entries + * need verification under the tree lock. + */ + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) + goto unlock; + if (*slot != entry) + goto unlock; + + /* another fsync thread may have already written back this entry */ + if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)) + goto unlock; + + radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE); + + if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) { + ret = -EIO; + goto unlock; + } + + dax.sector = RADIX_DAX_SECTOR(entry); + dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE); + spin_unlock_irq(&mapping->tree_lock); + + /* + * We cannot hold tree_lock while calling dax_map_atomic() because it + * eventually calls cond_resched(). + */ + ret = dax_map_atomic(bdev, &dax); + if (ret < 0) + return ret; + + if (WARN_ON_ONCE(ret < dax.size)) { + ret = -EIO; + dax_unmap_atomic(bdev, &dax); + return ret; + } + + spin_lock_irq(&mapping->tree_lock); + /* + * We need to revalidate our radix entry while holding tree_lock + * before we do the writeback. + */ + if (!__radix_tree_lookup(page_tree, index, &node, &slot)) + goto unmap; + if (*slot != entry) + goto unmap; + + wb_cache_pmem(dax.addr, dax.size); + unmap: + dax_unmap_atomic(bdev, &dax); + unlock: + spin_unlock_irq(&mapping->tree_lock); + return ret; +} + +/* + * Flush the mapping to the persistent domain within the byte range of [start, + * end]. This is required by data integrity operations to ensure file data is + * on persistent storage prior to completion of the operation. + */ +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end) +{ + struct inode *inode = mapping->host; + struct block_device *bdev = inode->i_sb->s_bdev; + pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t start_page, end_page; + struct pagevec pvec; + void *entry; + int i, ret = 0; + + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) + return -EIO; + + rcu_read_lock(); + entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK); + rcu_read_unlock(); + + /* see if the start of our range is covered by a PMD entry */ + if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) + start &= PMD_MASK; + + start_page = start >> PAGE_CACHE_SHIFT; + end_page = end >> PAGE_CACHE_SHIFT; + + tag_pages_for_writeback(mapping, start_page, end_page); + + pagevec_init(&pvec, 0); + while (1) { + pvec.nr = find_get_entries_tag(mapping, start_page, + PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE, + pvec.pages, indices); + + if (pvec.nr == 0) + break; + + for (i = 0; i < pvec.nr; i++) { + ret = dax_writeback_one(bdev, mapping, indices[i], + pvec.pages[i]); + if (ret < 0) + return ret; + } + } + wmb_pmem(); + return 0; +} +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); + static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -362,6 +533,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } dax_unmap_atomic(bdev, &dax); + error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, + vmf->flags & FAULT_FLAG_WRITE); + if (error) + goto out; + error = vm_insert_mixed(vma, vaddr, dax.pfn); out: @@ -486,6 +662,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, delete_from_page_cache(page); unlock_page(page); page_cache_release(page); + page = NULL; } /* @@ -579,7 +756,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, struct block_device *bdev = NULL; pgoff_t size, pgoff; sector_t block; - int result = 0; + int error, result = 0; /* dax pmd mappings require pfn_t_devmap() */ if (!IS_ENABLED(CONFIG_FS_DAX_PMD)) @@ -721,6 +898,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } dax_unmap_atomic(bdev, &dax); + if (write) { + error = dax_radix_entry(mapping, pgoff, dax.sector, + true, true); + if (error) { + dax_pmd_dbg(bdev, address, + "PMD radix insertion failed"); + goto fallback; + } + } + dev_dbg(part_to_dev(bdev->bd_part), "%s: %s addr: %lx pfn: %lx sect: %llx\n", __func__, current->comm, address, @@ -779,15 +966,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); * dax_pfn_mkwrite - handle first write to DAX page * @vma: The virtual memory area where the fault occurred * @vmf: The description of the fault - * */ int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) { - struct super_block *sb = file_inode(vma->vm_file)->i_sb; + struct file *file = vma->vm_file; - sb_start_pagefault(sb); - file_update_time(vma->vm_file); - sb_end_pagefault(sb); + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); diff --git a/include/linux/dax.h b/include/linux/dax.h index e9d57f68..8204c3d 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, + loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index 1e215fc..2e7c8d9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; + if (dax_mapping(mapping) && mapping->nrexceptional) { + err = dax_writeback_mapping_range(mapping, lstart, lend); + if (err) + return err; + } + if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL);
To properly handle fsync/msync in an efficient way DAX needs to track dirty pages so it is able to flush them durably to media on demand. The tracking of dirty pages is done via the radix tree in struct address_space. This radix tree is already used by the page writeback infrastructure for tracking dirty pages associated with an open file, and it already has support for exceptional (non struct page*) entries. We build upon these features to add exceptional entries to the radix tree for DAX dirty PMD or PTE pages at fault time. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- fs/dax.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/dax.h | 2 + mm/filemap.c | 6 ++ 3 files changed, 198 insertions(+), 6 deletions(-)