Message ID | 1445529945.17208.4.camel@intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Thu 22-10-15 16:05:46, Williams, Dan J wrote: > On Thu, 2015-10-22 at 11:35 +0200, Jan Kara wrote: > > On Thu 22-10-15 02:42:11, Dan Williams wrote: > > > If an application wants exclusive access to all of the persistent memory > > > provided by an NVDIMM namespace it can use this raw-block-dax facility > > > to forgo establishing a filesystem. This capability is targeted > > > primarily to hypervisors wanting to provision persistent memory for > > > guests. > > > > > > Cc: Jan Kara <jack@suse.cz> > > > Cc: Jeff Moyer <jmoyer@redhat.com> > > > Cc: Christoph Hellwig <hch@lst.de> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > > --- > > > fs/block_dev.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 53 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 3255dcec96b4..c27cd1a21a13 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = { > > > .is_dirty_writeback = buffer_check_dirty_writeback, > > > }; > > > > > > +#ifdef CONFIG_FS_DAX > > > +/* > > > + * In the raw block case we do not need to contend with truncation nor > > > + * unwritten file extents. Without those concerns there is no need for > > > + * additional locking beyond the mmap_sem context that these routines > > > + * are already executing under. > > > + * > > > + * Note, there is no protection if the block device is dynamically > > > + * resized (partition grow/shrink) during a fault. A stable block device > > > + * size is already not enforced in the blkdev_direct_IO path. > > > + * > > > + * For DAX, it is the responsibility of the block device driver to > > > + * ensure the whole-disk device size is stable while requests are in > > > + * flight. > > > + * > > > + * Finally, these paths do not synchronize against freezing > > > + * (sb_start_pagefault(), etc...) since bdev_sops does not support > > > + * freezing. > > > > Well, for devices freezing is handled directly in the block layer code > > (blk_stop_queue()) since there's no need to put some metadata structures > > into a consistent state. So the comment about bdev_sops is somewhat > > strange. > > This text was aimed at the request from Ross to document the differences > vs the generic_file_mmap() path. Is the following incremental change > more clear? Well, not really. I thought you'd just delete that paragraph :) The thing is: When doing IO directly to the block device, it makes no sense to look at a filesystem on top of it - hopefully there is none since you'd be corrupting it. So the paragraph that would make sense to me would be: * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling * sb_start_pagefault(). There is no filesystem which could be frozen here * and when bdev gets frozen, IO gets blocked in the request queue. But when spelled out like this, I've realized that with DAX, this blocking of requests in the request queue doesn't really block the IO to the device. So block device freezing (aka blk_queue_stop()) doesn't work reliably with DAX. That should be fixed but it's not easy as the only way to do that would be to hook into blk_stop_queue() and unmap (or at least write-protect) all the mappings of the device. Ugh... Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for filesystems since there's nothing which writeprotects pages that are writeably mapped. In normal path, page writeback does this but that doesn't happen for DAX. I remember we once talked about this but it got lost. We need something like walk all filesystem inodes during fs freeze and writeprotect all pages that are mapped. But that's going to be slow... Honza > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 840acd4380d4..4ae8fa55bd1e 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1702,9 +1702,15 @@ static const struct address_space_operations def_blk_aops = { > * ensure the whole-disk device size is stable while requests are in > * flight. > * > - * Finally, these paths do not synchronize against freezing > - * (sb_start_pagefault(), etc...) since bdev_sops does not support > - * freezing. > + * Finally, in contrast to the generic_file_mmap() path, there are no > + * calls to sb_start_pagefault(). That is meant to synchronize write > + * faults against requests to freeze the contents of the filesystem > + * hosting vma->vm_file. However, in the case of a block device special > + * file, it is a 0-sized device node usually hosted on devtmpfs, i.e. > + * nothing to do with the super_block for bdev_file_inode(vma->vm_file). > + * We could call get_super() in this path to retrieve the right > + * super_block, but the generic_file_mmap() path does not do this for > + * the CONFIG_FS_DAX=n case. > */ > static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > { >
On Thu, Oct 22, 2015 at 2:08 PM, Jan Kara <jack@suse.cz> wrote: > On Thu 22-10-15 16:05:46, Williams, Dan J wrote: [..] >> This text was aimed at the request from Ross to document the differences >> vs the generic_file_mmap() path. Is the following incremental change >> more clear? > > Well, not really. I thought you'd just delete that paragraph :) The thing > is: When doing IO directly to the block device, it makes no sense to look > at a filesystem on top of it - hopefully there is none since you'd be > corrupting it. So the paragraph that would make sense to me would be: > > * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling > * sb_start_pagefault(). There is no filesystem which could be frozen here > * and when bdev gets frozen, IO gets blocked in the request queue. I'm not following this assertion that "IO gets blocked in the request queue" when the device is frozen in the code. As far as I can see outside of tracking the freeze depth count the request_queue does not check if the device is frozen. freeze_bdev() is moot when no filesystem is a present. > But when spelled out like this, I've realized that with DAX, this blocking > of requests in the request queue doesn't really block the IO to the device. > So block device freezing (aka blk_queue_stop()) doesn't work reliably with > DAX. That should be fixed but it's not easy as the only way to do that > would be to hook into blk_stop_queue() and unmap (or at least > write-protect) all the mappings of the device. Ugh... Again I'm missing how this is guaranteed in the non-DAX case. freeze_bdev() will sync_blockdev(), but it does nothing to prevent re-dirtying through raw device mmaps while the fs in frozen. Should it? That's at least a separate patch. > Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for > filesystems since there's nothing which writeprotects pages that are > writeably mapped. In normal path, page writeback does this but that doesn't > happen for DAX. I remember we once talked about this but it got lost. > We need something like walk all filesystem inodes during fs freeze and > writeprotect all pages that are mapped. But that's going to be slow... This is what I'm attempting to tackle with the next revision of this series...
On Fri 23-10-15 16:32:57, Dan Williams wrote: > On Thu, Oct 22, 2015 at 2:08 PM, Jan Kara <jack@suse.cz> wrote: > > On Thu 22-10-15 16:05:46, Williams, Dan J wrote: > [..] > >> This text was aimed at the request from Ross to document the differences > >> vs the generic_file_mmap() path. Is the following incremental change > >> more clear? > > > > Well, not really. I thought you'd just delete that paragraph :) The thing > > is: When doing IO directly to the block device, it makes no sense to look > > at a filesystem on top of it - hopefully there is none since you'd be > > corrupting it. So the paragraph that would make sense to me would be: > > > > * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling > > * sb_start_pagefault(). There is no filesystem which could be frozen here > > * and when bdev gets frozen, IO gets blocked in the request queue. > > I'm not following this assertion that "IO gets blocked in the request > queue" when the device is frozen in the code. As far as I can see > outside of tracking the freeze depth count the request_queue does not > check if the device is frozen. freeze_bdev() is moot when no > filesystem is a present. Yes, how e.g. dm freezes devices when it wants to do a snapshot is that it first calls freeze_bdev() (to stop fs when there is one) and then calls blk_stop_queue() to block all the IO requests in the request queue. In this sense freeze_bdev() is somewhat a misnomer since it doesn't make sure no IO is submitted to the bdev. > > But when spelled out like this, I've realized that with DAX, this blocking > > of requests in the request queue doesn't really block the IO to the device. > > So block device freezing (aka blk_queue_stop()) doesn't work reliably with > > DAX. That should be fixed but it's not easy as the only way to do that > > would be to hook into blk_stop_queue() and unmap (or at least > > write-protect) all the mappings of the device. Ugh... > > Again I'm missing how this is guaranteed in the non-DAX case. > freeze_bdev() will sync_blockdev(), but it does nothing to prevent > re-dirtying through raw device mmaps while the fs in frozen. Should > it? That's at least a separate patch. It doesn't have to - after blk_stop_queue() is called no IO is submitted to the device and snapshotting happens in the level below bdev page cache so we don't care about modifications happening there. But with DAX things are different as we directly map device pages into page cache so we have to make sure no modifications of page cache happen. Honza
On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote: >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for >> filesystems since there's nothing which writeprotects pages that are >> writeably mapped. In normal path, page writeback does this but that doesn't >> happen for DAX. I remember we once talked about this but it got lost. >> We need something like walk all filesystem inodes during fs freeze and >> writeprotect all pages that are mapped. But that's going to be slow... > > fsync() has the same problem - we have no record of the pages that > need to be committed and then write protected when fsync() is called > after write()... I know Ross is still working on that implementation. However, I had a thought on the flight to ksummit that maybe we shouldn't worry about tracking dirty state on a per-page basis. For small / frequent synchronizations an application really should be using the nvml library [1] to issue cache flushes and pcommit from userspace on a per-cacheline basis. That leaves unmodified apps that want to be correct in the presence of dax mappings. Two things we can do to mitigate that case: 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass) flag. Applications shouldn't silently become incorrect simply because the fs is mounted with -o dax. If an app doesn't understand DAX mappings it should get page-cache semantics. This also protects apps that are not expecting DAX semantics on raw block device mappings. 2/ Even if we get a new flag that lets the kernel know the app understands DAX mappings, we shouldn't leave fsync broken. Can we instead get by with a simple / big hammer solution? I.e. on_each_cpu(sync_cache, ...); ...where sync_cache is something like: cache_disable(); wbinvd(); pcommit(); cache_enable(); Disruptive, yes, but if an app cares about efficient persistent memory synchronization fsync is already the wrong api.
On Mon 26-10-15 17:23:19, Dave Chinner wrote: > On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote: > > 2/ Even if we get a new flag that lets the kernel know the app > > understands DAX mappings, we shouldn't leave fsync broken. Can we > > instead get by with a simple / big hammer solution? I.e. > > Because we don't physically have to write back data the problem is > both simpler and more complex. The simplest solution is for the > underlying block device to implement blkdev_issue_flush() correctly. > > i.e. if blkdev_issue_flush() behaves according to it's required > semantics - that all volatile cached data is flushed to stable > storage - then fsync-on-DAX will work appropriately. As it is, this is > needed for journal based filesystems to work correctly, as they are > assuming that their journal writes are being treated correctly as > REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal > ordering is maintained.... > > So, to begin with, this problem needs to be solved at the block > device level. That's the simple, brute-force, big hammer solution to > the problem, and it requires no changes at the filesystem level at > all. Completely agreed. Just make sure REQ_FLUSH, REQ_FUA works correctly for pmem and fsync(2) / sync(2) issues go away. Fs freezing stuff is a different story, that will likely need some coordination from the filesystem layer (although with some luck we could keep it hidden in fs/super.c and fs/block_dev.c). I can have a look at that once ext4 dax support works unless someone beats me to it... Honza
On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Oct 26, 2015 at 11:48:06AM +0900, Dan Williams wrote: >> On Mon, Oct 26, 2015 at 6:22 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Thu, Oct 22, 2015 at 11:08:18PM +0200, Jan Kara wrote: >> >> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for >> >> filesystems since there's nothing which writeprotects pages that are >> >> writeably mapped. In normal path, page writeback does this but that doesn't >> >> happen for DAX. I remember we once talked about this but it got lost. >> >> We need something like walk all filesystem inodes during fs freeze and >> >> writeprotect all pages that are mapped. But that's going to be slow... >> > >> > fsync() has the same problem - we have no record of the pages that >> > need to be committed and then write protected when fsync() is called >> > after write()... >> >> I know Ross is still working on that implementation. However, I had a >> thought on the flight to ksummit that maybe we shouldn't worry about >> tracking dirty state on a per-page basis. For small / frequent >> synchronizations an application really should be using the nvml >> library [1] to issue cache flushes and pcommit from userspace on a >> per-cacheline basis. That leaves unmodified apps that want to be >> correct in the presence of dax mappings. Two things we can do to >> mitigate that case: >> >> 1/ Make DAX mappings opt-in with a new MMAP_DAX (page-cache bypass) >> flag. Applications shouldn't silently become incorrect simply because >> the fs is mounted with -o dax. If an app doesn't understand DAX >> mappings it should get page-cache semantics. This also protects apps >> that are not expecting DAX semantics on raw block device mappings. > > Which is the complete opposite of what we are trying to acehive with > DAX. i.e. that existing applications "just work" with DAX without > modification. So this is a non-starter. The list of things DAX breaks is getting shorter, but certainly the page-less paths will not be without surprises for quite a while yet... > Also, DAX access isn't a property of mmap - it's a property > of the inode. We cannot do DAX access via mmap while mixing page > cache based access through file descriptor based interfaces. This > I why I'm adding an inode attribute (on disk) to enable per-file DAX > capabilities - either everything is via the DAX paths, or nothing > is. > Per-inode control sounds very useful, I'll look at a similar mechanism for the raw block case. However, still not quite convinced page-cache control is an inode-only property, especially when direct-i/o is not an inode-property. That said, I agree the complexity of handling mixed mappings of the same file is prohibitive. >> 2/ Even if we get a new flag that lets the kernel know the app >> understands DAX mappings, we shouldn't leave fsync broken. Can we >> instead get by with a simple / big hammer solution? I.e. > > Because we don't physically have to write back data the problem is > both simpler and more complex. The simplest solution is for the > underlying block device to implement blkdev_issue_flush() correctly. > > i.e. if blkdev_issue_flush() behaves according to it's required > semantics - that all volatile cached data is flushed to stable > storage - then fsync-on-DAX will work appropriately. As it is, this is > needed for journal based filesystems to work correctly, as they are > assuming that their journal writes are being treated correctly as > REQ_FLUSH | REQ_FUA to ensure correct data/metadata/journal > ordering is maintained.... > > So, to begin with, this problem needs to be solved at the block > device level. That's the simple, brute-force, big hammer solution to > the problem, and it requires no changes at the filesystem level at > all. > > However, to avoid having to flush the entire block device range on > fsync we need a much more complex solution that tracks the dirty > ranges of the file and hence what needs committing when fsync is > run.... > >> Disruptive, yes, but if an app cares about efficient persistent memory >> synchronization fsync is already the wrong api. > > I don't really care about efficiency right now - correctness comes > first. Fundamentally, the app should not care whether it is writing to > persistent memory or spinning rust - the filesystem needs to > provide the application with exactly the same integrity guarantees > regardless of the underlying storage. > Sounds good, get blkdev_issue_flush() functional first and then worry about building a more efficient solution on top.
On Mon, Oct 26, 2015 at 05:56:30PM +0900, Dan Williams wrote: > On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@fromorbit.com> wrote: > > Also, DAX access isn't a property of mmap - it's a property > > of the inode. We cannot do DAX access via mmap while mixing page > > cache based access through file descriptor based interfaces. This > > I why I'm adding an inode attribute (on disk) to enable per-file DAX > > capabilities - either everything is via the DAX paths, or nothing > > is. > > > > Per-inode control sounds very useful, I'll look at a similar mechanism > for the raw block case. > > However, still not quite convinced page-cache control is an inode-only > property, especially when direct-i/o is not an inode-property. That > said, I agree the complexity of handling mixed mappings of the same > file is prohibitive. We didn't get that choice with direct IO - support via O_DIRECT was kinda inherited from other OS's(*). We still have all sorts of coherency problems between buffered/mmap/direct IO on the same file, and I'd really, really like to avoid making that same mistake again with DAX. i.e. We have a choice with DAX right now that will allow us to avoid coherency problems that we know existi and can't solve right now. Making DAX and inode property rather than a application context property avoids those coherence problems as all access will play by the same rules.... (*)That said, some other OS's did O_DIRECT as an inode property (e.g. solaris) where O_DIRECT was only done if no other cached operations were required (e.g. mmap), and so the fd would transparently shift between buffered and O_DIRECT depending on external accesses to the inode. This was not liked because of it's unpredictable effect on CPU usage and IO latency.... > Sounds good, get blkdev_issue_flush() functional first and then worry > about building a more efficient solution on top. *nod* Cheers, Dave.
On Tue, Oct 27, 2015 at 09:19:30AM +1100, Dave Chinner wrote: > On Mon, Oct 26, 2015 at 05:56:30PM +0900, Dan Williams wrote: > > On Mon, Oct 26, 2015 at 3:23 PM, Dave Chinner <david@fromorbit.com> wrote: > > > Also, DAX access isn't a property of mmap - it's a property > > > of the inode. We cannot do DAX access via mmap while mixing page > > > cache based access through file descriptor based interfaces. This > > > I why I'm adding an inode attribute (on disk) to enable per-file DAX > > > capabilities - either everything is via the DAX paths, or nothing > > > is. > > > > > > > Per-inode control sounds very useful, I'll look at a similar mechanism > > for the raw block case. > > > > However, still not quite convinced page-cache control is an inode-only > > property, especially when direct-i/o is not an inode-property. That > > said, I agree the complexity of handling mixed mappings of the same > > file is prohibitive. > > We didn't get that choice with direct IO - support via O_DIRECT was > kinda inherited from other OS's(*). We still have all sorts of > coherency problems between buffered/mmap/direct IO on the same file, > and I'd really, really like to avoid making that same mistake again > with DAX. > > i.e. We have a choice with DAX right now that will allow us to avoid > coherency problems that we know existi and can't solve right now. > Making DAX and inode property rather than a application context > property avoids those coherence problems as all access will play by > the same rules.... > > (*)That said, some other OS's did O_DIRECT as an inode property (e.g. > solaris) where O_DIRECT was only done if no other cached operations > were required (e.g. mmap), and so the fd would transparently shift > between buffered and O_DIRECT depending on external accesses to the > inode. This was not liked because of it's unpredictable effect on > CPU usage and IO latency.... > > > Sounds good, get blkdev_issue_flush() functional first and then worry > > about building a more efficient solution on top. > > *nod* Okay, I'll get this sent out this week. I've been working furiously on the fsync/msync solution which tracks dirty pages via the radix tree - I guess I'll send out an RFC version of those patches tomorrow so that we can begin the review process and any glaring issues can be addressed soon. That set has grown rather large, though, and I do worry that making it into v4.4 would be a stretch, although I guess I'm still holding out hope.
diff --git a/fs/block_dev.c b/fs/block_dev.c index 840acd4380d4..4ae8fa55bd1e 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1702,9 +1702,15 @@ static const struct address_space_operations def_blk_aops = { * ensure the whole-disk device size is stable while requests are in * flight. * - * Finally, these paths do not synchronize against freezing - * (sb_start_pagefault(), etc...) since bdev_sops does not support - * freezing. + * Finally, in contrast to the generic_file_mmap() path, there are no + * calls to sb_start_pagefault(). That is meant to synchronize write + * faults against requests to freeze the contents of the filesystem + * hosting vma->vm_file. However, in the case of a block device special + * file, it is a 0-sized device node usually hosted on devtmpfs, i.e. + * nothing to do with the super_block for bdev_file_inode(vma->vm_file). + * We could call get_super() in this path to retrieve the right + * super_block, but the generic_file_mmap() path does not do this for + * the CONFIG_FS_DAX=n case. */ static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) {