diff mbox

[v3,14/15] dax: dirty extent notification

Message ID 20151102043058.6610.15559.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Dan Williams Nov. 2, 2015, 4:30 a.m. UTC
DAX-enabled block device drivers can use hints from fs/dax.c to
optimize their internal tracking of potentially dirty cpu cache lines.
If a DAX mapping is being used for synchronous operations, dax_do_io(),
a dax-enabled block-driver knows that fs/dax.c will handle immediate
flushing.  For asynchronous mappings, i.e.  returned to userspace via
mmap, the driver can track active extents of the media for flushing.

We can later extend the DAX paths to indicate when an async mapping is
"closed" allowing the active extents to be marked clean.

Because this capability requires adding two new parameters to
->direct_access ('size' and 'flags') convert the function to a control
parameter block.  As a result this cleans up dax_map_atomic() usage as
there is no longer a need to have a separate __dax_map_atomic, and the
return value can match bdev_direct_access().

No functional change results from this patch, just code movement to the
new parameter scheme.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/sysdev/axonram.c |    9 +-
 drivers/block/brd.c           |   10 +-
 drivers/nvdimm/pmem.c         |   10 +-
 drivers/s390/block/dcssblk.c  |    9 +-
 fs/block_dev.c                |   17 ++--
 fs/dax.c                      |  167 ++++++++++++++++++++++-------------------
 include/linux/blkdev.h        |   24 +++++-
 7 files changed, 136 insertions(+), 110 deletions(-)

Comments

Dave Chinner Nov. 3, 2015, 1:16 a.m. UTC | #1
On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> DAX-enabled block device drivers can use hints from fs/dax.c to
> optimize their internal tracking of potentially dirty cpu cache lines.
> If a DAX mapping is being used for synchronous operations, dax_do_io(),
> a dax-enabled block-driver knows that fs/dax.c will handle immediate
> flushing.  For asynchronous mappings, i.e.  returned to userspace via
> mmap, the driver can track active extents of the media for flushing.

So, essentially, you are marking the calls into the mapping calls
with BLKDAX_F_DIRTY when the mapping is requested for a write page
fault?  Hence allowing the block device to track "dirty pages"
exactly?

But, really, if we're going to use Ross's mapping tree patches that
use exceptional entries to track dirty pfns, why do we need to this
special interface from DAX to the block device? Ross's changes will
track mmap'd ranges that are dirtied at the filesytem inode level,
and the fsync/writeback will trigger CPU cache writeback of those
dirty ranges. This will work for block devices that are mapped by
DAX, too, because they have a inode+mapping tree, too.

And if we are going to use Ross's infrastructure (which, when we
work the kinks out of, I think we will), we really should change
dax_do_io() to track pfns that are dirtied this way, too. That will
allow us to get rid of all the cache flushing from the DAX layer
(they'll get pushed into fsync/writeback) and so we only take the
CPU cache flushing penalties when synchronous operations are
requested by userspace...

> We can later extend the DAX paths to indicate when an async mapping is
> "closed" allowing the active extents to be marked clean.

Yes, that's a basic feature of Ross's patches. Hence I think this
special case DAX<->bdev interface is the wrong direction to be
taking.

Cheers,

Dave.
Dan Williams Nov. 3, 2015, 4:56 a.m. UTC | #2
On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
>> DAX-enabled block device drivers can use hints from fs/dax.c to
>> optimize their internal tracking of potentially dirty cpu cache lines.
>> If a DAX mapping is being used for synchronous operations, dax_do_io(),
>> a dax-enabled block-driver knows that fs/dax.c will handle immediate
>> flushing.  For asynchronous mappings, i.e.  returned to userspace via
>> mmap, the driver can track active extents of the media for flushing.
>
> So, essentially, you are marking the calls into the mapping calls
> with BLKDAX_F_DIRTY when the mapping is requested for a write page
> fault?  Hence allowing the block device to track "dirty pages"
> exactly?

Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the
total storage capacity), because tracking dirty mappings should be
temporary compatibility hack and not a first class citizen.

> But, really, if we're going to use Ross's mapping tree patches that
> use exceptional entries to track dirty pfns, why do we need to this
> special interface from DAX to the block device? Ross's changes will
> track mmap'd ranges that are dirtied at the filesytem inode level,
> and the fsync/writeback will trigger CPU cache writeback of those
> dirty ranges. This will work for block devices that are mapped by
> DAX, too, because they have a inode+mapping tree, too.
>
> And if we are going to use Ross's infrastructure (which, when we
> work the kinks out of, I think we will), we really should change
> dax_do_io() to track pfns that are dirtied this way, too. That will
> allow us to get rid of all the cache flushing from the DAX layer
> (they'll get pushed into fsync/writeback) and so we only take the
> CPU cache flushing penalties when synchronous operations are
> requested by userspace...

No, we definitely can't do that.   I think your mental model of the
cache flushing is similar to the disk model where a small buffer is
flushed after a large streaming write.  Both Ross' patches and my
approach suffer from the same horror that the cache flushing is O(N)
currently, so we don't want to make it responsible for more data
ranges areas than is strictly necessary.

>> We can later extend the DAX paths to indicate when an async mapping is
>> "closed" allowing the active extents to be marked clean.
>
> Yes, that's a basic feature of Ross's patches. Hence I think this
> special case DAX<->bdev interface is the wrong direction to be
> taking.

So here's my problem with the "track dirty mappings" in the core
mm/vfs approach, it's harder to unwind and delete when it turns out no
application actually needs it, or the platform gives us an O(1) flush
method that is independent of dirty pte tracking.

We have the NVML [1] library as the recommended method for
applications to interact with persistent memory and it is not using
fsync/msync for its synchronization primitives, it's managing the
cache directly.  The *only* user for tracking dirty DAX mappings is
unmodified legacy applications that do mmap I/O and call fsync/msync.

DAX in my opinion is not a transparent accelerator of all existing
apps, it's a targeted mechanism for applications ready to take
advantage of byte addressable persistent memory.  This is why I'm a
big supporter of your per-inode DAX control proposal.  The fact that
fsync is painful for large amounts of dirty data is a feature.  It
detects inodes that should have had DAX-disabled in the first
instance.  The only advantage of the radix approach is that the second
fsync after the big hit may be faster, but that still can't beat
either targeted disabling of DAX or updating the app to use NVML.

So, again, I remain to be convinced that we need to carry complexity
in the core kernel when we have the page cache to cover those cases.
The driver solution is a minimal extension of the data
bdev_direct_access() is already sending down to the driver, and covers
the gap without mm/fs entanglements while we figure out a longer term
solution.

[1]: https://github.com/pmem/nvml
Dave Chinner Nov. 3, 2015, 5:40 a.m. UTC | #3
On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> >> DAX-enabled block device drivers can use hints from fs/dax.c to
> >> optimize their internal tracking of potentially dirty cpu cache lines.
> >> If a DAX mapping is being used for synchronous operations, dax_do_io(),
> >> a dax-enabled block-driver knows that fs/dax.c will handle immediate
> >> flushing.  For asynchronous mappings, i.e.  returned to userspace via
> >> mmap, the driver can track active extents of the media for flushing.
> >
> > So, essentially, you are marking the calls into the mapping calls
> > with BLKDAX_F_DIRTY when the mapping is requested for a write page
> > fault?  Hence allowing the block device to track "dirty pages"
> > exactly?
> 
> Not pages, but larger extents (1 extent = 1/NUM_DAX_EXTENTS of the
> total storage capacity), because tracking dirty mappings should be
> temporary compatibility hack and not a first class citizen.
> 
> > But, really, if we're going to use Ross's mapping tree patches that
> > use exceptional entries to track dirty pfns, why do we need to this
> > special interface from DAX to the block device? Ross's changes will
> > track mmap'd ranges that are dirtied at the filesytem inode level,
> > and the fsync/writeback will trigger CPU cache writeback of those
> > dirty ranges. This will work for block devices that are mapped by
> > DAX, too, because they have a inode+mapping tree, too.
> >
> > And if we are going to use Ross's infrastructure (which, when we
> > work the kinks out of, I think we will), we really should change
> > dax_do_io() to track pfns that are dirtied this way, too. That will
> > allow us to get rid of all the cache flushing from the DAX layer
> > (they'll get pushed into fsync/writeback) and so we only take the
> > CPU cache flushing penalties when synchronous operations are
> > requested by userspace...
> 
> No, we definitely can't do that.   I think your mental model of the
> cache flushing is similar to the disk model where a small buffer is
> flushed after a large streaming write.  Both Ross' patches and my
> approach suffer from the same horror that the cache flushing is O(N)
> currently, so we don't want to make it responsible for more data
> ranges areas than is strictly necessary.

I didn't see anything that was O(N) in Ross's patches. What part of
the fsync algorithm that Ross proposed are you refering to here?

> >> We can later extend the DAX paths to indicate when an async mapping is
> >> "closed" allowing the active extents to be marked clean.
> >
> > Yes, that's a basic feature of Ross's patches. Hence I think this
> > special case DAX<->bdev interface is the wrong direction to be
> > taking.
> 
> So here's my problem with the "track dirty mappings" in the core
> mm/vfs approach, it's harder to unwind and delete when it turns out no
> application actually needs it, or the platform gives us an O(1) flush
> method that is independent of dirty pte tracking.
> 
> We have the NVML [1] library as the recommended method for
> applications to interact with persistent memory and it is not using
> fsync/msync for its synchronization primitives, it's managing the
> cache directly.  The *only* user for tracking dirty DAX mappings is
> unmodified legacy applications that do mmap I/O and call fsync/msync.

I'm pretty sure there are going to be many people still writing new
applications that use POSIX APIs they expect to work correctly on
pmem because, well, it's going to take 10 years before persistent
memory is common enough for most application developers to only
target storage via NVML.

The whole world is not crazy HFT applications that need to bypass
the kernel for *everything* because even a few nanoseconds of extra
latency matters.

> DAX in my opinion is not a transparent accelerator of all existing
> apps, it's a targeted mechanism for applications ready to take
> advantage of byte addressable persistent memory. 

And this is where we disagree. DAX is a method of allowing POSIX
compliant applications get the best of both worlds - portability
with existing storage and filesystems, yet with the speed and byte
addressiblity of persistent storage through the use of mmap.

Applications designed specifically for persistent memory don't want
a general purpose, POSIX compatible filesystem underneath them. The
should be interacting directly with, and only with, your NVML
library. If the NVML library is implemented by using DAX on a POSIX
compatible, general purpose filesystem, then you're just going to
have to live with everything we need to do to make DAX work with
general purpose POSIX compatible applications.

DAX has always been intended as a *stopgap measure* designed to
bridge the gap between existing POSIX based storage APIs and PMEM
native filesystem implementations. You're advocating that DAX should
only be used by PMEM native applications using NVML and then saying
anything that might be needed for POSIX compatible behaviour is
unacceptible overhead...

> This is why I'm a
> big supporter of your per-inode DAX control proposal.  The fact that
> fsync is painful for large amounts of dirty data is a feature.  It
> detects inodes that should have had DAX-disabled in the first
> instance.

fsync is painful for any storage when there is large amounts of
dirty data. DAX is no different, and it's not a reason for saying
"don't use DAX". DAX + fsync should be faster than "buffered IO
through the page cache on pmem + fsync" because there is only one
memory copy being done in the DAX case.

The buffered IO case has all that per-page radix tree tracking in it,
writeback, etc. Yet:

# mount -o dax /dev/ram0 /mnt/scratch
# time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
wrote 3221225472/3221225472 bytes at offset 0
3.000 GiB, 384 ops; 0:00:10.00 (305.746 MiB/sec and 38.2182 ops/sec)
0.00user 10.05system 0:10.05elapsed 100%CPU (0avgtext+0avgdata 10512maxresident)k
0inputs+0outputs (0major+2156minor)pagefaults 0swaps
# umount /mnt/scratch
# mount /dev/ram0 /mnt/scratch
# time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
wrote 3221225472/3221225472 bytes at offset 0
3.000 GiB, 384 ops; 0:00:02.00 (1.218 GiB/sec and 155.9046 ops/sec)
0.00user 2.83system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 10468maxresident)k
0inputs+0outputs (0major+2154minor)pagefaults 0swaps
#

So don't tell me that tracking dirty pages in the radix tree too
slow for DAX and that DAX should not be used for POSIX IO based
applications - it should be as fast as buffered IO, if not faster,
and if it isn't then we've screwed up real bad. And right now, we're
screwing up real bad.

Cheers,

Dave.
Dan Williams Nov. 3, 2015, 7:20 a.m. UTC | #4
On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> No, we definitely can't do that.   I think your mental model of the
>> cache flushing is similar to the disk model where a small buffer is
>> flushed after a large streaming write.  Both Ross' patches and my
>> approach suffer from the same horror that the cache flushing is O(N)
>> currently, so we don't want to make it responsible for more data
>> ranges areas than is strictly necessary.
>
> I didn't see anything that was O(N) in Ross's patches. What part of
> the fsync algorithm that Ross proposed are you refering to here?

We have to issue clflush per touched virtual address rather than a
constant number of physical ways, or a flush-all instruction.

>> >> We can later extend the DAX paths to indicate when an async mapping is
>> >> "closed" allowing the active extents to be marked clean.
>> >
>> > Yes, that's a basic feature of Ross's patches. Hence I think this
>> > special case DAX<->bdev interface is the wrong direction to be
>> > taking.
>>
>> So here's my problem with the "track dirty mappings" in the core
>> mm/vfs approach, it's harder to unwind and delete when it turns out no
>> application actually needs it, or the platform gives us an O(1) flush
>> method that is independent of dirty pte tracking.
>>
>> We have the NVML [1] library as the recommended method for
>> applications to interact with persistent memory and it is not using
>> fsync/msync for its synchronization primitives, it's managing the
>> cache directly.  The *only* user for tracking dirty DAX mappings is
>> unmodified legacy applications that do mmap I/O and call fsync/msync.
>
> I'm pretty sure there are going to be many people still writing new
> applications that use POSIX APIs they expect to work correctly on
> pmem because, well, it's going to take 10 years before persistent
> memory is common enough for most application developers to only
> target storage via NVML.
>
> The whole world is not crazy HFT applications that need to bypass
> the kernel for *everything* because even a few nanoseconds of extra
> latency matters.

I agree with all of that...

>> DAX in my opinion is not a transparent accelerator of all existing
>> apps, it's a targeted mechanism for applications ready to take
>> advantage of byte addressable persistent memory.
>
> And this is where we disagree. DAX is a method of allowing POSIX
> compliant applications get the best of both worlds - portability
> with existing storage and filesystems, yet with the speed and byte
> addressiblity of persistent storage through the use of mmap.
>
> Applications designed specifically for persistent memory don't want
> a general purpose, POSIX compatible filesystem underneath them. The
> should be interacting directly with, and only with, your NVML
> library. If the NVML library is implemented by using DAX on a POSIX
> compatible, general purpose filesystem, then you're just going to
> have to live with everything we need to do to make DAX work with
> general purpose POSIX compatible applications.
>
> DAX has always been intended as a *stopgap measure* designed to
> bridge the gap between existing POSIX based storage APIs and PMEM
> native filesystem implementations. You're advocating that DAX should
> only be used by PMEM native applications using NVML and then saying
> anything that might be needed for POSIX compatible behaviour is
> unacceptible overhead...

Also agreed, up until you this last sentence which is not what I am
saying at all.  I didn't say it is unacceptable overhead, my solution
in the driver has the exact same overhead.

Where I instead think we disagree is the acceptable cost of the "flush
cache" operation before the recommended solution is to locally disable
DAX, or require help from the platform to do this operation more
efficiently.  What I submit is unacceptable is to have the cpu loop
over every address heading out to storage.  The radix solution only
makes the second fsync after the first potentially less costly over
time.

I don't think we'll need it long term, or so I hope. The question
becomes do we want to carry this complexity in the core or push
selectively disabling DAX in the interim and have the simple driver
approach for cases where it's not feasible to disable DAX.  For 4.4 we
have the practical matter of not having the time to get mm folks to
review the radix approach.

I'm not opposed to ripping out the driver solution in 4.5 when we have
the time to get Ross' implementation reviewed.  I'm also holding back
the get_user_page() patches until 4.5 and given the big fat comment in
write_protect_page() about gup-fast interactions we'll need to think
through similar implications.

>
>> This is why I'm a
>> big supporter of your per-inode DAX control proposal.  The fact that
>> fsync is painful for large amounts of dirty data is a feature.  It
>> detects inodes that should have had DAX-disabled in the first
>> instance.
>
> fsync is painful for any storage when there is large amounts of
> dirty data. DAX is no different, and it's not a reason for saying
> "don't use DAX". DAX + fsync should be faster than "buffered IO
> through the page cache on pmem + fsync" because there is only one
> memory copy being done in the DAX case.
>
> The buffered IO case has all that per-page radix tree tracking in it,
> writeback, etc. Yet:
>
> # mount -o dax /dev/ram0 /mnt/scratch
> # time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
> wrote 3221225472/3221225472 bytes at offset 0
> 3.000 GiB, 384 ops; 0:00:10.00 (305.746 MiB/sec and 38.2182 ops/sec)
> 0.00user 10.05system 0:10.05elapsed 100%CPU (0avgtext+0avgdata 10512maxresident)k
> 0inputs+0outputs (0major+2156minor)pagefaults 0swaps
> # umount /mnt/scratch
> # mount /dev/ram0 /mnt/scratch
> # time xfs_io -fc "truncate 0" -c "pwrite -b 8m 0 3g" -c fsync /mnt/scratch/file
> wrote 3221225472/3221225472 bytes at offset 0
> 3.000 GiB, 384 ops; 0:00:02.00 (1.218 GiB/sec and 155.9046 ops/sec)
> 0.00user 2.83system 0:02.86elapsed 99%CPU (0avgtext+0avgdata 10468maxresident)k
> 0inputs+0outputs (0major+2154minor)pagefaults 0swaps
> #
>
> So don't tell me that tracking dirty pages in the radix tree too
> slow for DAX and that DAX should not be used for POSIX IO based
> applications - it should be as fast as buffered IO, if not faster,
> and if it isn't then we've screwed up real bad. And right now, we're
> screwing up real bad.

Again, it's not the dirty tracking in the radix I'm worried about it's
looping through all the virtual addresses within those pages..
Dave Chinner Nov. 3, 2015, 8:51 p.m. UTC | #5
On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> >> No, we definitely can't do that.   I think your mental model of the
> >> cache flushing is similar to the disk model where a small buffer is
> >> flushed after a large streaming write.  Both Ross' patches and my
> >> approach suffer from the same horror that the cache flushing is O(N)
> >> currently, so we don't want to make it responsible for more data
> >> ranges areas than is strictly necessary.
> >
> > I didn't see anything that was O(N) in Ross's patches. What part of
> > the fsync algorithm that Ross proposed are you refering to here?
> 
> We have to issue clflush per touched virtual address rather than a
> constant number of physical ways, or a flush-all instruction.
.....
> > So don't tell me that tracking dirty pages in the radix tree too
> > slow for DAX and that DAX should not be used for POSIX IO based
> > applications - it should be as fast as buffered IO, if not faster,
> > and if it isn't then we've screwed up real bad. And right now, we're
> > screwing up real bad.
> 
> Again, it's not the dirty tracking in the radix I'm worried about it's
> looping through all the virtual addresses within those pages..

So, let me summarise what I think you've just said. You are

1. fine with looping through the virtual addresses doing cache flushes
   synchronously when doing IO despite it having significant
   latency and performance costs.

2. Happy to hack a method into DAX to bypass the filesystems by
   pushing information to the block device for it to track regions that
   need cache flushes, then add infrastructure to the block device to
   track those dirty regions and then walk those addresses and issue
   cache flushes when the filesystem issues a REQ_FLUSH IO regardless
   of whether the filesystem actually needs those cachelines flushed
   for that specific IO?

3. Not happy to use the generic mm/vfs level infrastructure
   architectected specifically to provide the exact asynchronous
   cache flushing/writeback semantics we require because it will
   cause too many cache flushes, even though the number of cache
   flushes will be, at worst, the same as in 2).


1) will work, but as we can see it is *slow*. 3) is what Ross is
implementing - it's a tried and tested architecture that all mm/fs
developers understand, and his explanation of why it will work for
pmem is pretty solid and completely platform/hardware architecture
independent.

Which leaves this question: How does 2) save us anything in terms of
avoiding iterating virtual addresses and issuing cache flushes
over 3)? And is it sufficient to justify hacking a bypass into DAX
and the additional driver level complexity of having to add dirty
region tracking, flushing and cleaning to REQ_FLUSH operations?

Cheers,

Dave.
Ross Zwisler Nov. 3, 2015, 9:18 p.m. UTC | #6
On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
<>
> > Yes, that's a basic feature of Ross's patches. Hence I think this
> > special case DAX<->bdev interface is the wrong direction to be
> > taking.
> 
> So here's my problem with the "track dirty mappings" in the core
> mm/vfs approach, it's harder to unwind and delete when it turns out no
> application actually needs it, or the platform gives us an O(1) flush
> method that is independent of dirty pte tracking.

I don't think that we'll ever be able to "unwind and delete" the dirty page
tracking.  Even if *some* platform gives us an 0(1) flush method independent
of PTE tracking, that will be an optimized path that will need to live along
side the path that we'll need to keep for other architectures (like all the
ones that exist today).
Dan Williams Nov. 3, 2015, 9:19 p.m. UTC | #7
On Tue, Nov 3, 2015 at 12:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> >> No, we definitely can't do that.   I think your mental model of the
>> >> cache flushing is similar to the disk model where a small buffer is
>> >> flushed after a large streaming write.  Both Ross' patches and my
>> >> approach suffer from the same horror that the cache flushing is O(N)
>> >> currently, so we don't want to make it responsible for more data
>> >> ranges areas than is strictly necessary.
>> >
>> > I didn't see anything that was O(N) in Ross's patches. What part of
>> > the fsync algorithm that Ross proposed are you refering to here?
>>
>> We have to issue clflush per touched virtual address rather than a
>> constant number of physical ways, or a flush-all instruction.
> .....
>> > So don't tell me that tracking dirty pages in the radix tree too
>> > slow for DAX and that DAX should not be used for POSIX IO based
>> > applications - it should be as fast as buffered IO, if not faster,
>> > and if it isn't then we've screwed up real bad. And right now, we're
>> > screwing up real bad.
>>
>> Again, it's not the dirty tracking in the radix I'm worried about it's
>> looping through all the virtual addresses within those pages..
>
> So, let me summarise what I think you've just said. You are
>
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.

No, like I said in the blkdev_issue_zeroout thread we need to replace
looping flushes with non-temporal stores and delayed wmb_pmem()
wherever possible.

> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?

I'm happier with a temporary driver level hack than a temporary core
kernel change.  This requirement to flush by virtual address is
something that, in my opinion, must be addressed by the platform with
a reliable global flush or by walking a small constant number of
physical-cache-ways.  I think we're getting ahead of ourselves jumping
to solving this in the core kernel while the question of how to do
efficient large flushes is still pending.

> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).

Correct, because if/when a platform solution arrives the need to track
dirty pfns evaporates.

> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
>
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?
>

Given what we are talking about amounts to a hardware workaround I
think that kind of logic belongs in a driver.  If the cache flushing
gets fixed and we stop needing to track individual cachelines the
flush implementation will look and feel much more like existing
storage drivers.
Dan Williams Nov. 3, 2015, 9:34 p.m. UTC | #8
On Tue, Nov 3, 2015 at 1:18 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sun, Nov 01, 2015 at 11:30:58PM -0500, Dan Williams wrote:
> <>
>> > Yes, that's a basic feature of Ross's patches. Hence I think this
>> > special case DAX<->bdev interface is the wrong direction to be
>> > taking.
>>
>> So here's my problem with the "track dirty mappings" in the core
>> mm/vfs approach, it's harder to unwind and delete when it turns out no
>> application actually needs it, or the platform gives us an O(1) flush
>> method that is independent of dirty pte tracking.
>
> I don't think that we'll ever be able to "unwind and delete" the dirty page
> tracking.  Even if *some* platform gives us an 0(1) flush method independent
> of PTE tracking, that will be an optimized path that will need to live along
> side the path that we'll need to keep for other architectures (like all the
> ones that exist today).

Other architectures are in worse shape when we start talking about
virtually tagged caches.  If an architecture can't support DAX
efficiently I don't know why we would jump through hoops in the core
to allow it to use DAX.  In the interim to make forward progress we
have a safe workaround in the driver, and when the reports come in
about fsync taking too much time the response is "switch from fsync to
NVML" or "turn off DAX (ideally at the inode)".  What's broken about
those mitigation options?
Ross Zwisler Nov. 3, 2015, 9:37 p.m. UTC | #9
On Wed, Nov 04, 2015 at 07:51:31AM +1100, Dave Chinner wrote:
> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
> > >> No, we definitely can't do that.   I think your mental model of the
> > >> cache flushing is similar to the disk model where a small buffer is
> > >> flushed after a large streaming write.  Both Ross' patches and my
> > >> approach suffer from the same horror that the cache flushing is O(N)
> > >> currently, so we don't want to make it responsible for more data
> > >> ranges areas than is strictly necessary.
> > >
> > > I didn't see anything that was O(N) in Ross's patches. What part of
> > > the fsync algorithm that Ross proposed are you refering to here?
> > 
> > We have to issue clflush per touched virtual address rather than a
> > constant number of physical ways, or a flush-all instruction.
> .....
> > > So don't tell me that tracking dirty pages in the radix tree too
> > > slow for DAX and that DAX should not be used for POSIX IO based
> > > applications - it should be as fast as buffered IO, if not faster,
> > > and if it isn't then we've screwed up real bad. And right now, we're
> > > screwing up real bad.
> > 
> > Again, it's not the dirty tracking in the radix I'm worried about it's
> > looping through all the virtual addresses within those pages..
> 
> So, let me summarise what I think you've just said. You are
> 
> 1. fine with looping through the virtual addresses doing cache flushes
>    synchronously when doing IO despite it having significant
>    latency and performance costs.
> 
> 2. Happy to hack a method into DAX to bypass the filesystems by
>    pushing information to the block device for it to track regions that
>    need cache flushes, then add infrastructure to the block device to
>    track those dirty regions and then walk those addresses and issue
>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>    of whether the filesystem actually needs those cachelines flushed
>    for that specific IO?
> 
> 3. Not happy to use the generic mm/vfs level infrastructure
>    architectected specifically to provide the exact asynchronous
>    cache flushing/writeback semantics we require because it will
>    cause too many cache flushes, even though the number of cache
>    flushes will be, at worst, the same as in 2).
> 
> 
> 1) will work, but as we can see it is *slow*. 3) is what Ross is
> implementing - it's a tried and tested architecture that all mm/fs
> developers understand, and his explanation of why it will work for
> pmem is pretty solid and completely platform/hardware architecture
> independent.
> 
> Which leaves this question: How does 2) save us anything in terms of
> avoiding iterating virtual addresses and issuing cache flushes
> over 3)? And is it sufficient to justify hacking a bypass into DAX
> and the additional driver level complexity of having to add dirty
> region tracking, flushing and cleaning to REQ_FLUSH operations?

I also don't see a benefit of pushing this into the driver.  The generic
writeback infrastructure that is already in place seems to fit perfectly with
what we are trying to do.  I feel like putting the flushing infrastructure
into the driver, as with my first failed attempt at msync support, ends up
solving one aspect of the problem in a non-generic way that is ultimately
fatally flawed.

The driver inherently doesn't have enough information to solve this problem -
we really do need to involve the filesystem and mm layers.  For example:

1) The driver can't easily mark regions as clean once they have been flushed,
meaning that every time you dirty data you add to an ever increasing list of
things that will be flushed on the next REQ_FLUSH.

2) The driver doesn't know how inodes map to blocks, so when you get a
REQ_FLUSH for an fsync you end up flushing the dirty regions for *the entire
block device*, not just the one inode.

3) The driver doesn't understand how mmap ranges map to block regions, so if
someone msyncs a single page (causing a REQ_FLUSH) on a single mmap you will
once again flush every region that has ever been dirtied on the entire block
device.

Each of these cases is handled by the existing writeback infrastructure.  I'm
strongly in favor of waiting and solving this issue with the radix tree
patches.
Dan Williams Nov. 3, 2015, 9:43 p.m. UTC | #10
On Tue, Nov 3, 2015 at 1:37 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Nov 04, 2015 at 07:51:31AM +1100, Dave Chinner wrote:
>> On Mon, Nov 02, 2015 at 11:20:49PM -0800, Dan Williams wrote:
>> > On Mon, Nov 2, 2015 at 9:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Mon, Nov 02, 2015 at 08:56:24PM -0800, Dan Williams wrote:
>> > >> No, we definitely can't do that.   I think your mental model of the
>> > >> cache flushing is similar to the disk model where a small buffer is
>> > >> flushed after a large streaming write.  Both Ross' patches and my
>> > >> approach suffer from the same horror that the cache flushing is O(N)
>> > >> currently, so we don't want to make it responsible for more data
>> > >> ranges areas than is strictly necessary.
>> > >
>> > > I didn't see anything that was O(N) in Ross's patches. What part of
>> > > the fsync algorithm that Ross proposed are you refering to here?
>> >
>> > We have to issue clflush per touched virtual address rather than a
>> > constant number of physical ways, or a flush-all instruction.
>> .....
>> > > So don't tell me that tracking dirty pages in the radix tree too
>> > > slow for DAX and that DAX should not be used for POSIX IO based
>> > > applications - it should be as fast as buffered IO, if not faster,
>> > > and if it isn't then we've screwed up real bad. And right now, we're
>> > > screwing up real bad.
>> >
>> > Again, it's not the dirty tracking in the radix I'm worried about it's
>> > looping through all the virtual addresses within those pages..
>>
>> So, let me summarise what I think you've just said. You are
>>
>> 1. fine with looping through the virtual addresses doing cache flushes
>>    synchronously when doing IO despite it having significant
>>    latency and performance costs.
>>
>> 2. Happy to hack a method into DAX to bypass the filesystems by
>>    pushing information to the block device for it to track regions that
>>    need cache flushes, then add infrastructure to the block device to
>>    track those dirty regions and then walk those addresses and issue
>>    cache flushes when the filesystem issues a REQ_FLUSH IO regardless
>>    of whether the filesystem actually needs those cachelines flushed
>>    for that specific IO?
>>
>> 3. Not happy to use the generic mm/vfs level infrastructure
>>    architectected specifically to provide the exact asynchronous
>>    cache flushing/writeback semantics we require because it will
>>    cause too many cache flushes, even though the number of cache
>>    flushes will be, at worst, the same as in 2).
>>
>>
>> 1) will work, but as we can see it is *slow*. 3) is what Ross is
>> implementing - it's a tried and tested architecture that all mm/fs
>> developers understand, and his explanation of why it will work for
>> pmem is pretty solid and completely platform/hardware architecture
>> independent.
>>
>> Which leaves this question: How does 2) save us anything in terms of
>> avoiding iterating virtual addresses and issuing cache flushes
>> over 3)? And is it sufficient to justify hacking a bypass into DAX
>> and the additional driver level complexity of having to add dirty
>> region tracking, flushing and cleaning to REQ_FLUSH operations?
>
> I also don't see a benefit of pushing this into the driver.  The generic
> writeback infrastructure that is already in place seems to fit perfectly with
> what we are trying to do.  I feel like putting the flushing infrastructure
> into the driver, as with my first failed attempt at msync support, ends up
> solving one aspect of the problem in a non-generic way that is ultimately
> fatally flawed.
>
> The driver inherently doesn't have enough information to solve this problem -
> we really do need to involve the filesystem and mm layers.  For example:
>
> 1) The driver can't easily mark regions as clean once they have been flushed,
> meaning that every time you dirty data you add to an ever increasing list of
> things that will be flushed on the next REQ_FLUSH.
>
> 2) The driver doesn't know how inodes map to blocks, so when you get a
> REQ_FLUSH for an fsync you end up flushing the dirty regions for *the entire
> block device*, not just the one inode.
>
> 3) The driver doesn't understand how mmap ranges map to block regions, so if
> someone msyncs a single page (causing a REQ_FLUSH) on a single mmap you will
> once again flush every region that has ever been dirtied on the entire block
> device.
>
> Each of these cases is handled by the existing writeback infrastructure.  I'm
> strongly in favor of waiting and solving this issue with the radix tree
> patches.

Again, all of these holes are mitigated by turning off DAX or fixing
the app.  The radix solution does nothing to address the worst case
flushing and will spin, single-threaded flushing the world.  So I
remain strongly against the core change, but that's ultimately not my
call.

Looks like we're leaving this broken for 4.4...
diff mbox

Patch

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 59ca4c0ab529..11aeb47a6540 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -140,14 +140,13 @@  axon_ram_make_request(struct request_queue *queue, struct bio *bio)
  * @device, @sector, @data: see block_device_operations method
  */
 static long
-axon_ram_direct_access(struct block_device *device, sector_t sector,
-		       void __pmem **kaddr, pfn_t *pfn)
+axon_ram_direct_access(struct block_device *device, struct blk_dax_ctl *dax)
 {
 	struct axon_ram_bank *bank = device->bd_disk->private_data;
-	loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
+	loff_t offset = (loff_t)dax->sector << AXON_RAM_SECTOR_SHIFT;
 
-	*kaddr = (void __pmem __force *) bank->io_addr + offset;
-	*pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
+	dax->addr = (void __pmem __force *) bank->io_addr + offset;
+	dax->pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV);
 	return bank->size - offset;
 }
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0bbc60463779..686e1e7a5973 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -373,19 +373,19 @@  static int brd_rw_page(struct block_device *bdev, sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_RAM_DAX
-static long brd_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **kaddr, pfn_t *pfn)
+static long brd_direct_access(struct block_device *bdev,
+		struct blk_dax_ctl *dax)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
 	struct page *page;
 
 	if (!brd)
 		return -ENODEV;
-	page = brd_insert_page(brd, sector);
+	page = brd_insert_page(brd, dax->sector);
 	if (!page)
 		return -ENOSPC;
-	*kaddr = (void __pmem *)page_address(page);
-	*pfn = page_to_pfn_t(page);
+	dax->addr = (void __pmem *)page_address(page);
+	dax->pfn = page_to_pfn_t(page);
 
 	return PAGE_SIZE;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index aa2f1292120a..3d83f3079602 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -103,14 +103,14 @@  static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
-		      void __pmem **kaddr, pfn_t *pfn)
+static long pmem_direct_access(struct block_device *bdev,
+		struct blk_dax_ctl *dax)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	resource_size_t offset = sector * 512 + pmem->data_offset;
+	resource_size_t offset = dax->sector * 512 + pmem->data_offset;
 
-	*kaddr = pmem->virt_addr + offset;
-	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
+	dax->addr = pmem->virt_addr + offset;
+	dax->pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags);
 
 	return pmem->size - offset;
 }
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index e2b2839e4de5..6b01f56373e0 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -880,8 +880,7 @@  fail:
 }
 
 static long
-dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
-			void __pmem **kaddr, pfn_t *pfn)
+dcssblk_direct_access (struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct dcssblk_dev_info *dev_info;
 	unsigned long offset, dev_sz;
@@ -890,9 +889,9 @@  dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
 	if (!dev_info)
 		return -ENODEV;
 	dev_sz = dev_info->end - dev_info->start;
-	offset = secnum * 512;
-	*kaddr = (void __pmem *) (dev_info->start + offset);
-	*pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
+	offset = dax->sector * 512;
+	dax->addr = (void __pmem *) (dev_info->start + offset);
+	dax->pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV);
 
 	return dev_sz - offset;
 }
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ee34a31e6fa4..d1b0bbf00bd3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -453,10 +453,7 @@  EXPORT_SYMBOL_GPL(bdev_write_page);
 /**
  * bdev_direct_access() - Get the address for directly-accessibly memory
  * @bdev: The device containing the memory
- * @sector: The offset within the device
- * @addr: Where to put the address of the memory
- * @pfn: The Page Frame Number for the memory
- * @size: The number of bytes requested
+ * @ctl: control and output parameters for ->direct_access
  *
  * If a block device is made up of directly addressable memory, this function
  * will tell the caller the PFN and the address of the memory.  The address
@@ -467,10 +464,10 @@  EXPORT_SYMBOL_GPL(bdev_write_page);
  * Return: negative errno if an error occurs, otherwise the number of bytes
  * accessible at this address.
  */
-long bdev_direct_access(struct block_device *bdev, sector_t sector,
-			void __pmem **addr, pfn_t *pfn, long size)
+long bdev_direct_access(struct block_device *bdev, struct blk_dax_ctl *ctl)
 {
-	long avail;
+	sector_t sector, save;
+	long avail, size = ctl->size;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	/*
@@ -479,6 +476,8 @@  long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	 */
 	might_sleep();
 
+	save = ctl->sector;
+	sector = ctl->sector;
 	if (size < 0)
 		return size;
 	if (!ops->direct_access)
@@ -489,7 +488,9 @@  long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	avail = ops->direct_access(bdev, sector, addr, pfn);
+	ctl->sector = sector;
+	avail = ops->direct_access(bdev, ctl);
+	ctl->sector = save;
 	if (!avail)
 		return -ERANGE;
 	return min(avail, size);
diff --git a/fs/dax.c b/fs/dax.c
index ac8992e86779..f5835c4a7e1f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,36 +30,28 @@ 
 #include <linux/vmstat.h>
 #include <linux/sizes.h>
 
-static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size, pfn_t *pfn, long *len)
+static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
-	long rc;
-	void __pmem *addr;
 	struct request_queue *q = bdev->bd_queue;
+	long rc = -EIO;
 
+	dax->addr = (void __pmem *) ERR_PTR(-EIO);
 	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
-		return (void __pmem *) ERR_PTR(-EIO);
-	rc = bdev_direct_access(bdev, sector, &addr, pfn, size);
-	if (len)
-		*len = rc;
+		return rc;
+
+	rc = bdev_direct_access(bdev, dax);
 	if (rc < 0) {
+		dax->addr = (void __pmem *) ERR_PTR(rc);
 		blk_queue_exit(q);
-		return (void __pmem *) ERR_PTR(rc);
+		return rc;
 	}
-	return addr;
-}
-
-static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector,
-		long size)
-{
-	pfn_t pfn;
-
-	return __dax_map_atomic(bdev, sector, size, &pfn, NULL);
+	return rc;
 }
 
-static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
+static void dax_unmap_atomic(struct block_device *bdev,
+		const struct blk_dax_ctl *dax)
 {
-	if (IS_ERR(addr))
+	if (IS_ERR(dax->addr))
 		return;
 	blk_queue_exit(bdev->bd_queue);
 }
@@ -67,28 +59,29 @@  static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
 	struct block_device *bdev = inode->i_sb->s_bdev;
-	sector_t sector = block << (inode->i_blkbits - 9);
+	struct blk_dax_ctl dax;
 
 	might_sleep();
+	dax.sector = block << (inode->i_blkbits - 9),
+	dax.flags = 0;
+	dax.size = size;
 	do {
-		void __pmem *addr;
 		long count, sz;
-		pfn_t pfn;
 
 		sz = min_t(long, size, SZ_1M);
-		addr = __dax_map_atomic(bdev, sector, size, &pfn, &count);
-		if (IS_ERR(addr))
-			return PTR_ERR(addr);
+		count = dax_map_atomic(bdev, &dax);
+		if (count < 0)
+			return count;
 		if (count < sz)
 			sz = count;
-		clear_pmem(addr, sz);
-		addr += sz;
-		size -= sz;
+		clear_pmem(dax.addr, sz);
+		dax_unmap_atomic(bdev, &dax);
+		dax.addr += sz;
+		dax.size -= sz;
 		BUG_ON(sz & 511);
-		sector += sz / 512;
-		dax_unmap_atomic(bdev, addr);
+		dax.sector += sz / 512;
 		cond_resched();
-	} while (size);
+	} while (dax.size);
 
 	wmb_pmem();
 	return 0;
@@ -141,9 +134,11 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 	struct block_device *bdev = NULL;
 	int rw = iov_iter_rw(iter), rc;
 	long map_len = 0;
-	pfn_t pfn;
 	void __pmem *addr = NULL;
-	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
+	struct blk_dax_ctl dax = {
+		.addr = (void __pmem *) ERR_PTR(-EIO),
+		.flags = 0,
+	};
 	bool hole = false;
 	bool need_wmb = false;
 
@@ -181,15 +176,15 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				dax_unmap_atomic(bdev, kmap);
-				kmap = __dax_map_atomic(bdev,
-						to_sector(bh, inode),
-						bh->b_size, &pfn, &map_len);
-				if (IS_ERR(kmap)) {
-					rc = PTR_ERR(kmap);
+				dax_unmap_atomic(bdev, &dax);
+				dax.sector = to_sector(bh, inode);
+				dax.size = bh->b_size;
+				map_len = dax_map_atomic(bdev, &dax);
+				if (map_len < 0) {
+					rc = map_len;
 					break;
 				}
-				addr = kmap;
+				addr = dax.addr;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
 					dax_new_buf(addr, map_len, first, pos,
 							end);
@@ -219,7 +214,7 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
-	dax_unmap_atomic(bdev, kmap);
+	dax_unmap_atomic(bdev, &dax);
 
 	return (pos == start) ? rc : pos - start;
 }
@@ -313,17 +308,20 @@  static int dax_load_hole(struct address_space *mapping, struct page *page,
 static int copy_user_bh(struct page *to, struct inode *inode,
 		struct buffer_head *bh, unsigned long vaddr)
 {
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+		.flags = 0,
+	};
 	struct block_device *bdev = bh->b_bdev;
-	void __pmem *vfrom;
 	void *vto;
 
-	vfrom = dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size);
-	if (IS_ERR(vfrom))
-		return PTR_ERR(vfrom);
+	if (dax_map_atomic(bdev, &dax) < 0)
+		return PTR_ERR(dax.addr);
 	vto = kmap_atomic(to);
-	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
+	copy_user_page(vto, (void __force *)dax.addr, vaddr, to);
 	kunmap_atomic(vto);
-	dax_unmap_atomic(bdev, vfrom);
+	dax_unmap_atomic(bdev, &dax);
 	return 0;
 }
 
@@ -344,15 +342,25 @@  static void dax_account_mapping(struct block_device *bdev, pfn_t pfn,
 	}
 }
 
+static unsigned long vm_fault_to_dax_flags(struct vm_fault *vmf)
+{
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE))
+		return BLKDAX_F_DIRTY;
+	return 0;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
-	void __pmem *addr;
+	struct blk_dax_ctl dax = {
+		.sector = to_sector(bh, inode),
+		.size = bh->b_size,
+		.flags = vm_fault_to_dax_flags(vmf),
+	};
 	pgoff_t size;
-	pfn_t pfn;
 	int error;
 
 	i_mmap_lock_read(mapping);
@@ -370,22 +378,20 @@  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	addr = __dax_map_atomic(bdev, to_sector(bh, inode), bh->b_size,
-			&pfn, NULL);
-	if (IS_ERR(addr)) {
-		error = PTR_ERR(addr);
+	if (dax_map_atomic(bdev, &dax) < 0) {
+		error = PTR_ERR(dax.addr);
 		goto out;
 	}
 
 	if (buffer_unwritten(bh) || buffer_new(bh)) {
-		clear_pmem(addr, PAGE_SIZE);
+		clear_pmem(dax.addr, PAGE_SIZE);
 		wmb_pmem();
 	}
 
-	dax_account_mapping(bdev, pfn, mapping);
-	dax_unmap_atomic(bdev, addr);
+	dax_account_mapping(bdev, dax.pfn, mapping);
+	dax_unmap_atomic(bdev, &dax);
 
-	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn));
+	error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(dax.pfn));
 
  out:
 	i_mmap_unlock_read(mapping);
@@ -674,33 +680,35 @@  int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		pfn_t pfn;
-		long length;
-		void __pmem *kaddr = __dax_map_atomic(bdev,
-				to_sector(&bh, inode), HPAGE_SIZE, &pfn,
-				&length);
-
-		if (IS_ERR(kaddr)) {
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = HPAGE_SIZE,
+			.flags = flags,
+		};
+		long length = dax_map_atomic(bdev, &dax);
+
+		if (length < 0) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
-		if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) {
-			dax_unmap_atomic(bdev, kaddr);
+		if ((length < HPAGE_SIZE)
+				|| (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) {
+			dax_unmap_atomic(bdev, &dax);
 			goto fallback;
 		}
 
 		if (buffer_unwritten(&bh) || buffer_new(&bh)) {
-			clear_pmem(kaddr, HPAGE_SIZE);
+			clear_pmem(dax.addr, HPAGE_SIZE);
 			wmb_pmem();
 			count_vm_event(PGMAJFAULT);
 			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 			result |= VM_FAULT_MAJOR;
 		}
-		dax_account_mapping(bdev, pfn, mapping);
-		dax_unmap_atomic(bdev, kaddr);
+		dax_account_mapping(bdev, dax.pfn, mapping);
+		dax_unmap_atomic(bdev, &dax);
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd,
-				pfn_t_to_pfn(pfn), write);
+				pfn_t_to_pfn(dax.pfn), write);
 	}
 
  out:
@@ -803,14 +811,17 @@  int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 		return err;
 	if (buffer_written(&bh)) {
 		struct block_device *bdev = bh.b_bdev;
-		void __pmem *addr = dax_map_atomic(bdev, to_sector(&bh, inode),
-				PAGE_CACHE_SIZE);
-
-		if (IS_ERR(addr))
-			return PTR_ERR(addr);
-		clear_pmem(addr + offset, length);
+		struct blk_dax_ctl dax = {
+			.sector = to_sector(&bh, inode),
+			.size = PAGE_CACHE_SIZE,
+			.flags = 0,
+		};
+
+		if (dax_map_atomic(bdev, &dax) < 0)
+			return PTR_ERR(dax.addr);
+		clear_pmem(dax.addr + offset, length);
 		wmb_pmem();
-		dax_unmap_atomic(bdev, addr);
+		dax_unmap_atomic(bdev, &dax);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e121e5e0c6ac..663e9974820f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1615,14 +1615,31 @@  static inline bool integrity_req_gap_front_merge(struct request *req,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+#define BLKDAX_F_DIRTY (1UL << 0) /* range is mapped writable to userspace */
+
+/**
+ * struct blk_dax_ctl - control and output parameters for ->direct_access
+ * @sector: (input) offset relative to a block_device
+ * @addr: (output) kernel virtual address for @sector populated by driver
+ * @flags: (input) BLKDAX_F_*
+ * @pfn: (output) page frame number for @addr populated by driver
+ * @size: (input) number of bytes requested
+ */
+struct blk_dax_ctl {
+	sector_t sector;
+	void __pmem *addr;
+	unsigned long flags;
+	long size;
+	pfn_t pfn;
+};
+
 struct block_device_operations {
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
 	int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
 	int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
-	long (*direct_access)(struct block_device *, sector_t, void __pmem **,
-			pfn_t *);
+	long (*direct_access)(struct block_device *, struct blk_dax_ctl *);
 	unsigned int (*check_events) (struct gendisk *disk,
 				      unsigned int clearing);
 	/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1640,8 +1657,7 @@  extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
 extern int bdev_read_page(struct block_device *, sector_t, struct page *);
 extern int bdev_write_page(struct block_device *, sector_t, struct page *,
 						struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t,
-		void __pmem **addr, pfn_t *pfn, long size);
+extern long bdev_direct_access(struct block_device *, struct blk_dax_ctl *);
 #else /* CONFIG_BLOCK */
 
 struct block_device;