diff mbox

dax: remove the pmem_dax_ops->flush abstraction

Message ID alpine.LRH.2.02.1708312138390.32309@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mikulas Patocka Sept. 1, 2017, 1:47 a.m. UTC
The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is 
buggy. A dm device may be composed of multiple underlying devices and all 
of them need to be flushed. The patch just routes the flush request to the 
first device and ignores the other devices.

It could be fixed by adding more complex logic to the device mapper. But 
there is only one implementation of the method pmem_dax_ops->flush - that 
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we 
don't need the pmem_dax_ops->flush abstraction at all, we can call 
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush 
couldn't ever reach anything different from arch_wb_cache_pmem().

It should be also pointed out that for some uses of persistent memory it 
is needed to flush only very small amount of data (such as 1 cacheline), 
and it would be overkill if we go through that device mapper machinery for 
a single flushed cache line.

This patch removes the abstraction pmem_dax_ops->flush and calls 
arch_wb_cache_pmem() directly from dax_flush(). It also removes device 
mapper code that forwards the flushes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/dax/super.c           |   21 ++++++++++++++-------
 drivers/md/dm-linear.c        |   15 ---------------
 drivers/md/dm-stripe.c        |   20 --------------------
 drivers/md/dm.c               |   19 -------------------
 drivers/nvdimm/pmem.c         |    7 -------
 fs/dax.c                      |    4 ++--
 include/linux/dax.h           |    5 +----
 include/linux/device-mapper.h |    3 ---
 8 files changed, 17 insertions(+), 77 deletions(-)

Comments

Dan Williams Sept. 1, 2017, 2:38 a.m. UTC | #1
On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
> buggy. A dm device may be composed of multiple underlying devices and all
> of them need to be flushed. The patch just routes the flush request to the
> first device and ignores the other devices.
>
> It could be fixed by adding more complex logic to the device mapper. But
> there is only one implementation of the method pmem_dax_ops->flush - that
> is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
> don't need the pmem_dax_ops->flush abstraction at all, we can call
> arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
> couldn't ever reach anything different from arch_wb_cache_pmem().

Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
is for platforms that arrange for cpu caches to be flushed on
power-fail, like standalone storage appliances, where it would be a
waste for the kernel to track and flush dirty cachelines for dax.
Theoretically this could be done on a per-address range basis (think
battery backing a subset of the system memory). I think we need to fix
the routing to flush the same dax_device that ->direct_access() was
invoked.


> It should be also pointed out that for some uses of persistent memory it
> is needed to flush only very small amount of data (such as 1 cacheline),
> and it would be overkill if we go through that device mapper machinery for
> a single flushed cache line.

I think this is more an argument to not enable DAX on that
device-mapper topology if operation routing impacts performance. DAX
is meant to get the kernel out of the way most of the time.
Mikulas Patocka Sept. 1, 2017, 2:49 a.m. UTC | #2
On Thu, 31 Aug 2017, Dan Williams wrote:

> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
> > buggy. A dm device may be composed of multiple underlying devices and all
> > of them need to be flushed. The patch just routes the flush request to the
> > first device and ignores the other devices.
> >
> > It could be fixed by adding more complex logic to the device mapper. But
> > there is only one implementation of the method pmem_dax_ops->flush - that
> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
> > don't need the pmem_dax_ops->flush abstraction at all, we can call
> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
> > couldn't ever reach anything different from arch_wb_cache_pmem().
> 
> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
> is for platforms that arrange for cpu caches to be flushed on
> power-fail, like standalone storage appliances, where it would be a
> waste for the kernel to track and flush dirty cachelines for dax.
> Theoretically this could be done on a per-address range basis (think
> battery backing a subset of the system memory). I think we need to fix
> the routing to flush the same dax_device that ->direct_access() was
> invoked.

With my patch, dax_flush checks DAXDEV_WRITE_CACHE and doesn't do anything 
if it's not set - it just doesn't go through device mapper.

> > It should be also pointed out that for some uses of persistent memory it
> > is needed to flush only very small amount of data (such as 1 cacheline),
> > and it would be overkill if we go through that device mapper machinery for
> > a single flushed cache line.
> 
> I think this is more an argument to not enable DAX on that
> device-mapper topology if operation routing impacts performance. DAX
> is meant to get the kernel out of the way most of the time.

I don't understand what do you mean? I am writing a device mapper target 
that uses persistent memory as a cache. When flushing the metadata, 
sometimes it is needed to flush single cache lines. Ideally, I would use 
the clwb instruction for this purpose, but that clwb is hidden behind a 
complicated pmem_dax_ops->flush abstraction (and that abstraction will get 
even more complicated if implemented correctly). How do you think I should 
write that target without DAX?

Mikulas
Mikulas Patocka Sept. 1, 2017, 3:04 a.m. UTC | #3
On Thu, 31 Aug 2017, Dan Williams wrote:

> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
> > buggy. A dm device may be composed of multiple underlying devices and all
> > of them need to be flushed. The patch just routes the flush request to the
> > first device and ignores the other devices.
> >
> > It could be fixed by adding more complex logic to the device mapper. But
> > there is only one implementation of the method pmem_dax_ops->flush - that
> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
> > don't need the pmem_dax_ops->flush abstraction at all, we can call
> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
> > couldn't ever reach anything different from arch_wb_cache_pmem().
> 
> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
> is for platforms that arrange for cpu caches to be flushed on
> power-fail, like standalone storage appliances, where it would be a

BTW. if the system is capable of flushing caches on power failure, it is 
system-wide feature, it is not per-device feature. The CPU caches may 
store data for any persistent memory device.

You either have enough remaining power to flush the CPU caches or not.

Mikulas

> waste for the kernel to track and flush dirty cachelines for dax.
> Theoretically this could be done on a per-address range basis (think
> battery backing a subset of the system memory). I think we need to fix
> the routing to flush the same dax_device that ->direct_access() was
> invoked.
> 
> 
> > It should be also pointed out that for some uses of persistent memory it
> > is needed to flush only very small amount of data (such as 1 cacheline),
> > and it would be overkill if we go through that device mapper machinery for
> > a single flushed cache line.
> 
> I think this is more an argument to not enable DAX on that
> device-mapper topology if operation routing impacts performance. DAX
> is meant to get the kernel out of the way most of the time.
>
Dan Williams Sept. 1, 2017, 3:11 a.m. UTC | #4
On Thu, Aug 31, 2017 at 7:49 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 31 Aug 2017, Dan Williams wrote:
>
>> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
>> > buggy. A dm device may be composed of multiple underlying devices and all
>> > of them need to be flushed. The patch just routes the flush request to the
>> > first device and ignores the other devices.
>> >
>> > It could be fixed by adding more complex logic to the device mapper. But
>> > there is only one implementation of the method pmem_dax_ops->flush - that
>> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
>> > don't need the pmem_dax_ops->flush abstraction at all, we can call
>> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
>> > couldn't ever reach anything different from arch_wb_cache_pmem().
>>
>> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
>> is for platforms that arrange for cpu caches to be flushed on
>> power-fail, like standalone storage appliances, where it would be a
>> waste for the kernel to track and flush dirty cachelines for dax.
>> Theoretically this could be done on a per-address range basis (think
>> battery backing a subset of the system memory). I think we need to fix
>> the routing to flush the same dax_device that ->direct_access() was
>> invoked.
>
> With my patch, dax_flush checks DAXDEV_WRITE_CACHE and doesn't do anything
> if it's not set - it just doesn't go through device mapper.

True, but there's no guarantee that arch_wb_pmem() will be sufficient
going forward and the reason we routed this to the driver in the first
place was to allow driver / platform specific flush implementations.

>
>> > It should be also pointed out that for some uses of persistent memory it
>> > is needed to flush only very small amount of data (such as 1 cacheline),
>> > and it would be overkill if we go through that device mapper machinery for
>> > a single flushed cache line.
>>
>> I think this is more an argument to not enable DAX on that
>> device-mapper topology if operation routing impacts performance. DAX
>> is meant to get the kernel out of the way most of the time.
>
> I don't understand what do you mean? I am writing a device mapper target
> that uses persistent memory as a cache. When flushing the metadata,
> sometimes it is needed to flush single cache lines. Ideally, I would use
> the clwb instruction for this purpose, but that clwb is hidden behind a
> complicated pmem_dax_ops->flush abstraction (and that abstraction will get
> even more complicated if implemented correctly). How do you think I should
> write that target without DAX?

I don't think you want DAX for this, I think you want direct control
of a persistent-memory address range.  For example the brd driver
supports DAX but I wouldn't recommend using it as the basis for a
directly managed cache.

So I think we need to dive deeper into what an in kernel interface to
access a contiguous persistent memory range looks like. DAX was not
designed with that use case in mind.
Dan Williams Sept. 1, 2017, 3:13 a.m. UTC | #5
On Thu, Aug 31, 2017 at 8:04 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 31 Aug 2017, Dan Williams wrote:
>
>> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
>> > buggy. A dm device may be composed of multiple underlying devices and all
>> > of them need to be flushed. The patch just routes the flush request to the
>> > first device and ignores the other devices.
>> >
>> > It could be fixed by adding more complex logic to the device mapper. But
>> > there is only one implementation of the method pmem_dax_ops->flush - that
>> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
>> > don't need the pmem_dax_ops->flush abstraction at all, we can call
>> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
>> > couldn't ever reach anything different from arch_wb_cache_pmem().
>>
>> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
>> is for platforms that arrange for cpu caches to be flushed on
>> power-fail, like standalone storage appliances, where it would be a
>
> BTW. if the system is capable of flushing caches on power failure, it is
> system-wide feature, it is not per-device feature. The CPU caches may
> store data for any persistent memory device.
>
> You either have enough remaining power to flush the CPU caches or not.

Whether you need to flush caches or not can be a per-address range
attribute and can change over time.
Dan Williams Sept. 1, 2017, 3:34 a.m. UTC | #6
On Thu, Aug 31, 2017 at 8:11 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Aug 31, 2017 at 7:49 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>>
>> On Thu, 31 Aug 2017, Dan Williams wrote:
>>
>>> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>>> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
>>> > buggy. A dm device may be composed of multiple underlying devices and all
>>> > of them need to be flushed. The patch just routes the flush request to the
>>> > first device and ignores the other devices.
>>> >
>>> > It could be fixed by adding more complex logic to the device mapper. But
>>> > there is only one implementation of the method pmem_dax_ops->flush - that
>>> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
>>> > don't need the pmem_dax_ops->flush abstraction at all, we can call
>>> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
>>> > couldn't ever reach anything different from arch_wb_cache_pmem().
>>>
>>> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
>>> is for platforms that arrange for cpu caches to be flushed on
>>> power-fail, like standalone storage appliances, where it would be a
>>> waste for the kernel to track and flush dirty cachelines for dax.
>>> Theoretically this could be done on a per-address range basis (think
>>> battery backing a subset of the system memory). I think we need to fix
>>> the routing to flush the same dax_device that ->direct_access() was
>>> invoked.
>>
>> With my patch, dax_flush checks DAXDEV_WRITE_CACHE and doesn't do anything
>> if it's not set - it just doesn't go through device mapper.
>
> True, but there's no guarantee that arch_wb_pmem() will be sufficient
> going forward and the reason we routed this to the driver in the first
> place was to allow driver / platform specific flush implementations.
>
>>
>>> > It should be also pointed out that for some uses of persistent memory it
>>> > is needed to flush only very small amount of data (such as 1 cacheline),
>>> > and it would be overkill if we go through that device mapper machinery for
>>> > a single flushed cache line.
>>>
>>> I think this is more an argument to not enable DAX on that
>>> device-mapper topology if operation routing impacts performance. DAX
>>> is meant to get the kernel out of the way most of the time.
>>
>> I don't understand what do you mean? I am writing a device mapper target
>> that uses persistent memory as a cache. When flushing the metadata,
>> sometimes it is needed to flush single cache lines. Ideally, I would use
>> the clwb instruction for this purpose, but that clwb is hidden behind a
>> complicated pmem_dax_ops->flush abstraction (and that abstraction will get
>> even more complicated if implemented correctly). How do you think I should
>> write that target without DAX?
>
> I don't think you want DAX for this, I think you want direct control
> of a persistent-memory address range.  For example the brd driver
> supports DAX but I wouldn't recommend using it as the basis for a
> directly managed cache.
>
> So I think we need to dive deeper into what an in kernel interface to
> access a contiguous persistent memory range looks like. DAX was not
> designed with that use case in mind.

For a cache driver using pmem, a strawman / starting point would be
trying to do a ->direct_access() call for the full capacity of the
device and then:

1/ validate that it returns the full capacity (this will preclude
drivers like brd, and make sure the dax range is contiguous)

2/ use the returned pfn to walk the iomem resource tree and validate
that you are talking to persistent memory and not some special
dax-device range.
Mikulas Patocka Sept. 6, 2017, 6:29 a.m. UTC | #7
On Thu, 31 Aug 2017, Dan Williams wrote:

> On Thu, Aug 31, 2017 at 8:04 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> > On Thu, 31 Aug 2017, Dan Williams wrote:
> >
> >> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
> >> > buggy. A dm device may be composed of multiple underlying devices and all
> >> > of them need to be flushed. The patch just routes the flush request to the
> >> > first device and ignores the other devices.
> >> >
> >> > It could be fixed by adding more complex logic to the device mapper. But
> >> > there is only one implementation of the method pmem_dax_ops->flush - that
> >> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
> >> > don't need the pmem_dax_ops->flush abstraction at all, we can call
> >> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
> >> > couldn't ever reach anything different from arch_wb_cache_pmem().
> >>
> >> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
> >> is for platforms that arrange for cpu caches to be flushed on
> >> power-fail, like standalone storage appliances, where it would be a
> >
> > BTW. if the system is capable of flushing caches on power failure, it is
> > system-wide feature, it is not per-device feature. The CPU caches may
> > store data for any persistent memory device.
> >
> > You either have enough remaining power to flush the CPU caches or not.
> 
> Whether you need to flush caches or not can be a per-address range
> attribute and can change over time.

There is only one way to flush the cache (the clwb instruction), you don't 
need a method that abstracts over this instruction. A simple flag or 
static key that skips clwb() is sufficient.

Linux kernel API is changeable, so you don't need to overdesign the API 
for future. If it ever happens that we have a system with two types of 
persistent memory that need different instructions to flush them, we can 
design a proper abstraction. But there is no reason to design the 
abstraction in advance for a situation that will perhaps never happen.

That dm_dax_ops->flush abstraction slows down cache flushing almost twice 
(when used on device mapper) and it is just unacceptable. Unfortunatelly, 
there is no other architecture-independent function to flush cache.

Mikulas
Mikulas Patocka Sept. 6, 2017, 6:30 a.m. UTC | #8
On Thu, 31 Aug 2017, Dan Williams wrote:

> On Thu, Aug 31, 2017 at 8:11 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Thu, Aug 31, 2017 at 7:49 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >>
> >>
> >> On Thu, 31 Aug 2017, Dan Williams wrote:
> >>
> >>> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> >>> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
> >>> > buggy. A dm device may be composed of multiple underlying devices and all
> >>> > of them need to be flushed. The patch just routes the flush request to the
> >>> > first device and ignores the other devices.
> >>> >
> >>> > It could be fixed by adding more complex logic to the device mapper. But
> >>> > there is only one implementation of the method pmem_dax_ops->flush - that
> >>> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
> >>> > don't need the pmem_dax_ops->flush abstraction at all, we can call
> >>> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
> >>> > couldn't ever reach anything different from arch_wb_cache_pmem().
> >>>
> >>> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
> >>> is for platforms that arrange for cpu caches to be flushed on
> >>> power-fail, like standalone storage appliances, where it would be a
> >>> waste for the kernel to track and flush dirty cachelines for dax.
> >>> Theoretically this could be done on a per-address range basis (think
> >>> battery backing a subset of the system memory). I think we need to fix
> >>> the routing to flush the same dax_device that ->direct_access() was
> >>> invoked.
> >>
> >> With my patch, dax_flush checks DAXDEV_WRITE_CACHE and doesn't do anything
> >> if it's not set - it just doesn't go through device mapper.
> >
> > True, but there's no guarantee that arch_wb_pmem() will be sufficient
> > going forward and the reason we routed this to the driver in the first
> > place was to allow driver / platform specific flush implementations.
> >
> >>
> >>> > It should be also pointed out that for some uses of persistent memory it
> >>> > is needed to flush only very small amount of data (such as 1 cacheline),
> >>> > and it would be overkill if we go through that device mapper machinery for
> >>> > a single flushed cache line.
> >>>
> >>> I think this is more an argument to not enable DAX on that
> >>> device-mapper topology if operation routing impacts performance. DAX
> >>> is meant to get the kernel out of the way most of the time.
> >>
> >> I don't understand what do you mean? I am writing a device mapper target
> >> that uses persistent memory as a cache. When flushing the metadata,
> >> sometimes it is needed to flush single cache lines. Ideally, I would use
> >> the clwb instruction for this purpose, but that clwb is hidden behind a
> >> complicated pmem_dax_ops->flush abstraction (and that abstraction will get
> >> even more complicated if implemented correctly). How do you think I should
> >> write that target without DAX?
> >
> > I don't think you want DAX for this, I think you want direct control
> > of a persistent-memory address range.  For example the brd driver
> > supports DAX but I wouldn't recommend using it as the basis for a
> > directly managed cache.
> >
> > So I think we need to dive deeper into what an in kernel interface to
> > access a contiguous persistent memory range looks like. DAX was not
> > designed with that use case in mind.
> 
> For a cache driver using pmem, a strawman / starting point would be
> trying to do a ->direct_access() call for the full capacity of the
> device and then:
> 
> 1/ validate that it returns the full capacity (this will preclude
> drivers like brd, and make sure the dax range is contiguous)
> 
> 2/ use the returned pfn to walk the iomem resource tree and validate
> that you are talking to persistent memory and not some special
> dax-device range.

This is exactly what I am doing. (except that if direct_access returns 
less than the size of the device, I call direct_access again and remap the 
discontiguos pages to a contiguous linear region)

But the question is - how should I flush cache lines in my persistent 
memory driver?

* there is clwb() function, but it is specific to x86 architecture
* there is dax_flush(), but it has excessive overhead because it tries to 
  walk the device mapper tree (I measured it on a system with real 
  persistent memory and it is 1.9 times slower than clwb when used on 
  device mapper)
* in the kernel 4.12, there used to be wb_cache_pmem() - this function 
  would be perfect for this purpose, but the commit 
  4e4f00a9b51a1c52ebdd728a1caeb3b9fe48c39d removed it

Mikulas
Dan Williams Sept. 6, 2017, 2:22 p.m. UTC | #9
On Tue, Sep 5, 2017 at 11:29 PM, Mikulas Patocka <mpatocka@redhat.com>
wrote:

>
>
> On Thu, 31 Aug 2017, Dan Williams wrote:
>
> > On Thu, Aug 31, 2017 at 8:04 PM, Mikulas Patocka <mpatocka@redhat.com>
> wrote:
> > >
> > >
> > > On Thu, 31 Aug 2017, Dan Williams wrote:
> > >
> > >> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com>
> wrote:
> > >> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support")
> is
> > >> > buggy. A dm device may be composed of multiple underlying devices
> and all
> > >> > of them need to be flushed. The patch just routes the flush request
> to the
> > >> > first device and ignores the other devices.
> > >> >
> > >> > It could be fixed by adding more complex logic to the device
> mapper. But
> > >> > there is only one implementation of the method pmem_dax_ops->flush
> - that
> > >> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem().
> Consequently, we
> > >> > don't need the pmem_dax_ops->flush abstraction at all, we can call
> > >> > arch_wb_cache_pmem() directly from dax_flush() because
> dax_dev->ops->flush
> > >> > couldn't ever reach anything different from arch_wb_cache_pmem().
> > >>
> > >> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
> > >> is for platforms that arrange for cpu caches to be flushed on
> > >> power-fail, like standalone storage appliances, where it would be a
> > >
> > > BTW. if the system is capable of flushing caches on power failure, it
> is
> > > system-wide feature, it is not per-device feature. The CPU caches may
> > > store data for any persistent memory device.
> > >
> > > You either have enough remaining power to flush the CPU caches or not.
> >
> > Whether you need to flush caches or not can be a per-address range
> > attribute and can change over time.
>
> There is only one way to flush the cache (the clwb instruction), you don't
> need a method that abstracts over this instruction. A simple flag or
> static key that skips clwb() is sufficient.
>
> Linux kernel API is changeable, so you don't need to overdesign the API
> for future. If it ever happens that we have a system with two types of
> persistent memory that need different instructions to flush them, we can
> design a proper abstraction. But there is no reason to design the
> abstraction in advance for a situation that will perhaps never happen.
>

True, it is over designed and we can cross that complexity bridge in the
future when / if it arises.

Since dm is already checking dax_write_cache_enabled() to set its own flush
enabled bit I think your patch looks good. You can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams Sept. 6, 2017, 3:46 p.m. UTC | #10
[ resend in plain text ]

On Tue, Sep 5, 2017 at 11:29 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 31 Aug 2017, Dan Williams wrote:
>
>> On Thu, Aug 31, 2017 at 8:04 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >
>> >
>> > On Thu, 31 Aug 2017, Dan Williams wrote:
>> >
>> >> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
>> >> > buggy. A dm device may be composed of multiple underlying devices and all
>> >> > of them need to be flushed. The patch just routes the flush request to the
>> >> > first device and ignores the other devices.
>> >> >
>> >> > It could be fixed by adding more complex logic to the device mapper. But
>> >> > there is only one implementation of the method pmem_dax_ops->flush - that
>> >> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
>> >> > don't need the pmem_dax_ops->flush abstraction at all, we can call
>> >> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
>> >> > couldn't ever reach anything different from arch_wb_cache_pmem().
>> >>
>> >> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
>> >> is for platforms that arrange for cpu caches to be flushed on
>> >> power-fail, like standalone storage appliances, where it would be a
>> >
>> > BTW. if the system is capable of flushing caches on power failure, it is
>> > system-wide feature, it is not per-device feature. The CPU caches may
>> > store data for any persistent memory device.
>> >
>> > You either have enough remaining power to flush the CPU caches or not.
>>
>> Whether you need to flush caches or not can be a per-address range
>> attribute and can change over time.
>
> There is only one way to flush the cache (the clwb instruction), you don't
> need a method that abstracts over this instruction. A simple flag or
> static key that skips clwb() is sufficient.
>
> Linux kernel API is changeable, so you don't need to overdesign the API
> for future. If it ever happens that we have a system with two types of
> persistent memory that need different instructions to flush them, we can
> design a proper abstraction. But there is no reason to design the
> abstraction in advance for a situation that will perhaps never happen.

True, it is over designed and we can cross that complexity bridge in
the future when / if it arises.

Since dm is already checking dax_write_cache_enabled() to set its own
flush enabled bit I think your patch looks good. You can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
diff mbox

Patch

Index: linux-2.6/drivers/md/dm-linear.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-linear.c
+++ linux-2.6/drivers/md/dm-linear.c
@@ -184,20 +184,6 @@  static size_t linear_dax_copy_from_iter(
 	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
-static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
-		size_t size)
-{
-	struct linear_c *lc = ti->private;
-	struct block_device *bdev = lc->dev->bdev;
-	struct dax_device *dax_dev = lc->dev->dax_dev;
-	sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
-
-	dev_sector = linear_map_sector(ti, sector);
-	if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), &pgoff))
-		return;
-	dax_flush(dax_dev, pgoff, addr, size);
-}
-
 static struct target_type linear_target = {
 	.name   = "linear",
 	.version = {1, 4, 0},
@@ -212,7 +198,6 @@  static struct target_type linear_target
 	.iterate_devices = linear_iterate_devices,
 	.direct_access = linear_dax_direct_access,
 	.dax_copy_from_iter = linear_dax_copy_from_iter,
-	.dax_flush = linear_dax_flush,
 };
 
 int __init dm_linear_init(void)
Index: linux-2.6/drivers/md/dm-stripe.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-stripe.c
+++ linux-2.6/drivers/md/dm-stripe.c
@@ -351,25 +351,6 @@  static size_t stripe_dax_copy_from_iter(
 	return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
 }
 
-static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
-		size_t size)
-{
-	sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
-	struct stripe_c *sc = ti->private;
-	struct dax_device *dax_dev;
-	struct block_device *bdev;
-	uint32_t stripe;
-
-	stripe_map_sector(sc, sector, &stripe, &dev_sector);
-	dev_sector += sc->stripe[stripe].physical_start;
-	dax_dev = sc->stripe[stripe].dev->dax_dev;
-	bdev = sc->stripe[stripe].dev->bdev;
-
-	if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), &pgoff))
-		return;
-	dax_flush(dax_dev, pgoff, addr, size);
-}
-
 /*
  * Stripe status:
  *
@@ -491,7 +472,6 @@  static struct target_type stripe_target
 	.io_hints = stripe_io_hints,
 	.direct_access = stripe_dax_direct_access,
 	.dax_copy_from_iter = stripe_dax_copy_from_iter,
-	.dax_flush = stripe_dax_flush,
 };
 
 int __init dm_stripe_init(void)
Index: linux-2.6/drivers/nvdimm/pmem.c
===================================================================
--- linux-2.6.orig/drivers/nvdimm/pmem.c
+++ linux-2.6/drivers/nvdimm/pmem.c
@@ -243,16 +243,9 @@  static size_t pmem_copy_from_iter(struct
 	return copy_from_iter_flushcache(addr, bytes, i);
 }
 
-static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
-		void *addr, size_t size)
-{
-	arch_wb_cache_pmem(addr, size);
-}
-
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
 	.copy_from_iter = pmem_copy_from_iter,
-	.flush = pmem_dax_flush,
 };
 
 static const struct attribute_group *pmem_attribute_groups[] = {
Index: linux-2.6/include/linux/device-mapper.h
===================================================================
--- linux-2.6.orig/include/linux/device-mapper.h
+++ linux-2.6/include/linux/device-mapper.h
@@ -134,8 +134,6 @@  typedef long (*dm_dax_direct_access_fn)
 		long nr_pages, void **kaddr, pfn_t *pfn);
 typedef size_t (*dm_dax_copy_from_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
 		void *addr, size_t bytes, struct iov_iter *i);
-typedef void (*dm_dax_flush_fn)(struct dm_target *ti, pgoff_t pgoff, void *addr,
-		size_t size);
 #define PAGE_SECTORS (PAGE_SIZE / 512)
 
 void dm_error(const char *message);
@@ -186,7 +184,6 @@  struct target_type {
 	dm_io_hints_fn io_hints;
 	dm_dax_direct_access_fn direct_access;
 	dm_dax_copy_from_iter_fn dax_copy_from_iter;
-	dm_dax_flush_fn dax_flush;
 
 	/* For internal device-mapper use. */
 	struct list_head list;
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c
+++ linux-2.6/drivers/md/dm.c
@@ -997,24 +997,6 @@  static size_t dm_dax_copy_from_iter(stru
 	return ret;
 }
 
-static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t size)
-{
-	struct mapped_device *md = dax_get_private(dax_dev);
-	sector_t sector = pgoff * PAGE_SECTORS;
-	struct dm_target *ti;
-	int srcu_idx;
-
-	ti = dm_dax_get_live_target(md, sector, &srcu_idx);
-
-	if (!ti)
-		goto out;
-	if (ti->type->dax_flush)
-		ti->type->dax_flush(ti, pgoff, addr, size);
- out:
-	dm_put_live_table(md, srcu_idx);
-}
-
 /*
  * A target may call dm_accept_partial_bio only from the map routine.  It is
  * allowed for all bio types except REQ_PREFLUSH.
@@ -3002,7 +2984,6 @@  static const struct block_device_operati
 static const struct dax_operations dm_dax_ops = {
 	.direct_access = dm_dax_direct_access,
 	.copy_from_iter = dm_dax_copy_from_iter,
-	.flush = dm_dax_flush,
 };
 
 /*
Index: linux-2.6/fs/dax.c
===================================================================
--- linux-2.6.orig/fs/dax.c
+++ linux-2.6/fs/dax.c
@@ -783,7 +783,7 @@  static int dax_writeback_one(struct bloc
 	}
 
 	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
-	dax_flush(dax_dev, pgoff, kaddr, size);
+	dax_flush(dax_dev, kaddr, size);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -978,7 +978,7 @@  int __dax_zero_page_range(struct block_d
 			return rc;
 		}
 		memset(kaddr + offset, 0, size);
-		dax_flush(dax_dev, pgoff, kaddr + offset, size);
+		dax_flush(dax_dev, kaddr + offset, size);
 		dax_read_unlock(id);
 	}
 	return 0;
Index: linux-2.6/drivers/dax/super.c
===================================================================
--- linux-2.6.orig/drivers/dax/super.c
+++ linux-2.6/drivers/dax/super.c
@@ -189,8 +189,10 @@  static umode_t dax_visible(struct kobjec
 	if (!dax_dev)
 		return 0;
 
-	if (a == &dev_attr_write_cache.attr && !dax_dev->ops->flush)
+#ifndef CONFIG_ARCH_HAS_PMEM_API
+	if (a == &dev_attr_write_cache.attr)
 		return 0;
+#endif
 	return a->mode;
 }
 
@@ -255,18 +257,23 @@  size_t dax_copy_from_iter(struct dax_dev
 }
 EXPORT_SYMBOL_GPL(dax_copy_from_iter);
 
-void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t size)
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+void arch_wb_cache_pmem(void *addr, size_t size);
+void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (!dax_alive(dax_dev))
+	if (unlikely(!dax_alive(dax_dev)))
 		return;
 
-	if (!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags))
+	if (unlikely(!test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags)))
 		return;
 
-	if (dax_dev->ops->flush)
-		dax_dev->ops->flush(dax_dev, pgoff, addr, size);
+	arch_wb_cache_pmem(addr, size);
+}
+#else
+void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
+{
 }
+#endif
 EXPORT_SYMBOL_GPL(dax_flush);
 
 void dax_write_cache(struct dax_device *dax_dev, bool wc)
Index: linux-2.6/include/linux/dax.h
===================================================================
--- linux-2.6.orig/include/linux/dax.h
+++ linux-2.6/include/linux/dax.h
@@ -19,8 +19,6 @@  struct dax_operations {
 	/* copy_from_iter: required operation for fs-dax direct-i/o */
 	size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
 			struct iov_iter *);
-	/* flush: optional driver-specific cache management after writes */
-	void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
 };
 
 extern struct attribute_group dax_attribute_group;
@@ -84,8 +82,7 @@  long dax_direct_access(struct dax_device
 		void **kaddr, pfn_t *pfn);
 size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
-void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
-		size_t size);
+void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);