[v2,03/11] pmem: enable REQ_FUA/REQ_FLUSH handling

Message ID	1447459610-14259-4-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> From: Ross Zwisler <ross.zwisler@linux.intel.com> To: linux-kernel@vger.kernel.org Subject: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling Date: Fri, 13 Nov 2015 17:06:42 -0700 Message-Id: <1447459610-14259-4-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1447459610-14259-1-git-send-email-ross.zwisler@linux.intel.com> References: <1447459610-14259-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com>, Dave Chinner <david@fromorbit.com>, "J. Bruce Fields" <bfields@fieldses.org>, linux-mm@kvack.org, Andreas Dilger <adilger.kernel@dilger.ca>, "H. Peter Anvin" <hpa@zytor.com>, Jeff Layton <jlayton@poochiereds.net>, linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar <mingo@redhat.com>, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro <viro@zeniv.linux.org.uk>, Thomas Gleixner <tglx@linutronix.de>, Theodore Ts'o <tytso@mit.edu>, Jan Kara <jack@suse.com>, linux-fsdevel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <matthew.r.wilcox@intel.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Ross Zwisler Nov. 14, 2015, 12:06 a.m. UTC

Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
are sent down via blkdev_issue_flush() in response to a fsync() or msync()
and are used by filesystems to order their metadata, among other things.

When we get an msync() or fsync() it is the responsibility of the DAX code
to flush all dirty pages to media.  The PMEM driver then just has issue a
wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
the flushed data has been durably stored on the media.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Dan Williams Nov. 14, 2015, 12:20 a.m. UTC | #1

On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> and are used by filesystems to order their metadata, among other things.
>
> When we get an msync() or fsync() it is the responsibility of the DAX code
> to flush all dirty pages to media.  The PMEM driver then just has issue a
> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> the flushed data has been durably stored on the media.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Hmm, I'm not seeing why we need this patch.  If the actual flushing of
the cache is done by the core why does the driver need support
REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
only makes sense if individual writes can bypass the "drive" cache,
but no I/O submitted to the driver proper is ever cached we always
flush it through to media.

Dan Williams Nov. 14, 2015, 2:32 a.m. UTC | #2

On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>>> and are used by filesystems to order their metadata, among other things.
>>>
>>> When we get an msync() or fsync() it is the responsibility of the DAX code
>>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>>> the flushed data has been durably stored on the media.
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>>
>> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
>> the cache is done by the core why does the driver need support
>> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
>> only makes sense if individual writes can bypass the "drive" cache,
>> but no I/O submitted to the driver proper is ever cached we always
>> flush it through to media.
>
> If the upper level filesystem gets an error when submitting a flush
> request, then it assumes the underlying hardware is broken and cannot
> be as aggressive in IO submission, but instead has to wait for in-flight
> IO to complete.

Upper level filesystems won't get errors when the driver does not
support flush.  Those requests are ended cleanly in
generic_make_request_checks().  Yes, the fs still needs to wait for
outstanding I/O to complete but in the case of pmem all I/O is
synchronous.  There's never anything to await when flushing at the
pmem driver level.

> Since FUA/FLUSH is basically a no-op for pmem devices,
> it doesn't make sense _not_ to support this functionality.

Seems to be a nop either way.  Given that DAX may lead to dirty data
pending to the device in the cpu cache that a REQ_FLUSH request will
not touch, its better to leave it all to the mm core to handle.  I.e.
it doesn't make sense to call the driver just for two instructions
(sfence + pcommit) when the mm core is taking on the cache flushing.
Either handle it all in the mm or the driver, not a mixture.

Dan Williams Nov. 16, 2015, 5:28 p.m. UTC | #3

On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

We would need to gather the performance data...  The expectation is
that the cache flushing is more expensive than the sfence + pcommit.

Ross Zwisler Nov. 16, 2015, 7:48 p.m. UTC | #4

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> We would need to gather the performance data...  The expectation is
> that the cache flushing is more expensive than the sfence + pcommit.

I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

Ross Zwisler Nov. 16, 2015, 8:09 p.m. UTC | #5

On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

Does anyone know if ext4 and/or XFS alter their algorithms based on whether
the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
efficiently with respect to their internal I/O ordering, etc., if PMEM
advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
at the DAX layer?

Dan Williams Nov. 16, 2015, 8:34 p.m. UTC | #6

On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
>> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> Is there any reason why this wouldn't work or wouldn't be a good idea?

We don't have numbers to support the claim that pcommit is so
expensive as to need be deferred, especially if the upper layers are
already taking the hit on doing the flushes.

REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
through the driver never hits a volatile cache so there's no need to
tell the block layer that we have a volatile write cache, especially
when you have the core mm taking responsibility for doing cache
maintenance for dax-mmap ranges.

We also don't have numbers on if/when wbinvd is a more performant solution.

tl;dr Now that we have a baseline implementation can we please use
data to make future arch decisions?

Dave Chinner Nov. 16, 2015, 10:14 p.m. UTC | #7

On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > e.g. journal writes going through block layer or writes done through
> > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > and the way driver does IO doesn't guarantee this, does it?
> 
> Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> does, I'm indeed wrong because they both do wmb_pmem() after each write
> which seems to include sfence + pcommit. Sorry for confusion.

Which I want to remove, because it makes DAX IO 3x slower than
buffered IO on ramdisk based testing.

> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

I'm pretty sure it would be, because all of the overhead (and
therefore latency) I measured is in the cache flushing instructions.
But before we can remove the wmb_pmem() from  dax_do_io(), we need
the underlying device to support REQ_FLUSH correctly...

Cheers,

Dave.

Ross Zwisler Nov. 16, 2015, 11:29 p.m. UTC | #8

On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >>
> > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > >>>
> > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > >>> the flushed data has been durably stored on the media.
> > > > >>>
> > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > >>
> > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > >> the cache is done by the core why does the driver need support
> > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > >> flush it through to media.
> > > > >
> > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > IO to complete.
> > > > 
> > > > Upper level filesystems won't get errors when the driver does not
> > > > support flush.  Those requests are ended cleanly in
> > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > synchronous.  There's never anything to await when flushing at the
> > > > pmem driver level.
> > > > 
> > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > it doesn't make sense _not_ to support this functionality.
> > > > 
> > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > it doesn't make sense to call the driver just for two instructions
> > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > Either handle it all in the mm or the driver, not a mixture.
> > > 
> > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > e.g. journal writes going through block layer or writes done through
> > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > and the way driver does IO doesn't guarantee this, does it?
> > 
> > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > which seems to include sfence + pcommit. Sorry for confusion.
> 
> Which I want to remove, because it makes DAX IO 3x slower than
> buffered IO on ramdisk based testing.
> 
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> I'm pretty sure it would be, because all of the overhead (and
> therefore latency) I measured is in the cache flushing instructions.
> But before we can remove the wmb_pmem() from  dax_do_io(), we need
> the underlying device to support REQ_FLUSH correctly...

By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
set?  Or do you mean something that also involves cache flushing such as the
"big hammer" that flushes everything or something like WBINVD?

Dave Chinner Nov. 16, 2015, 11:42 p.m. UTC | #9

On Mon, Nov 16, 2015 at 04:29:27PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> > On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >>
> > > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > > >>>
> > > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > > >>> the flushed data has been durably stored on the media.
> > > > > >>>
> > > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > > >>
> > > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > > >> the cache is done by the core why does the driver need support
> > > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > > >> flush it through to media.
> > > > > >
> > > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > > IO to complete.
> > > > > 
> > > > > Upper level filesystems won't get errors when the driver does not
> > > > > support flush.  Those requests are ended cleanly in
> > > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > > synchronous.  There's never anything to await when flushing at the
> > > > > pmem driver level.
> > > > > 
> > > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > > it doesn't make sense _not_ to support this functionality.
> > > > > 
> > > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > > it doesn't make sense to call the driver just for two instructions
> > > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > > Either handle it all in the mm or the driver, not a mixture.
> > > > 
> > > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > > e.g. journal writes going through block layer or writes done through
> > > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > > and the way driver does IO doesn't guarantee this, does it?
> > > 
> > > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > > which seems to include sfence + pcommit. Sorry for confusion.
> > 
> > Which I want to remove, because it makes DAX IO 3x slower than
> > buffered IO on ramdisk based testing.
> > 
> > > But a question: Won't it be better to do sfence + pcommit only in response
> > > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > > expensive these instructions are but in theory it could be a performance
> > > win, couldn't it? For filesystems this is enough wrt persistency
> > > guarantees...
> > 
> > I'm pretty sure it would be, because all of the overhead (and
> > therefore latency) I measured is in the cache flushing instructions.
> > But before we can remove the wmb_pmem() from  dax_do_io(), we need
> > the underlying device to support REQ_FLUSH correctly...
> 
> By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
> set?  Or do you mean something that also involves cache flushing such as the
> "big hammer" that flushes everything or something like WBINVD?

Either. Both solve the problem of defering the cache flush penalty
to the context that needs it..

Cheers,

Dave.

Ross Zwisler Nov. 16, 2015, 11:57 p.m. UTC | #10

On Mon, Nov 16, 2015 at 12:34:55PM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> >> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > Is there any reason why this wouldn't work or wouldn't be a good idea?
> 
> We don't have numbers to support the claim that pcommit is so
> expensive as to need be deferred, especially if the upper layers are
> already taking the hit on doing the flushes.
> 
> REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
> through the driver never hits a volatile cache so there's no need to
> tell the block layer that we have a volatile write cache, especially
> when you have the core mm taking responsibility for doing cache
> maintenance for dax-mmap ranges.
> 
> We also don't have numbers on if/when wbinvd is a more performant solution.
> 
> tl;dr Now that we have a baseline implementation can we please use
> data to make future arch decisions?

Sure, fair enough.

Jan Kara Nov. 18, 2015, 10:40 a.m. UTC | #11

On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> efficiently with respect to their internal I/O ordering, etc., if PMEM
> advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> at the DAX layer?

So the information whether the driver supports FLUSH / FUA is generally
ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
required ordering for fs consistency and expect that block layer does the
right thing - i.e., if the device has volatile write cache, it will be
flushed, if it doesn't have it, the request will be ignored. So the
difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
only in how block layer handles such requests.

								Honza

Ross Zwisler Nov. 18, 2015, 4:16 p.m. UTC | #12

On Wed, Nov 18, 2015 at 11:40:55AM +0100, Jan Kara wrote:
> On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> > On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> > the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> > efficiently with respect to their internal I/O ordering, etc., if PMEM
> > advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> > at the DAX layer?
> 
> So the information whether the driver supports FLUSH / FUA is generally
> ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
> required ordering for fs consistency and expect that block layer does the
> right thing - i.e., if the device has volatile write cache, it will be
> flushed, if it doesn't have it, the request will be ignored. So the
> difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
> only in how block layer handles such requests.

Cool, thank you for the info.  Based on this I'll pull out the
REQ_FLUSH/REQ_FUA patch for v3 of this series and move the wmb_pmem() call up
to DAX as Dan suggests.  If performance data shows that we can get a benefit
from centralizing wmb_pmem() behind REQ_FUA/REQ_FLUSH, I'll add it back in
later as part of that series.

[v2,03/11] pmem: enable REQ_FUA/REQ_FLUSH handling

Commit Message

Comments

Patch