Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

Message ID	20181002100531.GC4135@quack2.suse.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of jack@suse.cz designates 195.135.220.15 as permitted sender) client-ip=195.135.220.15; Date: Tue, 2 Oct 2018 12:05:31 +0200 From: Jan Kara <jack@suse.cz> To: Dan Williams <dan.j.williams@intel.com>, Dave Jiang <dave.jiang@intel.com> Cc: linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, jthumshirn@suse.de Subject: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps Message-ID: <20181002100531.GC4135@quack2.suse.cz> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="YZ5djTAD1cGYuMQK" Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps \| expand Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

Jan Kara Oct. 2, 2018, 10:05 a.m. UTC

Hello,

commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
mean time certain customer of ours started poking into /proc/<pid>/smaps
and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
flags, the application just fails to start complaining that DAX support is
missing in the kernel. The question now is how do we go about this?

Strictly speaking, this is a userspace visible regression (as much as I
think that application poking into VMA flags at this level is just too
bold). Is there any precedens in handling similar issues with smaps which
really exposes a lot of information that is dependent on kernel
implementation details?

I have attached a patch that is an obvious "fix" for the issue - just fake
VM_MIXEDMAP flag in smaps. But I'm open to other suggestions...

								Honza

Michal Hocko Oct. 2, 2018, 10:50 a.m. UTC | #1

On Tue 02-10-18 12:05:31, Jan Kara wrote:
> Hello,
> 
> commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> mean time certain customer of ours started poking into /proc/<pid>/smaps
> and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> flags, the application just fails to start complaining that DAX support is
> missing in the kernel. The question now is how do we go about this?

Do they need to check for a general DAX support or do they need a per
mapping granularity?

> Strictly speaking, this is a userspace visible regression (as much as I
> think that application poking into VMA flags at this level is just too
> bold). Is there any precedens in handling similar issues with smaps which
> really exposes a lot of information that is dependent on kernel
> implementation details?

Yeah, exposing all the vma flags was just a terrible idea. We have had a
similar issue recently [1] for other flag that is no longer set while
the implementation of the feature is still in place. I guess we really
want to document that those flags are for debugging only and no stable
and long term API should rely on it.

Considering how new the thing really is (does anybody do anything
production like out there?) I would tend to try a better interface
rather than chasing after random vma flags. E.g. what prevents a
completely unrelated usage of VM_MIXEDMAP?

Johannes Thumshirn Oct. 2, 2018, 12:10 p.m. UTC | #2

On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
> Hello,
> 
> commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> mean time certain customer of ours started poking into /proc/<pid>/smaps
> and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> flags, the application just fails to start complaining that DAX support is
> missing in the kernel. The question now is how do we go about this?

OK naive question from me, how do we want an application to be able to
check if it is running on a DAX mapping?

AFAIU DAX is always associated with a file descriptor of some kind (be
it a real file with filesystem dax or the /dev/dax device file for
device dax). So could a new fcntl() be of any help here? IS_DAX() only
checks for the S_DAX flag in inode::i_flags, so this should be doable
for both fsdax and devdax.

I haven't tried it yet but it should be fairly easy to come up with
something like this.

Byte,
	Johannes

Jan Kara Oct. 2, 2018, 1:32 p.m. UTC | #3

On Tue 02-10-18 12:50:58, Michal Hocko wrote:
> On Tue 02-10-18 12:05:31, Jan Kara wrote:
> > Hello,
> > 
> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > flags, the application just fails to start complaining that DAX support is
> > missing in the kernel. The question now is how do we go about this?
> 
> Do they need to check for a general DAX support or do they need a per
> mapping granularity?

General DAX support per filesystem is OK for them, at least for now. So
they might be OK with just checking for 'dax' mount option in /proc/mounts.
But I agree this is cumbersome.

> > Strictly speaking, this is a userspace visible regression (as much as I
> > think that application poking into VMA flags at this level is just too
> > bold). Is there any precedens in handling similar issues with smaps which
> > really exposes a lot of information that is dependent on kernel
> > implementation details?
> 
> Yeah, exposing all the vma flags was just a terrible idea. We have had a
> similar issue recently [1] for other flag that is no longer set while
> the implementation of the feature is still in place. I guess we really
> want to document that those flags are for debugging only and no stable
> and long term API should rely on it.

Yeah, I have some doubts about usefulness of such documentation but I guess
it's better than nothing.

> Considering how new the thing really is (does anybody do anything
> production like out there?) I would tend to try a better interface
> rather than chasing after random vma flags. E.g. what prevents a
> completely unrelated usage of VM_MIXEDMAP?

Nothing checking that flag is in production AFAIK but DAX as such is in
active use for some limited usecases already. I'll reply regarding a better
interface for checking DAX, in an email to Johannes.

								Honza

Johannes Thumshirn Oct. 2, 2018, 2:20 p.m. UTC | #4

On Tue, Oct 02, 2018 at 02:10:39PM +0200, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > flags, the application just fails to start complaining that DAX support is
> > missing in the kernel. The question now is how do we go about this?
> 
> OK naive question from me, how do we want an application to be able to
> check if it is running on a DAX mapping?
> 
> AFAIU DAX is always associated with a file descriptor of some kind (be
> it a real file with filesystem dax or the /dev/dax device file for
> device dax). So could a new fcntl() be of any help here? IS_DAX() only
> checks for the S_DAX flag in inode::i_flags, so this should be doable
> for both fsdax and devdax.
> 
> I haven't tried it yet but it should be fairly easy to come up with
> something like this.

OK now I did on a normal file on BTFS (without DAX obviously) and on a
file on XFS with the -o dax mount option.

Here's the RFC:

commit 3a8f0d23c421e8c91bc9d8bd3a956e1ffe3f754b
Author: Johannes Thumshirn <jthumshirn@suse.de>
Date:   Tue Oct 2 14:51:33 2018 +0200

    fcntl: provide F_GETDAX for applications to query DAX capabilities
    
    Provide a F_GETDAX fcntl(2) command so an application can query
    whether it can make use of DAX or not.
    
    Both file-system DAX as well as device DAX mark the DAX capability in
    struct inode::i_flags using the S_DAX flag, so we can query it using
    the IS_DAX() macro on a struct file's inode.
    
    If the file descriptor is either device DAX or on a DAX capable
    file-system '1' is returned back to user-space, if DAX isn't usable
    for some reason '0' is returned back.
    
    This patch can be tested with the following small C program:
    
     #include <stdio.h>
     #include <stdlib.h>
     #include <unistd.h>
     #include <fcntl.h>
     #include <libgen.h>
    
     #ifndef F_LINUX_SPECIFIC_BASE
     #define F_LINUX_SPECIFIC_BASE 1024
     #endif
    
     #define F_GETDAX               (F_LINUX_SPECIFIC_BASE + 15)
    
     int main(int argc, char **argv)
     {
            int dax;
            int fd;
            int rc;
    
            if (argc != 2) {
                    printf("Usage: %s file\n", basename(argv[0]));
                    exit(EXIT_FAILURE);
            }
    
            fd = open(argv[1], O_RDONLY);
            if (fd < 0) {
                    perror("open");
                    exit(EXIT_FAILURE);
            }
    
            rc = fcntl(fd, F_GETDAX, &dax);
            if (rc < 0) {
                    perror("fcntl");
                    close(fd);
                    exit(EXIT_FAILURE);
            }
    
            if (dax) {
                    printf("fd %d is dax capable\n", fd);
                    exit(EXIT_FAILURE);
            } else {
                    printf("fd %d is not dax capable\n", fd);
                    exit(EXIT_SUCCESS);
            }
     }
    
    Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Dan Williams <dan.j.williams@intel.com>

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 4137d96534a6..0b53f968f569 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -32,6 +32,22 @@
 
 #define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
 
+static int fcntl_get_dax(struct file *filp, unsigned long arg)
+{
+	struct inode *inode = file_inode(filp);
+	u64 *argp = (u64 __user *)arg;
+	u64 dax;
+
+	if (IS_DAX(inode))
+		dax = 1;
+	else
+		dax = 0;
+
+	if (copy_to_user(argp, &dax, sizeof(*argp)))
+		return -EFAULT;
+	return 0;
+}
+
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
 	struct inode * inode = file_inode(filp);
@@ -426,6 +442,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_FILE_RW_HINT:
 		err = fcntl_rw_hint(filp, cmd, arg);
 		break;
+	case F_GETDAX:
+		err = fcntl_get_dax(filp, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..65a59c3cc46d 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -52,6 +52,7 @@
 #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
+#define F_GETDAX		(F_LINUX_SPECIFIC_BASE + 15)
 
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be

Jan Kara Oct. 2, 2018, 2:29 p.m. UTC | #5

[Added ext4, xfs, and linux-api folks to CC for the interface discussion]

On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > flags, the application just fails to start complaining that DAX support is
> > missing in the kernel. The question now is how do we go about this?
> 
> OK naive question from me, how do we want an application to be able to
> check if it is running on a DAX mapping?

The question from me is: Should application really care? After all DAX is
just a caching decision. Sure it affects performance characteristics and
memory usage of the kernel but it is not a correctness issue (in particular
we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be
supported for current mapping). And in the future the details of what we do
with DAX mapping can change - e.g. I could imagine we might decide to cache
writes in DRAM but do direct PMEM access on reads. And all this could be
auto-tuned based on media properties. And we don't want to tie our hands by
specifying too narrowly how the kernel is going to behave.

OTOH I understand that e.g. for a large database application the difference
between DAX and non-DAX mapping can be a difference between performs fine
and performs terribly / kills the machine so such application might want to
determine / force caching policy to save sysadmin from debugging why the
application is misbehaving.

> AFAIU DAX is always associated with a file descriptor of some kind (be
> it a real file with filesystem dax or the /dev/dax device file for
> device dax). So could a new fcntl() be of any help here? IS_DAX() only
> checks for the S_DAX flag in inode::i_flags, so this should be doable
> for both fsdax and devdax.

So fcntl() to query DAX usage is one option. Another option is the GETFLAGS
ioctl with which you can query the state of S_DAX flag (works only for XFS
currently). But that inode flag was meant more as a hint "use DAX if
available" AFAIK so that's probably not really suitable for querying
whether DAX is really in use or not. Since DAX is really about caching
policy, I was also thinking that we could use madvise / fadvise for this.
I.e., something like MADV_DIRECT_ACCESS which would return with success if
DAX is in use, with error if not. Later, kernel could use it as a hint to
really force DAX on a mapping and not try clever caching policies...
Thoughts?

								Honza

Christoph Hellwig Oct. 2, 2018, 2:37 p.m. UTC | #6

On Tue, Oct 02, 2018 at 04:29:59PM +0200, Jan Kara wrote:
> > OK naive question from me, how do we want an application to be able to
> > check if it is running on a DAX mapping?
> 
> The question from me is: Should application really care?

No, it should not.  DAX is an implementation detail thay may change
or go away at any time.

Johannes Thumshirn Oct. 2, 2018, 2:44 p.m. UTC | #7

On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> No, it should not.  DAX is an implementation detail thay may change
> or go away at any time.

Well we had an issue with an application checking for dax, this is how
we landed here in the first place.

It's not that I want them to do it, it's more that they're actually
doing it in all kinds of interesting ways and then complaining when it
doesn't work anymore.

So it's less of an "API beauty price problem" but more of a "provide a
documented way which we won't break" way.

Byte,
	   Johannes

Christoph Hellwig Oct. 2, 2018, 2:45 p.m. UTC | #8

On Tue, Oct 02, 2018 at 04:20:10PM +0200, Johannes Thumshirn wrote:
>     Provide a F_GETDAX fcntl(2) command so an application can query
>     whether it can make use of DAX or not.

How does an application "make use of DAX"?  What actual user visible
semantics are associated with a file that has this flag set?

Christoph Hellwig Oct. 2, 2018, 2:52 p.m. UTC | #9

On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > No, it should not.  DAX is an implementation detail thay may change
> > or go away at any time.
> 
> Well we had an issue with an application checking for dax, this is how
> we landed here in the first place.

So what exacty is that "DAX" they are querying about (and no, I'm not
joking, nor being philosophical).

Johannes Thumshirn Oct. 2, 2018, 3:01 p.m. UTC | #10

On Tue, Oct 02, 2018 at 07:45:47AM -0700, Christoph Hellwig wrote:
> How does an application "make use of DAX"?  What actual user visible
> semantics are associated with a file that has this flag set?

There may not be any user visible semantics of DAX, but there are
promises we gave to application developers praising DAX as _the_
method to map data on persistent memory and get around "the penalty of
the page cache" (however big this is).

As I said in another mail to this thread, applications have started to
poke in procfs to see whether they can use DAX or not.

Party A has promised party B something and they started checking for
it, then commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and
device dax" removed the way they checked if the kernel can keep up
with this promise.

So technically e1fb4a086495 is a user visible regression and in the
past we have reverted patches introducing these, even if the patch is
generally correct and poking in /proc/self/smaps is a bad idea.

I just wanted to give them a documented way to check for this
promise. Being neutral if this promise is right or wrong, good or bad,
or whatever. That's not my call, but I prefer not having angry users,
yelling at me because of broken applications.

Byte,
	Johannes

Christoph Hellwig Oct. 2, 2018, 3:06 p.m. UTC | #11

On Tue, Oct 02, 2018 at 05:01:24PM +0200, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 07:45:47AM -0700, Christoph Hellwig wrote:
> > How does an application "make use of DAX"?  What actual user visible
> > semantics are associated with a file that has this flag set?
> 
> There may not be any user visible semantics of DAX, but there are
> promises we gave to application developers praising DAX as _the_
> method to map data on persistent memory and get around "the penalty of
> the page cache" (however big this is).

Who is "we"?  As someone involved with DAX code I think it is a steaming
pile of *****, and we are still looking for cases where it actually
works without bugs.  That's why the experimental tag still is on it
for example.

> As I said in another mail to this thread, applications have started to
> poke in procfs to see whether they can use DAX or not.

And what are they actually doing with that?

> 
> Party A has promised party B

We have never promised anyone anything.

> So technically e1fb4a086495 is a user visible regression and in the
> past we have reverted patches introducing these, even if the patch is
> generally correct and poking in /proc/self/smaps is a bad idea.

What actually stops working here and why?  If some stupid app doesn't work
without mixedmap and we want to apply the don't break userspace mantra
hard we should just always expose it.

> I just wanted to give them a documented way to check for this
> promise. Being neutral if this promise is right or wrong, good or bad,
> or whatever. That's not my call, but I prefer not having angry users,
> yelling at me because of broken applications.

There is no promise, sorry.

Jan Kara Oct. 2, 2018, 3:07 p.m. UTC | #12

On Tue 02-10-18 07:37:13, Christoph Hellwig wrote:
> On Tue, Oct 02, 2018 at 04:29:59PM +0200, Jan Kara wrote:
> > > OK naive question from me, how do we want an application to be able to
> > > check if it is running on a DAX mapping?
> > 
> > The question from me is: Should application really care?
> 
> No, it should not.  DAX is an implementation detail thay may change
> or go away at any time.

I agree that whether / how pagecache is used for filesystem access is an
implementation detail of the kernel.  OTOH for some workloads it is about
whether kernel needs gigabytes of RAM to cache files or not, which is not a
detail anymore if you want to fully utilize the machine. So people will be
asking for this and will be finding odd ways to determine whether DAX is
used or not (such as poking in smaps). And once there is some widely enough
used application doing this, it is not "stupid application" problem anymore
but the kernel's problem of not maintaining backward compatibility.

So I think we would be better off providing *some* API which applications
can use to determine whether pagecache is used or not and make sure this
API will convey the right information even if we change DAX
implementation or remove it altogether.

								Honza

Jan Kara Oct. 2, 2018, 3:31 p.m. UTC | #13

On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > No, it should not.  DAX is an implementation detail thay may change
> > > or go away at any time.
> > 
> > Well we had an issue with an application checking for dax, this is how
> > we landed here in the first place.
> 
> So what exacty is that "DAX" they are querying about (and no, I'm not
> joking, nor being philosophical).

I believe the application we are speaking about is mostly concerned about
the memory overhead of the page cache. Think of a machine that has ~ 1TB of
DRAM, the database running on it is about that size as well and they want
database state stored somewhere persistently - which they may want to do by
modifying mmaped database files if they do small updates... So they really
want to be able to use close to all DRAM for the DB and not leave slack
space for the kernel page cache to cache 1TB of database files.

								Honza

Dan Williams Oct. 2, 2018, 8:18 p.m. UTC | #14

On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > No, it should not.  DAX is an implementation detail thay may change
> > > > or go away at any time.
> > >
> > > Well we had an issue with an application checking for dax, this is how
> > > we landed here in the first place.
> >
> > So what exacty is that "DAX" they are querying about (and no, I'm not
> > joking, nor being philosophical).
>
> I believe the application we are speaking about is mostly concerned about
> the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> DRAM, the database running on it is about that size as well and they want
> database state stored somewhere persistently - which they may want to do by
> modifying mmaped database files if they do small updates... So they really
> want to be able to use close to all DRAM for the DB and not leave slack
> space for the kernel page cache to cache 1TB of database files.

VM_MIXEDMAP was never a reliable indication of DAX because it could be
set for random other device-drivers that use vm_insert_mixed(). The
MAP_SYNC flag positively indicates that page cache is disabled for a
given mapping, although whether that property is due to "dax" or some
other kernel mechanics is purely an internal detail.

I'm not opposed to faking out VM_MIXEDMAP if this broken check has
made it into production, but again, it's unreliable.

Jan Kara Oct. 3, 2018, 12:50 p.m. UTC | #15

On Tue 02-10-18 13:18:54, Dan Williams wrote:
> On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > or go away at any time.
> > > >
> > > > Well we had an issue with an application checking for dax, this is how
> > > > we landed here in the first place.
> > >
> > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > joking, nor being philosophical).
> >
> > I believe the application we are speaking about is mostly concerned about
> > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > DRAM, the database running on it is about that size as well and they want
> > database state stored somewhere persistently - which they may want to do by
> > modifying mmaped database files if they do small updates... So they really
> > want to be able to use close to all DRAM for the DB and not leave slack
> > space for the kernel page cache to cache 1TB of database files.
> 
> VM_MIXEDMAP was never a reliable indication of DAX because it could be
> set for random other device-drivers that use vm_insert_mixed(). The
> MAP_SYNC flag positively indicates that page cache is disabled for a
> given mapping, although whether that property is due to "dax" or some
> other kernel mechanics is purely an internal detail.
> 
> I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> made it into production, but again, it's unreliable.

So luckily this particular application wasn't widely deployed yet so we
will likely get away with the vendor asking customers to update to a
version not looking into smaps and parsing /proc/mounts instead.

But I don't find parsing /proc/mounts that beautiful either and I'd prefer
if we had a better interface for applications to query whether they can
avoid page cache for mmaps or not.

								Honza

Dan Williams Oct. 3, 2018, 2:38 p.m. UTC | #16

On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > > or go away at any time.
> > > > >
> > > > > Well we had an issue with an application checking for dax, this is how
> > > > > we landed here in the first place.
> > > >
> > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > joking, nor being philosophical).
> > >
> > > I believe the application we are speaking about is mostly concerned about
> > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > DRAM, the database running on it is about that size as well and they want
> > > database state stored somewhere persistently - which they may want to do by
> > > modifying mmaped database files if they do small updates... So they really
> > > want to be able to use close to all DRAM for the DB and not leave slack
> > > space for the kernel page cache to cache 1TB of database files.
> >
> > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > set for random other device-drivers that use vm_insert_mixed(). The
> > MAP_SYNC flag positively indicates that page cache is disabled for a
> > given mapping, although whether that property is due to "dax" or some
> > other kernel mechanics is purely an internal detail.
> >
> > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > made it into production, but again, it's unreliable.
>
> So luckily this particular application wasn't widely deployed yet so we
> will likely get away with the vendor asking customers to update to a
> version not looking into smaps and parsing /proc/mounts instead.
>
> But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> if we had a better interface for applications to query whether they can
> avoid page cache for mmaps or not.

Yeah, the mount flag is not a good indicator either. I think we need
to follow through on the per-inode property of DAX. Darrick and I
discussed just allowing the property to be inherited from the parent
directory at file creation time. That avoids the dynamic set-up /
teardown races that seem intractable at this point.

What's wrong with MAP_SYNC as a page-cache detector in the meantime?

Jan Kara Oct. 3, 2018, 3:06 p.m. UTC | #17

On Wed 03-10-18 07:38:50, Dan Williams wrote:
> On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > > > or go away at any time.
> > > > > >
> > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > we landed here in the first place.
> > > > >
> > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > joking, nor being philosophical).
> > > >
> > > > I believe the application we are speaking about is mostly concerned about
> > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > DRAM, the database running on it is about that size as well and they want
> > > > database state stored somewhere persistently - which they may want to do by
> > > > modifying mmaped database files if they do small updates... So they really
> > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > space for the kernel page cache to cache 1TB of database files.
> > >
> > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > set for random other device-drivers that use vm_insert_mixed(). The
> > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > given mapping, although whether that property is due to "dax" or some
> > > other kernel mechanics is purely an internal detail.
> > >
> > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > made it into production, but again, it's unreliable.
> >
> > So luckily this particular application wasn't widely deployed yet so we
> > will likely get away with the vendor asking customers to update to a
> > version not looking into smaps and parsing /proc/mounts instead.
> >
> > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > if we had a better interface for applications to query whether they can
> > avoid page cache for mmaps or not.
> 
> Yeah, the mount flag is not a good indicator either. I think we need
> to follow through on the per-inode property of DAX. Darrick and I
> discussed just allowing the property to be inherited from the parent
> directory at file creation time. That avoids the dynamic set-up /
> teardown races that seem intractable at this point.
> 
> What's wrong with MAP_SYNC as a page-cache detector in the meantime?

So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
mount option. It works now but nobody promises it will reliably detect DAX in
future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
using pagecache if we find a sensible usecase for that.

WRT per-inode DAX property, AFAIU that inode flag is just going to be
advisory thing - i.e., use DAX if possible. If you mount a filesystem with
these inode flags set in a configuration which does not allow DAX to be
used, you will still be able to access such inodes but the access will use
page cache instead. And querying these flags should better show real
on-disk status and not just whether DAX is used as that would result in an
even bigger mess. So this feature seems to be somewhat orthogonal to the
API I'm looking for.

								Honza

Dan Williams Oct. 3, 2018, 3:13 p.m. UTC | #18

On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 03-10-18 07:38:50, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > > > > or go away at any time.
> > > > > > >
> > > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > > we landed here in the first place.
> > > > > >
> > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > > joking, nor being philosophical).
> > > > >
> > > > > I believe the application we are speaking about is mostly concerned about
> > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > > DRAM, the database running on it is about that size as well and they want
> > > > > database state stored somewhere persistently - which they may want to do by
> > > > > modifying mmaped database files if they do small updates... So they really
> > > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > > space for the kernel page cache to cache 1TB of database files.
> > > >
> > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > > set for random other device-drivers that use vm_insert_mixed(). The
> > > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > > given mapping, although whether that property is due to "dax" or some
> > > > other kernel mechanics is purely an internal detail.
> > > >
> > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > > made it into production, but again, it's unreliable.
> > >
> > > So luckily this particular application wasn't widely deployed yet so we
> > > will likely get away with the vendor asking customers to update to a
> > > version not looking into smaps and parsing /proc/mounts instead.
> > >
> > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > > if we had a better interface for applications to query whether they can
> > > avoid page cache for mmaps or not.
> >
> > Yeah, the mount flag is not a good indicator either. I think we need
> > to follow through on the per-inode property of DAX. Darrick and I
> > discussed just allowing the property to be inherited from the parent
> > directory at file creation time. That avoids the dynamic set-up /
> > teardown races that seem intractable at this point.
> >
> > What's wrong with MAP_SYNC as a page-cache detector in the meantime?
>
> So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
> mount option. It works now but nobody promises it will reliably detect DAX in
> future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
> using pagecache if we find a sensible usecase for that.

Fair enough.

> WRT per-inode DAX property, AFAIU that inode flag is just going to be
> advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> these inode flags set in a configuration which does not allow DAX to be
> used, you will still be able to access such inodes but the access will use
> page cache instead. And querying these flags should better show real
> on-disk status and not just whether DAX is used as that would result in an
> even bigger mess. So this feature seems to be somewhat orthogonal to the
> API I'm looking for.

True, I imagine once we have that flag we will be able to distinguish
the "saved" property and the "effective / live" property of DAX...
Also it's really not DAX that applications care about as much as "is
there page-cache indirection / overhead for this mapping?". That seems
to be a narrower guarantee that we can make than what "DAX" might
imply.

Jan Kara Oct. 3, 2018, 4:44 p.m. UTC | #19

On Wed 03-10-18 08:13:37, Dan Williams wrote:
> On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
> > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > these inode flags set in a configuration which does not allow DAX to be
> > used, you will still be able to access such inodes but the access will use
> > page cache instead. And querying these flags should better show real
> > on-disk status and not just whether DAX is used as that would result in an
> > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > API I'm looking for.
> 
> True, I imagine once we have that flag we will be able to distinguish
> the "saved" property and the "effective / live" property of DAX...
> Also it's really not DAX that applications care about as much as "is
> there page-cache indirection / overhead for this mapping?". That seems
> to be a narrower guarantee that we can make than what "DAX" might
> imply.

Right. So what do people think about my suggestion earlier in the thread to
use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
when DAX is in use, failure otherwise. Later we could extend it to be also
used as a hint for caching policy for the inode...

								Honza

Dan Williams Oct. 3, 2018, 9:13 p.m. UTC | #20

On Wed, Oct 3, 2018 at 9:46 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 03-10-18 08:13:37, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
> > > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > > these inode flags set in a configuration which does not allow DAX to be
> > > used, you will still be able to access such inodes but the access will use
> > > page cache instead. And querying these flags should better show real
> > > on-disk status and not just whether DAX is used as that would result in an
> > > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > > API I'm looking for.
> >
> > True, I imagine once we have that flag we will be able to distinguish
> > the "saved" property and the "effective / live" property of DAX...
> > Also it's really not DAX that applications care about as much as "is
> > there page-cache indirection / overhead for this mapping?". That seems
> > to be a narrower guarantee that we can make than what "DAX" might
> > imply.
>
> Right. So what do people think about my suggestion earlier in the thread to
> use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
> when DAX is in use, failure otherwise. Later we could extend it to be also
> used as a hint for caching policy for the inode...

The only problem is that you can't use it purely as a query. If we
ever did plumb it to be a hint you could not read the state without
writing the state.

mincore(2) seems to be close the intent of discovering whether RAM is
being consumed for a given address range, but it currently is
implemented to only indicate if *any* mapping is established, not
whether RAM is consumed. I can see an argument that a dax mapped file
should always report an empty mincore vector.

Johannes Thumshirn Oct. 4, 2018, 10:04 a.m. UTC | #21

On Wed, Oct 03, 2018 at 06:44:07PM +0200, Jan Kara wrote:
> On Wed 03-10-18 08:13:37, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
> > > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > > these inode flags set in a configuration which does not allow DAX to be
> > > used, you will still be able to access such inodes but the access will use
> > > page cache instead. And querying these flags should better show real
> > > on-disk status and not just whether DAX is used as that would result in an
> > > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > > API I'm looking for.
> > 
> > True, I imagine once we have that flag we will be able to distinguish
> > the "saved" property and the "effective / live" property of DAX...
> > Also it's really not DAX that applications care about as much as "is
> > there page-cache indirection / overhead for this mapping?". That seems
> > to be a narrower guarantee that we can make than what "DAX" might
> > imply.
> 
> Right. So what do people think about my suggestion earlier in the thread to
> use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
> when DAX is in use, failure otherwise. Later we could extend it to be also
> used as a hint for caching policy for the inode...

Hmm apart from Dan's objection that it can't really be used for a
query, isn't madvise(2) for mmap(2)?

But AFAIU (from looking at the xfs code, so please correct me if I',
wrong), DAX can be used for the traditional read(2)/write(2) interface
as well.

There is at least:

xfs_file_read_iter()
`-> if (IS_DAX(inode))
    `-> xfs_file_dax_read()
        `->dax_iomap_rw()

So IMHO something on an inode granularity would make more sens to me.

Byte,
	Johannes

Johannes Thumshirn Oct. 4, 2018, 10:09 a.m. UTC | #22

On Tue, Oct 02, 2018 at 08:06:34AM -0700, Christoph Hellwig wrote:
> There is no promise, sorry.

Well there have been lot's of articles on for instance lwn.net [1] [2]
[3] describing how to avoid the "overhead" of the page cache when
running on persistent memory.

So if I would be a database developer, I'd look into them and see how
I could exploit this for my needs.

So even if we don't want to call it a promise, it was at least an
advertisement and people are now taking our word for it.

[1] https://lwn.net/Articles/610174/
[2] https://lwn.net/Articles/717953/
[3] https://lwn.net/Articles/684828/ 

Byte,
	      Johannes

Christoph Hellwig Oct. 5, 2018, 6:25 a.m. UTC | #23

On Thu, Oct 04, 2018 at 12:09:49PM +0200, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 08:06:34AM -0700, Christoph Hellwig wrote:
> > There is no promise, sorry.
> 
> Well there have been lot's of articles on for instance lwn.net [1] [2]
> [3] describing how to avoid the "overhead" of the page cache when
> running on persistent memory.

Since when is an article on some website a promise (of what exactly)
by linux kernel developers?

Johannes Thumshirn Oct. 5, 2018, 6:35 a.m. UTC | #24

On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> Since when is an article on some website a promise (of what exactly)
> by linux kernel developers?

Let's stop it here, this doesn't make any sort of forward progress.

Dan Williams Oct. 6, 2018, 1:17 a.m. UTC | #25

On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
>
> On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > Since when is an article on some website a promise (of what exactly)
> > by linux kernel developers?
>
> Let's stop it here, this doesn't make any sort of forward progress.
>

I do think there is some progress we can make if we separate DAX as an
access mechanism vs DAX as a resource utilization contract. My attempt
at representing Christoph's position is that the kernel should not be
advertising / making access mechanism guarantees. That makes sense.
Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
mappings at will and trap access into a kernel handler. Additionally,
whether read(2) / write(2) does anything different behind the scenes
in DAX mode, or not should be irrelevant to the application.

That said what is certainly not irrelevant is a kernel giving
userspace visibility and control into resource utilization. Jan's
MADV_DIRECT_ACCESS let's the application make assumptions about page
cache utilization, we just need to another mechanism to read if a
mapping is effectively already in that state.

Jeff Moyer Oct. 9, 2018, 7:43 p.m. UTC | #26

Jan Kara <jack@suse.cz> writes:

> Hello,
>
> commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> mean time certain customer of ours started poking into /proc/<pid>/smaps
> and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> flags, the application just fails to start complaining that DAX support is
> missing in the kernel. The question now is how do we go about this?
>
> Strictly speaking, this is a userspace visible regression (as much as I
> think that application poking into VMA flags at this level is just too
> bold). Is there any precedens in handling similar issues with smaps which
> really exposes a lot of information that is dependent on kernel
> implementation details?
>
> I have attached a patch that is an obvious "fix" for the issue - just fake
> VM_MIXEDMAP flag in smaps. But I'm open to other suggestions...

Hi, Jan,

I'm intrigued by the use case.  Do I understand you correctly that the
database in question does not intend to make data persistent from
userspace?  In other words, fsync/msync system calls are being issued by
the database?

I guess what I'm really after is a statement of requirements or
expectations.  It would be great if you could convince the database
developer to engage in this discussion directly.

Cheers,
Jeff

Dan Williams Oct. 14, 2018, 3:47 p.m. UTC | #27

On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
> >
> > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > > Since when is an article on some website a promise (of what exactly)
> > > by linux kernel developers?
> >
> > Let's stop it here, this doesn't make any sort of forward progress.
> >
>
> I do think there is some progress we can make if we separate DAX as an
> access mechanism vs DAX as a resource utilization contract. My attempt
> at representing Christoph's position is that the kernel should not be
> advertising / making access mechanism guarantees. That makes sense.
> Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
> mappings at will and trap access into a kernel handler. Additionally,
> whether read(2) / write(2) does anything different behind the scenes
> in DAX mode, or not should be irrelevant to the application.
>
> That said what is certainly not irrelevant is a kernel giving
> userspace visibility and control into resource utilization. Jan's
> MADV_DIRECT_ACCESS let's the application make assumptions about page
> cache utilization, we just need to another mechanism to read if a
> mapping is effectively already in that state.

I thought more about this today while reviewing the virtio-pmem driver
that will behave mostly like a DAX-capable pmem device except it will
be implemented by passing host page cache through to the guest as a
pmem device with a paravirtualized / asynchronous flush interface.
MAP_SYNC obviously needs to be disabled for this case, but still need
allow to some semblance of DAX operation to save allocating page cache
in the guest. The need to explicitly clarify the state of DAX is
growing with the different nuances of DAX operation.

Lets use a new MAP_DIRECT flag to positively assert that a given
mmap() call is setting up a memory mapping without page-cache or
buffered indirection. To be clear not my original MAP_DIRECT proposal
from a while back, instead just a flag to mmap() that causes the
mapping attempt to fail if there is any software buffering fronting
the memory mapping, or any requirement for software to manage flushing
outside of pushing writes through the cpu cache. This way, if we ever
extend MAP_SYNC for a buffered use case we can still definitely assert
that the mapping is "direct". So, MAP_DIRECT would fail for
traditional non-DAX block devices, and for this new virtio-pmem case.
It would also fail for any pmem device where we cannot assert that the
platform will take care of flushing write-pending-queues on power-loss
events.

Jan Kara Oct. 16, 2018, 8:25 a.m. UTC | #28

Hi Jeff,

On Tue 09-10-18 15:43:41, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > flags, the application just fails to start complaining that DAX support is
> > missing in the kernel. The question now is how do we go about this?
> >
> > Strictly speaking, this is a userspace visible regression (as much as I
> > think that application poking into VMA flags at this level is just too
> > bold). Is there any precedens in handling similar issues with smaps which
> > really exposes a lot of information that is dependent on kernel
> > implementation details?
> >
> > I have attached a patch that is an obvious "fix" for the issue - just fake
> > VM_MIXEDMAP flag in smaps. But I'm open to other suggestions...
> 
> I'm intrigued by the use case.  Do I understand you correctly that the
> database in question does not intend to make data persistent from
> userspace?  In other words, fsync/msync system calls are being issued by
> the database?

Yes, at least at the initial stage, they use fsync / msync to persist data.

> I guess what I'm really after is a statement of requirements or
> expectations.  It would be great if you could convince the database
> developer to engage in this discussion directly.

So I talked to them and what they really look after is the control over the
amount of memory needed by the kernel. And they are right that if your
storage needs page cache, the amount of memory you need to set aside for the
kernel is larger.

								Honza

Jeff Moyer Oct. 16, 2018, 12:35 p.m. UTC | #29

Jan Kara <jack@suse.cz> writes:

> Hi Jeff,
>
> On Tue 09-10-18 15:43:41, Jeff Moyer wrote:
>> I'm intrigued by the use case.  Do I understand you correctly that the
>> database in question does not intend to make data persistent from
>> userspace?  In other words, fsync/msync system calls are being issued by
>> the database?
>
> Yes, at least at the initial stage, they use fsync / msync to persist data.

OK.

>> I guess what I'm really after is a statement of requirements or
>> expectations.  It would be great if you could convince the database
>> developer to engage in this discussion directly.
>
> So I talked to them and what they really look after is the control over the
> amount of memory needed by the kernel. And they are right that if your
> storage needs page cache, the amount of memory you need to set aside for the
> kernel is larger.

OK, thanks a lot for following up, Jan!

-Jeff

Dan Williams Oct. 17, 2018, 8:01 p.m. UTC | #30

On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > >
> > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > > > Since when is an article on some website a promise (of what exactly)
> > > > by linux kernel developers?
> > >
> > > Let's stop it here, this doesn't make any sort of forward progress.
> > >
> >
> > I do think there is some progress we can make if we separate DAX as an
> > access mechanism vs DAX as a resource utilization contract. My attempt
> > at representing Christoph's position is that the kernel should not be
> > advertising / making access mechanism guarantees. That makes sense.
> > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
> > mappings at will and trap access into a kernel handler. Additionally,
> > whether read(2) / write(2) does anything different behind the scenes
> > in DAX mode, or not should be irrelevant to the application.
> >
> > That said what is certainly not irrelevant is a kernel giving
> > userspace visibility and control into resource utilization. Jan's
> > MADV_DIRECT_ACCESS let's the application make assumptions about page
> > cache utilization, we just need to another mechanism to read if a
> > mapping is effectively already in that state.
>
> I thought more about this today while reviewing the virtio-pmem driver
> that will behave mostly like a DAX-capable pmem device except it will
> be implemented by passing host page cache through to the guest as a
> pmem device with a paravirtualized / asynchronous flush interface.
> MAP_SYNC obviously needs to be disabled for this case, but still need
> allow to some semblance of DAX operation to save allocating page cache
> in the guest. The need to explicitly clarify the state of DAX is
> growing with the different nuances of DAX operation.
>
> Lets use a new MAP_DIRECT flag to positively assert that a given
> mmap() call is setting up a memory mapping without page-cache or
> buffered indirection. To be clear not my original MAP_DIRECT proposal
> from a while back, instead just a flag to mmap() that causes the
> mapping attempt to fail if there is any software buffering fronting
> the memory mapping, or any requirement for software to manage flushing
> outside of pushing writes through the cpu cache. This way, if we ever
> extend MAP_SYNC for a buffered use case we can still definitely assert
> that the mapping is "direct". So, MAP_DIRECT would fail for
> traditional non-DAX block devices, and for this new virtio-pmem case.
> It would also fail for any pmem device where we cannot assert that the
> platform will take care of flushing write-pending-queues on power-loss
> events.

After letting this set for a few days I think I'm back to liking
MADV_DIRECT_ACCESS more since madvise() is more closely related to the
page-cache management than mmap. It does not solve the query vs enable
problem, but it's still a step towards giving applications what they
want with respect to resource expectations.

Perhaps a new syscall to retrieve the effective advice for a range?

     int madvice(void *addr, size_t length, int *advice);

Jeff Moyer Oct. 17, 2018, 8:23 p.m. UTC | #31

Jan Kara <jack@suse.cz> writes:

> [Added ext4, xfs, and linux-api folks to CC for the interface discussion]
>
> On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote:
>> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
>> > Hello,
>> > 
>> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
>> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
>> > mean time certain customer of ours started poking into /proc/<pid>/smaps
>> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
>> > flags, the application just fails to start complaining that DAX support is
>> > missing in the kernel. The question now is how do we go about this?
>> 
>> OK naive question from me, how do we want an application to be able to
>> check if it is running on a DAX mapping?
>
> The question from me is: Should application really care? After all DAX is
> just a caching decision. Sure it affects performance characteristics and
> memory usage of the kernel but it is not a correctness issue (in particular
> we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be
> supported for current mapping). And in the future the details of what we do
> with DAX mapping can change - e.g. I could imagine we might decide to cache
> writes in DRAM but do direct PMEM access on reads. And all this could be
> auto-tuned based on media properties. And we don't want to tie our hands by
> specifying too narrowly how the kernel is going to behave.

For read and write, I would expect the O_DIRECT open flag to still work,
even for dax-capable persistent memory.  Is that a contentious opinion?

So, what we're really discussing is the behavior for mmap.  MAP_SYNC
will certainly ensure that the page cache is not used for writes.  It
would also be odd for us to decide to cache reads.  The only issue I can
see is that perhaps the application doesn't want to take a performance
hit on write faults.  I haven't heard that concern expressed in this
thread, though.

Just to be clear, this is my understanding of the world:

MAP_SYNC
- file system guarantees that metadata required to reach faulted-in file
  data is consistent on media before a write fault is completed.  A
  side-effect is that the page cache will not be used for
  writably-mapped pages.

and what I think Dan had proposed:

mmap flag, MAP_DIRECT
- file system guarantees that page cache will not be used to front storage.
  storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
  The subtle difference is that a write fault /may/ not result in metadata
  being written back to media.

and this is what I think you were proposing, Jan:

madvise flag, MADV_DIRECT_ACCESS
- same semantics as MAP_DIRECT, but specified via the madvise system call

Cheers,
Jeff

Dave Chinner Oct. 18, 2018, 12:25 a.m. UTC | #32

On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > [Added ext4, xfs, and linux-api folks to CC for the interface discussion]
> >
> > On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote:
> >> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
> >> > Hello,
> >> > 
> >> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> >> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> >> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> >> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> >> > flags, the application just fails to start complaining that DAX support is
> >> > missing in the kernel. The question now is how do we go about this?
> >> 
> >> OK naive question from me, how do we want an application to be able to
> >> check if it is running on a DAX mapping?
> >
> > The question from me is: Should application really care? After all DAX is
> > just a caching decision. Sure it affects performance characteristics and
> > memory usage of the kernel but it is not a correctness issue (in particular
> > we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be
> > supported for current mapping). And in the future the details of what we do
> > with DAX mapping can change - e.g. I could imagine we might decide to cache
> > writes in DRAM but do direct PMEM access on reads. And all this could be
> > auto-tuned based on media properties. And we don't want to tie our hands by
> > specifying too narrowly how the kernel is going to behave.
> 
> For read and write, I would expect the O_DIRECT open flag to still work,
> even for dax-capable persistent memory.  Is that a contentious opinion?

Not contentious at all, because that's the way it currently works.
FYI, XFS decides what to do with read (and similarly writes) like
this:

        if (IS_DAX(inode))
                ret = xfs_file_dax_read(iocb, to);
        else if (iocb->ki_flags & IOCB_DIRECT)
                ret = xfs_file_dio_aio_read(iocb, to);
        else
                ret = xfs_file_buffered_aio_read(iocb, to);

Neither DAX or O_DIRECT on pmem use the page cache - the only difference
between the DAX read/write path and the O_DIRECT read/write path
is where the memcpy() into the user buffer is done. For DAX
it's done in the fsdax layer, for O_DIRECT it's done in the pmem
block driver.

> So, what we're really discussing is the behavior for mmap.

Yes.

> MAP_SYNC
> will certainly ensure that the page cache is not used for writes.  It
> would also be odd for us to decide to cache reads.  The only issue I can
> see is that perhaps the application doesn't want to take a performance
> hit on write faults.  I haven't heard that concern expressed in this
> thread, though.
> 
> Just to be clear, this is my understanding of the world:
> 
> MAP_SYNC
> - file system guarantees that metadata required to reach faulted-in file
>   data is consistent on media before a write fault is completed.  A
>   side-effect is that the page cache will not be used for
>   writably-mapped pages.

I think you are conflating current implementation with API
requirements - MAP_SYNC doesn't guarantee anything about page cache
use. The man page definition simply says "supported only for files
supporting DAX" and that it provides certain data integrity
guarantees. It does not define the implementation.

We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
because that's the only way we can currently provide the required
behaviour to userspace. However, if a filesystem can use the page
cache to provide the required functionality, then it's free to do
so.

i.e. if someone implements a pmem-based page cache, MAP_SYNC data
integrity could be provided /without DAX/ by any filesystem using
that persistent page cache. i.e. MAP_SYNC really only requires
mmap() of CPU addressable persistent memory - it does not require
DAX. Right now, however, the only way to get this functionality is
through a DAX capable filesystem on dax capable storage.

And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
COWs new pages in pmem and attaches them a special per-inode cache
on clean->dirty transition. Then on data sync, background writeback
or crash recovery, it migrates them from the cache into the file map
proper via atomic metadata pointer swaps.

IOWs, NOVA provides the correct MAP_SYNC semantics by using a
separate persistent per-inode write cache to provide the correct
crash recovery semantics for MAP_SYNC.

> and what I think Dan had proposed:
> 
> mmap flag, MAP_DIRECT
> - file system guarantees that page cache will not be used to front storage.
>   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
>   The subtle difference is that a write fault /may/ not result in metadata
>   being written back to media.

SIimilar to O_DIRECT, these semantics do not allow userspace apps to
replace msync/fsync with CPU cache flush operations. So any
application that uses this mode still needs to use either MAP_SYNC
or issue msync/fsync for data integrity.

If the app is using MAP_DIRECT, the what do we do if the filesystem
can't provide the required semantics for that specific operation? In
the case of O_DIRECT, we fall back to buffered IO because it has the
same data integrity semantics as O_DIRECT and will always work. It's
just slower and consumes more memory, but the app continues to work
just fine.

Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
without using the pagecache seems extremely problematic to me.  e.g.
an app already has an open MAP_DIRECT file, and a third party
reflinks it or dedupes it and the fs has to fall back to buffered IO
to do COW operations. This isn't the app's fault - the kernel should
just fall back transparently to using the page cache for the
MAP_DIRECT app and just keep working, just like it would if it was
using O_DIRECT read/write.

The point I'm trying to make here is that O_DIRECT is a /hint/, not
a guarantee, and it's done that way to prevent applications from
being presented with transient, potentially fatal error cases
because a filesystem implementation can't do a specific operation
through the direct IO path.

IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
guarantee. Over time we'll end up with filesystems that can
guarantee that MAP_DIRECT is always going to use DAX, in the same
way we have filesystems that guarantee O_DIRECT will always be
O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
no page cache will ever be used, then we are basically saying
"filesystems won't provide MAP_DIRECT even in common, useful cases
because they can't provide MAP_DIRECT in all cases." And that
doesn't seem like a very good solution to me.

> and this is what I think you were proposing, Jan:
> 
> madvise flag, MADV_DIRECT_ACCESS
> - same semantics as MAP_DIRECT, but specified via the madvise system call

Seems to be the equivalent of fcntl(F_SETFL, O_DIRECT). Makes sense
to have both MAP_DIRECT and MADV_DIRECT_ACCESS to me - one is an
init time flag, the other is a run time flag.

Cheers,

Dave.

Jan Kara Oct. 18, 2018, 2:55 p.m. UTC | #33

On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > MAP_SYNC
> > - file system guarantees that metadata required to reach faulted-in file
> >   data is consistent on media before a write fault is completed.  A
> >   side-effect is that the page cache will not be used for
> >   writably-mapped pages.
> 
> I think you are conflating current implementation with API
> requirements - MAP_SYNC doesn't guarantee anything about page cache
> use. The man page definition simply says "supported only for files
> supporting DAX" and that it provides certain data integrity
> guarantees. It does not define the implementation.
> 
> We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> because that's the only way we can currently provide the required
> behaviour to userspace. However, if a filesystem can use the page
> cache to provide the required functionality, then it's free to do
> so.
> 
> i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> integrity could be provided /without DAX/ by any filesystem using
> that persistent page cache. i.e. MAP_SYNC really only requires
> mmap() of CPU addressable persistent memory - it does not require
> DAX. Right now, however, the only way to get this functionality is
> through a DAX capable filesystem on dax capable storage.
> 
> And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> COWs new pages in pmem and attaches them a special per-inode cache
> on clean->dirty transition. Then on data sync, background writeback
> or crash recovery, it migrates them from the cache into the file map
> proper via atomic metadata pointer swaps.
> 
> IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> separate persistent per-inode write cache to provide the correct
> crash recovery semantics for MAP_SYNC.

Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
effectively it will be also able to provide MAP_DIRECT semantics, right?
Because there won't be DRAM between app and persistent storage and I don't
think COW tricks or other data integrity methods are that interesting for
the application. Most users of O_DIRECT are concerned about getting close
to media speed performance and low DRAM usage...

> > and what I think Dan had proposed:
> > 
> > mmap flag, MAP_DIRECT
> > - file system guarantees that page cache will not be used to front storage.
> >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> >   The subtle difference is that a write fault /may/ not result in metadata
> >   being written back to media.
> 
> SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> replace msync/fsync with CPU cache flush operations. So any
> application that uses this mode still needs to use either MAP_SYNC
> or issue msync/fsync for data integrity.
> 
> If the app is using MAP_DIRECT, the what do we do if the filesystem
> can't provide the required semantics for that specific operation? In
> the case of O_DIRECT, we fall back to buffered IO because it has the
> same data integrity semantics as O_DIRECT and will always work. It's
> just slower and consumes more memory, but the app continues to work
> just fine.
> 
> Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> without using the pagecache seems extremely problematic to me.  e.g.
> an app already has an open MAP_DIRECT file, and a third party
> reflinks it or dedupes it and the fs has to fall back to buffered IO
> to do COW operations. This isn't the app's fault - the kernel should
> just fall back transparently to using the page cache for the
> MAP_DIRECT app and just keep working, just like it would if it was
> using O_DIRECT read/write.

There's another option of failing reflink / dedupe with EBUSY if the file
is mapped with MAP_DIRECT and the filesystem cannot support relink &
MAP_DIRECT together. But there are downsides to that as well.

> The point I'm trying to make here is that O_DIRECT is a /hint/, not
> a guarantee, and it's done that way to prevent applications from
> being presented with transient, potentially fatal error cases
> because a filesystem implementation can't do a specific operation
> through the direct IO path.
> 
> IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> guarantee. Over time we'll end up with filesystems that can
> guarantee that MAP_DIRECT is always going to use DAX, in the same
> way we have filesystems that guarantee O_DIRECT will always be
> O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> no page cache will ever be used, then we are basically saying
> "filesystems won't provide MAP_DIRECT even in common, useful cases
> because they can't provide MAP_DIRECT in all cases." And that
> doesn't seem like a very good solution to me.

These are good points. I'm just somewhat wary of the situation where users
will map files with MAP_DIRECT and then the machine starts thrashing
because the file got reflinked and thus pagecache gets used suddently.
With O_DIRECT the fallback to buffered IO is quite rare (at least for major
filesystems) so usually people just won't notice. If fallback for
MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

								Honza

Jan Kara Oct. 18, 2018, 5:43 p.m. UTC | #34

On Wed 17-10-18 13:01:15, Dan Williams wrote:
> On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > > > > Since when is an article on some website a promise (of what exactly)
> > > > > by linux kernel developers?
> > > >
> > > > Let's stop it here, this doesn't make any sort of forward progress.
> > > >
> > >
> > > I do think there is some progress we can make if we separate DAX as an
> > > access mechanism vs DAX as a resource utilization contract. My attempt
> > > at representing Christoph's position is that the kernel should not be
> > > advertising / making access mechanism guarantees. That makes sense.
> > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
> > > mappings at will and trap access into a kernel handler. Additionally,
> > > whether read(2) / write(2) does anything different behind the scenes
> > > in DAX mode, or not should be irrelevant to the application.
> > >
> > > That said what is certainly not irrelevant is a kernel giving
> > > userspace visibility and control into resource utilization. Jan's
> > > MADV_DIRECT_ACCESS let's the application make assumptions about page
> > > cache utilization, we just need to another mechanism to read if a
> > > mapping is effectively already in that state.
> >
> > I thought more about this today while reviewing the virtio-pmem driver
> > that will behave mostly like a DAX-capable pmem device except it will
> > be implemented by passing host page cache through to the guest as a
> > pmem device with a paravirtualized / asynchronous flush interface.
> > MAP_SYNC obviously needs to be disabled for this case, but still need
> > allow to some semblance of DAX operation to save allocating page cache
> > in the guest. The need to explicitly clarify the state of DAX is
> > growing with the different nuances of DAX operation.
> >
> > Lets use a new MAP_DIRECT flag to positively assert that a given
> > mmap() call is setting up a memory mapping without page-cache or
> > buffered indirection. To be clear not my original MAP_DIRECT proposal
> > from a while back, instead just a flag to mmap() that causes the
> > mapping attempt to fail if there is any software buffering fronting
> > the memory mapping, or any requirement for software to manage flushing
> > outside of pushing writes through the cpu cache. This way, if we ever
> > extend MAP_SYNC for a buffered use case we can still definitely assert
> > that the mapping is "direct". So, MAP_DIRECT would fail for
> > traditional non-DAX block devices, and for this new virtio-pmem case.
> > It would also fail for any pmem device where we cannot assert that the
> > platform will take care of flushing write-pending-queues on power-loss
> > events.
> 
> After letting this set for a few days I think I'm back to liking
> MADV_DIRECT_ACCESS more since madvise() is more closely related to the
> page-cache management than mmap. It does not solve the query vs enable
> problem, but it's still a step towards giving applications what they
> want with respect to resource expectations.

Yeah, I don't have a strong opinion wrt mmap flag vs madvise flag.

> Perhaps a new syscall to retrieve the effective advice for a range?
> 
>      int madvice(void *addr, size_t length, int *advice);

After some thought, I'm not 100% sure this is really needed. I know about
apps that want to make sure DRAM is not consumed - for those mmap / madvise
flag is fine if it returns error in case the feature cannot be provided.
Most other apps don't care whether DAX is on or off. So this call would be
needed only if someone wanted to behave differently depending on whether
DAX is used or not. And although I can imagine some application like that,
I'm not sure how real that is...

								Honza

Dan Williams Oct. 18, 2018, 7:10 p.m. UTC | #35

On Thu, Oct 18, 2018 at 10:43 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-10-18 13:01:15, Dan Williams wrote:
> > On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > > > >
> > > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > > > > > Since when is an article on some website a promise (of what exactly)
> > > > > > by linux kernel developers?
> > > > >
> > > > > Let's stop it here, this doesn't make any sort of forward progress.
> > > > >
> > > >
> > > > I do think there is some progress we can make if we separate DAX as an
> > > > access mechanism vs DAX as a resource utilization contract. My attempt
> > > > at representing Christoph's position is that the kernel should not be
> > > > advertising / making access mechanism guarantees. That makes sense.
> > > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
> > > > mappings at will and trap access into a kernel handler. Additionally,
> > > > whether read(2) / write(2) does anything different behind the scenes
> > > > in DAX mode, or not should be irrelevant to the application.
> > > >
> > > > That said what is certainly not irrelevant is a kernel giving
> > > > userspace visibility and control into resource utilization. Jan's
> > > > MADV_DIRECT_ACCESS let's the application make assumptions about page
> > > > cache utilization, we just need to another mechanism to read if a
> > > > mapping is effectively already in that state.
> > >
> > > I thought more about this today while reviewing the virtio-pmem driver
> > > that will behave mostly like a DAX-capable pmem device except it will
> > > be implemented by passing host page cache through to the guest as a
> > > pmem device with a paravirtualized / asynchronous flush interface.
> > > MAP_SYNC obviously needs to be disabled for this case, but still need
> > > allow to some semblance of DAX operation to save allocating page cache
> > > in the guest. The need to explicitly clarify the state of DAX is
> > > growing with the different nuances of DAX operation.
> > >
> > > Lets use a new MAP_DIRECT flag to positively assert that a given
> > > mmap() call is setting up a memory mapping without page-cache or
> > > buffered indirection. To be clear not my original MAP_DIRECT proposal
> > > from a while back, instead just a flag to mmap() that causes the
> > > mapping attempt to fail if there is any software buffering fronting
> > > the memory mapping, or any requirement for software to manage flushing
> > > outside of pushing writes through the cpu cache. This way, if we ever
> > > extend MAP_SYNC for a buffered use case we can still definitely assert
> > > that the mapping is "direct". So, MAP_DIRECT would fail for
> > > traditional non-DAX block devices, and for this new virtio-pmem case.
> > > It would also fail for any pmem device where we cannot assert that the
> > > platform will take care of flushing write-pending-queues on power-loss
> > > events.
> >
> > After letting this set for a few days I think I'm back to liking
> > MADV_DIRECT_ACCESS more since madvise() is more closely related to the
> > page-cache management than mmap. It does not solve the query vs enable
> > problem, but it's still a step towards giving applications what they
> > want with respect to resource expectations.
>
> Yeah, I don't have a strong opinion wrt mmap flag vs madvise flag.

MADV_DIRECT_ACCESS seems more flexible as the agent setting up the
mapping does not need to be the one concerned with the DAX-state of
the mapping. It's also the canonical interface for affecting page
cache behavior.

> > Perhaps a new syscall to retrieve the effective advice for a range?
> >
> >      int madvice(void *addr, size_t length, int *advice);
>
> After some thought, I'm not 100% sure this is really needed. I know about
> apps that want to make sure DRAM is not consumed - for those mmap / madvise
> flag is fine if it returns error in case the feature cannot be provided.
> Most other apps don't care whether DAX is on or off. So this call would be
> needed only if someone wanted to behave differently depending on whether
> DAX is used or not. And although I can imagine some application like that,
> I'm not sure how real that is...

True, yes, if an application wants the behavior just ask.

The only caveat to address all the use cases for applications making
decisions based on the presence of DAX is to make MADV_DIRECT_ACCESS
fail if the mapping was not established with MAP_SYNC. That way we
have both a way to assert that page cache resources are not being
consumed, and that the kernel is handling metadata synchronization for
any write-faults.

Jeff Moyer Oct. 18, 2018, 9:05 p.m. UTC | #36

Dave,

Thanks for the detailed response!  I hadn't considered the NOVA use case
at all.

Cheers,
Jeff

Dave Chinner Oct. 19, 2018, 12:43 a.m. UTC | #37

On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > MAP_SYNC
> > > - file system guarantees that metadata required to reach faulted-in file
> > >   data is consistent on media before a write fault is completed.  A
> > >   side-effect is that the page cache will not be used for
> > >   writably-mapped pages.
> > 
> > I think you are conflating current implementation with API
> > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > use. The man page definition simply says "supported only for files
> > supporting DAX" and that it provides certain data integrity
> > guarantees. It does not define the implementation.
> > 
> > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > because that's the only way we can currently provide the required
> > behaviour to userspace. However, if a filesystem can use the page
> > cache to provide the required functionality, then it's free to do
> > so.
> > 
> > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > integrity could be provided /without DAX/ by any filesystem using
> > that persistent page cache. i.e. MAP_SYNC really only requires
> > mmap() of CPU addressable persistent memory - it does not require
> > DAX. Right now, however, the only way to get this functionality is
> > through a DAX capable filesystem on dax capable storage.
> > 
> > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > COWs new pages in pmem and attaches them a special per-inode cache
> > on clean->dirty transition. Then on data sync, background writeback
> > or crash recovery, it migrates them from the cache into the file map
> > proper via atomic metadata pointer swaps.
> > 
> > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > separate persistent per-inode write cache to provide the correct
> > crash recovery semantics for MAP_SYNC.
> 
> Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> effectively it will be also able to provide MAP_DIRECT semantics, right?

Yes, I think so. It still needs to do COW on first write fault,
but then the app has direct access to the data buffer until it is
cleaned and put back in place. The "put back in place" is just an
atomic swap of metadata pointers, so it doesn't need the page cache
at all...

> Because there won't be DRAM between app and persistent storage and I don't
> think COW tricks or other data integrity methods are that interesting for
> the application.

Not for the application, but the filesystem still wants to support
snapshots and other such functionality that requires COW. And NOVA
doesn't have write-in-place functionality at all - it always COWs
on the clean->dirty transition.

> Most users of O_DIRECT are concerned about getting close
> to media speed performance and low DRAM usage...

*nod*

> > > and what I think Dan had proposed:
> > > 
> > > mmap flag, MAP_DIRECT
> > > - file system guarantees that page cache will not be used to front storage.
> > >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> > >   The subtle difference is that a write fault /may/ not result in metadata
> > >   being written back to media.
> > 
> > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > replace msync/fsync with CPU cache flush operations. So any
> > application that uses this mode still needs to use either MAP_SYNC
> > or issue msync/fsync for data integrity.
> > 
> > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > can't provide the required semantics for that specific operation? In
> > the case of O_DIRECT, we fall back to buffered IO because it has the
> > same data integrity semantics as O_DIRECT and will always work. It's
> > just slower and consumes more memory, but the app continues to work
> > just fine.
> > 
> > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > without using the pagecache seems extremely problematic to me.  e.g.
> > an app already has an open MAP_DIRECT file, and a third party
> > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > to do COW operations. This isn't the app's fault - the kernel should
> > just fall back transparently to using the page cache for the
> > MAP_DIRECT app and just keep working, just like it would if it was
> > using O_DIRECT read/write.
> 
> There's another option of failing reflink / dedupe with EBUSY if the file
> is mapped with MAP_DIRECT and the filesystem cannot support relink &
> MAP_DIRECT together. But there are downsides to that as well.

Yup, not the least that setting MAP_DIRECT can race with a
reflink....

> > The point I'm trying to make here is that O_DIRECT is a /hint/, not
> > a guarantee, and it's done that way to prevent applications from
> > being presented with transient, potentially fatal error cases
> > because a filesystem implementation can't do a specific operation
> > through the direct IO path.
> > 
> > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> > guarantee. Over time we'll end up with filesystems that can
> > guarantee that MAP_DIRECT is always going to use DAX, in the same
> > way we have filesystems that guarantee O_DIRECT will always be
> > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> > no page cache will ever be used, then we are basically saying
> > "filesystems won't provide MAP_DIRECT even in common, useful cases
> > because they can't provide MAP_DIRECT in all cases." And that
> > doesn't seem like a very good solution to me.
> 
> These are good points. I'm just somewhat wary of the situation where users
> will map files with MAP_DIRECT and then the machine starts thrashing
> because the file got reflinked and thus pagecache gets used suddently.

It's still better than apps randomly getting SIGBUS.

FWIW, this suggests that we probably need to be able to host both
DAX pages and page cache pages on the same file at the same time,
and be able to handle page faults based on the type of page being
mapped (different sets of fault ops for different page types?)
and have fallback paths when the page type needs to be changed
between direct and cached during the fault....

> With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> filesystems) so usually people just won't notice. If fallback for
> MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

Which is just like the situation where O_DIRECT on ext3 was not very
useful, but on other filesystems like XFS it was fully functional.

IMO, the fact that a specific filesytem has a suboptimal fallback
path for an uncommon behaviour isn't an argument against MAP_DIRECT
as a hint - it's actually a feature. If MAP_DIRECT can't be used
until it's always direct access, then most filesystems wouldn't be
able to provide any faster paths at all. It's much better to have
partial functionality now than it is to never have the functionality
at all, and so we need to design in the flexibility we need to
iteratively improve implementations without needing API changes that
will break applications.

Cheers,

Dave.

Dave Chinner Oct. 19, 2018, 3:01 a.m. UTC | #38

On Thu, Oct 18, 2018 at 12:10:13PM -0700, Dan Williams wrote:
> The only caveat to address all the use cases for applications making
> decisions based on the presence of DAX

And that's how we've got into this mess.

Applications need to focus on the functionality they require, not
the technology that provides it. That's the root of the we are
trying to solve here and really I don't care if we have to break
existing applications to do it. i.e. we've made no promises about
API/ABI stability and the functionality is still experimental.

Fundamentally, DAX is a technology, not an API property. The two
"DAX" API properties that matter to applications are:

	1. does mmap allow us to use CPU flush instructions for data
	integrity operations safely? And
	2. can mmap directly access the backing store without
	incurring any additional overhead?

MAP_SYNC provides #1, MAP_DIRECT provides #2, and DAX provides both.
However, they do not define DAX, nor does DAX define them. e.g.

	MAP_SYNC can be provided by a persistent memory page cache.
	But a persistent memory page cache does not provide
	MAP_DIRECT.

	MAP_SYNC can be provided by filesystem DAX, but *only* when
	direct access is used. i.e. MAP_SYNC | MAP_DIRECT

	MAP_DIRECT can be provided by filesystem DAX, but it does
	not imply or require MAP_SYNC behaviour.

IOWs, using MAP_SYNC and/or MAP_DIRECT to answering an "is DAX
present" question ties the API to a technology rather than to the
functionality the technology provides applications.

i.e. If the requested behaviour/property is not available from the
underlying technology, then the app needs to handle that error and
use a different access method.

> applications making
> decisions based on the presence of DAX
> is to make MADV_DIRECT_ACCESS
> fail if the mapping was not established with MAP_SYNC.

And so this is wrong - MADV_DIRECT_ACCESS does not require MAP_SYNC.

It is perfectly legal for MADV_DIRECT_ACCESS to be used without
MAP_SYNC - the app just needs to use msync/fsync instead.

Wanting to enable full userspace CPU data sync semantics via
madvise() implies we also need MADV_SYNC in addition to
MADV_DIRECT_ACCESS.

i.e. Apps that are currently testing for dax should use
mmap(MAP_SYNC|MAP_DIRECT) or madvise(MADV_SYNC|MADV_DIRECT) and they
will fail if the underlying storage is not DAX capable. The app
doesn't need to poke at anything else to see if DAX is enabled - if
the functionality is there then it will work, otherwise they need to
handle the error and do something else.

> That way we
> have both a way to assert that page cache resources are not being
> consumed, and that the kernel is handling metadata synchronization for
> any write-faults.

Yes, we need to do that, but not at the cost of having the API
prevent apps from ever being able to use direct access + msync/fsync
data integrity operations.

Cheers,

Dave.

Dan Williams Oct. 30, 2018, 6:30 a.m. UTC | #39

On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > MAP_SYNC
> > > > - file system guarantees that metadata required to reach faulted-in file
> > > >   data is consistent on media before a write fault is completed.  A
> > > >   side-effect is that the page cache will not be used for
> > > >   writably-mapped pages.
> > >
> > > I think you are conflating current implementation with API
> > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > use. The man page definition simply says "supported only for files
> > > supporting DAX" and that it provides certain data integrity
> > > guarantees. It does not define the implementation.
> > >
> > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > > because that's the only way we can currently provide the required
> > > behaviour to userspace. However, if a filesystem can use the page
> > > cache to provide the required functionality, then it's free to do
> > > so.
> > >
> > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > > integrity could be provided /without DAX/ by any filesystem using
> > > that persistent page cache. i.e. MAP_SYNC really only requires
> > > mmap() of CPU addressable persistent memory - it does not require
> > > DAX. Right now, however, the only way to get this functionality is
> > > through a DAX capable filesystem on dax capable storage.
> > >
> > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > > COWs new pages in pmem and attaches them a special per-inode cache
> > > on clean->dirty transition. Then on data sync, background writeback
> > > or crash recovery, it migrates them from the cache into the file map
> > > proper via atomic metadata pointer swaps.
> > >
> > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > > separate persistent per-inode write cache to provide the correct
> > > crash recovery semantics for MAP_SYNC.
> >
> > Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> > effectively it will be also able to provide MAP_DIRECT semantics, right?
>
> Yes, I think so. It still needs to do COW on first write fault,
> but then the app has direct access to the data buffer until it is
> cleaned and put back in place. The "put back in place" is just an
> atomic swap of metadata pointers, so it doesn't need the page cache
> at all...
>
> > Because there won't be DRAM between app and persistent storage and I don't
> > think COW tricks or other data integrity methods are that interesting for
> > the application.
>
> Not for the application, but the filesystem still wants to support
> snapshots and other such functionality that requires COW. And NOVA
> doesn't have write-in-place functionality at all - it always COWs
> on the clean->dirty transition.
>
> > Most users of O_DIRECT are concerned about getting close
> > to media speed performance and low DRAM usage...
>
> *nod*
>
> > > > and what I think Dan had proposed:
> > > >
> > > > mmap flag, MAP_DIRECT
> > > > - file system guarantees that page cache will not be used to front storage.
> > > >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> > > >   The subtle difference is that a write fault /may/ not result in metadata
> > > >   being written back to media.
> > >
> > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > > replace msync/fsync with CPU cache flush operations. So any
> > > application that uses this mode still needs to use either MAP_SYNC
> > > or issue msync/fsync for data integrity.
> > >
> > > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > > can't provide the required semantics for that specific operation? In
> > > the case of O_DIRECT, we fall back to buffered IO because it has the
> > > same data integrity semantics as O_DIRECT and will always work. It's
> > > just slower and consumes more memory, but the app continues to work
> > > just fine.
> > >
> > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > > without using the pagecache seems extremely problematic to me.  e.g.
> > > an app already has an open MAP_DIRECT file, and a third party
> > > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > > to do COW operations. This isn't the app's fault - the kernel should
> > > just fall back transparently to using the page cache for the
> > > MAP_DIRECT app and just keep working, just like it would if it was
> > > using O_DIRECT read/write.
> >
> > There's another option of failing reflink / dedupe with EBUSY if the file
> > is mapped with MAP_DIRECT and the filesystem cannot support relink &
> > MAP_DIRECT together. But there are downsides to that as well.
>
> Yup, not the least that setting MAP_DIRECT can race with a
> reflink....
>
> > > The point I'm trying to make here is that O_DIRECT is a /hint/, not
> > > a guarantee, and it's done that way to prevent applications from
> > > being presented with transient, potentially fatal error cases
> > > because a filesystem implementation can't do a specific operation
> > > through the direct IO path.
> > >
> > > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> > > guarantee. Over time we'll end up with filesystems that can
> > > guarantee that MAP_DIRECT is always going to use DAX, in the same
> > > way we have filesystems that guarantee O_DIRECT will always be
> > > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> > > no page cache will ever be used, then we are basically saying
> > > "filesystems won't provide MAP_DIRECT even in common, useful cases
> > > because they can't provide MAP_DIRECT in all cases." And that
> > > doesn't seem like a very good solution to me.
> >
> > These are good points. I'm just somewhat wary of the situation where users
> > will map files with MAP_DIRECT and then the machine starts thrashing
> > because the file got reflinked and thus pagecache gets used suddently.
>
> It's still better than apps randomly getting SIGBUS.
>
> FWIW, this suggests that we probably need to be able to host both
> DAX pages and page cache pages on the same file at the same time,
> and be able to handle page faults based on the type of page being
> mapped (different sets of fault ops for different page types?)
> and have fallback paths when the page type needs to be changed
> between direct and cached during the fault....
>
> > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > filesystems) so usually people just won't notice. If fallback for
> > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
>
> Which is just like the situation where O_DIRECT on ext3 was not very
> useful, but on other filesystems like XFS it was fully functional.
>
> IMO, the fact that a specific filesytem has a suboptimal fallback
> path for an uncommon behaviour isn't an argument against MAP_DIRECT
> as a hint - it's actually a feature. If MAP_DIRECT can't be used
> until it's always direct access, then most filesystems wouldn't be
> able to provide any faster paths at all. It's much better to have
> partial functionality now than it is to never have the functionality
> at all, and so we need to design in the flexibility we need to
> iteratively improve implementations without needing API changes that
> will break applications.

The hard guarantee requirement still remains though because an
application that expects combined MAP_SYNC|MAP_DIRECT semantics will
be surprised if the MAP_DIRECT property silently disappears. I think
it still makes some sense as a hint for apps that want to minimize
page cache, but for the applications with a flush from userspace model
I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
still gives the filesystem the option to inject page-cache at will,
but with an application coordination point.

Dave Chinner Oct. 30, 2018, 10:49 p.m. UTC | #40

On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > MAP_SYNC
> > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > >   data is consistent on media before a write fault is completed.  A
> > > > >   side-effect is that the page cache will not be used for
> > > > >   writably-mapped pages.
> > > >
> > > > I think you are conflating current implementation with API
> > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > use. The man page definition simply says "supported only for files
> > > > supporting DAX" and that it provides certain data integrity
> > > > guarantees. It does not define the implementation.
....
> > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > filesystems) so usually people just won't notice. If fallback for
> > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> >
> > Which is just like the situation where O_DIRECT on ext3 was not very
> > useful, but on other filesystems like XFS it was fully functional.
> >
> > IMO, the fact that a specific filesytem has a suboptimal fallback
> > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > until it's always direct access, then most filesystems wouldn't be
> > able to provide any faster paths at all. It's much better to have
> > partial functionality now than it is to never have the functionality
> > at all, and so we need to design in the flexibility we need to
> > iteratively improve implementations without needing API changes that
> > will break applications.
> 
> The hard guarantee requirement still remains though because an
> application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> be surprised if the MAP_DIRECT property silently disappears.

Why would they be surprised? They won't even notice it if the
filesystem can provide MAP_SYNC without MAP_DIRECT.

And that's the whole point.

MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
visible to the filesystem and the filesystem does nothing to enforce
them. If someone does something that requires the page cache (e.g.
calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
heap of new work to do. And, in some cases, the filesystem may not
be able to provide MAP_DIRECT as a result..

IOWs, the filesystem cannot guarantee MAP_DIRECT and the
circumstances under which MAP_DIRECT will and will not work are
dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
applications randomly segfaulting in production as things like
backups, indexers, etc run over the filesystem and do their work.

This is why MAP_DIRECT needs to be an optimisation, not a
requirement - things will still work if MAP_DIRECT is not used. What
matters to these applications is MAP_SYNC - if we break MAP_SYNC,
then the application data integrity model is violated. That's not an
acceptible outcome.

The problem, it seems to me, is that people are unable to separate
MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
provide MAP_SYNC functionality on DAX mappings. However, that's a
/filesystem implementation issue/, not an API guarantee we need to
provide to userspace.

If we implement a persistent page cache (e.g. allocate page cache
pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
could provide applications with the MAP_SYNC data integrity model
without MAP_DIRECT. Indeed, those filesystems would not even be able
to provide MAP_DIRECT semantics because they aren't backed by pmem.

Hence if applications that want MAP_SYNC are hard coded
MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
those applications are going to fail on a filesystem that provides
only MAP_SYNC. This is despite the fact the applications would
function correctly and the data integrity model would be maintained.
i.e. the failure is because applications have assumed MAP_SYNC can
only be provided by a DAX implementation, not because MAP_SYNC is
not supported.

MAP_SYNC really isn't about DAX at all. It's about enabling a data
integrity model that requires the filesystem to provide userspace
access to CPU addressable persistent memory.  DAX+MAP_DIRECT is just
one method of providing this functionality, but it's not the only
method. Our API needs to be future proof rather than an encoding of
the existing implementation limitations, otherwise apps will have to
be re-written as every new MAP_SYNC capable technology comes along.

In summary:

	MAP_DIRECT is an access hint.

	MAP_SYNC provides a data integrity model guarantee.

	MAP_SYNC may imply MAP_DIRECT for specific implementations,
	but it does not require or guarantee MAP_DIRECT.

Let's compare that with O_DIRECT:

	O_DIRECT in an access hint.

	O_DSYNC provides a data integrity model guarantee.

	O_DSYNC may imply O_DIRECT for specific implementations, but
	it does not require or guarantee O_DIRECT.

Consistency in access and data integrity models is a good thing. DAX
and pmem is not an exception. We need to use a model we know works
and has proven itself over a long period of time.

> I think
> it still makes some sense as a hint for apps that want to minimize
> page cache, but for the applications with a flush from userspace model
> I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> still gives the filesystem the option to inject page-cache at will,
> but with an application coordination point.

Why make it more complex for applications than it needs to be? 

Cheers,

Dave.

Dan Williams Oct. 30, 2018, 10:59 p.m. UTC | #41

On Tue, Oct 30, 2018 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > > MAP_SYNC
> > > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > >   data is consistent on media before a write fault is completed.  A
> > > > > >   side-effect is that the page cache will not be used for
> > > > > >   writably-mapped pages.
> > > > >
> > > > > I think you are conflating current implementation with API
> > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > > use. The man page definition simply says "supported only for files
> > > > > supporting DAX" and that it provides certain data integrity
> > > > > guarantees. It does not define the implementation.
> ....
> > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > > filesystems) so usually people just won't notice. If fallback for
> > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> > >
> > > Which is just like the situation where O_DIRECT on ext3 was not very
> > > useful, but on other filesystems like XFS it was fully functional.
> > >
> > > IMO, the fact that a specific filesytem has a suboptimal fallback
> > > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > > until it's always direct access, then most filesystems wouldn't be
> > > able to provide any faster paths at all. It's much better to have
> > > partial functionality now than it is to never have the functionality
> > > at all, and so we need to design in the flexibility we need to
> > > iteratively improve implementations without needing API changes that
> > > will break applications.
> >
> > The hard guarantee requirement still remains though because an
> > application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> > be surprised if the MAP_DIRECT property silently disappears.
>
> Why would they be surprised? They won't even notice it if the
> filesystem can provide MAP_SYNC without MAP_DIRECT.
>
> And that's the whole point.
>
> MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
> visible to the filesystem and the filesystem does nothing to enforce
> them. If someone does something that requires the page cache (e.g.
> calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
> heap of new work to do. And, in some cases, the filesystem may not
> be able to provide MAP_DIRECT as a result..
>
> IOWs, the filesystem cannot guarantee MAP_DIRECT and the
> circumstances under which MAP_DIRECT will and will not work are
> dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
> applications randomly segfaulting in production as things like
> backups, indexers, etc run over the filesystem and do their work.
>
> This is why MAP_DIRECT needs to be an optimisation, not a
> requirement - things will still work if MAP_DIRECT is not used. What
> matters to these applications is MAP_SYNC - if we break MAP_SYNC,
> then the application data integrity model is violated. That's not an
> acceptible outcome.
>
> The problem, it seems to me, is that people are unable to separate
> MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
> MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
> provide MAP_SYNC functionality on DAX mappings. However, that's a
> /filesystem implementation issue/, not an API guarantee we need to
> provide to userspace.
>
> If we implement a persistent page cache (e.g. allocate page cache
> pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
> could provide applications with the MAP_SYNC data integrity model
> without MAP_DIRECT. Indeed, those filesystems would not even be able
> to provide MAP_DIRECT semantics because they aren't backed by pmem.
>
> Hence if applications that want MAP_SYNC are hard coded
> MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
> those applications are going to fail on a filesystem that provides
> only MAP_SYNC. This is despite the fact the applications would
> function correctly and the data integrity model would be maintained.
> i.e. the failure is because applications have assumed MAP_SYNC can
> only be provided by a DAX implementation, not because MAP_SYNC is
> not supported.
>
> MAP_SYNC really isn't about DAX at all. It's about enabling a data
> integrity model that requires the filesystem to provide userspace
> access to CPU addressable persistent memory.  DAX+MAP_DIRECT is just
> one method of providing this functionality, but it's not the only
> method. Our API needs to be future proof rather than an encoding of
> the existing implementation limitations, otherwise apps will have to
> be re-written as every new MAP_SYNC capable technology comes along.
>
> In summary:
>
>         MAP_DIRECT is an access hint.
>
>         MAP_SYNC provides a data integrity model guarantee.
>
>         MAP_SYNC may imply MAP_DIRECT for specific implementations,
>         but it does not require or guarantee MAP_DIRECT.
>
> Let's compare that with O_DIRECT:
>
>         O_DIRECT in an access hint.
>
>         O_DSYNC provides a data integrity model guarantee.
>
>         O_DSYNC may imply O_DIRECT for specific implementations, but
>         it does not require or guarantee O_DIRECT.
>
> Consistency in access and data integrity models is a good thing. DAX
> and pmem is not an exception. We need to use a model we know works
> and has proven itself over a long period of time.
>
> > I think
> > it still makes some sense as a hint for apps that want to minimize
> > page cache, but for the applications with a flush from userspace model
> > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> > still gives the filesystem the option to inject page-cache at will,
> > but with an application coordination point.
>
> Why make it more complex for applications than it needs to be?

With the clarification that MAP_SYNC implies "cpu cache flush to
persistent memory page-cache *or* dax to persistent memory" I think
all of the concerns are addressed. I was conflating MAP_DIRECT as "no
page cache indirection", but the indirection does not matter if the
page cache itself is persisted.

Yasunori Gotou (Fujitsu) Oct. 31, 2018, 5:59 a.m. UTC | #42

Hello,

> On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > > MAP_SYNC
> > > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > >   data is consistent on media before a write fault is completed.  A
> > > > > >   side-effect is that the page cache will not be used for
> > > > > >   writably-mapped pages.
> > > > >
> > > > > I think you are conflating current implementation with API
> > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > > use. The man page definition simply says "supported only for files
> > > > > supporting DAX" and that it provides certain data integrity
> > > > > guarantees. It does not define the implementation.
> ....
> > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > > filesystems) so usually people just won't notice. If fallback for
> > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> > >
> > > Which is just like the situation where O_DIRECT on ext3 was not very
> > > useful, but on other filesystems like XFS it was fully functional.
> > >
> > > IMO, the fact that a specific filesytem has a suboptimal fallback
> > > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > > until it's always direct access, then most filesystems wouldn't be
> > > able to provide any faster paths at all. It's much better to have
> > > partial functionality now than it is to never have the functionality
> > > at all, and so we need to design in the flexibility we need to
> > > iteratively improve implementations without needing API changes that
> > > will break applications.
> >
> > The hard guarantee requirement still remains though because an
> > application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> > be surprised if the MAP_DIRECT property silently disappears.
> 
> Why would they be surprised? They won't even notice it if the
> filesystem can provide MAP_SYNC without MAP_DIRECT.
> 
> And that's the whole point.
> 
> MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
> visible to the filesystem and the filesystem does nothing to enforce
> them. If someone does something that requires the page cache (e.g.
> calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
> heap of new work to do. And, in some cases, the filesystem may not
> be able to provide MAP_DIRECT as a result..
> 
> IOWs, the filesystem cannot guarantee MAP_DIRECT and the
> circumstances under which MAP_DIRECT will and will not work are
> dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
> applications randomly segfaulting in production as things like
> backups, indexers, etc run over the filesystem and do their work.
> 
> This is why MAP_DIRECT needs to be an optimisation, not a
> requirement - things will still work if MAP_DIRECT is not used. What
> matters to these applications is MAP_SYNC - if we break MAP_SYNC,
> then the application data integrity model is violated. That's not an
> acceptible outcome.
> 
> The problem, it seems to me, is that people are unable to separate
> MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
> MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
> provide MAP_SYNC functionality on DAX mappings. However, that's a
> /filesystem implementation issue/, not an API guarantee we need to
> provide to userspace.
> 
> If we implement a persistent page cache (e.g. allocate page cache
> pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
> could provide applications with the MAP_SYNC data integrity model
> without MAP_DIRECT. Indeed, those filesystems would not even be able
> to provide MAP_DIRECT semantics because they aren't backed by pmem.
> 
> Hence if applications that want MAP_SYNC are hard coded
> MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
> those applications are going to fail on a filesystem that provides
> only MAP_SYNC. This is despite the fact the applications would
> function correctly and the data integrity model would be maintained.
> i.e. the failure is because applications have assumed MAP_SYNC can
> only be provided by a DAX implementation, not because MAP_SYNC is
> not supported.
> 
> MAP_SYNC really isn't about DAX at all. It's about enabling a data
> integrity model that requires the filesystem to provide userspace
> access to CPU addressable persistent memory.  DAX+MAP_DIRECT is just
> one method of providing this functionality, but it's not the only
> method. Our API needs to be future proof rather than an encoding of
> the existing implementation limitations, otherwise apps will have to
> be re-written as every new MAP_SYNC capable technology comes along.
> 
> In summary:
> 
> 	MAP_DIRECT is an access hint.
> 
> 	MAP_SYNC provides a data integrity model guarantee.
> 
> 	MAP_SYNC may imply MAP_DIRECT for specific implementations,
> 	but it does not require or guarantee MAP_DIRECT.
> 
> Let's compare that with O_DIRECT:
> 
> 	O_DIRECT in an access hint.
> 
> 	O_DSYNC provides a data integrity model guarantee.
> 
> 	O_DSYNC may imply O_DIRECT for specific implementations, but
> 	it does not require or guarantee O_DIRECT.
> 
> Consistency in access and data integrity models is a good thing. DAX
> and pmem is not an exception. We need to use a model we know works
> and has proven itself over a long period of time.

Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
(I'm not opposed to your opinion, but I need to know it.)

In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is 
"wrong alignment is specified by application", right?

When filesystem can not use O_DIRECT and it uses page cache instead,
then system uses more memory resource than user's expectation.
So, there is a side effect, and it may cause other trouble.
(memory pressure, expected performance can not be gained, and so on ..)
 
In such case its administrator (or technical support engineer) needs to struggle to
investigate what is the reason.
If the reason of the breakage is clear, then it is helpful to find the root cause,
and they can require the developer of wrong application to fix the problem.
"Please fix the alignment!".

So, I would like to know in MAP_DIRECT case, what is the reasons? 
I think it will be helpful for users.
Only splice?

(Maybe such document will be necessary....)

Thanks,

> 
> > I think
> > it still makes some sense as a hint for apps that want to minimize
> > page cache, but for the applications with a flush from userspace model
> > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> > still gives the filesystem the option to inject page-cache at will,
> > but with an application coordination point.
> 
> Why make it more complex for applications than it needs to be?
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave Chinner Nov. 1, 2018, 11 p.m. UTC | #43

On Wed, Oct 31, 2018 at 05:59:17AM +0000, y-goto@fujitsu.com wrote:
> > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > In summary:
> > 
> > 	MAP_DIRECT is an access hint.
> > 
> > 	MAP_SYNC provides a data integrity model guarantee.
> > 
> > 	MAP_SYNC may imply MAP_DIRECT for specific implementations,
> > 	but it does not require or guarantee MAP_DIRECT.
> > 
> > Let's compare that with O_DIRECT:
> > 
> > 	O_DIRECT in an access hint.
> > 
> > 	O_DSYNC provides a data integrity model guarantee.
> > 
> > 	O_DSYNC may imply O_DIRECT for specific implementations, but
> > 	it does not require or guarantee O_DIRECT.
> > 
> > Consistency in access and data integrity models is a good thing. DAX
> > and pmem is not an exception. We need to use a model we know works
> > and has proven itself over a long period of time.
> 
> Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
> (I'm not opposed to your opinion, but I need to know it.)
> 
> In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is 
> "wrong alignment is specified by application", right?

O_DIRECT has defined memory and offset alignment restrictions, and
will return an error to userspace when they are violated. It does
not fall back to buffered IO in this case. MAP_DIRECT has no
equivalent restriction, so IO alignment of O_DIRECT is largely
irrelevant here.

What we are talking about here is that some filesystems can only do
certain operations through buffered IO, such as block allocation or
file extension, and so silently fall back to doing them via buffered
IO even when O_DIRECT is specified. The old direct IO code used to
be full of conditionals to allow this - I think DIO_SKIP_HOLES is
only one remaining:

                /*
                 * For writes that could fill holes inside i_size on a
                 * DIO_SKIP_HOLES filesystem we forbid block creations: only
                 * overwrites are permitted. We will return early to the caller
                 * once we see an unmapped buffer head returned, and the caller
                 * will fall back to buffered I/O.
                 *
                 * Otherwise the decision is left to the get_blocks method,
                 * which may decide to handle it or also return an unmapped
                 * buffer head.
                 */
                create = dio->op == REQ_OP_WRITE;
                if (dio->flags & DIO_SKIP_HOLES) {
                        if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
                                                        i_blkbits))
                                create = 0;
                }

Other cases like file extension cases are caught by the filesystems
before calling into the DIO code itself, so there's multiple avenues
for O_DIRECT transparently falling back to buffered IO.

This means the applications don't fail just because the filesystem
can't do a specific operation via O_DIRECT. The data writes still
succeed because they fall back to buffered IO, and the application
is blissfully unaware that the filesystem behaved that way.

> When filesystem can not use O_DIRECT and it uses page cache instead,
> then system uses more memory resource than user's expectation.

That's far better than failing unexpectedly because the app
unexpectedly came across a hole in the file (e.g. someone ran
sparsify across the filesystem).

> So, there is a side effect, and it may cause other trouble.
> (memory pressure, expected performance can not be gained, and so on ..)

Which is why people are supposed to test their systems before they
put them into production.

I've lost count of the number of times I've heard "but O_DIRECT is
supposed to make things faster!" because people don't understand
exactly what it does or means. Bypassing the page cache does not
magically make applications go faster - it puts the responsibility
for doing optimal IO on the application, not the kernel.

MAP_DIRECT will be no different. It's no guarantee that it will make
things faster, or that everything will just work as users expect
them to. It specifically places the responsibility for performing IO
in an optimal fashion on the application and the user for making
sure that it is fit for their purposes. Like O_DIRECT, using
MAP_DIRECT means "I, the application, know exactly what I'm doing,
so get out of the way as much as possible because I'm taking
responsibility for issuing IO in the most optimal manner now".

> In such case its administrator (or technical support engineer) needs to struggle to
> investigate what is the reason.

That's no different to performance problems that arise from
inappropriate use of O_DIRECT. It requires a certain level of
expertise to be able to understand and diagnose such issues.

> So, I would like to know in MAP_DIRECT case, what is the reasons? 
> I think it will be helpful for users.
> Only splice?

The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm
certain that filesystem developers will try to maintain MAP_DIRECT
semantics as much as possible, but it's not going to be possible in
/all situations/ on XFS and ext4 because they simply haven't been
designed with DAX in mind. Filesystems designed specifically for
pmem and DAX might be able to provide MAP_DIRECT in all situations,
but those filesystems don't really exist yet.

This is no different to the early days of O_DIRECT. e.g.  ext3
couldn't do O_DIRECT for all operations when it was first
introduced, but over time the functionality improved as the
underlying issues were solved. If O_DIRECT was a guarantee, then
ext3 would have never supported O_DIRECT at all...

> (Maybe such document will be necessary....)

The semantics will need to be documented in the relevant man pages.

Cheers,

Dave.

Yasunori Gotou (Fujitsu) Nov. 2, 2018, 1:43 a.m. UTC | #44

> > > 	MAP_DIRECT is an access hint.
> > >
> > > 	MAP_SYNC provides a data integrity model guarantee.
> > >
> > > 	MAP_SYNC may imply MAP_DIRECT for specific implementations,
> > > 	but it does not require or guarantee MAP_DIRECT.
> > >
> > > Let's compare that with O_DIRECT:
> > >
> > > 	O_DIRECT in an access hint.
> > >
> > > 	O_DSYNC provides a data integrity model guarantee.
> > >
> > > 	O_DSYNC may imply O_DIRECT for specific implementations, but
> > > 	it does not require or guarantee O_DIRECT.
> > >
> > > Consistency in access and data integrity models is a good thing. DAX
> > > and pmem is not an exception. We need to use a model we know works
> > > and has proven itself over a long period of time.
> >
> > Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
> > (I'm not opposed to your opinion, but I need to know it.)
> >
> > In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is
> > "wrong alignment is specified by application", right?
> 
> O_DIRECT has defined memory and offset alignment restrictions, and
> will return an error to userspace when they are violated. It does
> not fall back to buffered IO in this case. MAP_DIRECT has no
> equivalent restriction, so IO alignment of O_DIRECT is largely
> irrelevant here.
> 
> What we are talking about here is that some filesystems can only do
> certain operations through buffered IO, such as block allocation or
> file extension, and so silently fall back to doing them via buffered
> IO even when O_DIRECT is specified. The old direct IO code used to
> be full of conditionals to allow this - I think DIO_SKIP_HOLES is
> only one remaining:
> 
>                 /*
>                  * For writes that could fill holes inside i_size on a
>                  * DIO_SKIP_HOLES filesystem we forbid block creations: only
>                  * overwrites are permitted. We will return early to the caller
>                  * once we see an unmapped buffer head returned, and the caller
>                  * will fall back to buffered I/O.
>                  *
>                  * Otherwise the decision is left to the get_blocks method,
>                  * which may decide to handle it or also return an unmapped
>                  * buffer head.
>                  */
>                 create = dio->op == REQ_OP_WRITE;
>                 if (dio->flags & DIO_SKIP_HOLES) {
>                         if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
>                                                         i_blkbits))
>                                 create = 0;
>                 }
> 
> Other cases like file extension cases are caught by the filesystems
> before calling into the DIO code itself, so there's multiple avenues
> for O_DIRECT transparently falling back to buffered IO.
> 
> This means the applications don't fail just because the filesystem
> can't do a specific operation via O_DIRECT. The data writes still
> succeed because they fall back to buffered IO, and the application
> is blissfully unaware that the filesystem behaved that way.
> 
> > When filesystem can not use O_DIRECT and it uses page cache instead,
> > then system uses more memory resource than user's expectation.
> 
> That's far better than failing unexpectedly because the app
> unexpectedly came across a hole in the file (e.g. someone ran
> sparsify across the filesystem).
> 
> > So, there is a side effect, and it may cause other trouble.
> > (memory pressure, expected performance can not be gained, and so on ..)
> 
> Which is why people are supposed to test their systems before they
> put them into production.
> 
> I've lost count of the number of times I've heard "but O_DIRECT is
> supposed to make things faster!" because people don't understand
> exactly what it does or means. Bypassing the page cache does not
> magically make applications go faster - it puts the responsibility
> for doing optimal IO on the application, not the kernel.
> 
> MAP_DIRECT will be no different. It's no guarantee that it will make
> things faster, or that everything will just work as users expect
> them to. It specifically places the responsibility for performing IO
> in an optimal fashion on the application and the user for making
> sure that it is fit for their purposes. Like O_DIRECT, using
> MAP_DIRECT means "I, the application, know exactly what I'm doing,
> so get out of the way as much as possible because I'm taking
> responsibility for issuing IO in the most optimal manner now".
> 
> > In such case its administrator (or technical support engineer) needs to struggle to
> > investigate what is the reason.
> 
> That's no different to performance problems that arise from
> inappropriate use of O_DIRECT. It requires a certain level of
> expertise to be able to understand and diagnose such issues.
> 
> > So, I would like to know in MAP_DIRECT case, what is the reasons?
> > I think it will be helpful for users.
> > Only splice?
> 
> The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm
> certain that filesystem developers will try to maintain MAP_DIRECT
> semantics as much as possible, but it's not going to be possible in
> /all situations/ on XFS and ext4 because they simply haven't been
> designed with DAX in mind. Filesystems designed specifically for
> pmem and DAX might be able to provide MAP_DIRECT in all situations,
> but those filesystems don't really exist yet.
> 
> This is no different to the early days of O_DIRECT. e.g.  ext3
> couldn't do O_DIRECT for all operations when it was first
> introduced, but over time the functionality improved as the
> underlying issues were solved. If O_DIRECT was a guarantee, then
> ext3 would have never supported O_DIRECT at all...

Hmm, Ok. I see.
Thank you very much for your detail explanation.

> 
> > (Maybe such document will be necessary....)
> 
> The semantics will need to be documented in the relevant man pages.

I agree.

Thanks, again.
----
Yasunori Goto

Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

Commit Message

Comments

Patch