Message ID | 20181002100531.GC4135@quack2.suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps | expand |
On Tue 02-10-18 12:05:31, Jan Kara wrote: > Hello, > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > mean time certain customer of ours started poking into /proc/<pid>/smaps > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > flags, the application just fails to start complaining that DAX support is > missing in the kernel. The question now is how do we go about this? Do they need to check for a general DAX support or do they need a per mapping granularity? > Strictly speaking, this is a userspace visible regression (as much as I > think that application poking into VMA flags at this level is just too > bold). Is there any precedens in handling similar issues with smaps which > really exposes a lot of information that is dependent on kernel > implementation details? Yeah, exposing all the vma flags was just a terrible idea. We have had a similar issue recently [1] for other flag that is no longer set while the implementation of the feature is still in place. I guess we really want to document that those flags are for debugging only and no stable and long term API should rely on it. Considering how new the thing really is (does anybody do anything production like out there?) I would tend to try a better interface rather than chasing after random vma flags. E.g. what prevents a completely unrelated usage of VM_MIXEDMAP?
On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote: > Hello, > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > mean time certain customer of ours started poking into /proc/<pid>/smaps > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > flags, the application just fails to start complaining that DAX support is > missing in the kernel. The question now is how do we go about this? OK naive question from me, how do we want an application to be able to check if it is running on a DAX mapping? AFAIU DAX is always associated with a file descriptor of some kind (be it a real file with filesystem dax or the /dev/dax device file for device dax). So could a new fcntl() be of any help here? IS_DAX() only checks for the S_DAX flag in inode::i_flags, so this should be doable for both fsdax and devdax. I haven't tried it yet but it should be fairly easy to come up with something like this. Byte, Johannes
On Tue 02-10-18 12:50:58, Michal Hocko wrote: > On Tue 02-10-18 12:05:31, Jan Kara wrote: > > Hello, > > > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > > mean time certain customer of ours started poking into /proc/<pid>/smaps > > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > > flags, the application just fails to start complaining that DAX support is > > missing in the kernel. The question now is how do we go about this? > > Do they need to check for a general DAX support or do they need a per > mapping granularity? General DAX support per filesystem is OK for them, at least for now. So they might be OK with just checking for 'dax' mount option in /proc/mounts. But I agree this is cumbersome. > > Strictly speaking, this is a userspace visible regression (as much as I > > think that application poking into VMA flags at this level is just too > > bold). Is there any precedens in handling similar issues with smaps which > > really exposes a lot of information that is dependent on kernel > > implementation details? > > Yeah, exposing all the vma flags was just a terrible idea. We have had a > similar issue recently [1] for other flag that is no longer set while > the implementation of the feature is still in place. I guess we really > want to document that those flags are for debugging only and no stable > and long term API should rely on it. Yeah, I have some doubts about usefulness of such documentation but I guess it's better than nothing. > Considering how new the thing really is (does anybody do anything > production like out there?) I would tend to try a better interface > rather than chasing after random vma flags. E.g. what prevents a > completely unrelated usage of VM_MIXEDMAP? Nothing checking that flag is in production AFAIK but DAX as such is in active use for some limited usecases already. I'll reply regarding a better interface for checking DAX, in an email to Johannes. Honza
On Tue, Oct 02, 2018 at 02:10:39PM +0200, Johannes Thumshirn wrote: > On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote: > > Hello, > > > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > > mean time certain customer of ours started poking into /proc/<pid>/smaps > > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > > flags, the application just fails to start complaining that DAX support is > > missing in the kernel. The question now is how do we go about this? > > OK naive question from me, how do we want an application to be able to > check if it is running on a DAX mapping? > > AFAIU DAX is always associated with a file descriptor of some kind (be > it a real file with filesystem dax or the /dev/dax device file for > device dax). So could a new fcntl() be of any help here? IS_DAX() only > checks for the S_DAX flag in inode::i_flags, so this should be doable > for both fsdax and devdax. > > I haven't tried it yet but it should be fairly easy to come up with > something like this. OK now I did on a normal file on BTFS (without DAX obviously) and on a file on XFS with the -o dax mount option. Here's the RFC: commit 3a8f0d23c421e8c91bc9d8bd3a956e1ffe3f754b Author: Johannes Thumshirn <jthumshirn@suse.de> Date: Tue Oct 2 14:51:33 2018 +0200 fcntl: provide F_GETDAX for applications to query DAX capabilities Provide a F_GETDAX fcntl(2) command so an application can query whether it can make use of DAX or not. Both file-system DAX as well as device DAX mark the DAX capability in struct inode::i_flags using the S_DAX flag, so we can query it using the IS_DAX() macro on a struct file's inode. If the file descriptor is either device DAX or on a DAX capable file-system '1' is returned back to user-space, if DAX isn't usable for some reason '0' is returned back. This patch can be tested with the following small C program: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <libgen.h> #ifndef F_LINUX_SPECIFIC_BASE #define F_LINUX_SPECIFIC_BASE 1024 #endif #define F_GETDAX (F_LINUX_SPECIFIC_BASE + 15) int main(int argc, char **argv) { int dax; int fd; int rc; if (argc != 2) { printf("Usage: %s file\n", basename(argv[0])); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); exit(EXIT_FAILURE); } rc = fcntl(fd, F_GETDAX, &dax); if (rc < 0) { perror("fcntl"); close(fd); exit(EXIT_FAILURE); } if (dax) { printf("fd %d is dax capable\n", fd); exit(EXIT_FAILURE); } else { printf("fd %d is not dax capable\n", fd); exit(EXIT_SUCCESS); } } Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> diff --git a/fs/fcntl.c b/fs/fcntl.c index 4137d96534a6..0b53f968f569 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -32,6 +32,22 @@ #define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +static int fcntl_get_dax(struct file *filp, unsigned long arg) +{ + struct inode *inode = file_inode(filp); + u64 *argp = (u64 __user *)arg; + u64 dax; + + if (IS_DAX(inode)) + dax = 1; + else + dax = 0; + + if (copy_to_user(argp, &dax, sizeof(*argp))) + return -EFAULT; + return 0; +} + static int setfl(int fd, struct file * filp, unsigned long arg) { struct inode * inode = file_inode(filp); @@ -426,6 +442,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_SET_FILE_RW_HINT: err = fcntl_rw_hint(filp, cmd, arg); break; + case F_GETDAX: + err = fcntl_get_dax(filp, arg); + break; default: break; } diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6448cdd9a350..65a59c3cc46d 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -52,6 +52,7 @@ #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13) #define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14) +#define F_GETDAX (F_LINUX_SPECIFIC_BASE + 15) /* * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
[Added ext4, xfs, and linux-api folks to CC for the interface discussion] On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote: > On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote: > > Hello, > > > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > > mean time certain customer of ours started poking into /proc/<pid>/smaps > > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > > flags, the application just fails to start complaining that DAX support is > > missing in the kernel. The question now is how do we go about this? > > OK naive question from me, how do we want an application to be able to > check if it is running on a DAX mapping? The question from me is: Should application really care? After all DAX is just a caching decision. Sure it affects performance characteristics and memory usage of the kernel but it is not a correctness issue (in particular we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be supported for current mapping). And in the future the details of what we do with DAX mapping can change - e.g. I could imagine we might decide to cache writes in DRAM but do direct PMEM access on reads. And all this could be auto-tuned based on media properties. And we don't want to tie our hands by specifying too narrowly how the kernel is going to behave. OTOH I understand that e.g. for a large database application the difference between DAX and non-DAX mapping can be a difference between performs fine and performs terribly / kills the machine so such application might want to determine / force caching policy to save sysadmin from debugging why the application is misbehaving. > AFAIU DAX is always associated with a file descriptor of some kind (be > it a real file with filesystem dax or the /dev/dax device file for > device dax). So could a new fcntl() be of any help here? IS_DAX() only > checks for the S_DAX flag in inode::i_flags, so this should be doable > for both fsdax and devdax. So fcntl() to query DAX usage is one option. Another option is the GETFLAGS ioctl with which you can query the state of S_DAX flag (works only for XFS currently). But that inode flag was meant more as a hint "use DAX if available" AFAIK so that's probably not really suitable for querying whether DAX is really in use or not. Since DAX is really about caching policy, I was also thinking that we could use madvise / fadvise for this. I.e., something like MADV_DIRECT_ACCESS which would return with success if DAX is in use, with error if not. Later, kernel could use it as a hint to really force DAX on a mapping and not try clever caching policies... Thoughts? Honza
On Tue, Oct 02, 2018 at 04:29:59PM +0200, Jan Kara wrote: > > OK naive question from me, how do we want an application to be able to > > check if it is running on a DAX mapping? > > The question from me is: Should application really care? No, it should not. DAX is an implementation detail thay may change or go away at any time.
On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > No, it should not. DAX is an implementation detail thay may change > or go away at any time. Well we had an issue with an application checking for dax, this is how we landed here in the first place. It's not that I want them to do it, it's more that they're actually doing it in all kinds of interesting ways and then complaining when it doesn't work anymore. So it's less of an "API beauty price problem" but more of a "provide a documented way which we won't break" way. Byte, Johannes
On Tue, Oct 02, 2018 at 04:20:10PM +0200, Johannes Thumshirn wrote: > Provide a F_GETDAX fcntl(2) command so an application can query > whether it can make use of DAX or not. How does an application "make use of DAX"? What actual user visible semantics are associated with a file that has this flag set?
On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > No, it should not. DAX is an implementation detail thay may change > > or go away at any time. > > Well we had an issue with an application checking for dax, this is how > we landed here in the first place. So what exacty is that "DAX" they are querying about (and no, I'm not joking, nor being philosophical).
On Tue, Oct 02, 2018 at 07:45:47AM -0700, Christoph Hellwig wrote: > How does an application "make use of DAX"? What actual user visible > semantics are associated with a file that has this flag set? There may not be any user visible semantics of DAX, but there are promises we gave to application developers praising DAX as _the_ method to map data on persistent memory and get around "the penalty of the page cache" (however big this is). As I said in another mail to this thread, applications have started to poke in procfs to see whether they can use DAX or not. Party A has promised party B something and they started checking for it, then commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" removed the way they checked if the kernel can keep up with this promise. So technically e1fb4a086495 is a user visible regression and in the past we have reverted patches introducing these, even if the patch is generally correct and poking in /proc/self/smaps is a bad idea. I just wanted to give them a documented way to check for this promise. Being neutral if this promise is right or wrong, good or bad, or whatever. That's not my call, but I prefer not having angry users, yelling at me because of broken applications. Byte, Johannes
On Tue, Oct 02, 2018 at 05:01:24PM +0200, Johannes Thumshirn wrote: > On Tue, Oct 02, 2018 at 07:45:47AM -0700, Christoph Hellwig wrote: > > How does an application "make use of DAX"? What actual user visible > > semantics are associated with a file that has this flag set? > > There may not be any user visible semantics of DAX, but there are > promises we gave to application developers praising DAX as _the_ > method to map data on persistent memory and get around "the penalty of > the page cache" (however big this is). Who is "we"? As someone involved with DAX code I think it is a steaming pile of *****, and we are still looking for cases where it actually works without bugs. That's why the experimental tag still is on it for example. > As I said in another mail to this thread, applications have started to > poke in procfs to see whether they can use DAX or not. And what are they actually doing with that? > > Party A has promised party B We have never promised anyone anything. > So technically e1fb4a086495 is a user visible regression and in the > past we have reverted patches introducing these, even if the patch is > generally correct and poking in /proc/self/smaps is a bad idea. What actually stops working here and why? If some stupid app doesn't work without mixedmap and we want to apply the don't break userspace mantra hard we should just always expose it. > I just wanted to give them a documented way to check for this > promise. Being neutral if this promise is right or wrong, good or bad, > or whatever. That's not my call, but I prefer not having angry users, > yelling at me because of broken applications. There is no promise, sorry.
On Tue 02-10-18 07:37:13, Christoph Hellwig wrote: > On Tue, Oct 02, 2018 at 04:29:59PM +0200, Jan Kara wrote: > > > OK naive question from me, how do we want an application to be able to > > > check if it is running on a DAX mapping? > > > > The question from me is: Should application really care? > > No, it should not. DAX is an implementation detail thay may change > or go away at any time. I agree that whether / how pagecache is used for filesystem access is an implementation detail of the kernel. OTOH for some workloads it is about whether kernel needs gigabytes of RAM to cache files or not, which is not a detail anymore if you want to fully utilize the machine. So people will be asking for this and will be finding odd ways to determine whether DAX is used or not (such as poking in smaps). And once there is some widely enough used application doing this, it is not "stupid application" problem anymore but the kernel's problem of not maintaining backward compatibility. So I think we would be better off providing *some* API which applications can use to determine whether pagecache is used or not and make sure this API will convey the right information even if we change DAX implementation or remove it altogether. Honza
On Tue 02-10-18 07:52:06, Christoph Hellwig wrote: > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > > No, it should not. DAX is an implementation detail thay may change > > > or go away at any time. > > > > Well we had an issue with an application checking for dax, this is how > > we landed here in the first place. > > So what exacty is that "DAX" they are querying about (and no, I'm not > joking, nor being philosophical). I believe the application we are speaking about is mostly concerned about the memory overhead of the page cache. Think of a machine that has ~ 1TB of DRAM, the database running on it is about that size as well and they want database state stored somewhere persistently - which they may want to do by modifying mmaped database files if they do small updates... So they really want to be able to use close to all DRAM for the DB and not leave slack space for the kernel page cache to cache 1TB of database files. Honza
On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote: > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote: > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > > > No, it should not. DAX is an implementation detail thay may change > > > > or go away at any time. > > > > > > Well we had an issue with an application checking for dax, this is how > > > we landed here in the first place. > > > > So what exacty is that "DAX" they are querying about (and no, I'm not > > joking, nor being philosophical). > > I believe the application we are speaking about is mostly concerned about > the memory overhead of the page cache. Think of a machine that has ~ 1TB of > DRAM, the database running on it is about that size as well and they want > database state stored somewhere persistently - which they may want to do by > modifying mmaped database files if they do small updates... So they really > want to be able to use close to all DRAM for the DB and not leave slack > space for the kernel page cache to cache 1TB of database files. VM_MIXEDMAP was never a reliable indication of DAX because it could be set for random other device-drivers that use vm_insert_mixed(). The MAP_SYNC flag positively indicates that page cache is disabled for a given mapping, although whether that property is due to "dax" or some other kernel mechanics is purely an internal detail. I'm not opposed to faking out VM_MIXEDMAP if this broken check has made it into production, but again, it's unreliable.
On Tue 02-10-18 13:18:54, Dan Williams wrote: > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote: > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote: > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > > > > No, it should not. DAX is an implementation detail thay may change > > > > > or go away at any time. > > > > > > > > Well we had an issue with an application checking for dax, this is how > > > > we landed here in the first place. > > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not > > > joking, nor being philosophical). > > > > I believe the application we are speaking about is mostly concerned about > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of > > DRAM, the database running on it is about that size as well and they want > > database state stored somewhere persistently - which they may want to do by > > modifying mmaped database files if they do small updates... So they really > > want to be able to use close to all DRAM for the DB and not leave slack > > space for the kernel page cache to cache 1TB of database files. > > VM_MIXEDMAP was never a reliable indication of DAX because it could be > set for random other device-drivers that use vm_insert_mixed(). The > MAP_SYNC flag positively indicates that page cache is disabled for a > given mapping, although whether that property is due to "dax" or some > other kernel mechanics is purely an internal detail. > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has > made it into production, but again, it's unreliable. So luckily this particular application wasn't widely deployed yet so we will likely get away with the vendor asking customers to update to a version not looking into smaps and parsing /proc/mounts instead. But I don't find parsing /proc/mounts that beautiful either and I'd prefer if we had a better interface for applications to query whether they can avoid page cache for mmaps or not. Honza
On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote: > > On Tue 02-10-18 13:18:54, Dan Williams wrote: > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote: > > > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote: > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > > > > > No, it should not. DAX is an implementation detail thay may change > > > > > > or go away at any time. > > > > > > > > > > Well we had an issue with an application checking for dax, this is how > > > > > we landed here in the first place. > > > > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not > > > > joking, nor being philosophical). > > > > > > I believe the application we are speaking about is mostly concerned about > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of > > > DRAM, the database running on it is about that size as well and they want > > > database state stored somewhere persistently - which they may want to do by > > > modifying mmaped database files if they do small updates... So they really > > > want to be able to use close to all DRAM for the DB and not leave slack > > > space for the kernel page cache to cache 1TB of database files. > > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be > > set for random other device-drivers that use vm_insert_mixed(). The > > MAP_SYNC flag positively indicates that page cache is disabled for a > > given mapping, although whether that property is due to "dax" or some > > other kernel mechanics is purely an internal detail. > > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has > > made it into production, but again, it's unreliable. > > So luckily this particular application wasn't widely deployed yet so we > will likely get away with the vendor asking customers to update to a > version not looking into smaps and parsing /proc/mounts instead. > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer > if we had a better interface for applications to query whether they can > avoid page cache for mmaps or not. Yeah, the mount flag is not a good indicator either. I think we need to follow through on the per-inode property of DAX. Darrick and I discussed just allowing the property to be inherited from the parent directory at file creation time. That avoids the dynamic set-up / teardown races that seem intractable at this point. What's wrong with MAP_SYNC as a page-cache detector in the meantime?
On Wed 03-10-18 07:38:50, Dan Williams wrote: > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote: > > > > On Tue 02-10-18 13:18:54, Dan Williams wrote: > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote: > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > > > > > > No, it should not. DAX is an implementation detail thay may change > > > > > > > or go away at any time. > > > > > > > > > > > > Well we had an issue with an application checking for dax, this is how > > > > > > we landed here in the first place. > > > > > > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not > > > > > joking, nor being philosophical). > > > > > > > > I believe the application we are speaking about is mostly concerned about > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of > > > > DRAM, the database running on it is about that size as well and they want > > > > database state stored somewhere persistently - which they may want to do by > > > > modifying mmaped database files if they do small updates... So they really > > > > want to be able to use close to all DRAM for the DB and not leave slack > > > > space for the kernel page cache to cache 1TB of database files. > > > > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be > > > set for random other device-drivers that use vm_insert_mixed(). The > > > MAP_SYNC flag positively indicates that page cache is disabled for a > > > given mapping, although whether that property is due to "dax" or some > > > other kernel mechanics is purely an internal detail. > > > > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has > > > made it into production, but again, it's unreliable. > > > > So luckily this particular application wasn't widely deployed yet so we > > will likely get away with the vendor asking customers to update to a > > version not looking into smaps and parsing /proc/mounts instead. > > > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer > > if we had a better interface for applications to query whether they can > > avoid page cache for mmaps or not. > > Yeah, the mount flag is not a good indicator either. I think we need > to follow through on the per-inode property of DAX. Darrick and I > discussed just allowing the property to be inherited from the parent > directory at file creation time. That avoids the dynamic set-up / > teardown races that seem intractable at this point. > > What's wrong with MAP_SYNC as a page-cache detector in the meantime? So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax' mount option. It works now but nobody promises it will reliably detect DAX in future - e.g. there's nothing that prevents MAP_SYNC to work for mappings using pagecache if we find a sensible usecase for that. WRT per-inode DAX property, AFAIU that inode flag is just going to be advisory thing - i.e., use DAX if possible. If you mount a filesystem with these inode flags set in a configuration which does not allow DAX to be used, you will still be able to access such inodes but the access will use page cache instead. And querying these flags should better show real on-disk status and not just whether DAX is used as that would result in an even bigger mess. So this feature seems to be somewhat orthogonal to the API I'm looking for. Honza
On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote: > > On Wed 03-10-18 07:38:50, Dan Williams wrote: > > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote: > > > > > > On Tue 02-10-18 13:18:54, Dan Williams wrote: > > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote: > > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote: > > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote: > > > > > > > > No, it should not. DAX is an implementation detail thay may change > > > > > > > > or go away at any time. > > > > > > > > > > > > > > Well we had an issue with an application checking for dax, this is how > > > > > > > we landed here in the first place. > > > > > > > > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not > > > > > > joking, nor being philosophical). > > > > > > > > > > I believe the application we are speaking about is mostly concerned about > > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of > > > > > DRAM, the database running on it is about that size as well and they want > > > > > database state stored somewhere persistently - which they may want to do by > > > > > modifying mmaped database files if they do small updates... So they really > > > > > want to be able to use close to all DRAM for the DB and not leave slack > > > > > space for the kernel page cache to cache 1TB of database files. > > > > > > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be > > > > set for random other device-drivers that use vm_insert_mixed(). The > > > > MAP_SYNC flag positively indicates that page cache is disabled for a > > > > given mapping, although whether that property is due to "dax" or some > > > > other kernel mechanics is purely an internal detail. > > > > > > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has > > > > made it into production, but again, it's unreliable. > > > > > > So luckily this particular application wasn't widely deployed yet so we > > > will likely get away with the vendor asking customers to update to a > > > version not looking into smaps and parsing /proc/mounts instead. > > > > > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer > > > if we had a better interface for applications to query whether they can > > > avoid page cache for mmaps or not. > > > > Yeah, the mount flag is not a good indicator either. I think we need > > to follow through on the per-inode property of DAX. Darrick and I > > discussed just allowing the property to be inherited from the parent > > directory at file creation time. That avoids the dynamic set-up / > > teardown races that seem intractable at this point. > > > > What's wrong with MAP_SYNC as a page-cache detector in the meantime? > > So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax' > mount option. It works now but nobody promises it will reliably detect DAX in > future - e.g. there's nothing that prevents MAP_SYNC to work for mappings > using pagecache if we find a sensible usecase for that. Fair enough. > WRT per-inode DAX property, AFAIU that inode flag is just going to be > advisory thing - i.e., use DAX if possible. If you mount a filesystem with > these inode flags set in a configuration which does not allow DAX to be > used, you will still be able to access such inodes but the access will use > page cache instead. And querying these flags should better show real > on-disk status and not just whether DAX is used as that would result in an > even bigger mess. So this feature seems to be somewhat orthogonal to the > API I'm looking for. True, I imagine once we have that flag we will be able to distinguish the "saved" property and the "effective / live" property of DAX... Also it's really not DAX that applications care about as much as "is there page-cache indirection / overhead for this mapping?". That seems to be a narrower guarantee that we can make than what "DAX" might imply.
On Wed 03-10-18 08:13:37, Dan Williams wrote: > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote: > > WRT per-inode DAX property, AFAIU that inode flag is just going to be > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with > > these inode flags set in a configuration which does not allow DAX to be > > used, you will still be able to access such inodes but the access will use > > page cache instead. And querying these flags should better show real > > on-disk status and not just whether DAX is used as that would result in an > > even bigger mess. So this feature seems to be somewhat orthogonal to the > > API I'm looking for. > > True, I imagine once we have that flag we will be able to distinguish > the "saved" property and the "effective / live" property of DAX... > Also it's really not DAX that applications care about as much as "is > there page-cache indirection / overhead for this mapping?". That seems > to be a narrower guarantee that we can make than what "DAX" might > imply. Right. So what do people think about my suggestion earlier in the thread to use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success when DAX is in use, failure otherwise. Later we could extend it to be also used as a hint for caching policy for the inode... Honza
On Wed, Oct 3, 2018 at 9:46 AM Jan Kara <jack@suse.cz> wrote: > > On Wed 03-10-18 08:13:37, Dan Williams wrote: > > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote: > > > WRT per-inode DAX property, AFAIU that inode flag is just going to be > > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with > > > these inode flags set in a configuration which does not allow DAX to be > > > used, you will still be able to access such inodes but the access will use > > > page cache instead. And querying these flags should better show real > > > on-disk status and not just whether DAX is used as that would result in an > > > even bigger mess. So this feature seems to be somewhat orthogonal to the > > > API I'm looking for. > > > > True, I imagine once we have that flag we will be able to distinguish > > the "saved" property and the "effective / live" property of DAX... > > Also it's really not DAX that applications care about as much as "is > > there page-cache indirection / overhead for this mapping?". That seems > > to be a narrower guarantee that we can make than what "DAX" might > > imply. > > Right. So what do people think about my suggestion earlier in the thread to > use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success > when DAX is in use, failure otherwise. Later we could extend it to be also > used as a hint for caching policy for the inode... The only problem is that you can't use it purely as a query. If we ever did plumb it to be a hint you could not read the state without writing the state. mincore(2) seems to be close the intent of discovering whether RAM is being consumed for a given address range, but it currently is implemented to only indicate if *any* mapping is established, not whether RAM is consumed. I can see an argument that a dax mapped file should always report an empty mincore vector.
On Wed, Oct 03, 2018 at 06:44:07PM +0200, Jan Kara wrote: > On Wed 03-10-18 08:13:37, Dan Williams wrote: > > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote: > > > WRT per-inode DAX property, AFAIU that inode flag is just going to be > > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with > > > these inode flags set in a configuration which does not allow DAX to be > > > used, you will still be able to access such inodes but the access will use > > > page cache instead. And querying these flags should better show real > > > on-disk status and not just whether DAX is used as that would result in an > > > even bigger mess. So this feature seems to be somewhat orthogonal to the > > > API I'm looking for. > > > > True, I imagine once we have that flag we will be able to distinguish > > the "saved" property and the "effective / live" property of DAX... > > Also it's really not DAX that applications care about as much as "is > > there page-cache indirection / overhead for this mapping?". That seems > > to be a narrower guarantee that we can make than what "DAX" might > > imply. > > Right. So what do people think about my suggestion earlier in the thread to > use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success > when DAX is in use, failure otherwise. Later we could extend it to be also > used as a hint for caching policy for the inode... Hmm apart from Dan's objection that it can't really be used for a query, isn't madvise(2) for mmap(2)? But AFAIU (from looking at the xfs code, so please correct me if I', wrong), DAX can be used for the traditional read(2)/write(2) interface as well. There is at least: xfs_file_read_iter() `-> if (IS_DAX(inode)) `-> xfs_file_dax_read() `->dax_iomap_rw() So IMHO something on an inode granularity would make more sens to me. Byte, Johannes
On Tue, Oct 02, 2018 at 08:06:34AM -0700, Christoph Hellwig wrote:
> There is no promise, sorry.
Well there have been lot's of articles on for instance lwn.net [1] [2]
[3] describing how to avoid the "overhead" of the page cache when
running on persistent memory.
So if I would be a database developer, I'd look into them and see how
I could exploit this for my needs.
So even if we don't want to call it a promise, it was at least an
advertisement and people are now taking our word for it.
[1] https://lwn.net/Articles/610174/
[2] https://lwn.net/Articles/717953/
[3] https://lwn.net/Articles/684828/
Byte,
Johannes
On Thu, Oct 04, 2018 at 12:09:49PM +0200, Johannes Thumshirn wrote: > On Tue, Oct 02, 2018 at 08:06:34AM -0700, Christoph Hellwig wrote: > > There is no promise, sorry. > > Well there have been lot's of articles on for instance lwn.net [1] [2] > [3] describing how to avoid the "overhead" of the page cache when > running on persistent memory. Since when is an article on some website a promise (of what exactly) by linux kernel developers?
On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote: > Since when is an article on some website a promise (of what exactly) > by linux kernel developers? Let's stop it here, this doesn't make any sort of forward progress.
On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote: > > Since when is an article on some website a promise (of what exactly) > > by linux kernel developers? > > Let's stop it here, this doesn't make any sort of forward progress. > I do think there is some progress we can make if we separate DAX as an access mechanism vs DAX as a resource utilization contract. My attempt at representing Christoph's position is that the kernel should not be advertising / making access mechanism guarantees. That makes sense. Even with MAP_SYNC+DAX the kernel reserves the right to write-protect mappings at will and trap access into a kernel handler. Additionally, whether read(2) / write(2) does anything different behind the scenes in DAX mode, or not should be irrelevant to the application. That said what is certainly not irrelevant is a kernel giving userspace visibility and control into resource utilization. Jan's MADV_DIRECT_ACCESS let's the application make assumptions about page cache utilization, we just need to another mechanism to read if a mapping is effectively already in that state.
Jan Kara <jack@suse.cz> writes: > Hello, > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > mean time certain customer of ours started poking into /proc/<pid>/smaps > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > flags, the application just fails to start complaining that DAX support is > missing in the kernel. The question now is how do we go about this? > > Strictly speaking, this is a userspace visible regression (as much as I > think that application poking into VMA flags at this level is just too > bold). Is there any precedens in handling similar issues with smaps which > really exposes a lot of information that is dependent on kernel > implementation details? > > I have attached a patch that is an obvious "fix" for the issue - just fake > VM_MIXEDMAP flag in smaps. But I'm open to other suggestions... Hi, Jan, I'm intrigued by the use case. Do I understand you correctly that the database in question does not intend to make data persistent from userspace? In other words, fsync/msync system calls are being issued by the database? I guess what I'm really after is a statement of requirements or expectations. It would be great if you could convince the database developer to engage in this discussion directly. Cheers, Jeff
On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote: > > > Since when is an article on some website a promise (of what exactly) > > > by linux kernel developers? > > > > Let's stop it here, this doesn't make any sort of forward progress. > > > > I do think there is some progress we can make if we separate DAX as an > access mechanism vs DAX as a resource utilization contract. My attempt > at representing Christoph's position is that the kernel should not be > advertising / making access mechanism guarantees. That makes sense. > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect > mappings at will and trap access into a kernel handler. Additionally, > whether read(2) / write(2) does anything different behind the scenes > in DAX mode, or not should be irrelevant to the application. > > That said what is certainly not irrelevant is a kernel giving > userspace visibility and control into resource utilization. Jan's > MADV_DIRECT_ACCESS let's the application make assumptions about page > cache utilization, we just need to another mechanism to read if a > mapping is effectively already in that state. I thought more about this today while reviewing the virtio-pmem driver that will behave mostly like a DAX-capable pmem device except it will be implemented by passing host page cache through to the guest as a pmem device with a paravirtualized / asynchronous flush interface. MAP_SYNC obviously needs to be disabled for this case, but still need allow to some semblance of DAX operation to save allocating page cache in the guest. The need to explicitly clarify the state of DAX is growing with the different nuances of DAX operation. Lets use a new MAP_DIRECT flag to positively assert that a given mmap() call is setting up a memory mapping without page-cache or buffered indirection. To be clear not my original MAP_DIRECT proposal from a while back, instead just a flag to mmap() that causes the mapping attempt to fail if there is any software buffering fronting the memory mapping, or any requirement for software to manage flushing outside of pushing writes through the cpu cache. This way, if we ever extend MAP_SYNC for a buffered use case we can still definitely assert that the mapping is "direct". So, MAP_DIRECT would fail for traditional non-DAX block devices, and for this new virtio-pmem case. It would also fail for any pmem device where we cannot assert that the platform will take care of flushing write-pending-queues on power-loss events.
Hi Jeff, On Tue 09-10-18 15:43:41, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > > mean time certain customer of ours started poking into /proc/<pid>/smaps > > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > > flags, the application just fails to start complaining that DAX support is > > missing in the kernel. The question now is how do we go about this? > > > > Strictly speaking, this is a userspace visible regression (as much as I > > think that application poking into VMA flags at this level is just too > > bold). Is there any precedens in handling similar issues with smaps which > > really exposes a lot of information that is dependent on kernel > > implementation details? > > > > I have attached a patch that is an obvious "fix" for the issue - just fake > > VM_MIXEDMAP flag in smaps. But I'm open to other suggestions... > > I'm intrigued by the use case. Do I understand you correctly that the > database in question does not intend to make data persistent from > userspace? In other words, fsync/msync system calls are being issued by > the database? Yes, at least at the initial stage, they use fsync / msync to persist data. > I guess what I'm really after is a statement of requirements or > expectations. It would be great if you could convince the database > developer to engage in this discussion directly. So I talked to them and what they really look after is the control over the amount of memory needed by the kernel. And they are right that if your storage needs page cache, the amount of memory you need to set aside for the kernel is larger. Honza
Jan Kara <jack@suse.cz> writes: > Hi Jeff, > > On Tue 09-10-18 15:43:41, Jeff Moyer wrote: >> I'm intrigued by the use case. Do I understand you correctly that the >> database in question does not intend to make data persistent from >> userspace? In other words, fsync/msync system calls are being issued by >> the database? > > Yes, at least at the initial stage, they use fsync / msync to persist data. OK. >> I guess what I'm really after is a statement of requirements or >> expectations. It would be great if you could convince the database >> developer to engage in this discussion directly. > > So I talked to them and what they really look after is the control over the > amount of memory needed by the kernel. And they are right that if your > storage needs page cache, the amount of memory you need to set aside for the > kernel is larger. OK, thanks a lot for following up, Jan! -Jeff
On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote: > > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > > > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote: > > > > Since when is an article on some website a promise (of what exactly) > > > > by linux kernel developers? > > > > > > Let's stop it here, this doesn't make any sort of forward progress. > > > > > > > I do think there is some progress we can make if we separate DAX as an > > access mechanism vs DAX as a resource utilization contract. My attempt > > at representing Christoph's position is that the kernel should not be > > advertising / making access mechanism guarantees. That makes sense. > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect > > mappings at will and trap access into a kernel handler. Additionally, > > whether read(2) / write(2) does anything different behind the scenes > > in DAX mode, or not should be irrelevant to the application. > > > > That said what is certainly not irrelevant is a kernel giving > > userspace visibility and control into resource utilization. Jan's > > MADV_DIRECT_ACCESS let's the application make assumptions about page > > cache utilization, we just need to another mechanism to read if a > > mapping is effectively already in that state. > > I thought more about this today while reviewing the virtio-pmem driver > that will behave mostly like a DAX-capable pmem device except it will > be implemented by passing host page cache through to the guest as a > pmem device with a paravirtualized / asynchronous flush interface. > MAP_SYNC obviously needs to be disabled for this case, but still need > allow to some semblance of DAX operation to save allocating page cache > in the guest. The need to explicitly clarify the state of DAX is > growing with the different nuances of DAX operation. > > Lets use a new MAP_DIRECT flag to positively assert that a given > mmap() call is setting up a memory mapping without page-cache or > buffered indirection. To be clear not my original MAP_DIRECT proposal > from a while back, instead just a flag to mmap() that causes the > mapping attempt to fail if there is any software buffering fronting > the memory mapping, or any requirement for software to manage flushing > outside of pushing writes through the cpu cache. This way, if we ever > extend MAP_SYNC for a buffered use case we can still definitely assert > that the mapping is "direct". So, MAP_DIRECT would fail for > traditional non-DAX block devices, and for this new virtio-pmem case. > It would also fail for any pmem device where we cannot assert that the > platform will take care of flushing write-pending-queues on power-loss > events. After letting this set for a few days I think I'm back to liking MADV_DIRECT_ACCESS more since madvise() is more closely related to the page-cache management than mmap. It does not solve the query vs enable problem, but it's still a step towards giving applications what they want with respect to resource expectations. Perhaps a new syscall to retrieve the effective advice for a range? int madvice(void *addr, size_t length, int *advice);
Jan Kara <jack@suse.cz> writes: > [Added ext4, xfs, and linux-api folks to CC for the interface discussion] > > On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote: >> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote: >> > Hello, >> > >> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has >> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the >> > mean time certain customer of ours started poking into /proc/<pid>/smaps >> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA >> > flags, the application just fails to start complaining that DAX support is >> > missing in the kernel. The question now is how do we go about this? >> >> OK naive question from me, how do we want an application to be able to >> check if it is running on a DAX mapping? > > The question from me is: Should application really care? After all DAX is > just a caching decision. Sure it affects performance characteristics and > memory usage of the kernel but it is not a correctness issue (in particular > we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be > supported for current mapping). And in the future the details of what we do > with DAX mapping can change - e.g. I could imagine we might decide to cache > writes in DRAM but do direct PMEM access on reads. And all this could be > auto-tuned based on media properties. And we don't want to tie our hands by > specifying too narrowly how the kernel is going to behave. For read and write, I would expect the O_DIRECT open flag to still work, even for dax-capable persistent memory. Is that a contentious opinion? So, what we're really discussing is the behavior for mmap. MAP_SYNC will certainly ensure that the page cache is not used for writes. It would also be odd for us to decide to cache reads. The only issue I can see is that perhaps the application doesn't want to take a performance hit on write faults. I haven't heard that concern expressed in this thread, though. Just to be clear, this is my understanding of the world: MAP_SYNC - file system guarantees that metadata required to reach faulted-in file data is consistent on media before a write fault is completed. A side-effect is that the page cache will not be used for writably-mapped pages. and what I think Dan had proposed: mmap flag, MAP_DIRECT - file system guarantees that page cache will not be used to front storage. storage MUST be directly addressable. This *almost* implies MAP_SYNC. The subtle difference is that a write fault /may/ not result in metadata being written back to media. and this is what I think you were proposing, Jan: madvise flag, MADV_DIRECT_ACCESS - same semantics as MAP_DIRECT, but specified via the madvise system call Cheers, Jeff
On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > Jan Kara <jack@suse.cz> writes: > > > [Added ext4, xfs, and linux-api folks to CC for the interface discussion] > > > > On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote: > >> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote: > >> > Hello, > >> > > >> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > >> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > >> > mean time certain customer of ours started poking into /proc/<pid>/smaps > >> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > >> > flags, the application just fails to start complaining that DAX support is > >> > missing in the kernel. The question now is how do we go about this? > >> > >> OK naive question from me, how do we want an application to be able to > >> check if it is running on a DAX mapping? > > > > The question from me is: Should application really care? After all DAX is > > just a caching decision. Sure it affects performance characteristics and > > memory usage of the kernel but it is not a correctness issue (in particular > > we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be > > supported for current mapping). And in the future the details of what we do > > with DAX mapping can change - e.g. I could imagine we might decide to cache > > writes in DRAM but do direct PMEM access on reads. And all this could be > > auto-tuned based on media properties. And we don't want to tie our hands by > > specifying too narrowly how the kernel is going to behave. > > For read and write, I would expect the O_DIRECT open flag to still work, > even for dax-capable persistent memory. Is that a contentious opinion? Not contentious at all, because that's the way it currently works. FYI, XFS decides what to do with read (and similarly writes) like this: if (IS_DAX(inode)) ret = xfs_file_dax_read(iocb, to); else if (iocb->ki_flags & IOCB_DIRECT) ret = xfs_file_dio_aio_read(iocb, to); else ret = xfs_file_buffered_aio_read(iocb, to); Neither DAX or O_DIRECT on pmem use the page cache - the only difference between the DAX read/write path and the O_DIRECT read/write path is where the memcpy() into the user buffer is done. For DAX it's done in the fsdax layer, for O_DIRECT it's done in the pmem block driver. > So, what we're really discussing is the behavior for mmap. Yes. > MAP_SYNC > will certainly ensure that the page cache is not used for writes. It > would also be odd for us to decide to cache reads. The only issue I can > see is that perhaps the application doesn't want to take a performance > hit on write faults. I haven't heard that concern expressed in this > thread, though. > > Just to be clear, this is my understanding of the world: > > MAP_SYNC > - file system guarantees that metadata required to reach faulted-in file > data is consistent on media before a write fault is completed. A > side-effect is that the page cache will not be used for > writably-mapped pages. I think you are conflating current implementation with API requirements - MAP_SYNC doesn't guarantee anything about page cache use. The man page definition simply says "supported only for files supporting DAX" and that it provides certain data integrity guarantees. It does not define the implementation. We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour, because that's the only way we can currently provide the required behaviour to userspace. However, if a filesystem can use the page cache to provide the required functionality, then it's free to do so. i.e. if someone implements a pmem-based page cache, MAP_SYNC data integrity could be provided /without DAX/ by any filesystem using that persistent page cache. i.e. MAP_SYNC really only requires mmap() of CPU addressable persistent memory - it does not require DAX. Right now, however, the only way to get this functionality is through a DAX capable filesystem on dax capable storage. And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it COWs new pages in pmem and attaches them a special per-inode cache on clean->dirty transition. Then on data sync, background writeback or crash recovery, it migrates them from the cache into the file map proper via atomic metadata pointer swaps. IOWs, NOVA provides the correct MAP_SYNC semantics by using a separate persistent per-inode write cache to provide the correct crash recovery semantics for MAP_SYNC. > and what I think Dan had proposed: > > mmap flag, MAP_DIRECT > - file system guarantees that page cache will not be used to front storage. > storage MUST be directly addressable. This *almost* implies MAP_SYNC. > The subtle difference is that a write fault /may/ not result in metadata > being written back to media. SIimilar to O_DIRECT, these semantics do not allow userspace apps to replace msync/fsync with CPU cache flush operations. So any application that uses this mode still needs to use either MAP_SYNC or issue msync/fsync for data integrity. If the app is using MAP_DIRECT, the what do we do if the filesystem can't provide the required semantics for that specific operation? In the case of O_DIRECT, we fall back to buffered IO because it has the same data integrity semantics as O_DIRECT and will always work. It's just slower and consumes more memory, but the app continues to work just fine. Sending SIGBUS to apps when we can't perform MAP_DIRECT operations without using the pagecache seems extremely problematic to me. e.g. an app already has an open MAP_DIRECT file, and a third party reflinks it or dedupes it and the fs has to fall back to buffered IO to do COW operations. This isn't the app's fault - the kernel should just fall back transparently to using the page cache for the MAP_DIRECT app and just keep working, just like it would if it was using O_DIRECT read/write. The point I'm trying to make here is that O_DIRECT is a /hint/, not a guarantee, and it's done that way to prevent applications from being presented with transient, potentially fatal error cases because a filesystem implementation can't do a specific operation through the direct IO path. IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a guarantee. Over time we'll end up with filesystems that can guarantee that MAP_DIRECT is always going to use DAX, in the same way we have filesystems that guarantee O_DIRECT will always be O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee no page cache will ever be used, then we are basically saying "filesystems won't provide MAP_DIRECT even in common, useful cases because they can't provide MAP_DIRECT in all cases." And that doesn't seem like a very good solution to me. > and this is what I think you were proposing, Jan: > > madvise flag, MADV_DIRECT_ACCESS > - same semantics as MAP_DIRECT, but specified via the madvise system call Seems to be the equivalent of fcntl(F_SETFL, O_DIRECT). Makes sense to have both MAP_DIRECT and MADV_DIRECT_ACCESS to me - one is an init time flag, the other is a run time flag. Cheers, Dave.
On Thu 18-10-18 11:25:10, Dave Chinner wrote: > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > MAP_SYNC > > - file system guarantees that metadata required to reach faulted-in file > > data is consistent on media before a write fault is completed. A > > side-effect is that the page cache will not be used for > > writably-mapped pages. > > I think you are conflating current implementation with API > requirements - MAP_SYNC doesn't guarantee anything about page cache > use. The man page definition simply says "supported only for files > supporting DAX" and that it provides certain data integrity > guarantees. It does not define the implementation. > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour, > because that's the only way we can currently provide the required > behaviour to userspace. However, if a filesystem can use the page > cache to provide the required functionality, then it's free to do > so. > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data > integrity could be provided /without DAX/ by any filesystem using > that persistent page cache. i.e. MAP_SYNC really only requires > mmap() of CPU addressable persistent memory - it does not require > DAX. Right now, however, the only way to get this functionality is > through a DAX capable filesystem on dax capable storage. > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it > COWs new pages in pmem and attaches them a special per-inode cache > on clean->dirty transition. Then on data sync, background writeback > or crash recovery, it migrates them from the cache into the file map > proper via atomic metadata pointer swaps. > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a > separate persistent per-inode write cache to provide the correct > crash recovery semantics for MAP_SYNC. Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But effectively it will be also able to provide MAP_DIRECT semantics, right? Because there won't be DRAM between app and persistent storage and I don't think COW tricks or other data integrity methods are that interesting for the application. Most users of O_DIRECT are concerned about getting close to media speed performance and low DRAM usage... > > and what I think Dan had proposed: > > > > mmap flag, MAP_DIRECT > > - file system guarantees that page cache will not be used to front storage. > > storage MUST be directly addressable. This *almost* implies MAP_SYNC. > > The subtle difference is that a write fault /may/ not result in metadata > > being written back to media. > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to > replace msync/fsync with CPU cache flush operations. So any > application that uses this mode still needs to use either MAP_SYNC > or issue msync/fsync for data integrity. > > If the app is using MAP_DIRECT, the what do we do if the filesystem > can't provide the required semantics for that specific operation? In > the case of O_DIRECT, we fall back to buffered IO because it has the > same data integrity semantics as O_DIRECT and will always work. It's > just slower and consumes more memory, but the app continues to work > just fine. > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations > without using the pagecache seems extremely problematic to me. e.g. > an app already has an open MAP_DIRECT file, and a third party > reflinks it or dedupes it and the fs has to fall back to buffered IO > to do COW operations. This isn't the app's fault - the kernel should > just fall back transparently to using the page cache for the > MAP_DIRECT app and just keep working, just like it would if it was > using O_DIRECT read/write. There's another option of failing reflink / dedupe with EBUSY if the file is mapped with MAP_DIRECT and the filesystem cannot support relink & MAP_DIRECT together. But there are downsides to that as well. > The point I'm trying to make here is that O_DIRECT is a /hint/, not > a guarantee, and it's done that way to prevent applications from > being presented with transient, potentially fatal error cases > because a filesystem implementation can't do a specific operation > through the direct IO path. > > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a > guarantee. Over time we'll end up with filesystems that can > guarantee that MAP_DIRECT is always going to use DAX, in the same > way we have filesystems that guarantee O_DIRECT will always be > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee > no page cache will ever be used, then we are basically saying > "filesystems won't provide MAP_DIRECT even in common, useful cases > because they can't provide MAP_DIRECT in all cases." And that > doesn't seem like a very good solution to me. These are good points. I'm just somewhat wary of the situation where users will map files with MAP_DIRECT and then the machine starts thrashing because the file got reflinked and thus pagecache gets used suddently. With O_DIRECT the fallback to buffered IO is quite rare (at least for major filesystems) so usually people just won't notice. If fallback for MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. Honza
On Wed 17-10-18 13:01:15, Dan Williams wrote: > On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > > > > > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote: > > > > > Since when is an article on some website a promise (of what exactly) > > > > > by linux kernel developers? > > > > > > > > Let's stop it here, this doesn't make any sort of forward progress. > > > > > > > > > > I do think there is some progress we can make if we separate DAX as an > > > access mechanism vs DAX as a resource utilization contract. My attempt > > > at representing Christoph's position is that the kernel should not be > > > advertising / making access mechanism guarantees. That makes sense. > > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect > > > mappings at will and trap access into a kernel handler. Additionally, > > > whether read(2) / write(2) does anything different behind the scenes > > > in DAX mode, or not should be irrelevant to the application. > > > > > > That said what is certainly not irrelevant is a kernel giving > > > userspace visibility and control into resource utilization. Jan's > > > MADV_DIRECT_ACCESS let's the application make assumptions about page > > > cache utilization, we just need to another mechanism to read if a > > > mapping is effectively already in that state. > > > > I thought more about this today while reviewing the virtio-pmem driver > > that will behave mostly like a DAX-capable pmem device except it will > > be implemented by passing host page cache through to the guest as a > > pmem device with a paravirtualized / asynchronous flush interface. > > MAP_SYNC obviously needs to be disabled for this case, but still need > > allow to some semblance of DAX operation to save allocating page cache > > in the guest. The need to explicitly clarify the state of DAX is > > growing with the different nuances of DAX operation. > > > > Lets use a new MAP_DIRECT flag to positively assert that a given > > mmap() call is setting up a memory mapping without page-cache or > > buffered indirection. To be clear not my original MAP_DIRECT proposal > > from a while back, instead just a flag to mmap() that causes the > > mapping attempt to fail if there is any software buffering fronting > > the memory mapping, or any requirement for software to manage flushing > > outside of pushing writes through the cpu cache. This way, if we ever > > extend MAP_SYNC for a buffered use case we can still definitely assert > > that the mapping is "direct". So, MAP_DIRECT would fail for > > traditional non-DAX block devices, and for this new virtio-pmem case. > > It would also fail for any pmem device where we cannot assert that the > > platform will take care of flushing write-pending-queues on power-loss > > events. > > After letting this set for a few days I think I'm back to liking > MADV_DIRECT_ACCESS more since madvise() is more closely related to the > page-cache management than mmap. It does not solve the query vs enable > problem, but it's still a step towards giving applications what they > want with respect to resource expectations. Yeah, I don't have a strong opinion wrt mmap flag vs madvise flag. > Perhaps a new syscall to retrieve the effective advice for a range? > > int madvice(void *addr, size_t length, int *advice); After some thought, I'm not 100% sure this is really needed. I know about apps that want to make sure DRAM is not consumed - for those mmap / madvise flag is fine if it returns error in case the feature cannot be provided. Most other apps don't care whether DAX is on or off. So this call would be needed only if someone wanted to behave differently depending on whether DAX is used or not. And although I can imagine some application like that, I'm not sure how real that is... Honza
On Thu, Oct 18, 2018 at 10:43 AM Jan Kara <jack@suse.cz> wrote: > > On Wed 17-10-18 13:01:15, Dan Williams wrote: > > On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > > > > > > > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote: > > > > > > Since when is an article on some website a promise (of what exactly) > > > > > > by linux kernel developers? > > > > > > > > > > Let's stop it here, this doesn't make any sort of forward progress. > > > > > > > > > > > > > I do think there is some progress we can make if we separate DAX as an > > > > access mechanism vs DAX as a resource utilization contract. My attempt > > > > at representing Christoph's position is that the kernel should not be > > > > advertising / making access mechanism guarantees. That makes sense. > > > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect > > > > mappings at will and trap access into a kernel handler. Additionally, > > > > whether read(2) / write(2) does anything different behind the scenes > > > > in DAX mode, or not should be irrelevant to the application. > > > > > > > > That said what is certainly not irrelevant is a kernel giving > > > > userspace visibility and control into resource utilization. Jan's > > > > MADV_DIRECT_ACCESS let's the application make assumptions about page > > > > cache utilization, we just need to another mechanism to read if a > > > > mapping is effectively already in that state. > > > > > > I thought more about this today while reviewing the virtio-pmem driver > > > that will behave mostly like a DAX-capable pmem device except it will > > > be implemented by passing host page cache through to the guest as a > > > pmem device with a paravirtualized / asynchronous flush interface. > > > MAP_SYNC obviously needs to be disabled for this case, but still need > > > allow to some semblance of DAX operation to save allocating page cache > > > in the guest. The need to explicitly clarify the state of DAX is > > > growing with the different nuances of DAX operation. > > > > > > Lets use a new MAP_DIRECT flag to positively assert that a given > > > mmap() call is setting up a memory mapping without page-cache or > > > buffered indirection. To be clear not my original MAP_DIRECT proposal > > > from a while back, instead just a flag to mmap() that causes the > > > mapping attempt to fail if there is any software buffering fronting > > > the memory mapping, or any requirement for software to manage flushing > > > outside of pushing writes through the cpu cache. This way, if we ever > > > extend MAP_SYNC for a buffered use case we can still definitely assert > > > that the mapping is "direct". So, MAP_DIRECT would fail for > > > traditional non-DAX block devices, and for this new virtio-pmem case. > > > It would also fail for any pmem device where we cannot assert that the > > > platform will take care of flushing write-pending-queues on power-loss > > > events. > > > > After letting this set for a few days I think I'm back to liking > > MADV_DIRECT_ACCESS more since madvise() is more closely related to the > > page-cache management than mmap. It does not solve the query vs enable > > problem, but it's still a step towards giving applications what they > > want with respect to resource expectations. > > Yeah, I don't have a strong opinion wrt mmap flag vs madvise flag. MADV_DIRECT_ACCESS seems more flexible as the agent setting up the mapping does not need to be the one concerned with the DAX-state of the mapping. It's also the canonical interface for affecting page cache behavior. > > Perhaps a new syscall to retrieve the effective advice for a range? > > > > int madvice(void *addr, size_t length, int *advice); > > After some thought, I'm not 100% sure this is really needed. I know about > apps that want to make sure DRAM is not consumed - for those mmap / madvise > flag is fine if it returns error in case the feature cannot be provided. > Most other apps don't care whether DAX is on or off. So this call would be > needed only if someone wanted to behave differently depending on whether > DAX is used or not. And although I can imagine some application like that, > I'm not sure how real that is... True, yes, if an application wants the behavior just ask. The only caveat to address all the use cases for applications making decisions based on the presence of DAX is to make MADV_DIRECT_ACCESS fail if the mapping was not established with MAP_SYNC. That way we have both a way to assert that page cache resources are not being consumed, and that the kernel is handling metadata synchronization for any write-faults.
Dave, Thanks for the detailed response! I hadn't considered the NOVA use case at all. Cheers, Jeff
On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote: > On Thu 18-10-18 11:25:10, Dave Chinner wrote: > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > > MAP_SYNC > > > - file system guarantees that metadata required to reach faulted-in file > > > data is consistent on media before a write fault is completed. A > > > side-effect is that the page cache will not be used for > > > writably-mapped pages. > > > > I think you are conflating current implementation with API > > requirements - MAP_SYNC doesn't guarantee anything about page cache > > use. The man page definition simply says "supported only for files > > supporting DAX" and that it provides certain data integrity > > guarantees. It does not define the implementation. > > > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour, > > because that's the only way we can currently provide the required > > behaviour to userspace. However, if a filesystem can use the page > > cache to provide the required functionality, then it's free to do > > so. > > > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data > > integrity could be provided /without DAX/ by any filesystem using > > that persistent page cache. i.e. MAP_SYNC really only requires > > mmap() of CPU addressable persistent memory - it does not require > > DAX. Right now, however, the only way to get this functionality is > > through a DAX capable filesystem on dax capable storage. > > > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it > > COWs new pages in pmem and attaches them a special per-inode cache > > on clean->dirty transition. Then on data sync, background writeback > > or crash recovery, it migrates them from the cache into the file map > > proper via atomic metadata pointer swaps. > > > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a > > separate persistent per-inode write cache to provide the correct > > crash recovery semantics for MAP_SYNC. > > Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But > effectively it will be also able to provide MAP_DIRECT semantics, right? Yes, I think so. It still needs to do COW on first write fault, but then the app has direct access to the data buffer until it is cleaned and put back in place. The "put back in place" is just an atomic swap of metadata pointers, so it doesn't need the page cache at all... > Because there won't be DRAM between app and persistent storage and I don't > think COW tricks or other data integrity methods are that interesting for > the application. Not for the application, but the filesystem still wants to support snapshots and other such functionality that requires COW. And NOVA doesn't have write-in-place functionality at all - it always COWs on the clean->dirty transition. > Most users of O_DIRECT are concerned about getting close > to media speed performance and low DRAM usage... *nod* > > > and what I think Dan had proposed: > > > > > > mmap flag, MAP_DIRECT > > > - file system guarantees that page cache will not be used to front storage. > > > storage MUST be directly addressable. This *almost* implies MAP_SYNC. > > > The subtle difference is that a write fault /may/ not result in metadata > > > being written back to media. > > > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to > > replace msync/fsync with CPU cache flush operations. So any > > application that uses this mode still needs to use either MAP_SYNC > > or issue msync/fsync for data integrity. > > > > If the app is using MAP_DIRECT, the what do we do if the filesystem > > can't provide the required semantics for that specific operation? In > > the case of O_DIRECT, we fall back to buffered IO because it has the > > same data integrity semantics as O_DIRECT and will always work. It's > > just slower and consumes more memory, but the app continues to work > > just fine. > > > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations > > without using the pagecache seems extremely problematic to me. e.g. > > an app already has an open MAP_DIRECT file, and a third party > > reflinks it or dedupes it and the fs has to fall back to buffered IO > > to do COW operations. This isn't the app's fault - the kernel should > > just fall back transparently to using the page cache for the > > MAP_DIRECT app and just keep working, just like it would if it was > > using O_DIRECT read/write. > > There's another option of failing reflink / dedupe with EBUSY if the file > is mapped with MAP_DIRECT and the filesystem cannot support relink & > MAP_DIRECT together. But there are downsides to that as well. Yup, not the least that setting MAP_DIRECT can race with a reflink.... > > The point I'm trying to make here is that O_DIRECT is a /hint/, not > > a guarantee, and it's done that way to prevent applications from > > being presented with transient, potentially fatal error cases > > because a filesystem implementation can't do a specific operation > > through the direct IO path. > > > > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a > > guarantee. Over time we'll end up with filesystems that can > > guarantee that MAP_DIRECT is always going to use DAX, in the same > > way we have filesystems that guarantee O_DIRECT will always be > > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee > > no page cache will ever be used, then we are basically saying > > "filesystems won't provide MAP_DIRECT even in common, useful cases > > because they can't provide MAP_DIRECT in all cases." And that > > doesn't seem like a very good solution to me. > > These are good points. I'm just somewhat wary of the situation where users > will map files with MAP_DIRECT and then the machine starts thrashing > because the file got reflinked and thus pagecache gets used suddently. It's still better than apps randomly getting SIGBUS. FWIW, this suggests that we probably need to be able to host both DAX pages and page cache pages on the same file at the same time, and be able to handle page faults based on the type of page being mapped (different sets of fault ops for different page types?) and have fallback paths when the page type needs to be changed between direct and cached during the fault.... > With O_DIRECT the fallback to buffered IO is quite rare (at least for major > filesystems) so usually people just won't notice. If fallback for > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. Which is just like the situation where O_DIRECT on ext3 was not very useful, but on other filesystems like XFS it was fully functional. IMO, the fact that a specific filesytem has a suboptimal fallback path for an uncommon behaviour isn't an argument against MAP_DIRECT as a hint - it's actually a feature. If MAP_DIRECT can't be used until it's always direct access, then most filesystems wouldn't be able to provide any faster paths at all. It's much better to have partial functionality now than it is to never have the functionality at all, and so we need to design in the flexibility we need to iteratively improve implementations without needing API changes that will break applications. Cheers, Dave.
On Thu, Oct 18, 2018 at 12:10:13PM -0700, Dan Williams wrote: > The only caveat to address all the use cases for applications making > decisions based on the presence of DAX And that's how we've got into this mess. Applications need to focus on the functionality they require, not the technology that provides it. That's the root of the we are trying to solve here and really I don't care if we have to break existing applications to do it. i.e. we've made no promises about API/ABI stability and the functionality is still experimental. Fundamentally, DAX is a technology, not an API property. The two "DAX" API properties that matter to applications are: 1. does mmap allow us to use CPU flush instructions for data integrity operations safely? And 2. can mmap directly access the backing store without incurring any additional overhead? MAP_SYNC provides #1, MAP_DIRECT provides #2, and DAX provides both. However, they do not define DAX, nor does DAX define them. e.g. MAP_SYNC can be provided by a persistent memory page cache. But a persistent memory page cache does not provide MAP_DIRECT. MAP_SYNC can be provided by filesystem DAX, but *only* when direct access is used. i.e. MAP_SYNC | MAP_DIRECT MAP_DIRECT can be provided by filesystem DAX, but it does not imply or require MAP_SYNC behaviour. IOWs, using MAP_SYNC and/or MAP_DIRECT to answering an "is DAX present" question ties the API to a technology rather than to the functionality the technology provides applications. i.e. If the requested behaviour/property is not available from the underlying technology, then the app needs to handle that error and use a different access method. > applications making > decisions based on the presence of DAX > is to make MADV_DIRECT_ACCESS > fail if the mapping was not established with MAP_SYNC. And so this is wrong - MADV_DIRECT_ACCESS does not require MAP_SYNC. It is perfectly legal for MADV_DIRECT_ACCESS to be used without MAP_SYNC - the app just needs to use msync/fsync instead. Wanting to enable full userspace CPU data sync semantics via madvise() implies we also need MADV_SYNC in addition to MADV_DIRECT_ACCESS. i.e. Apps that are currently testing for dax should use mmap(MAP_SYNC|MAP_DIRECT) or madvise(MADV_SYNC|MADV_DIRECT) and they will fail if the underlying storage is not DAX capable. The app doesn't need to poke at anything else to see if DAX is enabled - if the functionality is there then it will work, otherwise they need to handle the error and do something else. > That way we > have both a way to assert that page cache resources are not being > consumed, and that the kernel is handling metadata synchronization for > any write-faults. Yes, we need to do that, but not at the cost of having the API prevent apps from ever being able to use direct access + msync/fsync data integrity operations. Cheers, Dave.
On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote: > > On Thu 18-10-18 11:25:10, Dave Chinner wrote: > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > > > MAP_SYNC > > > > - file system guarantees that metadata required to reach faulted-in file > > > > data is consistent on media before a write fault is completed. A > > > > side-effect is that the page cache will not be used for > > > > writably-mapped pages. > > > > > > I think you are conflating current implementation with API > > > requirements - MAP_SYNC doesn't guarantee anything about page cache > > > use. The man page definition simply says "supported only for files > > > supporting DAX" and that it provides certain data integrity > > > guarantees. It does not define the implementation. > > > > > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour, > > > because that's the only way we can currently provide the required > > > behaviour to userspace. However, if a filesystem can use the page > > > cache to provide the required functionality, then it's free to do > > > so. > > > > > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data > > > integrity could be provided /without DAX/ by any filesystem using > > > that persistent page cache. i.e. MAP_SYNC really only requires > > > mmap() of CPU addressable persistent memory - it does not require > > > DAX. Right now, however, the only way to get this functionality is > > > through a DAX capable filesystem on dax capable storage. > > > > > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it > > > COWs new pages in pmem and attaches them a special per-inode cache > > > on clean->dirty transition. Then on data sync, background writeback > > > or crash recovery, it migrates them from the cache into the file map > > > proper via atomic metadata pointer swaps. > > > > > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a > > > separate persistent per-inode write cache to provide the correct > > > crash recovery semantics for MAP_SYNC. > > > > Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But > > effectively it will be also able to provide MAP_DIRECT semantics, right? > > Yes, I think so. It still needs to do COW on first write fault, > but then the app has direct access to the data buffer until it is > cleaned and put back in place. The "put back in place" is just an > atomic swap of metadata pointers, so it doesn't need the page cache > at all... > > > Because there won't be DRAM between app and persistent storage and I don't > > think COW tricks or other data integrity methods are that interesting for > > the application. > > Not for the application, but the filesystem still wants to support > snapshots and other such functionality that requires COW. And NOVA > doesn't have write-in-place functionality at all - it always COWs > on the clean->dirty transition. > > > Most users of O_DIRECT are concerned about getting close > > to media speed performance and low DRAM usage... > > *nod* > > > > > and what I think Dan had proposed: > > > > > > > > mmap flag, MAP_DIRECT > > > > - file system guarantees that page cache will not be used to front storage. > > > > storage MUST be directly addressable. This *almost* implies MAP_SYNC. > > > > The subtle difference is that a write fault /may/ not result in metadata > > > > being written back to media. > > > > > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to > > > replace msync/fsync with CPU cache flush operations. So any > > > application that uses this mode still needs to use either MAP_SYNC > > > or issue msync/fsync for data integrity. > > > > > > If the app is using MAP_DIRECT, the what do we do if the filesystem > > > can't provide the required semantics for that specific operation? In > > > the case of O_DIRECT, we fall back to buffered IO because it has the > > > same data integrity semantics as O_DIRECT and will always work. It's > > > just slower and consumes more memory, but the app continues to work > > > just fine. > > > > > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations > > > without using the pagecache seems extremely problematic to me. e.g. > > > an app already has an open MAP_DIRECT file, and a third party > > > reflinks it or dedupes it and the fs has to fall back to buffered IO > > > to do COW operations. This isn't the app's fault - the kernel should > > > just fall back transparently to using the page cache for the > > > MAP_DIRECT app and just keep working, just like it would if it was > > > using O_DIRECT read/write. > > > > There's another option of failing reflink / dedupe with EBUSY if the file > > is mapped with MAP_DIRECT and the filesystem cannot support relink & > > MAP_DIRECT together. But there are downsides to that as well. > > Yup, not the least that setting MAP_DIRECT can race with a > reflink.... > > > > The point I'm trying to make here is that O_DIRECT is a /hint/, not > > > a guarantee, and it's done that way to prevent applications from > > > being presented with transient, potentially fatal error cases > > > because a filesystem implementation can't do a specific operation > > > through the direct IO path. > > > > > > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a > > > guarantee. Over time we'll end up with filesystems that can > > > guarantee that MAP_DIRECT is always going to use DAX, in the same > > > way we have filesystems that guarantee O_DIRECT will always be > > > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee > > > no page cache will ever be used, then we are basically saying > > > "filesystems won't provide MAP_DIRECT even in common, useful cases > > > because they can't provide MAP_DIRECT in all cases." And that > > > doesn't seem like a very good solution to me. > > > > These are good points. I'm just somewhat wary of the situation where users > > will map files with MAP_DIRECT and then the machine starts thrashing > > because the file got reflinked and thus pagecache gets used suddently. > > It's still better than apps randomly getting SIGBUS. > > FWIW, this suggests that we probably need to be able to host both > DAX pages and page cache pages on the same file at the same time, > and be able to handle page faults based on the type of page being > mapped (different sets of fault ops for different page types?) > and have fallback paths when the page type needs to be changed > between direct and cached during the fault.... > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major > > filesystems) so usually people just won't notice. If fallback for > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. > > Which is just like the situation where O_DIRECT on ext3 was not very > useful, but on other filesystems like XFS it was fully functional. > > IMO, the fact that a specific filesytem has a suboptimal fallback > path for an uncommon behaviour isn't an argument against MAP_DIRECT > as a hint - it's actually a feature. If MAP_DIRECT can't be used > until it's always direct access, then most filesystems wouldn't be > able to provide any faster paths at all. It's much better to have > partial functionality now than it is to never have the functionality > at all, and so we need to design in the flexibility we need to > iteratively improve implementations without needing API changes that > will break applications. The hard guarantee requirement still remains though because an application that expects combined MAP_SYNC|MAP_DIRECT semantics will be surprised if the MAP_DIRECT property silently disappears. I think it still makes some sense as a hint for apps that want to minimize page cache, but for the applications with a flush from userspace model I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This still gives the filesystem the option to inject page-cache at will, but with an application coordination point.
On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote: > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote: > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote: > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > > > > MAP_SYNC > > > > > - file system guarantees that metadata required to reach faulted-in file > > > > > data is consistent on media before a write fault is completed. A > > > > > side-effect is that the page cache will not be used for > > > > > writably-mapped pages. > > > > > > > > I think you are conflating current implementation with API > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache > > > > use. The man page definition simply says "supported only for files > > > > supporting DAX" and that it provides certain data integrity > > > > guarantees. It does not define the implementation. .... > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major > > > filesystems) so usually people just won't notice. If fallback for > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. > > > > Which is just like the situation where O_DIRECT on ext3 was not very > > useful, but on other filesystems like XFS it was fully functional. > > > > IMO, the fact that a specific filesytem has a suboptimal fallback > > path for an uncommon behaviour isn't an argument against MAP_DIRECT > > as a hint - it's actually a feature. If MAP_DIRECT can't be used > > until it's always direct access, then most filesystems wouldn't be > > able to provide any faster paths at all. It's much better to have > > partial functionality now than it is to never have the functionality > > at all, and so we need to design in the flexibility we need to > > iteratively improve implementations without needing API changes that > > will break applications. > > The hard guarantee requirement still remains though because an > application that expects combined MAP_SYNC|MAP_DIRECT semantics will > be surprised if the MAP_DIRECT property silently disappears. Why would they be surprised? They won't even notice it if the filesystem can provide MAP_SYNC without MAP_DIRECT. And that's the whole point. MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not visible to the filesystem and the filesystem does nothing to enforce them. If someone does something that requires the page cache (e.g. calls do_splice_direct()) then that MAP_DIRECT mapping has a whole heap of new work to do. And, in some cases, the filesystem may not be able to provide MAP_DIRECT as a result.. IOWs, the filesystem cannot guarantee MAP_DIRECT and the circumstances under which MAP_DIRECT will and will not work are dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have applications randomly segfaulting in production as things like backups, indexers, etc run over the filesystem and do their work. This is why MAP_DIRECT needs to be an optimisation, not a requirement - things will still work if MAP_DIRECT is not used. What matters to these applications is MAP_SYNC - if we break MAP_SYNC, then the application data integrity model is violated. That's not an acceptible outcome. The problem, it seems to me, is that people are unable to separate MAP_DIRECT and MAP_SYNC. I suspect that is because, at present, MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only provide MAP_SYNC functionality on DAX mappings. However, that's a /filesystem implementation issue/, not an API guarantee we need to provide to userspace. If we implement a persistent page cache (e.g. allocate page cache pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4 could provide applications with the MAP_SYNC data integrity model without MAP_DIRECT. Indeed, those filesystems would not even be able to provide MAP_DIRECT semantics because they aren't backed by pmem. Hence if applications that want MAP_SYNC are hard coded MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then those applications are going to fail on a filesystem that provides only MAP_SYNC. This is despite the fact the applications would function correctly and the data integrity model would be maintained. i.e. the failure is because applications have assumed MAP_SYNC can only be provided by a DAX implementation, not because MAP_SYNC is not supported. MAP_SYNC really isn't about DAX at all. It's about enabling a data integrity model that requires the filesystem to provide userspace access to CPU addressable persistent memory. DAX+MAP_DIRECT is just one method of providing this functionality, but it's not the only method. Our API needs to be future proof rather than an encoding of the existing implementation limitations, otherwise apps will have to be re-written as every new MAP_SYNC capable technology comes along. In summary: MAP_DIRECT is an access hint. MAP_SYNC provides a data integrity model guarantee. MAP_SYNC may imply MAP_DIRECT for specific implementations, but it does not require or guarantee MAP_DIRECT. Let's compare that with O_DIRECT: O_DIRECT in an access hint. O_DSYNC provides a data integrity model guarantee. O_DSYNC may imply O_DIRECT for specific implementations, but it does not require or guarantee O_DIRECT. Consistency in access and data integrity models is a good thing. DAX and pmem is not an exception. We need to use a model we know works and has proven itself over a long period of time. > I think > it still makes some sense as a hint for apps that want to minimize > page cache, but for the applications with a flush from userspace model > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This > still gives the filesystem the option to inject page-cache at will, > but with an application coordination point. Why make it more complex for applications than it needs to be? Cheers, Dave.
On Tue, Oct 30, 2018 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote: > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote: > > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote: > > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote: > > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > > > > > MAP_SYNC > > > > > > - file system guarantees that metadata required to reach faulted-in file > > > > > > data is consistent on media before a write fault is completed. A > > > > > > side-effect is that the page cache will not be used for > > > > > > writably-mapped pages. > > > > > > > > > > I think you are conflating current implementation with API > > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache > > > > > use. The man page definition simply says "supported only for files > > > > > supporting DAX" and that it provides certain data integrity > > > > > guarantees. It does not define the implementation. > .... > > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major > > > > filesystems) so usually people just won't notice. If fallback for > > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. > > > > > > Which is just like the situation where O_DIRECT on ext3 was not very > > > useful, but on other filesystems like XFS it was fully functional. > > > > > > IMO, the fact that a specific filesytem has a suboptimal fallback > > > path for an uncommon behaviour isn't an argument against MAP_DIRECT > > > as a hint - it's actually a feature. If MAP_DIRECT can't be used > > > until it's always direct access, then most filesystems wouldn't be > > > able to provide any faster paths at all. It's much better to have > > > partial functionality now than it is to never have the functionality > > > at all, and so we need to design in the flexibility we need to > > > iteratively improve implementations without needing API changes that > > > will break applications. > > > > The hard guarantee requirement still remains though because an > > application that expects combined MAP_SYNC|MAP_DIRECT semantics will > > be surprised if the MAP_DIRECT property silently disappears. > > Why would they be surprised? They won't even notice it if the > filesystem can provide MAP_SYNC without MAP_DIRECT. > > And that's the whole point. > > MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not > visible to the filesystem and the filesystem does nothing to enforce > them. If someone does something that requires the page cache (e.g. > calls do_splice_direct()) then that MAP_DIRECT mapping has a whole > heap of new work to do. And, in some cases, the filesystem may not > be able to provide MAP_DIRECT as a result.. > > IOWs, the filesystem cannot guarantee MAP_DIRECT and the > circumstances under which MAP_DIRECT will and will not work are > dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have > applications randomly segfaulting in production as things like > backups, indexers, etc run over the filesystem and do their work. > > This is why MAP_DIRECT needs to be an optimisation, not a > requirement - things will still work if MAP_DIRECT is not used. What > matters to these applications is MAP_SYNC - if we break MAP_SYNC, > then the application data integrity model is violated. That's not an > acceptible outcome. > > The problem, it seems to me, is that people are unable to separate > MAP_DIRECT and MAP_SYNC. I suspect that is because, at present, > MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only > provide MAP_SYNC functionality on DAX mappings. However, that's a > /filesystem implementation issue/, not an API guarantee we need to > provide to userspace. > > If we implement a persistent page cache (e.g. allocate page cache > pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4 > could provide applications with the MAP_SYNC data integrity model > without MAP_DIRECT. Indeed, those filesystems would not even be able > to provide MAP_DIRECT semantics because they aren't backed by pmem. > > Hence if applications that want MAP_SYNC are hard coded > MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then > those applications are going to fail on a filesystem that provides > only MAP_SYNC. This is despite the fact the applications would > function correctly and the data integrity model would be maintained. > i.e. the failure is because applications have assumed MAP_SYNC can > only be provided by a DAX implementation, not because MAP_SYNC is > not supported. > > MAP_SYNC really isn't about DAX at all. It's about enabling a data > integrity model that requires the filesystem to provide userspace > access to CPU addressable persistent memory. DAX+MAP_DIRECT is just > one method of providing this functionality, but it's not the only > method. Our API needs to be future proof rather than an encoding of > the existing implementation limitations, otherwise apps will have to > be re-written as every new MAP_SYNC capable technology comes along. > > In summary: > > MAP_DIRECT is an access hint. > > MAP_SYNC provides a data integrity model guarantee. > > MAP_SYNC may imply MAP_DIRECT for specific implementations, > but it does not require or guarantee MAP_DIRECT. > > Let's compare that with O_DIRECT: > > O_DIRECT in an access hint. > > O_DSYNC provides a data integrity model guarantee. > > O_DSYNC may imply O_DIRECT for specific implementations, but > it does not require or guarantee O_DIRECT. > > Consistency in access and data integrity models is a good thing. DAX > and pmem is not an exception. We need to use a model we know works > and has proven itself over a long period of time. > > > I think > > it still makes some sense as a hint for apps that want to minimize > > page cache, but for the applications with a flush from userspace model > > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This > > still gives the filesystem the option to inject page-cache at will, > > but with an application coordination point. > > Why make it more complex for applications than it needs to be? With the clarification that MAP_SYNC implies "cpu cache flush to persistent memory page-cache *or* dax to persistent memory" I think all of the concerns are addressed. I was conflating MAP_DIRECT as "no page cache indirection", but the indirection does not matter if the page cache itself is persisted.
Hello, > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote: > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote: > > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote: > > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote: > > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > > > > > MAP_SYNC > > > > > > - file system guarantees that metadata required to reach faulted-in file > > > > > > data is consistent on media before a write fault is completed. A > > > > > > side-effect is that the page cache will not be used for > > > > > > writably-mapped pages. > > > > > > > > > > I think you are conflating current implementation with API > > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache > > > > > use. The man page definition simply says "supported only for files > > > > > supporting DAX" and that it provides certain data integrity > > > > > guarantees. It does not define the implementation. > .... > > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major > > > > filesystems) so usually people just won't notice. If fallback for > > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. > > > > > > Which is just like the situation where O_DIRECT on ext3 was not very > > > useful, but on other filesystems like XFS it was fully functional. > > > > > > IMO, the fact that a specific filesytem has a suboptimal fallback > > > path for an uncommon behaviour isn't an argument against MAP_DIRECT > > > as a hint - it's actually a feature. If MAP_DIRECT can't be used > > > until it's always direct access, then most filesystems wouldn't be > > > able to provide any faster paths at all. It's much better to have > > > partial functionality now than it is to never have the functionality > > > at all, and so we need to design in the flexibility we need to > > > iteratively improve implementations without needing API changes that > > > will break applications. > > > > The hard guarantee requirement still remains though because an > > application that expects combined MAP_SYNC|MAP_DIRECT semantics will > > be surprised if the MAP_DIRECT property silently disappears. > > Why would they be surprised? They won't even notice it if the > filesystem can provide MAP_SYNC without MAP_DIRECT. > > And that's the whole point. > > MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not > visible to the filesystem and the filesystem does nothing to enforce > them. If someone does something that requires the page cache (e.g. > calls do_splice_direct()) then that MAP_DIRECT mapping has a whole > heap of new work to do. And, in some cases, the filesystem may not > be able to provide MAP_DIRECT as a result.. > > IOWs, the filesystem cannot guarantee MAP_DIRECT and the > circumstances under which MAP_DIRECT will and will not work are > dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have > applications randomly segfaulting in production as things like > backups, indexers, etc run over the filesystem and do their work. > > This is why MAP_DIRECT needs to be an optimisation, not a > requirement - things will still work if MAP_DIRECT is not used. What > matters to these applications is MAP_SYNC - if we break MAP_SYNC, > then the application data integrity model is violated. That's not an > acceptible outcome. > > The problem, it seems to me, is that people are unable to separate > MAP_DIRECT and MAP_SYNC. I suspect that is because, at present, > MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only > provide MAP_SYNC functionality on DAX mappings. However, that's a > /filesystem implementation issue/, not an API guarantee we need to > provide to userspace. > > If we implement a persistent page cache (e.g. allocate page cache > pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4 > could provide applications with the MAP_SYNC data integrity model > without MAP_DIRECT. Indeed, those filesystems would not even be able > to provide MAP_DIRECT semantics because they aren't backed by pmem. > > Hence if applications that want MAP_SYNC are hard coded > MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then > those applications are going to fail on a filesystem that provides > only MAP_SYNC. This is despite the fact the applications would > function correctly and the data integrity model would be maintained. > i.e. the failure is because applications have assumed MAP_SYNC can > only be provided by a DAX implementation, not because MAP_SYNC is > not supported. > > MAP_SYNC really isn't about DAX at all. It's about enabling a data > integrity model that requires the filesystem to provide userspace > access to CPU addressable persistent memory. DAX+MAP_DIRECT is just > one method of providing this functionality, but it's not the only > method. Our API needs to be future proof rather than an encoding of > the existing implementation limitations, otherwise apps will have to > be re-written as every new MAP_SYNC capable technology comes along. > > In summary: > > MAP_DIRECT is an access hint. > > MAP_SYNC provides a data integrity model guarantee. > > MAP_SYNC may imply MAP_DIRECT for specific implementations, > but it does not require or guarantee MAP_DIRECT. > > Let's compare that with O_DIRECT: > > O_DIRECT in an access hint. > > O_DSYNC provides a data integrity model guarantee. > > O_DSYNC may imply O_DIRECT for specific implementations, but > it does not require or guarantee O_DIRECT. > > Consistency in access and data integrity models is a good thing. DAX > and pmem is not an exception. We need to use a model we know works > and has proven itself over a long period of time. Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT. (I'm not opposed to your opinion, but I need to know it.) In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is "wrong alignment is specified by application", right? When filesystem can not use O_DIRECT and it uses page cache instead, then system uses more memory resource than user's expectation. So, there is a side effect, and it may cause other trouble. (memory pressure, expected performance can not be gained, and so on ..) In such case its administrator (or technical support engineer) needs to struggle to investigate what is the reason. If the reason of the breakage is clear, then it is helpful to find the root cause, and they can require the developer of wrong application to fix the problem. "Please fix the alignment!". So, I would like to know in MAP_DIRECT case, what is the reasons? I think it will be helpful for users. Only splice? (Maybe such document will be necessary....) Thanks, > > > I think > > it still makes some sense as a hint for apps that want to minimize > > page cache, but for the applications with a flush from userspace model > > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This > > still gives the filesystem the option to inject page-cache at will, > > but with an application coordination point. > > Why make it more complex for applications than it needs to be? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Oct 31, 2018 at 05:59:17AM +0000, y-goto@fujitsu.com wrote: > > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote: > > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote: > > In summary: > > > > MAP_DIRECT is an access hint. > > > > MAP_SYNC provides a data integrity model guarantee. > > > > MAP_SYNC may imply MAP_DIRECT for specific implementations, > > but it does not require or guarantee MAP_DIRECT. > > > > Let's compare that with O_DIRECT: > > > > O_DIRECT in an access hint. > > > > O_DSYNC provides a data integrity model guarantee. > > > > O_DSYNC may imply O_DIRECT for specific implementations, but > > it does not require or guarantee O_DIRECT. > > > > Consistency in access and data integrity models is a good thing. DAX > > and pmem is not an exception. We need to use a model we know works > > and has proven itself over a long period of time. > > Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT. > (I'm not opposed to your opinion, but I need to know it.) > > In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is > "wrong alignment is specified by application", right? O_DIRECT has defined memory and offset alignment restrictions, and will return an error to userspace when they are violated. It does not fall back to buffered IO in this case. MAP_DIRECT has no equivalent restriction, so IO alignment of O_DIRECT is largely irrelevant here. What we are talking about here is that some filesystems can only do certain operations through buffered IO, such as block allocation or file extension, and so silently fall back to doing them via buffered IO even when O_DIRECT is specified. The old direct IO code used to be full of conditionals to allow this - I think DIO_SKIP_HOLES is only one remaining: /* * For writes that could fill holes inside i_size on a * DIO_SKIP_HOLES filesystem we forbid block creations: only * overwrites are permitted. We will return early to the caller * once we see an unmapped buffer head returned, and the caller * will fall back to buffered I/O. * * Otherwise the decision is left to the get_blocks method, * which may decide to handle it or also return an unmapped * buffer head. */ create = dio->op == REQ_OP_WRITE; if (dio->flags & DIO_SKIP_HOLES) { if (fs_startblk <= ((i_size_read(dio->inode) - 1) >> i_blkbits)) create = 0; } Other cases like file extension cases are caught by the filesystems before calling into the DIO code itself, so there's multiple avenues for O_DIRECT transparently falling back to buffered IO. This means the applications don't fail just because the filesystem can't do a specific operation via O_DIRECT. The data writes still succeed because they fall back to buffered IO, and the application is blissfully unaware that the filesystem behaved that way. > When filesystem can not use O_DIRECT and it uses page cache instead, > then system uses more memory resource than user's expectation. That's far better than failing unexpectedly because the app unexpectedly came across a hole in the file (e.g. someone ran sparsify across the filesystem). > So, there is a side effect, and it may cause other trouble. > (memory pressure, expected performance can not be gained, and so on ..) Which is why people are supposed to test their systems before they put them into production. I've lost count of the number of times I've heard "but O_DIRECT is supposed to make things faster!" because people don't understand exactly what it does or means. Bypassing the page cache does not magically make applications go faster - it puts the responsibility for doing optimal IO on the application, not the kernel. MAP_DIRECT will be no different. It's no guarantee that it will make things faster, or that everything will just work as users expect them to. It specifically places the responsibility for performing IO in an optimal fashion on the application and the user for making sure that it is fit for their purposes. Like O_DIRECT, using MAP_DIRECT means "I, the application, know exactly what I'm doing, so get out of the way as much as possible because I'm taking responsibility for issuing IO in the most optimal manner now". > In such case its administrator (or technical support engineer) needs to struggle to > investigate what is the reason. That's no different to performance problems that arise from inappropriate use of O_DIRECT. It requires a certain level of expertise to be able to understand and diagnose such issues. > So, I would like to know in MAP_DIRECT case, what is the reasons? > I think it will be helpful for users. > Only splice? The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm certain that filesystem developers will try to maintain MAP_DIRECT semantics as much as possible, but it's not going to be possible in /all situations/ on XFS and ext4 because they simply haven't been designed with DAX in mind. Filesystems designed specifically for pmem and DAX might be able to provide MAP_DIRECT in all situations, but those filesystems don't really exist yet. This is no different to the early days of O_DIRECT. e.g. ext3 couldn't do O_DIRECT for all operations when it was first introduced, but over time the functionality improved as the underlying issues were solved. If O_DIRECT was a guarantee, then ext3 would have never supported O_DIRECT at all... > (Maybe such document will be necessary....) The semantics will need to be documented in the relevant man pages. Cheers, Dave.
> > > MAP_DIRECT is an access hint. > > > > > > MAP_SYNC provides a data integrity model guarantee. > > > > > > MAP_SYNC may imply MAP_DIRECT for specific implementations, > > > but it does not require or guarantee MAP_DIRECT. > > > > > > Let's compare that with O_DIRECT: > > > > > > O_DIRECT in an access hint. > > > > > > O_DSYNC provides a data integrity model guarantee. > > > > > > O_DSYNC may imply O_DIRECT for specific implementations, but > > > it does not require or guarantee O_DIRECT. > > > > > > Consistency in access and data integrity models is a good thing. DAX > > > and pmem is not an exception. We need to use a model we know works > > > and has proven itself over a long period of time. > > > > Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT. > > (I'm not opposed to your opinion, but I need to know it.) > > > > In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is > > "wrong alignment is specified by application", right? > > O_DIRECT has defined memory and offset alignment restrictions, and > will return an error to userspace when they are violated. It does > not fall back to buffered IO in this case. MAP_DIRECT has no > equivalent restriction, so IO alignment of O_DIRECT is largely > irrelevant here. > > What we are talking about here is that some filesystems can only do > certain operations through buffered IO, such as block allocation or > file extension, and so silently fall back to doing them via buffered > IO even when O_DIRECT is specified. The old direct IO code used to > be full of conditionals to allow this - I think DIO_SKIP_HOLES is > only one remaining: > > /* > * For writes that could fill holes inside i_size on a > * DIO_SKIP_HOLES filesystem we forbid block creations: only > * overwrites are permitted. We will return early to the caller > * once we see an unmapped buffer head returned, and the caller > * will fall back to buffered I/O. > * > * Otherwise the decision is left to the get_blocks method, > * which may decide to handle it or also return an unmapped > * buffer head. > */ > create = dio->op == REQ_OP_WRITE; > if (dio->flags & DIO_SKIP_HOLES) { > if (fs_startblk <= ((i_size_read(dio->inode) - 1) >> > i_blkbits)) > create = 0; > } > > Other cases like file extension cases are caught by the filesystems > before calling into the DIO code itself, so there's multiple avenues > for O_DIRECT transparently falling back to buffered IO. > > This means the applications don't fail just because the filesystem > can't do a specific operation via O_DIRECT. The data writes still > succeed because they fall back to buffered IO, and the application > is blissfully unaware that the filesystem behaved that way. > > > When filesystem can not use O_DIRECT and it uses page cache instead, > > then system uses more memory resource than user's expectation. > > That's far better than failing unexpectedly because the app > unexpectedly came across a hole in the file (e.g. someone ran > sparsify across the filesystem). > > > So, there is a side effect, and it may cause other trouble. > > (memory pressure, expected performance can not be gained, and so on ..) > > Which is why people are supposed to test their systems before they > put them into production. > > I've lost count of the number of times I've heard "but O_DIRECT is > supposed to make things faster!" because people don't understand > exactly what it does or means. Bypassing the page cache does not > magically make applications go faster - it puts the responsibility > for doing optimal IO on the application, not the kernel. > > MAP_DIRECT will be no different. It's no guarantee that it will make > things faster, or that everything will just work as users expect > them to. It specifically places the responsibility for performing IO > in an optimal fashion on the application and the user for making > sure that it is fit for their purposes. Like O_DIRECT, using > MAP_DIRECT means "I, the application, know exactly what I'm doing, > so get out of the way as much as possible because I'm taking > responsibility for issuing IO in the most optimal manner now". > > > In such case its administrator (or technical support engineer) needs to struggle to > > investigate what is the reason. > > That's no different to performance problems that arise from > inappropriate use of O_DIRECT. It requires a certain level of > expertise to be able to understand and diagnose such issues. > > > So, I would like to know in MAP_DIRECT case, what is the reasons? > > I think it will be helpful for users. > > Only splice? > > The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm > certain that filesystem developers will try to maintain MAP_DIRECT > semantics as much as possible, but it's not going to be possible in > /all situations/ on XFS and ext4 because they simply haven't been > designed with DAX in mind. Filesystems designed specifically for > pmem and DAX might be able to provide MAP_DIRECT in all situations, > but those filesystems don't really exist yet. > > This is no different to the early days of O_DIRECT. e.g. ext3 > couldn't do O_DIRECT for all operations when it was first > introduced, but over time the functionality improved as the > underlying issues were solved. If O_DIRECT was a guarantee, then > ext3 would have never supported O_DIRECT at all... Hmm, Ok. I see. Thank you very much for your detail explanation. > > > (Maybe such document will be necessary....) > > The semantics will need to be documented in the relevant man pages. I agree. Thanks, again. ---- Yasunori Goto
From a3bfac5e1582d9c31e67090b306efedf7c392d36 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Tue, 2 Oct 2018 11:58:22 +0200 Subject: [PATCH] proc: Show DAX mappings as having VM_MIXEDMAP flag Until commit e1fb4a08649 "dax: remove VM_MIXEDMAP for fsdax and device dax", userspace could distinguish DAX and non-DAX mappings in /proc/<pid>/smaps by checking whether the VMA has VM_MIXEDMAP (mm in smaps) flag. And indeed some applications started doing this. Maintain backwards compatibility by faking VM_MIXEDMAP flag in smaps for DAX mappings. Fixes: e1fb4a0864958fac2fb1b23f9f4562a9f90e3e8f Signed-off-by: Jan Kara <jack@suse.cz> --- fs/proc/task_mmu.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5ea1d64cb0b4..9cd8e84610a7 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/fs.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -654,12 +655,18 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif /* CONFIG_ARCH_HAS_PKEYS */ }; size_t i; + unsigned long flags = vma->vm_flags; + /* + * We fake VM_MIXEDMAP for DAX mappings here as some apps depend on it. + */ + if (vma_is_dax(vma)) + flags |= VM_MIXEDMAP; seq_puts(m, "VmFlags: "); for (i = 0; i < BITS_PER_LONG; i++) { if (!mnemonics[i][0]) continue; - if (vma->vm_flags & (1UL << i)) { + if (flags & (1UL << i)) { seq_putc(m, mnemonics[i][0]); seq_putc(m, mnemonics[i][1]); seq_putc(m, ' '); -- 2.16.4