Message ID | 20200227052442.22524-1-ira.weiny@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Enable per-file/per-directory DAX operations V5 | expand |
FYI, I still will fully NAK any series that adds additional locks and thus atomic instructions to basically every fs call, and grows the inode by a rw_semaphore plus and atomic64_t. I also think the whole idea of switching operation vectors at runtime is fatally flawed and we should never add such code, nevermind just for a fringe usecase of a fringe feature. On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > Changes from V4: > * Open code the aops lock rather than add it to the xfs_ilock() > subsystem (Darrick's comments were obsoleted by this change) > * Fix lkp build suggestions and bugs > > Changes from V3: > * Remove global locking... :-D > * put back per inode locking and remove pre-mature optimizations > * Fix issues with Directories having IS_DAX() set > * Fix kernel crash issues reported by Jeff > * Add some clean up patches > * Consolidate diflags to iflags functions > * Update/add documentation > * Reorder/rename patches quite a bit > > Changes from V2: > > * Move i_dax_sem to be a global percpu_rw_sem rather than per inode > Internal discussions with Dan determined this would be easier, > just as performant, and slightly less overhead that having it > in the SB as suggested by Jan > * Fix locking order in comments and throughout code > * Change "mode" to "state" throughout commits > * Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not > configured > * Add static branch for which is activated by a device which supports > DAX in XFS > * Change "lock/unlock" to up/down read/write as appropriate > Previous names were over simplified > * Update comments/documentation > > * Remove the xfs specific lock to the vfs (global) layer. > * Fix i_dax_sem locking order and comments > > * Move 'i_mapped' count from struct inode to struct address_space and > rename it to mmap_count > * Add inode_has_mappings() call > > * Fix build issues > * Clean up syntax spacing and minor issues > * Update man page text for STATX_ATTR_DAX > * Add reviewed-by's > * Rebase to 5.6 > > Rename patch: > from: fs/xfs: Add lock/unlock state to xfs > to: fs/xfs: Add write DAX lock to xfs layer > Add patch: > fs/xfs: Clarify lockdep dependency for xfs_isilocked() > Drop patch: > fs/xfs: Fix truncate up > > > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory > consumption due to their inability to detect whether the kernel will > instantiate page cache for a file, and cases where a global dax enable via a > mount option is too coarse. > > The following patch series enables selecting the use of DAX on individual files > and/or directories on xfs, and lays some groundwork to do so in ext4. In this > scheme the dax mount option can be omitted to allow the per-file property to > take effect. > > The insight at LSF/MM was to separate the per-mount or per-file "physical" > capability switch from an "effective" attribute for the file. > > At LSF/MM we discussed the difficulties of switching the DAX state of a file > with active mappings / page cache. It was thought the races could be avoided > by limiting DAX state flips to 0-length files. > > However, this turns out to not be true.[3] This is because address space > operations (a_ops) may be in use at any time the inode is referenced and users > have expressed a desire to be able to change the DAX state on a file with data > in it. For those reasons this patch set allows changing the DAX state flag on > a file as long as it is not current mapped. > > Details of when and how DAX state can be changed on a file is included in a > documentation patch. > > It should be noted that the physical DAX flag inheritance is not shown in this > patch set as it was maintained from previous work on XFS. The physical DAX > flag and it's inheritance will need to be added to other file systems for user > control. > > As submitted this works on real hardware testing. > > > [1] https://lwn.net/Articles/787973/ > [2] https://lwn.net/Articles/787233/ > [3] https://lkml.org/lkml/2019/10/20/96 > [4] https://patchwork.kernel.org/patch/11310511/ > > > To: linux-kernel@vger.kernel.org > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: "Theodore Y. Ts'o" <tytso@mit.edu> > Cc: Jan Kara <jack@suse.cz> > Cc: linux-ext4@vger.kernel.org > Cc: linux-xfs@vger.kernel.org > Cc: linux-fsdevel@vger.kernel.org > > > Ira Weiny (12): > fs/xfs: Remove unnecessary initialization of i_rwsem > fs: Remove unneeded IS_DAX() check > fs/stat: Define DAX statx attribute > fs/xfs: Isolate the physical DAX flag from enabled > fs/xfs: Create function xfs_inode_enable_dax() > fs: Add locking for a dynamic address space operations state > fs: Prevent DAX state change if file is mmap'ed > fs/xfs: Hold off aops users while changing DAX state > fs/xfs: Clean up locking in dax invalidate > fs/xfs: Allow toggle of effective DAX flag > fs/xfs: Remove xfs_diflags_to_linux() > Documentation/dax: Update Usage section > > Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++- > Documentation/filesystems/vfs.rst | 16 +++++ > fs/attr.c | 1 + > fs/inode.c | 16 ++++- > fs/iomap/buffered-io.c | 1 + > fs/open.c | 4 ++ > fs/stat.c | 5 ++ > fs/xfs/xfs_icache.c | 5 +- > fs/xfs/xfs_inode.h | 2 + > fs/xfs/xfs_ioctl.c | 98 +++++++++++++++---------------- > fs/xfs/xfs_iops.c | 69 +++++++++++++++------- > include/linux/fs.h | 73 ++++++++++++++++++++++- > include/uapi/linux/stat.h | 1 + > mm/fadvise.c | 7 ++- > mm/filemap.c | 4 ++ > mm/huge_memory.c | 1 + > mm/khugepaged.c | 2 + > mm/mmap.c | 19 +++++- > mm/util.c | 9 ++- > 19 files changed, 328 insertions(+), 89 deletions(-) > > -- > 2.21.0 ---end quoted text---
On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote: > FYI, I still will fully NAK any series that adds additional locks > and thus atomic instructions to basically every fs call, and grows > the inode by a rw_semaphore plus and atomic64_t. I also think the > whole idea of switching operation vectors at runtime is fatally flawed > and we should never add such code, nevermind just for a fringe usecase > of a fringe feature. Being new to this area of the kernel I'm not clear on the history... It was my understanding that the per-file flag support was a requirement to removing the experimental designation from DAX. Is this still the case? Ira > > On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > Changes from V4: > > * Open code the aops lock rather than add it to the xfs_ilock() > > subsystem (Darrick's comments were obsoleted by this change) > > * Fix lkp build suggestions and bugs > > > > Changes from V3: > > * Remove global locking... :-D > > * put back per inode locking and remove pre-mature optimizations > > * Fix issues with Directories having IS_DAX() set > > * Fix kernel crash issues reported by Jeff > > * Add some clean up patches > > * Consolidate diflags to iflags functions > > * Update/add documentation > > * Reorder/rename patches quite a bit > > > > Changes from V2: > > > > * Move i_dax_sem to be a global percpu_rw_sem rather than per inode > > Internal discussions with Dan determined this would be easier, > > just as performant, and slightly less overhead that having it > > in the SB as suggested by Jan > > * Fix locking order in comments and throughout code > > * Change "mode" to "state" throughout commits > > * Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not > > configured > > * Add static branch for which is activated by a device which supports > > DAX in XFS > > * Change "lock/unlock" to up/down read/write as appropriate > > Previous names were over simplified > > * Update comments/documentation > > > > * Remove the xfs specific lock to the vfs (global) layer. > > * Fix i_dax_sem locking order and comments > > > > * Move 'i_mapped' count from struct inode to struct address_space and > > rename it to mmap_count > > * Add inode_has_mappings() call > > > > * Fix build issues > > * Clean up syntax spacing and minor issues > > * Update man page text for STATX_ATTR_DAX > > * Add reviewed-by's > > * Rebase to 5.6 > > > > Rename patch: > > from: fs/xfs: Add lock/unlock state to xfs > > to: fs/xfs: Add write DAX lock to xfs layer > > Add patch: > > fs/xfs: Clarify lockdep dependency for xfs_isilocked() > > Drop patch: > > fs/xfs: Fix truncate up > > > > > > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory > > consumption due to their inability to detect whether the kernel will > > instantiate page cache for a file, and cases where a global dax enable via a > > mount option is too coarse. > > > > The following patch series enables selecting the use of DAX on individual files > > and/or directories on xfs, and lays some groundwork to do so in ext4. In this > > scheme the dax mount option can be omitted to allow the per-file property to > > take effect. > > > > The insight at LSF/MM was to separate the per-mount or per-file "physical" > > capability switch from an "effective" attribute for the file. > > > > At LSF/MM we discussed the difficulties of switching the DAX state of a file > > with active mappings / page cache. It was thought the races could be avoided > > by limiting DAX state flips to 0-length files. > > > > However, this turns out to not be true.[3] This is because address space > > operations (a_ops) may be in use at any time the inode is referenced and users > > have expressed a desire to be able to change the DAX state on a file with data > > in it. For those reasons this patch set allows changing the DAX state flag on > > a file as long as it is not current mapped. > > > > Details of when and how DAX state can be changed on a file is included in a > > documentation patch. > > > > It should be noted that the physical DAX flag inheritance is not shown in this > > patch set as it was maintained from previous work on XFS. The physical DAX > > flag and it's inheritance will need to be added to other file systems for user > > control. > > > > As submitted this works on real hardware testing. > > > > > > [1] https://lwn.net/Articles/787973/ > > [2] https://lwn.net/Articles/787233/ > > [3] https://lkml.org/lkml/2019/10/20/96 > > [4] https://patchwork.kernel.org/patch/11310511/ > > > > > > To: linux-kernel@vger.kernel.org > > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: Dave Chinner <david@fromorbit.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: "Theodore Y. Ts'o" <tytso@mit.edu> > > Cc: Jan Kara <jack@suse.cz> > > Cc: linux-ext4@vger.kernel.org > > Cc: linux-xfs@vger.kernel.org > > Cc: linux-fsdevel@vger.kernel.org > > > > > > Ira Weiny (12): > > fs/xfs: Remove unnecessary initialization of i_rwsem > > fs: Remove unneeded IS_DAX() check > > fs/stat: Define DAX statx attribute > > fs/xfs: Isolate the physical DAX flag from enabled > > fs/xfs: Create function xfs_inode_enable_dax() > > fs: Add locking for a dynamic address space operations state > > fs: Prevent DAX state change if file is mmap'ed > > fs/xfs: Hold off aops users while changing DAX state > > fs/xfs: Clean up locking in dax invalidate > > fs/xfs: Allow toggle of effective DAX flag > > fs/xfs: Remove xfs_diflags_to_linux() > > Documentation/dax: Update Usage section > > > > Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++- > > Documentation/filesystems/vfs.rst | 16 +++++ > > fs/attr.c | 1 + > > fs/inode.c | 16 ++++- > > fs/iomap/buffered-io.c | 1 + > > fs/open.c | 4 ++ > > fs/stat.c | 5 ++ > > fs/xfs/xfs_icache.c | 5 +- > > fs/xfs/xfs_inode.h | 2 + > > fs/xfs/xfs_ioctl.c | 98 +++++++++++++++---------------- > > fs/xfs/xfs_iops.c | 69 +++++++++++++++------- > > include/linux/fs.h | 73 ++++++++++++++++++++++- > > include/uapi/linux/stat.h | 1 + > > mm/fadvise.c | 7 ++- > > mm/filemap.c | 4 ++ > > mm/huge_memory.c | 1 + > > mm/khugepaged.c | 2 + > > mm/mmap.c | 19 +++++- > > mm/util.c | 9 ++- > > 19 files changed, 328 insertions(+), 89 deletions(-) > > > > -- > > 2.21.0 > ---end quoted text---
On Mon, Mar 09, 2020 at 10:04:37AM -0700, Ira Weiny wrote: > On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote: > > FYI, I still will fully NAK any series that adds additional locks > > and thus atomic instructions to basically every fs call, and grows > > the inode by a rw_semaphore plus and atomic64_t. I also think the > > whole idea of switching operation vectors at runtime is fatally flawed > > and we should never add such code, nevermind just for a fringe usecase > > of a fringe feature. > > Being new to this area of the kernel I'm not clear on the history... I /think/ the TLDR version in no particular order is: - Some people expressed interest in being able to control page cache vs. direct access on pmem hardware at a higher granularity than the entire fs. - Dave Chinner(?) added the per-inode flag intending it to be the sign that would flip on DAX. - Someone (I forget who) made it a mount option that would enable it for all files regardless of inode flags and whatnot. - Eric Sandeen(?) complained that the behavior of the dax inode flag was weird, particularly the part where you set the iflag and at some point if and when the inode gets reclaimed and then reconstituted then the change will finally take place. - Christoph Hellwig objected on various grounds (the kernel is responsible for selecting the most appropriate hardware abstraction for the usage pattern; a binary flag doesn't capture enough detail for potential future pmem hardware; and now additional locking overhead). - There's been (I hope) a long term understanding that the mount option will go away eventually, and not after we remove the EXPERIMENTAL tags. (FWIW I tend to agree with Eric and Christoph, but I also thought it would be useful at least to see what changeable file operations would be like; if there were other users who had already implemented it; and how much of an apetite there was for revoke().) Hopefully I summarized that more or less accurately... > It was my understanding that the per-file flag support was a requirement to > removing the experimental designation from DAX. Is this still the case? Nailing down the meaning of the per-file dax flag is/was the requirement, even if we kill it off entirely in the end. Given Christoph's veto threat, I suppose that leaves the following options? 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX status via statx. Document that changes to FS_XFLAG_DAX do not take effect immediately and that one must check statx to find out the real mode. If we choose this, I would also deprecate the dax mount option; send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on all files at mkfs time; and we can finally lay this whole thing to rest. This is the closest to what we have today. 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on usage patterns, hardware heuristics, or spiteful arbitrariness. Can we please pick (1) and just be done with this? I want to move on. --D There are still other things that need to be ironed out WRT pmem: a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the xfs buffer cache, or something worse? b) getting our stories straight on how to clear poison, and whether or not we can come up with a better story for ZERO_FILE_RANGE on pmem. In the ideal world I'd love to see Z_F_R actually memset(0) the pmem and clear poison, at least if the file->pmem mappings were contiguous. c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise figuring out how to get storage to tell xfs that it lost something so that maybe xfs can fix it quickly > Ira > > > > > On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > Changes from V4: > > > * Open code the aops lock rather than add it to the xfs_ilock() > > > subsystem (Darrick's comments were obsoleted by this change) > > > * Fix lkp build suggestions and bugs > > > > > > Changes from V3: > > > * Remove global locking... :-D > > > * put back per inode locking and remove pre-mature optimizations > > > * Fix issues with Directories having IS_DAX() set > > > * Fix kernel crash issues reported by Jeff > > > * Add some clean up patches > > > * Consolidate diflags to iflags functions > > > * Update/add documentation > > > * Reorder/rename patches quite a bit > > > > > > Changes from V2: > > > > > > * Move i_dax_sem to be a global percpu_rw_sem rather than per inode > > > Internal discussions with Dan determined this would be easier, > > > just as performant, and slightly less overhead that having it > > > in the SB as suggested by Jan > > > * Fix locking order in comments and throughout code > > > * Change "mode" to "state" throughout commits > > > * Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not > > > configured > > > * Add static branch for which is activated by a device which supports > > > DAX in XFS > > > * Change "lock/unlock" to up/down read/write as appropriate > > > Previous names were over simplified > > > * Update comments/documentation > > > > > > * Remove the xfs specific lock to the vfs (global) layer. > > > * Fix i_dax_sem locking order and comments > > > > > > * Move 'i_mapped' count from struct inode to struct address_space and > > > rename it to mmap_count > > > * Add inode_has_mappings() call > > > > > > * Fix build issues > > > * Clean up syntax spacing and minor issues > > > * Update man page text for STATX_ATTR_DAX > > > * Add reviewed-by's > > > * Rebase to 5.6 > > > > > > Rename patch: > > > from: fs/xfs: Add lock/unlock state to xfs > > > to: fs/xfs: Add write DAX lock to xfs layer > > > Add patch: > > > fs/xfs: Clarify lockdep dependency for xfs_isilocked() > > > Drop patch: > > > fs/xfs: Fix truncate up > > > > > > > > > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory > > > consumption due to their inability to detect whether the kernel will > > > instantiate page cache for a file, and cases where a global dax enable via a > > > mount option is too coarse. > > > > > > The following patch series enables selecting the use of DAX on individual files > > > and/or directories on xfs, and lays some groundwork to do so in ext4. In this > > > scheme the dax mount option can be omitted to allow the per-file property to > > > take effect. > > > > > > The insight at LSF/MM was to separate the per-mount or per-file "physical" > > > capability switch from an "effective" attribute for the file. > > > > > > At LSF/MM we discussed the difficulties of switching the DAX state of a file > > > with active mappings / page cache. It was thought the races could be avoided > > > by limiting DAX state flips to 0-length files. > > > > > > However, this turns out to not be true.[3] This is because address space > > > operations (a_ops) may be in use at any time the inode is referenced and users > > > have expressed a desire to be able to change the DAX state on a file with data > > > in it. For those reasons this patch set allows changing the DAX state flag on > > > a file as long as it is not current mapped. > > > > > > Details of when and how DAX state can be changed on a file is included in a > > > documentation patch. > > > > > > It should be noted that the physical DAX flag inheritance is not shown in this > > > patch set as it was maintained from previous work on XFS. The physical DAX > > > flag and it's inheritance will need to be added to other file systems for user > > > control. > > > > > > As submitted this works on real hardware testing. > > > > > > > > > [1] https://lwn.net/Articles/787973/ > > > [2] https://lwn.net/Articles/787233/ > > > [3] https://lkml.org/lkml/2019/10/20/96 > > > [4] https://patchwork.kernel.org/patch/11310511/ > > > > > > > > > To: linux-kernel@vger.kernel.org > > > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > > > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > > > Cc: Dan Williams <dan.j.williams@intel.com> > > > Cc: Dave Chinner <david@fromorbit.com> > > > Cc: Christoph Hellwig <hch@lst.de> > > > Cc: "Theodore Y. Ts'o" <tytso@mit.edu> > > > Cc: Jan Kara <jack@suse.cz> > > > Cc: linux-ext4@vger.kernel.org > > > Cc: linux-xfs@vger.kernel.org > > > Cc: linux-fsdevel@vger.kernel.org > > > > > > > > > Ira Weiny (12): > > > fs/xfs: Remove unnecessary initialization of i_rwsem > > > fs: Remove unneeded IS_DAX() check > > > fs/stat: Define DAX statx attribute > > > fs/xfs: Isolate the physical DAX flag from enabled > > > fs/xfs: Create function xfs_inode_enable_dax() > > > fs: Add locking for a dynamic address space operations state > > > fs: Prevent DAX state change if file is mmap'ed > > > fs/xfs: Hold off aops users while changing DAX state > > > fs/xfs: Clean up locking in dax invalidate > > > fs/xfs: Allow toggle of effective DAX flag > > > fs/xfs: Remove xfs_diflags_to_linux() > > > Documentation/dax: Update Usage section > > > > > > Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++- > > > Documentation/filesystems/vfs.rst | 16 +++++ > > > fs/attr.c | 1 + > > > fs/inode.c | 16 ++++- > > > fs/iomap/buffered-io.c | 1 + > > > fs/open.c | 4 ++ > > > fs/stat.c | 5 ++ > > > fs/xfs/xfs_icache.c | 5 +- > > > fs/xfs/xfs_inode.h | 2 + > > > fs/xfs/xfs_ioctl.c | 98 +++++++++++++++---------------- > > > fs/xfs/xfs_iops.c | 69 +++++++++++++++------- > > > include/linux/fs.h | 73 ++++++++++++++++++++++- > > > include/uapi/linux/stat.h | 1 + > > > mm/fadvise.c | 7 ++- > > > mm/filemap.c | 4 ++ > > > mm/huge_memory.c | 1 + > > > mm/khugepaged.c | 2 + > > > mm/mmap.c | 19 +++++- > > > mm/util.c | 9 ++- > > > 19 files changed, 328 insertions(+), 89 deletions(-) > > > > > > -- > > > 2.21.0 > > ---end quoted text---
On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote: > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX > status via statx. Document that changes to FS_XFLAG_DAX do not take > effect immediately and that one must check statx to find out the real > mode. If we choose this, I would also deprecate the dax mount option; > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on > all files at mkfs time; and we can finally lay this whole thing to rest. > This is the closest to what we have today. > > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on > usage patterns, hardware heuristics, or spiteful arbitrariness. 3) Only allow changing FS_XFLAG_DAX on directories or files that do not have blocks allocated to them yet, and side step all the hard problems. Which of course still side steps the hard question of what it actually is supposed to mean..
On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote: > There are still other things that need to be ironed out WRT pmem: > > a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the > xfs buffer cache, or something worse? I don't think we need either. We just need to remove the DAX page association for hwpoison that requires the struct page to store the mapping and index. Get rid of that and we should be able to safely map the same page into different inode address spaces at the same time. When we write a shared page, we COW it immediately and replace the page in the inode's mapping tree, so we can't actually write to a shared page... IOWs, the dax_associate_page() related functionality probably needs to be a filesystem callout - part of the aops vector, I think, so that device dax can still use it. That way XFS can go it's own way, while ext4 and device dax can continue to use the existing mechanism mechanisn that is currently implemented.... XFS can then make use of rmapbt to find the owners on a bad page notification, and run the "kill userspace dead dead dead" lookup on each mapping/index tuple rather than pass it around on a struct page. i.e. we'll do a kill scan for each mapping/index owner tuple we find, not just one. That requires converting all the current vma killer code to pass mapping/index tuples around rather than the struct page. That kill code doesn't actually need the struct page, it just needs the mapping/index tuple to match to the vmas that have it mapped into userspace. > b) getting our stories straight on how to clear poison, and whether or > not we can come up with a better story for ZERO_FILE_RANGE on pmem. In > the ideal world I'd love to see Z_F_R actually memset(0) the pmem and > clear poison, at least if the file->pmem mappings were contiguous. Are you talking about ZFR from userspace through the filesystem (how do you clear poison in free space?) or ZFR on the dax device fro either userspace or the kernel filesystem code? > c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise > figuring out how to get storage to tell xfs that it lost something so > that maybe xfs can fix it quickly Yup, I think that's a dax device callback into the filesystem. i.e the hwpoison gets delivered to the dax device, which then calls the fs function rather than do it's current "dax_lock_page(), kill userspace dead dead dead, dax_unlock_page()" dance. The filesystem can do a much more intricate dance and wreak far more havoc on userspace than what the dax device can do..... Copious amounts of unused time are things I don't have, unfortunately. Only having 7 fingers to type with right now doesn't help, either. Cheers, Dave.
On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote: > IOWs, the dax_associate_page() related functionality probably needs > to be a filesystem callout - part of the aops vector, I think, so > that device dax can still use it. That way XFS can go it's own way, > while ext4 and device dax can continue to use the existing mechanism > mechanisn that is currently implemented.... s/XFS/XFS with rmap/, as most XFS file systems currently don't have that enabled we'll also need to keep the legacy path around.
On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote: > > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX > > status via statx. Document that changes to FS_XFLAG_DAX do not take > > effect immediately and that one must check statx to find out the real > > mode. If we choose this, I would also deprecate the dax mount option; > > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on > > all files at mkfs time; and we can finally lay this whole thing to rest. > > This is the closest to what we have today. > > > > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on > > usage patterns, hardware heuristics, or spiteful arbitrariness. > > 3) Only allow changing FS_XFLAG_DAX on directories or files that do > not have blocks allocated to them yet, and side step all the hard > problems. This sounds reasonable to me. As for deprecating the mount option, I think at a minimum it needs to continue be accepted as an option even if it is ignored to not break existing setups. We're currently going through the prolonged flag day of people discovering that if they update xfsprogs they need to specify "-m reflink=0" to mkfs.xfs. That pain seems to have only been a road bump not a showstopper based on the bug reports I've seen. If anything it has added helpful pressure towards getting reflink support bumped up in the priority. Hopefully the xfs position that the dax mount option can be ignored makes it possible to implement the same policy on ext4, and we can just move on... > Which of course still side steps the hard question of what it actually > is supposed to mean.. If we have statx to indicate the effective dax-state that addresses the pain for applications that want to account for dax in their page cache pressure estimates, and lets FS_XFLAG_DAX not need to specify precise semantics.
On Tue, Mar 10, 2020 at 11:44 PM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote: > > IOWs, the dax_associate_page() related functionality probably needs > > to be a filesystem callout - part of the aops vector, I think, so > > that device dax can still use it. That way XFS can go it's own way, > > while ext4 and device dax can continue to use the existing mechanism > > mechanisn that is currently implemented.... > > s/XFS/XFS with rmap/, as most XFS file systems currently don't have > that enabled we'll also need to keep the legacy path around. Agree, it needs to be an opt-in capability as ext4 and xfs w/o rmap will stay on the legacy path for the foreseeable future.
On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote: > On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote: > > IOWs, the dax_associate_page() related functionality probably needs > > to be a filesystem callout - part of the aops vector, I think, so > > that device dax can still use it. That way XFS can go it's own way, > > while ext4 and device dax can continue to use the existing mechanism > > mechanisn that is currently implemented.... > > s/XFS/XFS with rmap/, as most XFS file systems currently don't have > that enabled we'll also need to keep the legacy path around. Sure, that's trivially easy to handle in the XFS code once the callouts are in place. But, quite frankly, we can enforce rmap to be enabled enabled because nobody is using a reflink enabled FS w/ DAX right now. Everyone will have to mkfs their filesystems anyway to enable reflink+dax, so we simply don't allow reflink+dax to be enabled unless rmap is also enabled. Simple, easy, trivial. Cheers, Dave.
On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote: > On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote: > > On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote: > > > IOWs, the dax_associate_page() related functionality probably needs > > > to be a filesystem callout - part of the aops vector, I think, so > > > that device dax can still use it. That way XFS can go it's own way, > > > while ext4 and device dax can continue to use the existing mechanism > > > mechanisn that is currently implemented.... > > > > s/XFS/XFS with rmap/, as most XFS file systems currently don't have > > that enabled we'll also need to keep the legacy path around. > > Sure, that's trivially easy to handle in the XFS code once the > callouts are in place. > > But, quite frankly, we can enforce rmap to be enabled > enabled because nobody is using a reflink enabled FS w/ DAX right > now. Everyone will have to mkfs their filesystems anyway to enable > reflink+dax, so we simply don't allow reflink+dax to be enabled > unless rmap is also enabled. Simple, easy, trivial. Heh, this reminds me that I need to get that rmap performance analysis report out to the list... it does have a fairly substantial performance impact (in its current not-terribly-optimized state) but otoh enables self-repair. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote: > > > IOWs, the dax_associate_page() related functionality probably needs > > > to be a filesystem callout - part of the aops vector, I think, so > > > that device dax can still use it. That way XFS can go it's own way, > > > while ext4 and device dax can continue to use the existing mechanism > > > mechanisn that is currently implemented.... > > > > s/XFS/XFS with rmap/, as most XFS file systems currently don't have > > that enabled we'll also need to keep the legacy path around. > > Sure, that's trivially easy to handle in the XFS code once the > callouts are in place. > > But, quite frankly, we can enforce rmap to be enabled > enabled because nobody is using a reflink enabled FS w/ DAX right > now. Everyone will have to mkfs their filesystems anyway to enable > reflink+dax, so we simply don't allow reflink+dax to be enabled > unless rmap is also enabled. Simple, easy, trivial. True, I think rmap will be required for DAX+reflink. But we still need the legacy infrastructure for error reporting.
On Wed 11-03-20 10:07:18, Dan Williams wrote: > On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <hch@lst.de> wrote: > > > > On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote: > > > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX > > > status via statx. Document that changes to FS_XFLAG_DAX do not take > > > effect immediately and that one must check statx to find out the real > > > mode. If we choose this, I would also deprecate the dax mount option; > > > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on > > > all files at mkfs time; and we can finally lay this whole thing to rest. > > > This is the closest to what we have today. > > > > > > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on > > > usage patterns, hardware heuristics, or spiteful arbitrariness. > > > > 3) Only allow changing FS_XFLAG_DAX on directories or files that do > > not have blocks allocated to them yet, and side step all the hard > > problems. > > This sounds reasonable to me. > > As for deprecating the mount option, I think at a minimum it needs to > continue be accepted as an option even if it is ignored to not break > existing setups. Agreed. But that's how we usually deprecate mount options. Also I'd say that statx() support for reporting DAX state and some education of programmers using DAX is required before we deprecate the mount option since currently applications check 'dax' mount option to determine how much memory they need to set aside for page cache before they consume everything else on the machine... Honza
On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote: > > This sounds reasonable to me. > > > > As for deprecating the mount option, I think at a minimum it needs to > > continue be accepted as an option even if it is ignored to not break > > existing setups. > > Agreed. But that's how we usually deprecate mount options. Also I'd say > that statx() support for reporting DAX state and some education of > programmers using DAX is required before we deprecate the mount option > since currently applications check 'dax' mount option to determine how much > memory they need to set aside for page cache before they consume everything > else on the machine... I don't even think we should deprecate it. It isn't painful to maintain and actually useful for testing. Instead we should expand it into a tristate: dax=off dax=flag dax=always where the existing "dax" option maps to "dax=always" and nodax maps to "dax=off". and dax=flag becomes the default for DAX capable devices.
On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote: > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote: > > > This sounds reasonable to me. > > > > > > As for deprecating the mount option, I think at a minimum it needs to > > > continue be accepted as an option even if it is ignored to not break > > > existing setups. > > > > Agreed. But that's how we usually deprecate mount options. Also I'd say > > that statx() support for reporting DAX state and some education of > > programmers using DAX is required before we deprecate the mount option > > since currently applications check 'dax' mount option to determine how much > > memory they need to set aside for page cache before they consume everything > > else on the machine... > > I don't even think we should deprecate it. It isn't painful to maintain > and actually useful for testing. Instead we should expand it into a > tristate: > > dax=off > dax=flag > dax=always > > where the existing "dax" option maps to "dax=always" and nodax maps > to "dax=off". and dax=flag becomes the default for DAX capable devices. That works for me. In summary: - Applications must call statx to discover the current S_DAX state. - There exists an advisory file inode flag FS_XFLAG_DAX that can be changed on files that have no blocks allocated to them. Changing this flag does not necessarily change the S_DAX state immediately but programs can query the S_DAX state via statx. If FS_XFLAG_DAX is set and the fs is on pmem then it will always enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will never enable S_DAX. Unless overridden... - There exists a dax= mount option. dax=off means "never set S_DAX, ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX" and is the default. "dax" by itself means "dax=always". "nodax" means "dax=off". - There exists an advisory directory inode flag FS_XFLAG_DAX that can be changed at any time. The flag state is copied into any files or subdirectories created within that directory. If programs require that file access runs in S_DAX mode, they'll have to create those files themselves inside a directory with FS_XFLAG_DAX set, or mount the fs with dax=always. Ok? Let's please get this part finished for 5.8, then we can get back to arguing about fs-rmap and reflink and dax and whatnot. --D
On Wed 01-04-20 04:00:21, Darrick J. Wong wrote: > On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote: > > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote: > > > > This sounds reasonable to me. > > > > > > > > As for deprecating the mount option, I think at a minimum it needs to > > > > continue be accepted as an option even if it is ignored to not break > > > > existing setups. > > > > > > Agreed. But that's how we usually deprecate mount options. Also I'd say > > > that statx() support for reporting DAX state and some education of > > > programmers using DAX is required before we deprecate the mount option > > > since currently applications check 'dax' mount option to determine how much > > > memory they need to set aside for page cache before they consume everything > > > else on the machine... > > > > I don't even think we should deprecate it. It isn't painful to maintain > > and actually useful for testing. Instead we should expand it into a > > tristate: > > > > dax=off > > dax=flag > > dax=always > > > > where the existing "dax" option maps to "dax=always" and nodax maps > > to "dax=off". and dax=flag becomes the default for DAX capable devices. > > That works for me. In summary: > > - Applications must call statx to discover the current S_DAX state. > > - There exists an advisory file inode flag FS_XFLAG_DAX that can be > changed on files that have no blocks allocated to them. Changing > this flag does not necessarily change the S_DAX state immediately > but programs can query the S_DAX state via statx. I generally like the proposal but I think the fact that toggling FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some confusion and ultimately bug reports. I'm thinking whether we could somehow improve this... For example an ioctl that would try to make set inode flags effective by evicting the inode (and returning EBUSY if the eviction is impossible for some reason)? Honza > > If FS_XFLAG_DAX is set and the fs is on pmem then it will always > enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will > never enable S_DAX. Unless overridden... > > - There exists a dax= mount option. dax=off means "never set S_DAX, > ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on > pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX" > and is the default. "dax" by itself means "dax=always". "nodax" > means "dax=off". > > - There exists an advisory directory inode flag FS_XFLAG_DAX that can > be changed at any time. The flag state is copied into any files or > subdirectories created within that directory. If programs require > that file access runs in S_DAX mode, they'll have to create those > files themselves inside a directory with FS_XFLAG_DAX set, or mount > the fs with dax=always. > > Ok? Let's please get this part finished for 5.8, then we can get back > to arguing about fs-rmap and reflink and dax and whatnot. > > --D
On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote: > > - Applications must call statx to discover the current S_DAX state. > > > > - There exists an advisory file inode flag FS_XFLAG_DAX that can be > > changed on files that have no blocks allocated to them. Changing > > this flag does not necessarily change the S_DAX state immediately > > but programs can query the S_DAX state via statx. > > I generally like the proposal but I think the fact that toggling > FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some > confusion and ultimately bug reports. I'm thinking whether we could somehow > improve this... For example an ioctl that would try to make set inode flags > effective by evicting the inode (and returning EBUSY if the eviction is > impossible for some reason)? I'd just return an error for that case, don't play silly games like evicting the inode.
On Thu, Apr 02, 2020 at 10:53:27AM +0200, Christoph Hellwig wrote: > On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote: > > > - Applications must call statx to discover the current S_DAX state. > > > > > > - There exists an advisory file inode flag FS_XFLAG_DAX that can be > > > changed on files that have no blocks allocated to them. Changing > > > this flag does not necessarily change the S_DAX state immediately > > > but programs can query the S_DAX state via statx. > > > > I generally like the proposal but I think the fact that toggling > > FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some > > confusion and ultimately bug reports. I'm thinking whether we could somehow > > improve this... For example an ioctl that would try to make set inode flags > > effective by evicting the inode (and returning EBUSY if the eviction is > > impossible for some reason)? > > I'd just return an error for that case, don't play silly games like > evicting the inode. I think I agree with Christoph here. But I want to clarify. I was heading in a direction of failing the ioctl completely. But we could have the flag change with an appropriate error which could let the user know the change has been delayed. But I don't immediately see what error code is appropriate for such an indication. Candidates I can envision: EAGAIN ERESTART EUSERS EINPROGRESS None are perfect but I'm leaning toward EINPROGRESS. Ira
On Wed, Apr 01, 2020 at 04:00:21AM +0000, Darrick J. Wong wrote: > On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote: > > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote: > > > > This sounds reasonable to me. > > > > > > > > As for deprecating the mount option, I think at a minimum it needs to > > > > continue be accepted as an option even if it is ignored to not break > > > > existing setups. > > > > > > Agreed. But that's how we usually deprecate mount options. Also I'd say > > > that statx() support for reporting DAX state and some education of > > > programmers using DAX is required before we deprecate the mount option > > > since currently applications check 'dax' mount option to determine how much > > > memory they need to set aside for page cache before they consume everything > > > else on the machine... > > > > I don't even think we should deprecate it. It isn't painful to maintain > > and actually useful for testing. Instead we should expand it into a > > tristate: > > > > dax=off > > dax=flag > > dax=always > > > > where the existing "dax" option maps to "dax=always" and nodax maps > > to "dax=off". and dax=flag becomes the default for DAX capable devices. > > That works for me. In summary: > > - Applications must call statx to discover the current S_DAX state. > > - There exists an advisory file inode flag FS_XFLAG_DAX that can be > changed on files that have no blocks allocated to them. Changing > this flag does not necessarily change the S_DAX state immediately > but programs can query the S_DAX state via statx. > > If FS_XFLAG_DAX is set and the fs is on pmem then it will always > enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will > never enable S_DAX. Unless overridden... > > - There exists a dax= mount option. dax=off means "never set S_DAX, > ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on > pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX" > and is the default. "dax" by itself means "dax=always". "nodax" > means "dax=off". > > - There exists an advisory directory inode flag FS_XFLAG_DAX that can > be changed at any time. The flag state is copied into any files or > subdirectories created within that directory. If programs require > that file access runs in S_DAX mode, they'll have to create those > files themselves inside a directory with FS_XFLAG_DAX set, or mount > the fs with dax=always. One other thing to add here. They _can_ set the FS_XFLAG_DAX on a file with data and force an eviction to get S_DAX to change. I think that is a nice reason to have a different error code returned. > > Ok? Let's please get this part finished for 5.8, then we can get back > to arguing about fs-rmap and reflink and dax and whatnot. I'm happy to see you motivated to get this in. I'm starting with a new xfstest to make sure we agree on the semantics prior to more patches. I hope to have the xfstest patch sent tomorrow sometime. Ira > > --D
On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > I'd just return an error for that case, don't play silly games like > > evicting the inode. > > I think I agree with Christoph here. But I want to clarify. I was heading in > a direction of failing the ioctl completely. But we could have the flag change > with an appropriate error which could let the user know the change has been > delayed. > > But I don't immediately see what error code is appropriate for such an > indication. Candidates I can envision: > > EAGAIN > ERESTART > EUSERS > EINPROGRESS > > None are perfect but I'm leaning toward EINPROGRESS. I really, really dislike that idea. The whole point of not forcing evictions is to make it clear - no this inode is "busy" you can't do that. A reasonably smart application can try to evict itself. But returning an error and doing a lazy change anyway is straight from the playbook for arcane and confusing API designs.
On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote: > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > > I'd just return an error for that case, don't play silly games like > > > evicting the inode. > > > > I think I agree with Christoph here. But I want to clarify. I was heading in > > a direction of failing the ioctl completely. But we could have the flag change > > with an appropriate error which could let the user know the change has been > > delayed. > > > > But I don't immediately see what error code is appropriate for such an > > indication. Candidates I can envision: > > > > EAGAIN > > ERESTART > > EUSERS > > EINPROGRESS > > > > None are perfect but I'm leaning toward EINPROGRESS. > > I really, really dislike that idea. The whole point of not forcing > evictions is to make it clear - no this inode is "busy" you can't > do that. A reasonably smart application can try to evict itself. I don't understand. What Darrick proposed would never need any evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can not be changed. So I don't see what good eviction would do at all. > > But returning an error and doing a lazy change anyway is straight from > the playbook for arcane and confusing API designs. Jan countered with a proposal that the FS_XFLAG_DAX does change with blocks allocated. But that S_DAX would change on eviction. Adding that some eviction ioctl could be added. You then proposed just returning an error for that case. (This lead me to believe that you were ok with an eviction based change of S_DAX.) So I agreed that changing S_DAX could be delayed until an explicit eviction. But, to aid the 'smart application', a different error code could be used to indicate that the FS_XFLAG_DAX had been changed but that until that explicit eviction occurs S_DAX would remain. So I don't fully follow what you mean by 'lazy change'? Do you still really, really dislike an explicit eviction method for changing the S_DAX flag? If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the user wants to change the mode of operations on their 'data'; they would have to create a new file with the proper setting and move the data there. For example copy the file into a directory marked FS_XFLAG_DAX==true? I'm ok with either interface as I think both could be clear if documented. Jan? Ira
On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote: > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > > I'd just return an error for that case, don't play silly games like > > > evicting the inode. > > > > I think I agree with Christoph here. But I want to clarify. I was heading in > > a direction of failing the ioctl completely. But we could have the flag change > > with an appropriate error which could let the user know the change has been > > delayed. > > > > But I don't immediately see what error code is appropriate for such an > > indication. Candidates I can envision: > > > > EAGAIN > > ERESTART > > EUSERS > > EINPROGRESS > > > > None are perfect but I'm leaning toward EINPROGRESS. > > I really, really dislike that idea. The whole point of not forcing > evictions is to make it clear - no this inode is "busy" you can't > do that. A reasonably smart application can try to evict itself. > > But returning an error and doing a lazy change anyway is straight from > the playbook for arcane and confusing API designs. Agreed. That's why I wrote that applications can set FS_XFLAG_DAX and then query statx for STATX_ATTR_DAX to find out if it actually took effect, and that if applications require it immediately they can either create a file in a FS_XFLAG_DAX directory, or the admin can mount with dax=always. No magic return values required or desired anywhere. I don't know what "try to evict the inode" magic means, but I'm fairly sure I don't want to. ;) --D
On Fri 03-04-20 08:48:29, Ira Weiny wrote: > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote: > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > > > I'd just return an error for that case, don't play silly games like > > > > evicting the inode. > > > > > > I think I agree with Christoph here. But I want to clarify. I was heading in > > > a direction of failing the ioctl completely. But we could have the flag change > > > with an appropriate error which could let the user know the change has been > > > delayed. > > > > > > But I don't immediately see what error code is appropriate for such an > > > indication. Candidates I can envision: > > > > > > EAGAIN > > > ERESTART > > > EUSERS > > > EINPROGRESS > > > > > > None are perfect but I'm leaning toward EINPROGRESS. > > > > I really, really dislike that idea. The whole point of not forcing > > evictions is to make it clear - no this inode is "busy" you can't > > do that. A reasonably smart application can try to evict itself. > > I don't understand. What Darrick proposed would never need any > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can > not be changed. So I don't see what good eviction would do at all. I guess there's some confusion here (may well be than on my side). Darrick propose that we can switch FS_XFLAG_DAX only when file has no blocks allocated - fine by me. But that still does not mean than we can switch S_DAX immediately, does it? Because that would still mean we need to switch aops on living inode and that's ... difficult and Christoph didn't want to clutter the code with it. So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag, S_DAX flag will magically switch when inode gets evicted and the inode gets reloaded from the disk again. Did I misunderstand anything? And my thinking was that this is surprising behavior for the user and so it will likely generate lots of bug reports along the lines of "DAX inode flag does not work!". So I was pondering how to make the behavior less confusing... The ioctl I've suggested was just a poor attempt at that. > > But returning an error and doing a lazy change anyway is straight from > > the playbook for arcane and confusing API designs. > > Jan countered with a proposal that the FS_XFLAG_DAX does change with > blocks allocated. But that S_DAX would change on eviction. Adding that > some eviction ioctl could be added. No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I was still speaking about the case without blocks allocated. > You then proposed just returning an error for that case. (This lead me to > believe that you were ok with an eviction based change of S_DAX.) > > So I agreed that changing S_DAX could be delayed until an explicit eviction. > But, to aid the 'smart application', a different error code could be used to > indicate that the FS_XFLAG_DAX had been changed but that until that explicit > eviction occurs S_DAX would remain. > > So I don't fully follow what you mean by 'lazy change'? > > Do you still really, really dislike an explicit eviction method for changing > the S_DAX flag? > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the > user wants to change the mode of operations on their 'data'; they would have to > create a new file with the proper setting and move the data there. For example > copy the file into a directory marked FS_XFLAG_DAX==true? > > I'm ok with either interface as I think both could be clear if documented. I agree that what Darrick suggested is technically easily doable and can be documented. But it is not natural behavior (i.e., different than all inode flags we have) and we know how careful people are when reading documentation... Honza
On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote: > On Fri 03-04-20 08:48:29, Ira Weiny wrote: > > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote: > > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > > > > I'd just return an error for that case, don't play silly games like > > > > > evicting the inode. > > > > > > > > I think I agree with Christoph here. But I want to clarify. I was heading in > > > > a direction of failing the ioctl completely. But we could have the flag change > > > > with an appropriate error which could let the user know the change has been > > > > delayed. > > > > > > > > But I don't immediately see what error code is appropriate for such an > > > > indication. Candidates I can envision: > > > > > > > > EAGAIN > > > > ERESTART > > > > EUSERS > > > > EINPROGRESS > > > > > > > > None are perfect but I'm leaning toward EINPROGRESS. > > > > > > I really, really dislike that idea. The whole point of not forcing > > > evictions is to make it clear - no this inode is "busy" you can't > > > do that. A reasonably smart application can try to evict itself. > > > > I don't understand. What Darrick proposed would never need any > > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can > > not be changed. So I don't see what good eviction would do at all. > > I guess there's some confusion here (may well be than on my side). Darrick > propose that we can switch FS_XFLAG_DAX only when file has no blocks > allocated - fine by me. But that still does not mean than we can switch > S_DAX immediately, does it? Because that would still mean we need to switch > aops on living inode and that's ... difficult and Christoph didn't want to > clutter the code with it. > > So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag, > S_DAX flag will magically switch when inode gets evicted and the inode gets > reloaded from the disk again. Did I misunderstand anything? > > And my thinking was that this is surprising behavior for the user and so it > will likely generate lots of bug reports along the lines of "DAX inode flag > does not work!". So I was pondering how to make the behavior less > confusing... The ioctl I've suggested was just a poor attempt at that. Ok but then I don't understand Christophs comment to "just return an error for that case"? Which case? > > > > But returning an error and doing a lazy change anyway is straight from > > > the playbook for arcane and confusing API designs. > > > > Jan countered with a proposal that the FS_XFLAG_DAX does change with > > blocks allocated. But that S_DAX would change on eviction. Adding that > > some eviction ioctl could be added. > > No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I > was still speaking about the case without blocks allocated. Ah ok good point. But again what 'error' do we return when FS_XFLAG_DAX changed but S_DAX did not? > > > You then proposed just returning an error for that case. (This lead me to > > believe that you were ok with an eviction based change of S_DAX.) > > > > So I agreed that changing S_DAX could be delayed until an explicit eviction. > > But, to aid the 'smart application', a different error code could be used to > > indicate that the FS_XFLAG_DAX had been changed but that until that explicit > > eviction occurs S_DAX would remain. > > > > So I don't fully follow what you mean by 'lazy change'? > > > > Do you still really, really dislike an explicit eviction method for changing > > the S_DAX flag? > > > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the > > user wants to change the mode of operations on their 'data'; they would have to > > create a new file with the proper setting and move the data there. For example > > copy the file into a directory marked FS_XFLAG_DAX==true? > > > > I'm ok with either interface as I think both could be clear if documented. > > I agree that what Darrick suggested is technically easily doable and can be > documented. But it is not natural behavior (i.e., different than all inode > flags we have) and we know how careful people are when reading > documentation... > Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_ _all_... In summary: - Applications must call statx to discover the current S_DAX state. - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on the parent directory FS_XFLAG_DAX inode flag. (There is no way to change this flag after file creation.) If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX. Unless overridden... - There exists a dax= mount option. "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX" "-o nodax" means "dax=off" "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX" "-o dax" by itself means "dax=always" "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default - There exists an advisory directory inode flag FS_XFLAG_DAX that can be changed at any time. The flag state is copied into any files or subdirectories when they are created within that directory. If programs require file access runs in S_DAX mode, they'll have to create those files inside a directory with FS_XFLAG_DAX set, or mount the fs with an appropriate dax mount option. ??? Ira
On Fri, Apr 03, 2020 at 11:18:43AM -0700, 'Ira Weiny' wrote: [snip] > > Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_ > _all_... To be fair, I believe Christoph advocated for this at one time or another... Ira > > In summary: > > - Applications must call statx to discover the current S_DAX state. > > - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on > the parent directory FS_XFLAG_DAX inode flag. (There is no way to change > this flag after file creation.) > > If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at > inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX. > Unless overridden... > > - There exists a dax= mount option. > > "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX" > "-o nodax" means "dax=off" > "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX" > "-o dax" by itself means "dax=always" > "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default > > - There exists an advisory directory inode flag FS_XFLAG_DAX that can be > changed at any time. The flag state is copied into any files or > subdirectories when they are created within that directory. If programs > require file access runs in S_DAX mode, they'll have to create those files > inside a directory with FS_XFLAG_DAX set, or mount the fs with an > appropriate dax mount option. > > > ??? > > Ira >
On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote: > On Fri 03-04-20 08:48:29, Ira Weiny wrote: > > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote: > > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > > > > I'd just return an error for that case, don't play silly games like > > > > > evicting the inode. > > > > > > > > I think I agree with Christoph here. But I want to clarify. I was heading in > > > > a direction of failing the ioctl completely. But we could have the flag change > > > > with an appropriate error which could let the user know the change has been > > > > delayed. > > > > > > > > But I don't immediately see what error code is appropriate for such an > > > > indication. Candidates I can envision: > > > > > > > > EAGAIN > > > > ERESTART > > > > EUSERS > > > > EINPROGRESS > > > > > > > > None are perfect but I'm leaning toward EINPROGRESS. > > > > > > I really, really dislike that idea. The whole point of not forcing > > > evictions is to make it clear - no this inode is "busy" you can't > > > do that. A reasonably smart application can try to evict itself. > > > > I don't understand. What Darrick proposed would never need any > > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can > > not be changed. So I don't see what good eviction would do at all. > > I guess there's some confusion here (may well be than on my side). Darrick > propose that we can switch FS_XFLAG_DAX only when file has no blocks > allocated - fine by me. But that still does not mean than we can switch > S_DAX immediately, does it? Because that would still mean we need to switch > aops on living inode and that's ... difficult and Christoph didn't want to > clutter the code with it. IIRC, the reason Ira was trying to introduce this file operations lock is because there isn't any other safe way to change the operations dynamically, because some of those operations don't start taking any locks at all until after we've accessed the file operations pointer. Now that I think about it some more, that's also means we can't change S_DAX even on files without blocks allocated. Look at fallocate, it doesn't even take i_rwsem until we've called into (say) xfs_file_fallocate. Ok, so with that in mind, I think we simply have to say that FS_XFLAG_DAX is 100% advisory, and S_DAX will magically switch some time in the future or after the next umount/mount cycle. FSSETXATTR can't evict an inode it has just set FS_XFLAG_DAX on, because it applies that change to the fd passed into ioctl(), which means that the caller has a reference to the fd -> file -> inode, which means the inode is unevictable until after the call completes. IOWs, > So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag, > S_DAX flag will magically switch when inode gets evicted and the inode gets > reloaded from the disk again. Did I misunderstand anything? That was /my/ understanding. :) > And my thinking was that this is surprising behavior for the user and so it > will likely generate lots of bug reports along the lines of "DAX inode flag > does not work!". So I was pondering how to make the behavior less > confusing... The ioctl I've suggested was just a poor attempt at that. At best, userspace uses FSSETXATTR to change FS_XFLAG_DAX, and then calls some as-yet-undefined ioctl to try to evict the inode from memory. I'm not sure how you'd actually do that, though, considering that you'd have to close the fd and as soon as that happens the inode can disappear permanently. Ok, that's not sane. Forget I ever wrote that. > > > But returning an error and doing a lazy change anyway is straight from > > > the playbook for arcane and confusing API designs. > > > > Jan countered with a proposal that the FS_XFLAG_DAX does change with > > blocks allocated. But that S_DAX would change on eviction. Adding that > > some eviction ioctl could be added. > > No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I > was still speaking about the case without blocks allocated. > > > You then proposed just returning an error for that case. (This lead me to > > believe that you were ok with an eviction based change of S_DAX.) > > > > So I agreed that changing S_DAX could be delayed until an explicit eviction. > > But, to aid the 'smart application', a different error code could be used to > > indicate that the FS_XFLAG_DAX had been changed but that until that explicit > > eviction occurs S_DAX would remain. There's no point in returning a magic error code from FSSETXATTR, as I realized above. To restate: FSSETXATTR can't evict the inode, so it would always have to return the magic error code. The best we can do is tell userspace that they can set the advisory FS_XFLAG_DAX flag and then check STATX_ATTR_DAX immediately afterwards. If some day in the future we get smarter and can change it immediately, the statx output will reflect that. > > So I don't fully follow what you mean by 'lazy change'? > > > > Do you still really, really dislike an explicit eviction method for changing > > the S_DAX flag? > > > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the > > user wants to change the mode of operations on their 'data'; they would have to > > create a new file with the proper setting and move the data there. For example > > copy the file into a directory marked FS_XFLAG_DAX==true? > > > > I'm ok with either interface as I think both could be clear if documented. > > I agree that what Darrick suggested is technically easily doable and can be > documented. But it is not natural behavior (i.e., different than all inode > flags we have) and we know how careful people are when reading > documentation... To reflect all that I've rambled in this thread, I withdraw the previous paragraph and submit this one for consideration: - There exists an advisory file inode flag FS_XFLAG_DAX. If FS_XFLAG_DAX is set and the fs is on pmem then it will always enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will never enable S_DAX. The advice can be overridden by mount option. Changing this flag does not necessarily change the S_DAX state immediately but programs can query the S_DAX state via statx to detect when the new advice has gone into effect. Consider this from the perspective of minimizing changes to userspace programs between now and the future. If your program really wants S_DAX, you can write: fd = open(...); ioctl(fd, FSGETXATTR, &fsx); if (fsx.xflags & FS_XFLAG_DAX) return 0; fsx.xflags |= FS_XFLAG_DAX; ioctl(fd, FSSETXATTR, &fsx); do { statx(fd, STATX_GIVE_ME_EVERYTHING, &statx); if (statx.attrs & STATX_ATTR_DAX) return 0; /* do stupid magic to evict things */ } while (we haven't gotten bored and wandered away); This code snippet will work with the current limitations of the kernel, and it'll continue working even on Linux v5000 where we finally figure out how to change S_DAX on the fly. --D > > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Fri, Apr 03, 2020 at 11:18:43AM -0700, Ira Weiny wrote: > On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote: > > On Fri 03-04-20 08:48:29, Ira Weiny wrote: > > > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote: > > > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote: > > > > > > I'd just return an error for that case, don't play silly games like > > > > > > evicting the inode. > > > > > > > > > > I think I agree with Christoph here. But I want to clarify. I was heading in > > > > > a direction of failing the ioctl completely. But we could have the flag change > > > > > with an appropriate error which could let the user know the change has been > > > > > delayed. > > > > > > > > > > But I don't immediately see what error code is appropriate for such an > > > > > indication. Candidates I can envision: > > > > > > > > > > EAGAIN > > > > > ERESTART > > > > > EUSERS > > > > > EINPROGRESS > > > > > > > > > > None are perfect but I'm leaning toward EINPROGRESS. > > > > > > > > I really, really dislike that idea. The whole point of not forcing > > > > evictions is to make it clear - no this inode is "busy" you can't > > > > do that. A reasonably smart application can try to evict itself. > > > > > > I don't understand. What Darrick proposed would never need any > > > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can > > > not be changed. So I don't see what good eviction would do at all. > > > > I guess there's some confusion here (may well be than on my side). Darrick > > propose that we can switch FS_XFLAG_DAX only when file has no blocks > > allocated - fine by me. But that still does not mean than we can switch > > S_DAX immediately, does it? Because that would still mean we need to switch > > aops on living inode and that's ... difficult and Christoph didn't want to > > clutter the code with it. > > > > So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag, > > S_DAX flag will magically switch when inode gets evicted and the inode gets > > reloaded from the disk again. Did I misunderstand anything? > > > > And my thinking was that this is surprising behavior for the user and so it > > will likely generate lots of bug reports along the lines of "DAX inode flag > > does not work!". So I was pondering how to make the behavior less > > confusing... The ioctl I've suggested was just a poor attempt at that. > > Ok but then I don't understand Christophs comment to "just return an error for > that case"? Which case? > > > > > > > But returning an error and doing a lazy change anyway is straight from > > > > the playbook for arcane and confusing API designs. > > > > > > Jan countered with a proposal that the FS_XFLAG_DAX does change with > > > blocks allocated. But that S_DAX would change on eviction. Adding that > > > some eviction ioctl could be added. > > > > No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I > > was still speaking about the case without blocks allocated. > > Ah ok good point. But again what 'error' do we return when FS_XFLAG_DAX > changed but S_DAX did not? > > > > > > You then proposed just returning an error for that case. (This lead me to > > > believe that you were ok with an eviction based change of S_DAX.) > > > > > > So I agreed that changing S_DAX could be delayed until an explicit eviction. > > > But, to aid the 'smart application', a different error code could be used to > > > indicate that the FS_XFLAG_DAX had been changed but that until that explicit > > > eviction occurs S_DAX would remain. > > > > > > So I don't fully follow what you mean by 'lazy change'? > > > > > > Do you still really, really dislike an explicit eviction method for changing > > > the S_DAX flag? > > > > > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the > > > user wants to change the mode of operations on their 'data'; they would have to > > > create a new file with the proper setting and move the data there. For example > > > copy the file into a directory marked FS_XFLAG_DAX==true? > > > > > > I'm ok with either interface as I think both could be clear if documented. > > > > I agree that what Darrick suggested is technically easily doable and can be > > documented. But it is not natural behavior (i.e., different than all inode > > flags we have) and we know how careful people are when reading > > documentation... > > > > Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_ > _all_... > > In summary: > > - Applications must call statx to discover the current S_DAX state. Ok. > - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on > the parent directory FS_XFLAG_DAX inode flag. (There is no way to change > this flag after file creation.) > > If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at > inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX. > Unless overridden... Ok, fine with me. :) > - There exists a dax= mount option. > > "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX" > "-o nodax" means "dax=off" I surveyed the three fses that support dax and found that none of the filesystems actually have a 'nodax' flag. Now would be the time not to add such a thing, and make people specify dax=off instead. It would be handy if we could have a single fsparam_enum for figuring out the dax mount options. > "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX" > "-o dax" by itself means "dax=always" > "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default > > - There exists an advisory directory inode flag FS_XFLAG_DAX that can be > changed at any time. The flag state is copied into any files or > subdirectories when they are created within that directory. If programs > require file access runs in S_DAX mode, they'll have to create those files "...they must create..." > inside a directory with FS_XFLAG_DAX set, or mount the fs with an > appropriate dax mount option. Otherwise seems ok to me. --D > > > ??? > > Ira >
> > > > In summary: > > > > - Applications must call statx to discover the current S_DAX state. > > Ok. > > > - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on > > the parent directory FS_XFLAG_DAX inode flag. (There is no way to change > > this flag after file creation.) > > > > If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at > > inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX. > > Unless overridden... > > Ok, fine with me. :) :-D > > > - There exists a dax= mount option. > > > > "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX" > > "-o nodax" means "dax=off" > > I surveyed the three fses that support dax and found that none of the > filesystems actually have a 'nodax' flag. Now would be the time not to > add such a thing, and make people specify dax=off instead. It would > be handy if we could have a single fsparam_enum for figuring out the dax > mount options. yes good point. I'm working on updating the documentation patch and I think this might also be better as: -o dax=never Which is the opposite of 'always'. > > > "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX" > > "-o dax" by itself means "dax=always" > > "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default > > > > - There exists an advisory directory inode flag FS_XFLAG_DAX that can be > > changed at any time. The flag state is copied into any files or > > subdirectories when they are created within that directory. If programs > > require file access runs in S_DAX mode, they'll have to create those files > > "...they must create..." yes > > > inside a directory with FS_XFLAG_DAX set, or mount the fs with an > > appropriate dax mount option. > > Otherwise seems ok to me. Thanks! Ira
On Fri 03-04-20 11:18:43, Ira Weiny wrote: > Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_ > _all_... > > In summary: > > - Applications must call statx to discover the current S_DAX state. > > - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on > the parent directory FS_XFLAG_DAX inode flag. (There is no way to change > this flag after file creation.) > > If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at > inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX. > Unless overridden... OK, after considering all the options we were hashing out here, I think this is the best API. There isn't the confusing "S_DAX will magically switch on inode eviction" and although the functionality is limited, I think 90% of users would end up using the functionality like this anyway. > - There exists a dax= mount option. > > "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX" > "-o nodax" means "dax=off" > "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX" > "-o dax" by itself means "dax=always" > "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default > > - There exists an advisory directory inode flag FS_XFLAG_DAX that can be > changed at any time. The flag state is copied into any files or > subdirectories when they are created within that directory. If programs > require file access runs in S_DAX mode, they'll have to create those files > inside a directory with FS_XFLAG_DAX set, or mount the fs with an > appropriate dax mount option. Honza