Message ID | 20200221004134.30599-8-ira.weiny@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | Enable per-file/per-directory DAX operations V4 | expand |
On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > DAX requires special address space operations (aops). Changing DAX > state therefore requires changing those aops. > > However, many functions require aops to remain consistent through a deep > call stack. > > Define a vfs level inode rwsem to protect aops throughout call stacks > which require them. > > Finally, define calls to be used in subsequent patches when aops usage > needs to be quiesced by the file system. I am very much against this. There is a reason why we don't support changes of ops vectors at runtime anywhere else, because it is horribly complicated and impossible to get right. IFF we ever want to change the DAX vs non-DAX mode (which I'm still not sold on) the right way is to just add a few simple conditionals and merge the aops, which is much easier to reason about, and less costly in overall overhead.
On Fri, Feb 21, 2020 at 06:44:49PM +0100, Christoph Hellwig wrote: > On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > DAX requires special address space operations (aops). Changing DAX > > state therefore requires changing those aops. > > > > However, many functions require aops to remain consistent through a deep > > call stack. > > > > Define a vfs level inode rwsem to protect aops throughout call stacks > > which require them. > > > > Finally, define calls to be used in subsequent patches when aops usage > > needs to be quiesced by the file system. > > I am very much against this. There is a reason why we don't support > changes of ops vectors at runtime anywhere else, because it is > horribly complicated and impossible to get right. IFF we ever want > to change the DAX vs non-DAX mode (which I'm still not sold on) the > right way is to just add a few simple conditionals and merge the > aops, which is much easier to reason about, and less costly in overall > overhead. *cough* That's exactly what the original code did. And it was broken because page faults call multiple aops that are dependent on the result of the previous aop calls setting up the state correctly for the latter calls. And when S_DAX changes between those calls, shit breaks. It's exactly the same problem as switching aops between two dependent aops method calls - we don't solve anything by merging aops and checking IS_DAX in each method because the race condition is still there. /me throws his hands in the air and walks away -Dave.
On Fri, Feb 21, 2020 at 2:44 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 21, 2020 at 06:44:49PM +0100, Christoph Hellwig wrote: > > On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > DAX requires special address space operations (aops). Changing DAX > > > state therefore requires changing those aops. > > > > > > However, many functions require aops to remain consistent through a deep > > > call stack. > > > > > > Define a vfs level inode rwsem to protect aops throughout call stacks > > > which require them. > > > > > > Finally, define calls to be used in subsequent patches when aops usage > > > needs to be quiesced by the file system. > > > > I am very much against this. There is a reason why we don't support > > changes of ops vectors at runtime anywhere else, because it is > > horribly complicated and impossible to get right. IFF we ever want > > to change the DAX vs non-DAX mode (which I'm still not sold on) the > > right way is to just add a few simple conditionals and merge the > > aops, which is much easier to reason about, and less costly in overall > > overhead. > > *cough* > > That's exactly what the original code did. And it was broken > because page faults call multiple aops that are dependent on the > result of the previous aop calls setting up the state correctly for > the latter calls. And when S_DAX changes between those calls, shit > breaks. > > It's exactly the same problem as switching aops between two > dependent aops method calls - we don't solve anything by merging > aops and checking IS_DAX in each method because the race condition > is still there. > > /me throws his hands in the air and walks away Please come back, because I think it's also clear that the "we don't support changes of ops vectors at runtime" assertion is already being violated by ext4 [1]. So that leaves "IFF we ever want to change the dax vs non-dax mode" which I thought was already consensus given the lingering hopes about having some future where the kernel is able to dynamically optimize for dax vs non-dax based on memory media performance characteristics. I thought the only thing missing from the conclusion of the last conversation [2] was the "physical" vs "effective" split that we identified at LSF'19 [3]. Christoph, that split allows for for your concern about application intent to be considered / overridden by kernel policy, and it allows for communication of the effective state which applications need for resource planning [4] independent of MAP_SYNC and other dax semantics. The status quo of globally enabling dax for all files is empirically the wrong choice for page-cache friendly workloads running on slower-than-DRAM pmem media. I am struggling to see how we address the above items without first having a dynamic / less than global-filesystem scope facility to control dax. [1]: https://lore.kernel.org/linux-fsdevel/20191108131238.GK20863@quack2.suse.cz [2]: https://lore.kernel.org/linux-fsdevel/20170927064001.GA27601@infradead.org [3]: https://lwn.net/Articles/787973/ [4]: https://lwn.net/Articles/787233/
On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > DAX requires special address space operations (aops). Changing DAX > state therefore requires changing those aops. > > However, many functions require aops to remain consistent through a deep > call stack. > > Define a vfs level inode rwsem to protect aops throughout call stacks > which require them. > > Finally, define calls to be used in subsequent patches when aops usage > needs to be quiesced by the file system. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > --- > Changes from V3: > Convert from global to a per-inode rwsem > Remove pre-optimization > Remove static branch stuff > Change function names from inode_dax_state_* to inode_aops_* > I kept 'inode' as the synchronization is at the inode > level now (probably where it belongs)... > > and I still prefer *_[down|up]_[read|write] as those > names better reflect the use and interaction between > users (readers) and writers better. 'XX_start_aop()' > would have to be matched with something like > 'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or > something which does not make sense on the 'writer' > side. > > Changes from V2 > > Rebase on linux-next-08-02-2020 > > Fix locking order > Change all references from mode to state where appropriate > add CONFIG_FS_DAX requirement for state change > Use a static branch to enable locking only when a dax capable > device has been seen. > > Move the lock to a global vfs lock > > this does a few things > 1) preps us better for ext4 support > 2) removes funky callbacks from inode ops > 3) remove complexity from XFS and probably from > ext4 later > > We can do this because > 1) the locking order is required to be at the > highest level anyway, so why complicate xfs > 2) We had to move the sem to the super_block > because it is too heavy for the inode. > 3) After internal discussions with Dan we > decided that this would be easier, just as > performant, and with slightly less overhead > than in the VFS SB. > > We also change the functions names to up/down; > read/write as appropriate. Previous names were over > simplified. > > Update comments and documentation > > squash: add locking > --- > Documentation/filesystems/vfs.rst | 16 ++++++++ > fs/attr.c | 1 + > fs/inode.c | 15 +++++-- > fs/iomap/buffered-io.c | 1 + > fs/open.c | 4 ++ > fs/stat.c | 2 + > fs/xfs/xfs_icache.c | 1 + > include/linux/fs.h | 66 ++++++++++++++++++++++++++++++- > mm/fadvise.c | 7 +++- > mm/filemap.c | 4 ++ > mm/huge_memory.c | 1 + > mm/khugepaged.c | 2 + > mm/util.c | 9 ++++- > 13 files changed, 121 insertions(+), 8 deletions(-) > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > index 7d4d09dd5e6d..4a10a232f8e2 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -934,6 +934,22 @@ cache in your filesystem. The following members are defined: > Called during swapoff on files where swap_activate was > successful. > > +Changing DAX 'state' dynamically > +---------------------------------- > + > +Some file systems which support DAX want to be able to change the DAX state > +dyanically. To switch the state safely we lock the inode state in all "normal" > +file system operations and restrict state changes to those operations. The > +specific rules are. > + > + 1) the direct_IO address_space_operation must be supported in all > + potential a_ops vectors for any state suported by the inode. > + > + 3) DAX state changes shall not be allowed while the file is mmap'ed > + 4) For non-mmaped operations the VFS layer must take the read lock for any > + use of IS_DAX() > + 5) Filesystems take the write lock when changing DAX states. > + > > The File Object > =============== > diff --git a/fs/attr.c b/fs/attr.c > index b4bbdbd4c8ca..9b15f73d1079 100644 > --- a/fs/attr.c > +++ b/fs/attr.c > @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de > if (error) > return error; > > + /* DAX read state should already be held here */ > if (inode->i_op->setattr) > error = inode->i_op->setattr(dentry, attr); > else > diff --git a/fs/inode.c b/fs/inode.c > index 7d57068b6b7a..6e4f1cc872f2 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) > #endif > inode->i_flctx = NULL; > this_cpu_inc(nr_inodes); > + init_rwsem(&inode->i_aops_sem); > > return 0; > out: > @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput); > */ > int bmap(struct inode *inode, sector_t *block) > { > - if (!inode->i_mapping->a_ops->bmap) > - return -EINVAL; > + int ret = 0; > + > + inode_aops_down_read(inode); > + if (!inode->i_mapping->a_ops->bmap) { > + ret = -EINVAL; > + goto err; > + } > > *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > - return 0; > + > +err: > + inode_aops_up_read(inode); > + return ret; > } > EXPORT_SYMBOL(bmap); > #endif > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 7c84c4c027c4..e313a34d5fa6 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, > offset = offset_in_page(pos); > bytes = min_t(loff_t, PAGE_SIZE - offset, count); > > + /* DAX state read should already be held here */ > if (IS_DAX(inode)) > status = iomap_dax_zero(pos, offset, bytes, iomap); > else > diff --git a/fs/open.c b/fs/open.c > index 0788b3715731..3abf0bfac462 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, > if (ret) > newattrs.ia_valid |= ret | ATTR_FORCE; > > + inode_aops_down_read(dentry->d_inode); > inode_lock(dentry->d_inode); > /* Note any delegations or leases have already been broken: */ > ret = notify_change(dentry, &newattrs, NULL); > inode_unlock(dentry->d_inode); > + inode_aops_up_read(dentry->d_inode); > return ret; > } > > @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > return -EOPNOTSUPP; > > file_start_write(file); > + inode_aops_down_read(inode); > ret = file->f_op->fallocate(file, mode, offset, len); > + inode_aops_up_read(inode); > > /* > * Create inotify and fanotify events. > diff --git a/fs/stat.c b/fs/stat.c > index 894699c74dde..274b3ccc82b1 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, > if (IS_AUTOMOUNT(inode)) > stat->attributes |= STATX_ATTR_AUTOMOUNT; > > + inode_aops_down_read(inode); > if (IS_DAX(inode)) > stat->attributes |= STATX_ATTR_DAX; > + inode_aops_up_read(inode); > > if (inode->i_op->getattr) > return inode->i_op->getattr(path, stat, request_mask, > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 836a1f09be03..3e83a97dc047 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -420,6 +420,7 @@ xfs_iget_cache_hit( > rcu_read_unlock(); > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > + ASSERT(!rwsem_is_locked(&inode->i_aops_sem)); > error = xfs_reinit_inode(mp, inode); > if (error) { > bool wake; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 63d1e533a07d..ad0f2368031b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -40,6 +40,7 @@ > #include <linux/fs_types.h> > #include <linux/build_bug.h> > #include <linux/stddef.h> > +#include <linux/percpu-rwsem.h> > > #include <asm/byteorder.h> > #include <uapi/linux/fs.h> > @@ -359,6 +360,11 @@ typedef struct { > typedef int (*read_actor_t)(read_descriptor_t *, struct page *, > unsigned long, unsigned long); > > +/** > + * NOTE: DO NOT define new functions in address_space_operations without first > + * considering how dynamic DAX states are to be supported. See the > + * inode_aops_*_read() functions > + */ > struct address_space_operations { > int (*writepage)(struct page *page, struct writeback_control *wbc); > int (*readpage)(struct file *, struct page *); > @@ -735,6 +741,9 @@ struct inode { > #endif > > void *i_private; /* fs or device private pointer */ > +#if defined(CONFIG_FS_DAX) > + struct rw_semaphore i_aops_sem; > +#endif > } __randomize_layout; > > struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); > @@ -1817,6 +1826,11 @@ struct block_device_operations; > > struct iov_iter; > > +/** > + * NOTE: DO NOT define new functions in file_operations without first > + * considering how dynamic address_space operations are to be supported. See > + * the inode_aops_*_read() functions in this file. > + */ > struct file_operations { > struct module *owner; > loff_t (*llseek) (struct file *, loff_t, int); > @@ -1889,16 +1903,64 @@ struct inode_operations { > int (*set_acl)(struct inode *, struct posix_acl *, int); > } ____cacheline_aligned; > > +#if defined(CONFIG_FS_DAX) > +/* > + * Filesystems wishing to support dynamic DAX states must do the following. > + * > + * 1) the direct_IO address_space_operation must be supported in all potential > + * a_ops vectors for any state suported by the inode. This is because the > + * direct_IO function is used as a flag long before the function is called. > + > + * 3) DAX state changes shall not be allowed while the file is mmap'ed > + * 4) For non-mmaped operations the VFS layer must take the read lock for any > + * use of IS_DAX() > + * 5) Filesystems take the write lock when changing DAX states. Do you envision ext4 migrating their aops changing code to use i_aops_sem in the future? --D > + */ > +static inline void inode_aops_down_read(struct inode *inode) > +{ > + down_read(&inode->i_aops_sem); > +} > +static inline void inode_aops_up_read(struct inode *inode) > +{ > + up_read(&inode->i_aops_sem); > +} > +static inline void inode_aops_down_write(struct inode *inode) > +{ > + down_write(&inode->i_aops_sem); > +} > +static inline void inode_aops_up_write(struct inode *inode) > +{ > + up_write(&inode->i_aops_sem); > +} > +#else /* !CONFIG_FS_DAX */ > +#define inode_aops_down_read(inode) do { (void)(inode); } while (0) > +#define inode_aops_up_read(inode) do { (void)(inode); } while (0) > +#define inode_aops_down_write(inode) do { (void)(inode); } while (0) > +#define inode_aops_up_write(inode) do { (void)(inode); } while (0) > +#endif /* CONFIG_FS_DAX */ > + > static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, > struct iov_iter *iter) > { > - return file->f_op->read_iter(kio, iter); > + struct inode *inode = file_inode(kio->ki_filp); > + ssize_t ret; > + > + inode_aops_down_read(inode); > + ret = file->f_op->read_iter(kio, iter); > + inode_aops_up_read(inode); > + return ret; > } > > static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, > struct iov_iter *iter) > { > - return file->f_op->write_iter(kio, iter); > + struct inode *inode = file_inode(kio->ki_filp); > + ssize_t ret; > + > + inode_aops_down_read(inode); > + ret = file->f_op->write_iter(kio, iter); > + inode_aops_up_read(inode); > + return ret; > } > > static inline int call_mmap(struct file *file, struct vm_area_struct *vma) > diff --git a/mm/fadvise.c b/mm/fadvise.c > index 4f17c83db575..6a30febb11e0 100644 > --- a/mm/fadvise.c > +++ b/mm/fadvise.c > @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > bdi = inode_to_bdi(mapping->host); > > if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { > + int ret = 0; > + > switch (advice) { > case POSIX_FADV_NORMAL: > case POSIX_FADV_RANDOM: > @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > /* no bad return value, but ignore advice */ > break; > default: > - return -EINVAL; > + ret = -EINVAL; > } > - return 0; > + > + return ret; > } > > /* > diff --git a/mm/filemap.c b/mm/filemap.c > index 1784478270e1..3a7863ba51b9 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) > * and return. Otherwise fallthrough to buffered io for > * the rest of the read. Buffered reads will not work for > * DAX files, so don't bother trying. > + * > + * IS_DAX is protected under ->read_iter lock > */ > if (retval < 0 || !count || iocb->ki_pos >= size || > IS_DAX(inode)) > @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > * holes, for example. For DAX files, a buffered write will > * not succeed (even if it did, DAX does not handle dirty > * page-cache pages correctly). > + * > + * IS_DAX is protected under ->write_iter lock > */ > if (written < 0 || !iov_iter_count(from) || IS_DAX(inode)) > goto out; > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index b08b199f9a11..3d05bd10d83e 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > unsigned long ret; > loff_t off = (loff_t)pgoff << PAGE_SHIFT; > > + /* Should not need locking here because mmap is not allowed */ > if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) > goto out; > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b679908743cb..f048178e2b93 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm, > } else { /* !is_shmem */ > if (!page || xa_is_value(page)) { > xas_unlock_irq(&xas); > + inode_aops_down_read(file->f_inode); > page_cache_sync_readahead(mapping, &file->f_ra, > file, index, > PAGE_SIZE); > + inode_aops_up_read(file->f_inode); > /* drain pagevecs to help isolate_lru_page() */ > lru_add_drain(); > page = find_lock_page(mapping, index); > diff --git a/mm/util.c b/mm/util.c > index 988d11e6c17c..a4fb0670137d 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, > > ret = security_mmap_file(file, prot, flag); > if (!ret) { > - if (down_write_killable(&mm->mmap_sem)) > + if (file) > + inode_aops_down_read(file_inode(file)); > + if (down_write_killable(&mm->mmap_sem)) { > + if (file) > + inode_aops_up_read(file_inode(file)); > return -EINTR; > + } > ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, > &populate, &uf); > up_write(&mm->mmap_sem); > + if (file) > + inode_aops_up_read(file_inode(file)); > userfaultfd_unmap_complete(mm, &uf); > if (populate) > mm_populate(ret, populate); > -- > 2.21.0 >
On Fri, Feb 21, 2020 at 04:33:17PM -0800, Darrick J. Wong wrote: > On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > DAX requires special address space operations (aops). Changing DAX > > state therefore requires changing those aops. > > > > However, many functions require aops to remain consistent through a deep > > call stack. > > > > Define a vfs level inode rwsem to protect aops throughout call stacks > > which require them. > > > > Finally, define calls to be used in subsequent patches when aops usage > > needs to be quiesced by the file system. > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > --- > > Changes from V3: > > Convert from global to a per-inode rwsem > > Remove pre-optimization > > Remove static branch stuff > > Change function names from inode_dax_state_* to inode_aops_* > > I kept 'inode' as the synchronization is at the inode > > level now (probably where it belongs)... > > > > and I still prefer *_[down|up]_[read|write] as those > > names better reflect the use and interaction between > > users (readers) and writers better. 'XX_start_aop()' > > would have to be matched with something like > > 'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or > > something which does not make sense on the 'writer' > > side. > > > > Changes from V2 > > > > Rebase on linux-next-08-02-2020 > > > > Fix locking order > > Change all references from mode to state where appropriate > > add CONFIG_FS_DAX requirement for state change > > Use a static branch to enable locking only when a dax capable > > device has been seen. > > > > Move the lock to a global vfs lock > > > > this does a few things > > 1) preps us better for ext4 support > > 2) removes funky callbacks from inode ops > > 3) remove complexity from XFS and probably from > > ext4 later > > > > We can do this because > > 1) the locking order is required to be at the > > highest level anyway, so why complicate xfs > > 2) We had to move the sem to the super_block > > because it is too heavy for the inode. > > 3) After internal discussions with Dan we > > decided that this would be easier, just as > > performant, and with slightly less overhead > > than in the VFS SB. > > > > We also change the functions names to up/down; > > read/write as appropriate. Previous names were over > > simplified. > > > > Update comments and documentation > > > > squash: add locking > > --- > > Documentation/filesystems/vfs.rst | 16 ++++++++ > > fs/attr.c | 1 + > > fs/inode.c | 15 +++++-- > > fs/iomap/buffered-io.c | 1 + > > fs/open.c | 4 ++ > > fs/stat.c | 2 + > > fs/xfs/xfs_icache.c | 1 + > > include/linux/fs.h | 66 ++++++++++++++++++++++++++++++- > > mm/fadvise.c | 7 +++- > > mm/filemap.c | 4 ++ > > mm/huge_memory.c | 1 + > > mm/khugepaged.c | 2 + > > mm/util.c | 9 ++++- > > 13 files changed, 121 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > > index 7d4d09dd5e6d..4a10a232f8e2 100644 > > --- a/Documentation/filesystems/vfs.rst > > +++ b/Documentation/filesystems/vfs.rst > > @@ -934,6 +934,22 @@ cache in your filesystem. The following members are defined: > > Called during swapoff on files where swap_activate was > > successful. > > > > +Changing DAX 'state' dynamically > > +---------------------------------- > > + > > +Some file systems which support DAX want to be able to change the DAX state > > +dyanically. To switch the state safely we lock the inode state in all "normal" > > +file system operations and restrict state changes to those operations. The > > +specific rules are. > > + > > + 1) the direct_IO address_space_operation must be supported in all > > + potential a_ops vectors for any state suported by the inode. > > + > > + 3) DAX state changes shall not be allowed while the file is mmap'ed > > + 4) For non-mmaped operations the VFS layer must take the read lock for any > > + use of IS_DAX() > > + 5) Filesystems take the write lock when changing DAX states. > > + > > > > The File Object > > =============== > > diff --git a/fs/attr.c b/fs/attr.c > > index b4bbdbd4c8ca..9b15f73d1079 100644 > > --- a/fs/attr.c > > +++ b/fs/attr.c > > @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de > > if (error) > > return error; > > > > + /* DAX read state should already be held here */ > > if (inode->i_op->setattr) > > error = inode->i_op->setattr(dentry, attr); > > else > > diff --git a/fs/inode.c b/fs/inode.c > > index 7d57068b6b7a..6e4f1cc872f2 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) > > #endif > > inode->i_flctx = NULL; > > this_cpu_inc(nr_inodes); > > + init_rwsem(&inode->i_aops_sem); > > > > return 0; > > out: > > @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput); > > */ > > int bmap(struct inode *inode, sector_t *block) > > { > > - if (!inode->i_mapping->a_ops->bmap) > > - return -EINVAL; > > + int ret = 0; > > + > > + inode_aops_down_read(inode); > > + if (!inode->i_mapping->a_ops->bmap) { > > + ret = -EINVAL; > > + goto err; > > + } > > > > *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > - return 0; > > + > > +err: > > + inode_aops_up_read(inode); > > + return ret; > > } > > EXPORT_SYMBOL(bmap); > > #endif > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 7c84c4c027c4..e313a34d5fa6 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, > > offset = offset_in_page(pos); > > bytes = min_t(loff_t, PAGE_SIZE - offset, count); > > > > + /* DAX state read should already be held here */ > > if (IS_DAX(inode)) > > status = iomap_dax_zero(pos, offset, bytes, iomap); > > else > > diff --git a/fs/open.c b/fs/open.c > > index 0788b3715731..3abf0bfac462 100644 > > --- a/fs/open.c > > +++ b/fs/open.c > > @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, > > if (ret) > > newattrs.ia_valid |= ret | ATTR_FORCE; > > > > + inode_aops_down_read(dentry->d_inode); > > inode_lock(dentry->d_inode); > > /* Note any delegations or leases have already been broken: */ > > ret = notify_change(dentry, &newattrs, NULL); > > inode_unlock(dentry->d_inode); > > + inode_aops_up_read(dentry->d_inode); > > return ret; > > } > > > > @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > return -EOPNOTSUPP; > > > > file_start_write(file); > > + inode_aops_down_read(inode); > > ret = file->f_op->fallocate(file, mode, offset, len); > > + inode_aops_up_read(inode); > > > > /* > > * Create inotify and fanotify events. > > diff --git a/fs/stat.c b/fs/stat.c > > index 894699c74dde..274b3ccc82b1 100644 > > --- a/fs/stat.c > > +++ b/fs/stat.c > > @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, > > if (IS_AUTOMOUNT(inode)) > > stat->attributes |= STATX_ATTR_AUTOMOUNT; > > > > + inode_aops_down_read(inode); > > if (IS_DAX(inode)) > > stat->attributes |= STATX_ATTR_DAX; > > + inode_aops_up_read(inode); > > > > if (inode->i_op->getattr) > > return inode->i_op->getattr(path, stat, request_mask, > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 836a1f09be03..3e83a97dc047 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -420,6 +420,7 @@ xfs_iget_cache_hit( > > rcu_read_unlock(); > > > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > > + ASSERT(!rwsem_is_locked(&inode->i_aops_sem)); > > error = xfs_reinit_inode(mp, inode); > > if (error) { > > bool wake; > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 63d1e533a07d..ad0f2368031b 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -40,6 +40,7 @@ > > #include <linux/fs_types.h> > > #include <linux/build_bug.h> > > #include <linux/stddef.h> > > +#include <linux/percpu-rwsem.h> > > > > #include <asm/byteorder.h> > > #include <uapi/linux/fs.h> > > @@ -359,6 +360,11 @@ typedef struct { > > typedef int (*read_actor_t)(read_descriptor_t *, struct page *, > > unsigned long, unsigned long); > > > > +/** > > + * NOTE: DO NOT define new functions in address_space_operations without first > > + * considering how dynamic DAX states are to be supported. See the > > + * inode_aops_*_read() functions > > + */ > > struct address_space_operations { > > int (*writepage)(struct page *page, struct writeback_control *wbc); > > int (*readpage)(struct file *, struct page *); > > @@ -735,6 +741,9 @@ struct inode { > > #endif > > > > void *i_private; /* fs or device private pointer */ > > +#if defined(CONFIG_FS_DAX) > > + struct rw_semaphore i_aops_sem; > > +#endif > > } __randomize_layout; > > > > struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); > > @@ -1817,6 +1826,11 @@ struct block_device_operations; > > > > struct iov_iter; > > > > +/** > > + * NOTE: DO NOT define new functions in file_operations without first > > + * considering how dynamic address_space operations are to be supported. See > > + * the inode_aops_*_read() functions in this file. > > + */ > > struct file_operations { > > struct module *owner; > > loff_t (*llseek) (struct file *, loff_t, int); > > @@ -1889,16 +1903,64 @@ struct inode_operations { > > int (*set_acl)(struct inode *, struct posix_acl *, int); > > } ____cacheline_aligned; > > > > +#if defined(CONFIG_FS_DAX) > > +/* > > + * Filesystems wishing to support dynamic DAX states must do the following. > > + * > > + * 1) the direct_IO address_space_operation must be supported in all potential > > + * a_ops vectors for any state suported by the inode. This is because the > > + * direct_IO function is used as a flag long before the function is called. > > + > > + * 3) DAX state changes shall not be allowed while the file is mmap'ed > > + * 4) For non-mmaped operations the VFS layer must take the read lock for any > > + * use of IS_DAX() > > + * 5) Filesystems take the write lock when changing DAX states. > > Do you envision ext4 migrating their aops changing code to use > i_aops_sem in the future? Ah... yea, which, at that time, this comment would be completely wrong. I tried to remove references to dax as per Dave but I missed this one. Thanks, Ira > > --D > > > + */ > > +static inline void inode_aops_down_read(struct inode *inode) > > +{ > > + down_read(&inode->i_aops_sem); > > +} > > +static inline void inode_aops_up_read(struct inode *inode) > > +{ > > + up_read(&inode->i_aops_sem); > > +} > > +static inline void inode_aops_down_write(struct inode *inode) > > +{ > > + down_write(&inode->i_aops_sem); > > +} > > +static inline void inode_aops_up_write(struct inode *inode) > > +{ > > + up_write(&inode->i_aops_sem); > > +} > > +#else /* !CONFIG_FS_DAX */ > > +#define inode_aops_down_read(inode) do { (void)(inode); } while (0) > > +#define inode_aops_up_read(inode) do { (void)(inode); } while (0) > > +#define inode_aops_down_write(inode) do { (void)(inode); } while (0) > > +#define inode_aops_up_write(inode) do { (void)(inode); } while (0) > > +#endif /* CONFIG_FS_DAX */ > > + > > static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, > > struct iov_iter *iter) > > { > > - return file->f_op->read_iter(kio, iter); > > + struct inode *inode = file_inode(kio->ki_filp); > > + ssize_t ret; > > + > > + inode_aops_down_read(inode); > > + ret = file->f_op->read_iter(kio, iter); > > + inode_aops_up_read(inode); > > + return ret; > > } > > > > static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, > > struct iov_iter *iter) > > { > > - return file->f_op->write_iter(kio, iter); > > + struct inode *inode = file_inode(kio->ki_filp); > > + ssize_t ret; > > + > > + inode_aops_down_read(inode); > > + ret = file->f_op->write_iter(kio, iter); > > + inode_aops_up_read(inode); > > + return ret; > > } > > > > static inline int call_mmap(struct file *file, struct vm_area_struct *vma) > > diff --git a/mm/fadvise.c b/mm/fadvise.c > > index 4f17c83db575..6a30febb11e0 100644 > > --- a/mm/fadvise.c > > +++ b/mm/fadvise.c > > @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > > bdi = inode_to_bdi(mapping->host); > > > > if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { > > + int ret = 0; > > + > > switch (advice) { > > case POSIX_FADV_NORMAL: > > case POSIX_FADV_RANDOM: > > @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > > /* no bad return value, but ignore advice */ > > break; > > default: > > - return -EINVAL; > > + ret = -EINVAL; > > } > > - return 0; > > + > > + return ret; > > } > > > > /* > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 1784478270e1..3a7863ba51b9 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) > > * and return. Otherwise fallthrough to buffered io for > > * the rest of the read. Buffered reads will not work for > > * DAX files, so don't bother trying. > > + * > > + * IS_DAX is protected under ->read_iter lock > > */ > > if (retval < 0 || !count || iocb->ki_pos >= size || > > IS_DAX(inode)) > > @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > > * holes, for example. For DAX files, a buffered write will > > * not succeed (even if it did, DAX does not handle dirty > > * page-cache pages correctly). > > + * > > + * IS_DAX is protected under ->write_iter lock > > */ > > if (written < 0 || !iov_iter_count(from) || IS_DAX(inode)) > > goto out; > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index b08b199f9a11..3d05bd10d83e 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > > unsigned long ret; > > loff_t off = (loff_t)pgoff << PAGE_SHIFT; > > > > + /* Should not need locking here because mmap is not allowed */ > > if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) > > goto out; > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index b679908743cb..f048178e2b93 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm, > > } else { /* !is_shmem */ > > if (!page || xa_is_value(page)) { > > xas_unlock_irq(&xas); > > + inode_aops_down_read(file->f_inode); > > page_cache_sync_readahead(mapping, &file->f_ra, > > file, index, > > PAGE_SIZE); > > + inode_aops_up_read(file->f_inode); > > /* drain pagevecs to help isolate_lru_page() */ > > lru_add_drain(); > > page = find_lock_page(mapping, index); > > diff --git a/mm/util.c b/mm/util.c > > index 988d11e6c17c..a4fb0670137d 100644 > > --- a/mm/util.c > > +++ b/mm/util.c > > @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, > > > > ret = security_mmap_file(file, prot, flag); > > if (!ret) { > > - if (down_write_killable(&mm->mmap_sem)) > > + if (file) > > + inode_aops_down_read(file_inode(file)); > > + if (down_write_killable(&mm->mmap_sem)) { > > + if (file) > > + inode_aops_up_read(file_inode(file)); > > return -EINTR; > > + } > > ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, > > &populate, &uf); > > up_write(&mm->mmap_sem); > > + if (file) > > + inode_aops_up_read(file_inode(file)); > > userfaultfd_unmap_complete(mm, &uf); > > if (populate) > > mm_populate(ret, populate); > > -- > > 2.21.0 > >
On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote: > > I am very much against this. There is a reason why we don't support > > changes of ops vectors at runtime anywhere else, because it is > > horribly complicated and impossible to get right. IFF we ever want > > to change the DAX vs non-DAX mode (which I'm still not sold on) the > > right way is to just add a few simple conditionals and merge the > > aops, which is much easier to reason about, and less costly in overall > > overhead. > > *cough* > > That's exactly what the original code did. And it was broken > because page faults call multiple aops that are dependent on the > result of the previous aop calls setting up the state correctly for > the latter calls. And when S_DAX changes between those calls, shit > breaks. No, the original code was broken because it didn't serialize between DAX and buffer access. Take a step back and look where the problems are, and they are not mostly with the aops. In fact the only aop useful for DAX is is ->writepages, and how it uses ->writepages is actually a bit of an abuse of that interface. So what we really need it just a way to prevent switching the flag when a file is mapped, and in the normal read/write path ensure the flag can't be switch while I/O is going on, which could either be done by ensuring it is only switched under i_rwsem or equivalent protection, or by setting the DAX flag once in the iocb similar to IOCB_DIRECT. And they easiest way to get all this done is as a first step to just allowing switching the flag when no blocks are allocated at all, similar to how the rt flag works. Once that works properly and use cases show up we can relax the requirements, and we need to do that in a way without bloating the inode and various VFS code paths, as DAX is a complete fringe feature, and Ñ•witching the flag and runtime is the fringe of a fringe. It just isn't worth bloating the inode and wasting tons of developer time on it.
On Mon, Feb 24, 2020 at 06:56:03PM +0100, Christoph Hellwig wrote: > On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote: > > > I am very much against this. There is a reason why we don't support > > > changes of ops vectors at runtime anywhere else, because it is > > > horribly complicated and impossible to get right. IFF we ever want > > > to change the DAX vs non-DAX mode (which I'm still not sold on) the > > > right way is to just add a few simple conditionals and merge the > > > aops, which is much easier to reason about, and less costly in overall > > > overhead. > > > > *cough* > > > > That's exactly what the original code did. And it was broken > > because page faults call multiple aops that are dependent on the > > result of the previous aop calls setting up the state correctly for > > the latter calls. And when S_DAX changes between those calls, shit > > breaks. > > No, the original code was broken because it didn't serialize between > DAX and buffer access. > > Take a step back and look where the problems are, and they are not > mostly with the aops. In fact the only aop useful for DAX is > is ->writepages, and how it uses ->writepages is actually a bit of > an abuse of that interface. The races are all through the fops, too, which is one of the reasons Darrick mentioned we should probably move this up to file ops level... > So what we really need it just a way to prevent switching the flag > when a file is mapped, That's not sufficient. We also have to prevent the file from being mapped *while we are switching*. Nothing in the mmap() path actually serialises against filesystem operations, and the initial behavioural checks in the page fault path are similarly unserialised against changing the S_DAX flag. e.g. there's a race against ->mmap() with switching the the S_DAX flag. In xfs_file_mmap(): file_accessed(file); vma->vm_ops = &xfs_file_vm_ops; if (IS_DAX(inode)) vma->vm_flags |= VM_HUGEPAGE; So if this runs while a switch from true to false is occurring, we've got a non-dax VMA with huge pages enabled, and the non-dax page fault path doesn't support this. If we are really lucky, we'll have IS_DAX() set just long enough to get through this check in xfs_filemap_huge_fault(): if (!IS_DAX(file_inode(vmf->vma->vm_file))) return VM_FAULT_FALLBACK; and then we'll get into __xfs_filemap_fault() and block on the MMAPLOCK. When we actually get that lock, S_DAX will now be false and we have a huge page fault running through a path (page cache IO) that doesn't support huge pages... This is the sort of landmine switching S_DAX without serialisation causes. The methods themselves look sane, but it's cross-method dependencies for a stable S_DAX flag that is the real problem. Yes, we can probably fix this by adding XFS_MMAPLOCK_SHARED here, but means every filesystem has to solve the same race conditions itself. That's hard to get right and easy to get wrong. If the VFS/mm subsystem provides the infrastructure needed to use this functionality safely, then that is hard for filesystem developers to get wrong..... > and in the normal read/write path ensure the > flag can't be switch while I/O is going on, which could either be > done by ensuring it is only switched under i_rwsem or equivalent > protection, or by setting the DAX flag once in the iocb similar to > IOCB_DIRECT. The iocb path is not the problem - that's entirely serialised against S_DAX changes by the i_rwsem. The problem is that we have no equivalent filesystem level serialisation for the entire mmap/page fault path, and it checks S_DAX all over the place. It's basically the same limitation that we have with mmap vs direct IO - we can't lock out mmap when we do direct IO, so we cannot guarantee coherency between the two. Similar here - we cannot lockout mmap in any sane way, so we cannot guarantee coherency between mmap and changing the S_DAX flag. That's the underlying problem we need to solve here. /me wonders if the best thing to do is to add a ->fault callout to tell the filesystem to lock/unlock the inode right up at the top of the page fault path, outside even the mmap_sem. That means all the methods that the page fault calls are protected against S_DAX changes, and it gives us a low cost method of serialising page faults against DIO (e.g. via inode_dio_wait()).... > And they easiest way to get all this done is as a first step to > just allowing switching the flag when no blocks are allocated at > all, similar to how the rt flag works. False equivalence - it is not similar because the RT flag changes and their associated state checks are *already fully serialised* by the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and that's the problem we need to solve... In reality, I don't really care how we solve this problem, but we've been down the path you are suggesting at least twice before and each time we've ended up in a "that doesn't work" corner and had to persue other solutions. I don't like going around in circles. Cheers, Dave.
On Tue, Feb 25, 2020 at 11:09:37AM +1100, Dave Chinner wrote: > > No, the original code was broken because it didn't serialize between > > DAX and buffer access. > > > > Take a step back and look where the problems are, and they are not > > mostly with the aops. In fact the only aop useful for DAX is > > is ->writepages, and how it uses ->writepages is actually a bit of > > an abuse of that interface. > > The races are all through the fops, too, which is one of the reasons > Darrick mentioned we should probably move this up to file ops > level... But the file ops are very simple to use. Pass the flag in the iocb, and make sure the flag can only changed with i_rwsem held. That part is pretty trivial, the interesting case is mmap because it is so spread out. > > So what we really need it just a way to prevent switching the flag > > when a file is mapped, > > That's not sufficient. > > We also have to prevent the file from being mapped *while we are > switching*. Nothing in the mmap() path actually serialises against > filesystem operations, and the initial behavioural checks in the > page fault path are similarly unserialised against changing the > S_DAX flag. And the important part here is ->mmap. If ->mmap doesn't get through we are not going to see page faults. > > and in the normal read/write path ensure the > > flag can't be switch while I/O is going on, which could either be > > done by ensuring it is only switched under i_rwsem or equivalent > > protection, or by setting the DAX flag once in the iocb similar to > > IOCB_DIRECT. > > The iocb path is not the problem - that's entirely serialised > against S_DAX changes by the i_rwsem. The problem is that we have no > equivalent filesystem level serialisation for the entire mmap/page > fault path, and it checks S_DAX all over the place. Not right now. We have various IS_DAX checks outside it. But it is easily fixable indeed. > /me wonders if the best thing to do is to add a ->fault callout to > tell the filesystem to lock/unlock the inode right up at the top of > the page fault path, outside even the mmap_sem. That means all the > methods that the page fault calls are protected against S_DAX > changes, and it gives us a low cost method of serialising page > faults against DIO (e.g. via inode_dio_wait()).... Maybe. Especially if it solves real problems and isn't just new overhead to add an esoteric feature. > > > And they easiest way to get all this done is as a first step to > > just allowing switching the flag when no blocks are allocated at > > all, similar to how the rt flag works. > > False equivalence - it is not similar because the RT flag changes > and their associated state checks are *already fully serialised* by > the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and > that's the problem we need to solve... And my point is that if we ensure S_DAX can only be checked if there are no blocks on the file, is is fairly easy to provide the same guarantee. And I've not heard any argument that we really need more flexibility than that. In fact I think just being able to change it on the parent directory and inheriting the flag might be more than plenty, which would lead to a very simple implementation without any of the crazy overhead in this series.
Christoph Hellwig <hch@lst.de> writes: > And my point is that if we ensure S_DAX can only be checked if there > are no blocks on the file, is is fairly easy to provide the same > guarantee. And I've not heard any argument that we really need more > flexibility than that. In fact I think just being able to change it > on the parent directory and inheriting the flag might be more than > plenty, which would lead to a very simple implementation without any > of the crazy overhead in this series. I know of one user who had at least mentioned it to me, so I cc'd him. Jonathan, can you describe your use case for being able to change a file between dax and non-dax modes? Or, if I'm misremembering, just correct me? Thanks! Jeff
On Tue, Feb 25, 2020 at 06:36:33PM +0100, Christoph Hellwig wrote: > On Tue, Feb 25, 2020 at 11:09:37AM +1100, Dave Chinner wrote: > > > No, the original code was broken because it didn't serialize between > > > DAX and buffer access. > > > > > > Take a step back and look where the problems are, and they are not > > > mostly with the aops. In fact the only aop useful for DAX is > > > is ->writepages, and how it uses ->writepages is actually a bit of > > > an abuse of that interface. > > > > The races are all through the fops, too, which is one of the reasons > > Darrick mentioned we should probably move this up to file ops > > level... > > But the file ops are very simple to use. Pass the flag in the iocb, > and make sure the flag can only changed with i_rwsem held. That part > is pretty trivial, the interesting case is mmap because it is so > spread out. > > > > So what we really need it just a way to prevent switching the flag > > > when a file is mapped, > > > > That's not sufficient. > > > > We also have to prevent the file from being mapped *while we are > > switching*. Nothing in the mmap() path actually serialises against > > filesystem operations, and the initial behavioural checks in the > > page fault path are similarly unserialised against changing the > > S_DAX flag. > > And the important part here is ->mmap. If ->mmap doesn't get through > we are not going to see page faults. The series already blocks mmap'ed files from a switch. That was not crazy hard. I don't see a use case to support changing the mode while the file is mapped. > > > > and in the normal read/write path ensure the > > > flag can't be switch while I/O is going on, which could either be > > > done by ensuring it is only switched under i_rwsem or equivalent > > > protection, or by setting the DAX flag once in the iocb similar to > > > IOCB_DIRECT. > > > > The iocb path is not the problem - that's entirely serialised > > against S_DAX changes by the i_rwsem. The problem is that we have no > > equivalent filesystem level serialisation for the entire mmap/page > > fault path, and it checks S_DAX all over the place. > > Not right now. We have various IS_DAX checks outside it. But it is > easily fixable indeed. > > > /me wonders if the best thing to do is to add a ->fault callout to > > tell the filesystem to lock/unlock the inode right up at the top of > > the page fault path, outside even the mmap_sem. That means all the > > methods that the page fault calls are protected against S_DAX > > changes, and it gives us a low cost method of serialising page > > faults against DIO (e.g. via inode_dio_wait()).... > > Maybe. Especially if it solves real problems and isn't just new > overhead to add an esoteric feature. I thought about doing something like this but it seemed easier to block the change while the file was mapped. > > > > > > And they easiest way to get all this done is as a first step to > > > just allowing switching the flag when no blocks are allocated at > > > all, similar to how the rt flag works. > > > > False equivalence - it is not similar because the RT flag changes > > and their associated state checks are *already fully serialised* by > > the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and > > that's the problem we need to solve... > > And my point is that if we ensure S_DAX can only be checked if there > are no blocks on the file, is is fairly easy to provide the same > guarantee. And I've not heard any argument that we really need more > flexibility than that. In fact I think just being able to change it > on the parent directory and inheriting the flag might be more than > plenty, which would lead to a very simple implementation without any > of the crazy overhead in this series. Inherit on file creation or also if a file was moved under the directory? Do we need to consider hard or soft links? Ira
Hi All I'm a middleware developer, focused on how Java (JVM) workloads can benefit from app-direct mode pmem. Initially the target is apps that need a fast binary log for fault tolerance: the classic database WAL use case; transaction coordination systems; enterprise message bus persistence and suchlike. Critically, there are cases where we use log based storage, i.e. it's not the strict 'read rarely, only on recovery' model that a classic db may have, but more of a 'append only, read many times' event stream model. Think of the log oriented data storage as having logical segments (let's implement them as files), of which the most recent is being appended to (read_write) and the remaining N-1 older segments are full and sealed, so effectively immutable (read_only) until discarded. The tail segment needs to be in DAX mode for optimal write performance, as the size of the append may be sub-block and we don't want the overhead of the kernel call anyhow. So that's clearly a good fit for putting on a DAX fs mount and using mmap with MAP_SYNC. However, we want fast read access into the segments, to retrieve stored records. The small access index can be built in volatile RAM (assuming we're willing to take the startup overhead of a full file scan at recovery time) but the data itself is big and we don't want to move it all off pmem. Which means the requirements are now different: we want the O/S cache to pull hot data into fast volatile RAM for us, which DAX explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather than app-direct mode, except here we're using the O/S rather than the hardware memory controller to do the cache management for us. Currently this requires closing the full (read_write) file, then copying it to a non-DAX device and reopening it (read_only) there. Clearly that's expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode that will instead go via the O/S cache in the traditional manner. Bonus points if I can do it over non-overlapping ranges in a file without closing the DAX mode mmap, since then the segments are entirely logical instead of needing separate physical files. I note a comment below regarding a per-directly setting, but don't have the background to fully understand what's being suggested. However, I'll note here that I can live with a per-directory granularity, as relinking a file into a new dir is a constant time operation, whilst the move described above isn't. So if a per-directory granularity is easier than a per-file one that's fine, though as a person with only passing knowledge of filesystem design I don't see how having multiple links to a file can work cleanly in that case. Hope that helps. Jonathan P.S. I'll cheekily take the opportunity of having your attention to tack on one minor gripe about the current system: The only way to know if a mmap with MAP_SYNC will work is to try it and catch the error. Which would be reasonable if it were free of side effects. However, the process requires first expanding the file to at least the size of the desired map, which is done non-atomically i.e. is user visible. There are thus nasty race conditions in the cleanup, where after a failed mmap attempt (e.g the device doesn't support DAX), we try to shrink the file back to its original size, but something else has already opened it at its new, larger size. This is not theoretical: I got caught by it whilst adapting some of our middleware to use pmem. Therefore, some way to probe the file path for its capability would be nice, much the same as I can e.g. inspect file permissions to (more or less) evaluate if I can write it without actually mutating it. Thanks! On 25/02/2020 19:37, Jeff Moyer wrote: > Christoph Hellwig <hch@lst.de> writes: > >> And my point is that if we ensure S_DAX can only be checked if there >> are no blocks on the file, is is fairly easy to provide the same >> guarantee. And I've not heard any argument that we really need more >> flexibility than that. In fact I think just being able to change it >> on the parent directory and inheriting the flag might be more than >> plenty, which would lead to a very simple implementation without any >> of the crazy overhead in this series. > > I know of one user who had at least mentioned it to me, so I cc'd him. > Jonathan, can you describe your use case for being able to change a > file between dax and non-dax modes? Or, if I'm misremembering, just > correct me? > > Thanks! > Jeff >
On Tue 25-02-20 11:09:37, Dave Chinner wrote: > /me wonders if the best thing to do is to add a ->fault callout to > tell the filesystem to lock/unlock the inode right up at the top of > the page fault path, outside even the mmap_sem. That means all the > methods that the page fault calls are protected against S_DAX > changes, and it gives us a low cost method of serialising page > faults against DIO (e.g. via inode_dio_wait()).... Well, that's going to be pretty hard. The main problem is: you cannot lookup VMA until you hold mmap_sem and the inode is inside the VMA. And this is a fundamental problem because until you hold mmap_sem, the address space can change and thus the virtual address you are faulting can be changing inode it is mapped to. So you would have to do some dance like: lock mmap_sem lookup vma get inode reference drop mmap_sem tell fs about page fault lock mmap_sem is the vma still the same? And I'm pretty confident the overhead will be visible in page fault intensive workloads... Honza
Hello, On Wed 26-02-20 09:28:57, Jonathan Halliday wrote: > I'm a middleware developer, focused on how Java (JVM) workloads can benefit > from app-direct mode pmem. Initially the target is apps that need a fast > binary log for fault tolerance: the classic database WAL use case; > transaction coordination systems; enterprise message bus persistence and > suchlike. Critically, there are cases where we use log based storage, i.e. > it's not the strict 'read rarely, only on recovery' model that a classic db > may have, but more of a 'append only, read many times' event stream model. > > Think of the log oriented data storage as having logical segments (let's > implement them as files), of which the most recent is being appended to > (read_write) and the remaining N-1 older segments are full and sealed, so > effectively immutable (read_only) until discarded. The tail segment needs to > be in DAX mode for optimal write performance, as the size of the append may > be sub-block and we don't want the overhead of the kernel call anyhow. So > that's clearly a good fit for putting on a DAX fs mount and using mmap with > MAP_SYNC. > > However, we want fast read access into the segments, to retrieve stored > records. The small access index can be built in volatile RAM (assuming we're > willing to take the startup overhead of a full file scan at recovery time) > but the data itself is big and we don't want to move it all off pmem. Which > means the requirements are now different: we want the O/S cache to pull hot > data into fast volatile RAM for us, which DAX explicitly won't do. > Effectively a poor man's 'memory mode' pmem, rather than app-direct mode, > except here we're using the O/S rather than the hardware memory controller > to do the cache management for us. > > Currently this requires closing the full (read_write) file, then copying it > to a non-DAX device and reopening it (read_only) there. Clearly that's > expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap, > then, leaving the file where it is, reopen it in a mode that will instead go > via the O/S cache in the traditional manner. Bonus points if I can do it > over non-overlapping ranges in a file without closing the DAX mode mmap, > since then the segments are entirely logical instead of needing separate > physical files. > > I note a comment below regarding a per-directly setting, but don't have the > background to fully understand what's being suggested. However, I'll note > here that I can live with a per-directory granularity, as relinking a file > into a new dir is a constant time operation, whilst the move described above > isn't. So if a per-directory granularity is easier than a per-file one > that's fine, though as a person with only passing knowledge of filesystem > design I don't see how having multiple links to a file can work cleanly in > that case. Well, with per-directory setting, relinking the file will not magically make it stop using DAX. So your situation would be very similar to the current one, except "copy to non-DAX device" can be replaced by "copy to non-DAX directory". Maybe the "copy" part could be actually reflink which would make it faster. > P.S. I'll cheekily take the opportunity of having your attention to tack on > one minor gripe about the current system: The only way to know if a mmap > with MAP_SYNC will work is to try it and catch the error. Which would be > reasonable if it were free of side effects. However, the process requires > first expanding the file to at least the size of the desired map, which is > done non-atomically i.e. is user visible. There are thus nasty race > conditions in the cleanup, where after a failed mmap attempt (e.g the device > doesn't support DAX), we try to shrink the file back to its original size, > but something else has already opened it at its new, larger size. This is > not theoretical: I got caught by it whilst adapting some of our middleware > to use pmem. Therefore, some way to probe the file path for its capability > would be nice, much the same as I can e.g. inspect file permissions to (more > or less) evaluate if I can write it without actually mutating it. Thanks! Well, reporting error on mmap(2) is the only way how to avoid time-to-check-time-to-use races. And these are very important when we are speaking about data integrity guarantees. So that's not going to change. But with Ira's patches you could use statx(2) to check whether file at least supports DAX and so avoid doing mmap check with the side effects in the common case where it's hopeless... I'd also think that you could currently do mmap check with the current file size and if it succeeds, expand the file to the desired size and mmap again. It's not ideal but it should work. Honza
On 26/02/2020 11:31, Jan Kara wrote: > Hello, > > On Wed 26-02-20 09:28:57, Jonathan Halliday wrote: >> I'm a middleware developer, focused on how Java (JVM) workloads can benefit >> from app-direct mode pmem. Initially the target is apps that need a fast >> binary log for fault tolerance: the classic database WAL use case; >> transaction coordination systems; enterprise message bus persistence and >> suchlike. Critically, there are cases where we use log based storage, i.e. >> it's not the strict 'read rarely, only on recovery' model that a classic db >> may have, but more of a 'append only, read many times' event stream model. >> >> Think of the log oriented data storage as having logical segments (let's >> implement them as files), of which the most recent is being appended to >> (read_write) and the remaining N-1 older segments are full and sealed, so >> effectively immutable (read_only) until discarded. The tail segment needs to >> be in DAX mode for optimal write performance, as the size of the append may >> be sub-block and we don't want the overhead of the kernel call anyhow. So >> that's clearly a good fit for putting on a DAX fs mount and using mmap with >> MAP_SYNC. >> >> However, we want fast read access into the segments, to retrieve stored >> records. The small access index can be built in volatile RAM (assuming we're >> willing to take the startup overhead of a full file scan at recovery time) >> but the data itself is big and we don't want to move it all off pmem. Which >> means the requirements are now different: we want the O/S cache to pull hot >> data into fast volatile RAM for us, which DAX explicitly won't do. >> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode, >> except here we're using the O/S rather than the hardware memory controller >> to do the cache management for us. >> >> Currently this requires closing the full (read_write) file, then copying it >> to a non-DAX device and reopening it (read_only) there. Clearly that's >> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap, >> then, leaving the file where it is, reopen it in a mode that will instead go >> via the O/S cache in the traditional manner. Bonus points if I can do it >> over non-overlapping ranges in a file without closing the DAX mode mmap, >> since then the segments are entirely logical instead of needing separate >> physical files. >> >> I note a comment below regarding a per-directly setting, but don't have the >> background to fully understand what's being suggested. However, I'll note >> here that I can live with a per-directory granularity, as relinking a file >> into a new dir is a constant time operation, whilst the move described above >> isn't. So if a per-directory granularity is easier than a per-file one >> that's fine, though as a person with only passing knowledge of filesystem >> design I don't see how having multiple links to a file can work cleanly in >> that case. > > Well, with per-directory setting, relinking the file will not magically > make it stop using DAX. So your situation would be very similar to the > current one, except "copy to non-DAX device" can be replaced by "copy to > non-DAX directory". Maybe the "copy" part could be actually reflink which > would make it faster. Indeed. The requirement is for 'change mode in constant time' rather than the current 'change mode in time proportional to file size'. That seems to imply requiring the approach to just change fs metadata, without relocating the file data bytes. Beyond that I'm largely indifferent to the implementation details. >> P.S. I'll cheekily take the opportunity of having your attention to tack on >> one minor gripe about the current system: The only way to know if a mmap >> with MAP_SYNC will work is to try it and catch the error. Which would be >> reasonable if it were free of side effects. However, the process requires >> first expanding the file to at least the size of the desired map, which is >> done non-atomically i.e. is user visible. There are thus nasty race >> conditions in the cleanup, where after a failed mmap attempt (e.g the device >> doesn't support DAX), we try to shrink the file back to its original size, >> but something else has already opened it at its new, larger size. This is >> not theoretical: I got caught by it whilst adapting some of our middleware >> to use pmem. Therefore, some way to probe the file path for its capability >> would be nice, much the same as I can e.g. inspect file permissions to (more >> or less) evaluate if I can write it without actually mutating it. Thanks! > > Well, reporting error on mmap(2) is the only way how to avoid > time-to-check-time-to-use races. And these are very important when we are > speaking about data integrity guarantees. So that's not going to change. > But with Ira's patches you could use statx(2) to check whether file at > least supports DAX and so avoid doing mmap check with the side effects in > the common case where it's hopeless... I'd also think that you could > currently do mmap check with the current file size and if it succeeds, > expand the file to the desired size and mmap again. It's not ideal but it > should work. Sure. Best effort is fine here, just as with looking at the permission bits on a file example - even in the absence of racing permission changes it's not definitive because of additional quota or selinux checks, but it's a reasonable approximation. That's a sufficiently useful improvement for my purposes, given the impractical nature of a 100% solution. Jonathan
On Wed, Feb 26, 2020 at 12:17:40PM +0100, Jan Kara wrote: > On Tue 25-02-20 11:09:37, Dave Chinner wrote: > > /me wonders if the best thing to do is to add a ->fault callout to > > tell the filesystem to lock/unlock the inode right up at the top of > > the page fault path, outside even the mmap_sem. That means all the > > methods that the page fault calls are protected against S_DAX > > changes, and it gives us a low cost method of serialising page > > faults against DIO (e.g. via inode_dio_wait()).... > > Well, that's going to be pretty hard. The main problem is: you cannot > lookup VMA until you hold mmap_sem and the inode is inside the VMA. And > this is a fundamental problem because until you hold mmap_sem, the address > space can change and thus the virtual address you are faulting can be > changing inode it is mapped to. So you would have to do some dance like: > > lock mmap_sem > lookup vma > get inode reference > drop mmap_sem > tell fs about page fault > lock mmap_sem > is the vma still the same? > > And I'm pretty confident the overhead will be visible in page fault > intensive workloads... I did not get to this level of detail... Rather I looked at it from a high level perspective and thought "does the mode need to change while someone has the mmap?" My thought is, that it does not make a lot of sense. Generally the user has mmaped with some use case in mind (either DAX or non-DAX) and it seems reasonable to keep that mode consistent while the map is in place. So I punted and restricted the change. Ira > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed, Feb 26, 2020 at 09:28:57AM +0000, Jonathan Halliday wrote: > > Hi All > > I'm a middleware developer, focused on how Java (JVM) workloads can benefit > from app-direct mode pmem. Initially the target is apps that need a fast > binary log for fault tolerance: the classic database WAL use case; > transaction coordination systems; enterprise message bus persistence and > suchlike. Critically, there are cases where we use log based storage, i.e. > it's not the strict 'read rarely, only on recovery' model that a classic db > may have, but more of a 'append only, read many times' event stream model. > > Think of the log oriented data storage as having logical segments (let's > implement them as files), of which the most recent is being appended to > (read_write) and the remaining N-1 older segments are full and sealed, so > effectively immutable (read_only) until discarded. The tail segment needs to > be in DAX mode for optimal write performance, as the size of the append may > be sub-block and we don't want the overhead of the kernel call anyhow. So > that's clearly a good fit for putting on a DAX fs mount and using mmap with > MAP_SYNC. > > However, we want fast read access into the segments, to retrieve stored > records. The small access index can be built in volatile RAM (assuming we're > willing to take the startup overhead of a full file scan at recovery time) > but the data itself is big and we don't want to move it all off pmem. Which > means the requirements are now different: we want the O/S cache to pull hot > data into fast volatile RAM for us, which DAX explicitly won't do. > Effectively a poor man's 'memory mode' pmem, rather than app-direct mode, > except here we're using the O/S rather than the hardware memory controller > to do the cache management for us. > > Currently this requires closing the full (read_write) file, then copying it > to a non-DAX device and reopening it (read_only) there. Clearly that's > expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap, > then, leaving the file where it is, reopen it in a mode that will instead go > via the O/S cache in the traditional manner. This patch set implements this capability. > Bonus points if I can do it > over non-overlapping ranges in a file without closing the DAX mode mmap, > since then the segments are entirely logical instead of needing separate > physical files. But it is too hard to do so while an mmap is in place so it can't do this. So no bonus points... > > I note a comment below regarding a per-directly setting, but don't have the > background to fully understand what's being suggested. However, I'll note > here that I can live with a per-directory granularity, as relinking a file > into a new dir is a constant time operation, whilst the move described above > isn't. So if a per-directory granularity is easier than a per-file one > that's fine, though as a person with only passing knowledge of filesystem > design I don't see how having multiple links to a file can work cleanly in > that case. The more I think about it the more I'm not comfortable with a directory option. soft links and hard links complicate that IMO. The current inheritance model is at file creation time and the file sticks with that state (mode). That is easy enough and carry's with the file without having the complexity of possibly looking at the wrong parent directory. Ira > > Hope that helps. > > Jonathan > > P.S. I'll cheekily take the opportunity of having your attention to tack on > one minor gripe about the current system: The only way to know if a mmap > with MAP_SYNC will work is to try it and catch the error. Which would be > reasonable if it were free of side effects. However, the process requires > first expanding the file to at least the size of the desired map, which is > done non-atomically i.e. is user visible. There are thus nasty race > conditions in the cleanup, where after a failed mmap attempt (e.g the device > doesn't support DAX), we try to shrink the file back to its original size, > but something else has already opened it at its new, larger size. This is > not theoretical: I got caught by it whilst adapting some of our middleware > to use pmem. Therefore, some way to probe the file path for its capability > would be nice, much the same as I can e.g. inspect file permissions to (more > or less) evaluate if I can write it without actually mutating it. Thanks! > > > > On 25/02/2020 19:37, Jeff Moyer wrote: > > Christoph Hellwig <hch@lst.de> writes: > > > > > And my point is that if we ensure S_DAX can only be checked if there > > > are no blocks on the file, is is fairly easy to provide the same > > > guarantee. And I've not heard any argument that we really need more > > > flexibility than that. In fact I think just being able to change it > > > on the parent directory and inheriting the flag might be more than > > > plenty, which would lead to a very simple implementation without any > > > of the crazy overhead in this series. > > > > I know of one user who had at least mentioned it to me, so I cc'd him. > > Jonathan, can you describe your use case for being able to change a > > file between dax and non-dax modes? Or, if I'm misremembering, just > > correct me? > > > > Thanks! > > Jeff > > > > -- > Registered in England and Wales under Company Registration No. 03798903 > Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander >
On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday <jonathan.halliday@redhat.com> wrote: > > > Hi All > > I'm a middleware developer, focused on how Java (JVM) workloads can > benefit from app-direct mode pmem. Initially the target is apps that > need a fast binary log for fault tolerance: the classic database WAL use > case; transaction coordination systems; enterprise message bus > persistence and suchlike. Critically, there are cases where we use log > based storage, i.e. it's not the strict 'read rarely, only on recovery' > model that a classic db may have, but more of a 'append only, read many > times' event stream model. > > Think of the log oriented data storage as having logical segments (let's > implement them as files), of which the most recent is being appended to > (read_write) and the remaining N-1 older segments are full and sealed, > so effectively immutable (read_only) until discarded. The tail segment > needs to be in DAX mode for optimal write performance, as the size of > the append may be sub-block and we don't want the overhead of the kernel > call anyhow. So that's clearly a good fit for putting on a DAX fs mount > and using mmap with MAP_SYNC. > > However, we want fast read access into the segments, to retrieve stored > records. The small access index can be built in volatile RAM (assuming > we're willing to take the startup overhead of a full file scan at > recovery time) but the data itself is big and we don't want to move it > all off pmem. Which means the requirements are now different: we want > the O/S cache to pull hot data into fast volatile RAM for us, which DAX > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather > than app-direct mode, except here we're using the O/S rather than the > hardware memory controller to do the cache management for us. > > Currently this requires closing the full (read_write) file, then copying > it to a non-DAX device and reopening it (read_only) there. Clearly > that's expensive and rather tedious. Instead, I'd like to close the > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode > that will instead go via the O/S cache in the traditional manner. Bonus > points if I can do it over non-overlapping ranges in a file without > closing the DAX mode mmap, since then the segments are entirely logical > instead of needing separate physical files. Hi John, IIRC we chatted about this at PIRL, right? At the time it sounded more like mixed mode dax, i.e. dax writes, but cached reads. To me that's an optimization to optionally use dax for direct-I/O writes, with its existing set of page-cache coherence warts, and not a capability to dynamically switch the dax-mode. mmap+MAP_SYNC seems the wrong interface for this. This writeup mentions bypassing kernel call overhead, but I don't see how a dax-write syscall is cheaper than an mmap syscall plus fault. If direct-I/O to a dax capable file bypasses the block layer, isn't that about the maximum of kernel overhead that can be cut out of this use case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block update-in-place writes not append writes.
On Wed 26-02-20 08:46:42, Dan Williams wrote: > On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday > <jonathan.halliday@redhat.com> wrote: > > > > > > Hi All > > > > I'm a middleware developer, focused on how Java (JVM) workloads can > > benefit from app-direct mode pmem. Initially the target is apps that > > need a fast binary log for fault tolerance: the classic database WAL use > > case; transaction coordination systems; enterprise message bus > > persistence and suchlike. Critically, there are cases where we use log > > based storage, i.e. it's not the strict 'read rarely, only on recovery' > > model that a classic db may have, but more of a 'append only, read many > > times' event stream model. > > > > Think of the log oriented data storage as having logical segments (let's > > implement them as files), of which the most recent is being appended to > > (read_write) and the remaining N-1 older segments are full and sealed, > > so effectively immutable (read_only) until discarded. The tail segment > > needs to be in DAX mode for optimal write performance, as the size of > > the append may be sub-block and we don't want the overhead of the kernel > > call anyhow. So that's clearly a good fit for putting on a DAX fs mount > > and using mmap with MAP_SYNC. > > > > However, we want fast read access into the segments, to retrieve stored > > records. The small access index can be built in volatile RAM (assuming > > we're willing to take the startup overhead of a full file scan at > > recovery time) but the data itself is big and we don't want to move it > > all off pmem. Which means the requirements are now different: we want > > the O/S cache to pull hot data into fast volatile RAM for us, which DAX > > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather > > than app-direct mode, except here we're using the O/S rather than the > > hardware memory controller to do the cache management for us. > > > > Currently this requires closing the full (read_write) file, then copying > > it to a non-DAX device and reopening it (read_only) there. Clearly > > that's expensive and rather tedious. Instead, I'd like to close the > > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode > > that will instead go via the O/S cache in the traditional manner. Bonus > > points if I can do it over non-overlapping ranges in a file without > > closing the DAX mode mmap, since then the segments are entirely logical > > instead of needing separate physical files. > > Hi John, > > IIRC we chatted about this at PIRL, right? > > At the time it sounded more like mixed mode dax, i.e. dax writes, but > cached reads. To me that's an optimization to optionally use dax for > direct-I/O writes, with its existing set of page-cache coherence > warts, and not a capability to dynamically switch the dax-mode. > mmap+MAP_SYNC seems the wrong interface for this. This writeup > mentions bypassing kernel call overhead, but I don't see how a > dax-write syscall is cheaper than an mmap syscall plus fault. If > direct-I/O to a dax capable file bypasses the block layer, isn't that > about the maximum of kernel overhead that can be cut out of this use > case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block > update-in-place writes not append writes. Well, even for appends you'll pay the cost only once per page (or maybe even once per huge page) when using MAP_SYNC. With a syscall you'll pay once per write. So although it would be good to check real numbers, the design isn't non-sensical to me. Honza
On Wed, Feb 26, 2020 at 9:20 AM Jan Kara <jack@suse.cz> wrote: > > On Wed 26-02-20 08:46:42, Dan Williams wrote: > > On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday > > <jonathan.halliday@redhat.com> wrote: > > > > > > > > > Hi All > > > > > > I'm a middleware developer, focused on how Java (JVM) workloads can > > > benefit from app-direct mode pmem. Initially the target is apps that > > > need a fast binary log for fault tolerance: the classic database WAL use > > > case; transaction coordination systems; enterprise message bus > > > persistence and suchlike. Critically, there are cases where we use log > > > based storage, i.e. it's not the strict 'read rarely, only on recovery' > > > model that a classic db may have, but more of a 'append only, read many > > > times' event stream model. > > > > > > Think of the log oriented data storage as having logical segments (let's > > > implement them as files), of which the most recent is being appended to > > > (read_write) and the remaining N-1 older segments are full and sealed, > > > so effectively immutable (read_only) until discarded. The tail segment > > > needs to be in DAX mode for optimal write performance, as the size of > > > the append may be sub-block and we don't want the overhead of the kernel > > > call anyhow. So that's clearly a good fit for putting on a DAX fs mount > > > and using mmap with MAP_SYNC. > > > > > > However, we want fast read access into the segments, to retrieve stored > > > records. The small access index can be built in volatile RAM (assuming > > > we're willing to take the startup overhead of a full file scan at > > > recovery time) but the data itself is big and we don't want to move it > > > all off pmem. Which means the requirements are now different: we want > > > the O/S cache to pull hot data into fast volatile RAM for us, which DAX > > > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather > > > than app-direct mode, except here we're using the O/S rather than the > > > hardware memory controller to do the cache management for us. > > > > > > Currently this requires closing the full (read_write) file, then copying > > > it to a non-DAX device and reopening it (read_only) there. Clearly > > > that's expensive and rather tedious. Instead, I'd like to close the > > > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode > > > that will instead go via the O/S cache in the traditional manner. Bonus > > > points if I can do it over non-overlapping ranges in a file without > > > closing the DAX mode mmap, since then the segments are entirely logical > > > instead of needing separate physical files. > > > > Hi John, > > > > IIRC we chatted about this at PIRL, right? > > > > At the time it sounded more like mixed mode dax, i.e. dax writes, but > > cached reads. To me that's an optimization to optionally use dax for > > direct-I/O writes, with its existing set of page-cache coherence > > warts, and not a capability to dynamically switch the dax-mode. > > mmap+MAP_SYNC seems the wrong interface for this. This writeup > > mentions bypassing kernel call overhead, but I don't see how a > > dax-write syscall is cheaper than an mmap syscall plus fault. If > > direct-I/O to a dax capable file bypasses the block layer, isn't that > > about the maximum of kernel overhead that can be cut out of this use > > case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block > > update-in-place writes not append writes. > > Well, even for appends you'll pay the cost only once per page (or maybe even > once per huge page) when using MAP_SYNC. With a syscall you'll pay once per > write. So although it would be good to check real numbers, the design isn't > non-sensical to me. True, Jonathan, how many writes per page are we talking about in this case?
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 7d4d09dd5e6d..4a10a232f8e2 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -934,6 +934,22 @@ cache in your filesystem. The following members are defined: Called during swapoff on files where swap_activate was successful. +Changing DAX 'state' dynamically +---------------------------------- + +Some file systems which support DAX want to be able to change the DAX state +dyanically. To switch the state safely we lock the inode state in all "normal" +file system operations and restrict state changes to those operations. The +specific rules are. + + 1) the direct_IO address_space_operation must be supported in all + potential a_ops vectors for any state suported by the inode. + + 3) DAX state changes shall not be allowed while the file is mmap'ed + 4) For non-mmaped operations the VFS layer must take the read lock for any + use of IS_DAX() + 5) Filesystems take the write lock when changing DAX states. + The File Object =============== diff --git a/fs/attr.c b/fs/attr.c index b4bbdbd4c8ca..9b15f73d1079 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de if (error) return error; + /* DAX read state should already be held here */ if (inode->i_op->setattr) error = inode->i_op->setattr(dentry, attr); else diff --git a/fs/inode.c b/fs/inode.c index 7d57068b6b7a..6e4f1cc872f2 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) #endif inode->i_flctx = NULL; this_cpu_inc(nr_inodes); + init_rwsem(&inode->i_aops_sem); return 0; out: @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput); */ int bmap(struct inode *inode, sector_t *block) { - if (!inode->i_mapping->a_ops->bmap) - return -EINVAL; + int ret = 0; + + inode_aops_down_read(inode); + if (!inode->i_mapping->a_ops->bmap) { + ret = -EINVAL; + goto err; + } *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); - return 0; + +err: + inode_aops_up_read(inode); + return ret; } EXPORT_SYMBOL(bmap); #endif diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 7c84c4c027c4..e313a34d5fa6 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count, offset = offset_in_page(pos); bytes = min_t(loff_t, PAGE_SIZE - offset, count); + /* DAX state read should already be held here */ if (IS_DAX(inode)) status = iomap_dax_zero(pos, offset, bytes, iomap); else diff --git a/fs/open.c b/fs/open.c index 0788b3715731..3abf0bfac462 100644 --- a/fs/open.c +++ b/fs/open.c @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, if (ret) newattrs.ia_valid |= ret | ATTR_FORCE; + inode_aops_down_read(dentry->d_inode); inode_lock(dentry->d_inode); /* Note any delegations or leases have already been broken: */ ret = notify_change(dentry, &newattrs, NULL); inode_unlock(dentry->d_inode); + inode_aops_up_read(dentry->d_inode); return ret; } @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -EOPNOTSUPP; file_start_write(file); + inode_aops_down_read(inode); ret = file->f_op->fallocate(file, mode, offset, len); + inode_aops_up_read(inode); /* * Create inotify and fanotify events. diff --git a/fs/stat.c b/fs/stat.c index 894699c74dde..274b3ccc82b1 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, if (IS_AUTOMOUNT(inode)) stat->attributes |= STATX_ATTR_AUTOMOUNT; + inode_aops_down_read(inode); if (IS_DAX(inode)) stat->attributes |= STATX_ATTR_DAX; + inode_aops_up_read(inode); if (inode->i_op->getattr) return inode->i_op->getattr(path, stat, request_mask, diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 836a1f09be03..3e83a97dc047 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -420,6 +420,7 @@ xfs_iget_cache_hit( rcu_read_unlock(); ASSERT(!rwsem_is_locked(&inode->i_rwsem)); + ASSERT(!rwsem_is_locked(&inode->i_aops_sem)); error = xfs_reinit_inode(mp, inode); if (error) { bool wake; diff --git a/include/linux/fs.h b/include/linux/fs.h index 63d1e533a07d..ad0f2368031b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -40,6 +40,7 @@ #include <linux/fs_types.h> #include <linux/build_bug.h> #include <linux/stddef.h> +#include <linux/percpu-rwsem.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -359,6 +360,11 @@ typedef struct { typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long); +/** + * NOTE: DO NOT define new functions in address_space_operations without first + * considering how dynamic DAX states are to be supported. See the + * inode_aops_*_read() functions + */ struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); @@ -735,6 +741,9 @@ struct inode { #endif void *i_private; /* fs or device private pointer */ +#if defined(CONFIG_FS_DAX) + struct rw_semaphore i_aops_sem; +#endif } __randomize_layout; struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); @@ -1817,6 +1826,11 @@ struct block_device_operations; struct iov_iter; +/** + * NOTE: DO NOT define new functions in file_operations without first + * considering how dynamic address_space operations are to be supported. See + * the inode_aops_*_read() functions in this file. + */ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); @@ -1889,16 +1903,64 @@ struct inode_operations { int (*set_acl)(struct inode *, struct posix_acl *, int); } ____cacheline_aligned; +#if defined(CONFIG_FS_DAX) +/* + * Filesystems wishing to support dynamic DAX states must do the following. + * + * 1) the direct_IO address_space_operation must be supported in all potential + * a_ops vectors for any state suported by the inode. This is because the + * direct_IO function is used as a flag long before the function is called. + + * 3) DAX state changes shall not be allowed while the file is mmap'ed + * 4) For non-mmaped operations the VFS layer must take the read lock for any + * use of IS_DAX() + * 5) Filesystems take the write lock when changing DAX states. + */ +static inline void inode_aops_down_read(struct inode *inode) +{ + down_read(&inode->i_aops_sem); +} +static inline void inode_aops_up_read(struct inode *inode) +{ + up_read(&inode->i_aops_sem); +} +static inline void inode_aops_down_write(struct inode *inode) +{ + down_write(&inode->i_aops_sem); +} +static inline void inode_aops_up_write(struct inode *inode) +{ + up_write(&inode->i_aops_sem); +} +#else /* !CONFIG_FS_DAX */ +#define inode_aops_down_read(inode) do { (void)(inode); } while (0) +#define inode_aops_up_read(inode) do { (void)(inode); } while (0) +#define inode_aops_down_write(inode) do { (void)(inode); } while (0) +#define inode_aops_up_write(inode) do { (void)(inode); } while (0) +#endif /* CONFIG_FS_DAX */ + static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { - return file->f_op->read_iter(kio, iter); + struct inode *inode = file_inode(kio->ki_filp); + ssize_t ret; + + inode_aops_down_read(inode); + ret = file->f_op->read_iter(kio, iter); + inode_aops_up_read(inode); + return ret; } static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { - return file->f_op->write_iter(kio, iter); + struct inode *inode = file_inode(kio->ki_filp); + ssize_t ret; + + inode_aops_down_read(inode); + ret = file->f_op->write_iter(kio, iter); + inode_aops_up_read(inode); + return ret; } static inline int call_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/mm/fadvise.c b/mm/fadvise.c index 4f17c83db575..6a30febb11e0 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) bdi = inode_to_bdi(mapping->host); if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { + int ret = 0; + switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) /* no bad return value, but ignore advice */ break; default: - return -EINVAL; + ret = -EINVAL; } - return 0; + + return ret; } /* diff --git a/mm/filemap.c b/mm/filemap.c index 1784478270e1..3a7863ba51b9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) * and return. Otherwise fallthrough to buffered io for * the rest of the read. Buffered reads will not work for * DAX files, so don't bother trying. + * + * IS_DAX is protected under ->read_iter lock */ if (retval < 0 || !count || iocb->ki_pos >= size || IS_DAX(inode)) @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) * holes, for example. For DAX files, a buffered write will * not succeed (even if it did, DAX does not handle dirty * page-cache pages correctly). + * + * IS_DAX is protected under ->write_iter lock */ if (written < 0 || !iov_iter_count(from) || IS_DAX(inode)) goto out; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b08b199f9a11..3d05bd10d83e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long ret; loff_t off = (loff_t)pgoff << PAGE_SHIFT; + /* Should not need locking here because mmap is not allowed */ if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) goto out; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b679908743cb..f048178e2b93 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm, } else { /* !is_shmem */ if (!page || xa_is_value(page)) { xas_unlock_irq(&xas); + inode_aops_down_read(file->f_inode); page_cache_sync_readahead(mapping, &file->f_ra, file, index, PAGE_SIZE); + inode_aops_up_read(file->f_inode); /* drain pagevecs to help isolate_lru_page() */ lru_add_drain(); page = find_lock_page(mapping, index); diff --git a/mm/util.c b/mm/util.c index 988d11e6c17c..a4fb0670137d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, ret = security_mmap_file(file, prot, flag); if (!ret) { - if (down_write_killable(&mm->mmap_sem)) + if (file) + inode_aops_down_read(file_inode(file)); + if (down_write_killable(&mm->mmap_sem)) { + if (file) + inode_aops_up_read(file_inode(file)); return -EINTR; + } ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate, &uf); up_write(&mm->mmap_sem); + if (file) + inode_aops_up_read(file_inode(file)); userfaultfd_unmap_complete(mm, &uf); if (populate) mm_populate(ret, populate);