Message ID | bfae590045c7fc37b7ccef10b9cec318012979fd.1736488799.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs/dax: Fix ZONE_DEVICE page reference counts | expand |
On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote: > FS DAX requires file systems to call into the DAX layout prior to unlinking > inodes to ensure there is no ongoing DMA or other remote access to the > direct mapped page. The fuse file system implements > fuse_dax_break_layouts() to do this which includes a comment indicating > that passing dmap_end == 0 leads to unmapping of the whole file. > > However this is not true - passing dmap_end == 0 will not unmap anything > before dmap_start, and further more dax_layout_busy_page_range() will not > scan any of the range to see if there maybe ongoing DMA access to the > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() > which will invalidate the entire file range to > dax_layout_busy_page_range(). Hi Alistair, Thanks for fixing DAX related issues for virtiofs. I am wondering how are you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust virtiofsd. C version of virtiofsd used to have out of the tree patches for DAX. But C version got deprecated long time ago. Do you have another implementation of virtiofsd somewhere else which supports DAX and allows for testing DAX related changes? Thanks Vivek > > Signed-off-by: Alistair Popple <apopple@nvidia.com> > Co-developed-by: Dan Williams <dan.j.williams@intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault path") > Cc: Vivek Goyal <vgoyal@redhat.com> > > --- > > Changes for v6: > > - Original patch had a misplaced hunk due to a bad rebase. > - Reworked fix based on Dan's comments. > --- > fs/fuse/dax.c | 1 - > fs/fuse/dir.c | 2 +- > fs/fuse/file.c | 4 ++-- > 3 files changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c > index 9abbc2f..455c4a1 100644 > --- a/fs/fuse/dax.c > +++ b/fs/fuse/dax.c > @@ -681,7 +681,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, > 0, 0, fuse_wait_dax_page(inode)); > } > > -/* dmap_end == 0 leads to unmapping of whole file */ > int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start, > u64 dmap_end) > { > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 0b2f856..bc6c893 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -1936,7 +1936,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, > if (FUSE_IS_DAX(inode) && is_truncate) { > filemap_invalidate_lock(mapping); > fault_blocked = true; > - err = fuse_dax_break_layouts(inode, 0, 0); > + err = fuse_dax_break_layouts(inode, 0, -1); > if (err) { > filemap_invalidate_unlock(mapping); > return err; > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 082ee37..cef7a8f 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file *file) > > if (dax_truncate) { > filemap_invalidate_lock(inode->i_mapping); > - err = fuse_dax_break_layouts(inode, 0, 0); > + err = fuse_dax_break_layouts(inode, 0, -1); > if (err) > goto out_inode_unlock; > } > @@ -2890,7 +2890,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, > inode_lock(inode); > if (block_faults) { > filemap_invalidate_lock(inode->i_mapping); > - err = fuse_dax_break_layouts(inode, 0, 0); > + err = fuse_dax_break_layouts(inode, 0, -1); > if (err) > goto out; > } > -- > git-series 0.9.1 >
Vivek Goyal wrote: > On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote: > > FS DAX requires file systems to call into the DAX layout prior to unlinking > > inodes to ensure there is no ongoing DMA or other remote access to the > > direct mapped page. The fuse file system implements > > fuse_dax_break_layouts() to do this which includes a comment indicating > > that passing dmap_end == 0 leads to unmapping of the whole file. > > > > However this is not true - passing dmap_end == 0 will not unmap anything > > before dmap_start, and further more dax_layout_busy_page_range() will not > > scan any of the range to see if there maybe ongoing DMA access to the > > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() > > which will invalidate the entire file range to > > dax_layout_busy_page_range(). > > Hi Alistair, > > Thanks for fixing DAX related issues for virtiofs. I am wondering how are > you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust > virtiofsd. C version of virtiofsd used to have out of the tree patches > for DAX. But C version got deprecated long time ago. > > Do you have another implementation of virtiofsd somewhere else which > supports DAX and allows for testing DAX related changes? I have personally never seen a virtiofs-dax test. It sounds like you are saying we can deprecate that support if there are no longer any users. Or, do you expect that C-virtiofsd is alive in the ecosystem?
Hi, On February 6, 2025 1:10:15 AM GMT+01:00, Dan Williams <dan.j.williams@intel.com> wrote: >Vivek Goyal wrote: >> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote: >> > FS DAX requires file systems to call into the DAX layout prior to unlinking >> > inodes to ensure there is no ongoing DMA or other remote access to the >> > direct mapped page. The fuse file system implements >> > fuse_dax_break_layouts() to do this which includes a comment indicating >> > that passing dmap_end == 0 leads to unmapping of the whole file. >> > >> > However this is not true - passing dmap_end == 0 will not unmap anything >> > before dmap_start, and further more dax_layout_busy_page_range() will not >> > scan any of the range to see if there maybe ongoing DMA access to the >> > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() >> > which will invalidate the entire file range to >> > dax_layout_busy_page_range(). >> >> Hi Alistair, >> >> Thanks for fixing DAX related issues for virtiofs. I am wondering how are >> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust >> virtiofsd. C version of virtiofsd used to have out of the tree patches >> for DAX. But C version got deprecated long time ago. >> >> Do you have another implementation of virtiofsd somewhere else which >> supports DAX and allows for testing DAX related changes? > >I have personally never seen a virtiofs-dax test. It sounds like you are >saying we can deprecate that support if there are no longer any users. >Or, do you expect that C-virtiofsd is alive in the ecosystem? I accidentally replied offlist, but I wanted to mention that libkrun supports DAX and we use it in muvm. It's a critical part of x11bridge functionality, since it uses DAX to share X11 shm fences between X11 clients in the VM and the XWayland server on the host, which only works if the mmaps are coherent. (Sorry for the unwrapped reply, I'm on mobile right now.) ~~ Lina
On Wed, Feb 05, 2025 at 04:10:15PM -0800, Dan Williams wrote: > Vivek Goyal wrote: > > On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote: > > > FS DAX requires file systems to call into the DAX layout prior to unlinking > > > inodes to ensure there is no ongoing DMA or other remote access to the > > > direct mapped page. The fuse file system implements > > > fuse_dax_break_layouts() to do this which includes a comment indicating > > > that passing dmap_end == 0 leads to unmapping of the whole file. > > > > > > However this is not true - passing dmap_end == 0 will not unmap anything > > > before dmap_start, and further more dax_layout_busy_page_range() will not > > > scan any of the range to see if there maybe ongoing DMA access to the > > > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() > > > which will invalidate the entire file range to > > > dax_layout_busy_page_range(). > > > > Hi Alistair, > > > > Thanks for fixing DAX related issues for virtiofs. I am wondering how are > > you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust > > virtiofsd. C version of virtiofsd used to have out of the tree patches > > for DAX. But C version got deprecated long time ago. > > > > Do you have another implementation of virtiofsd somewhere else which > > supports DAX and allows for testing DAX related changes? > > I have personally never seen a virtiofs-dax test. It sounds like you are > saying we can deprecate that support if there are no longer any users. > Or, do you expect that C-virtiofsd is alive in the ecosystem? Ashai Lina responded that they need and test DAX using libkrun. C version of virtiofsd is now gone. We are actively working and testing Rust version of virtiofsd. We have not been able to add DAX support to it yet for various reasons. Biggest unsolved problem with viritofsd DAX mode is guest process should get a SIGBUS if it tries to access a file beyond the file. This can happen if file has been truncated on the host (while it is still mapped inside the guest). I had tried to summarize the problem in this presentation in the section "KVM Page fault error handling". https://kvm-forum.qemu.org/2020/KVMForum2020_APF.pdf This is a tricky problem to handle. Once this gets handled, it becomes safer to use DAX with virtiofs. Otherwise you can't share the filesystem with other guests in DAX mode and use cases are limited. And then there are challenges at QEMU level. virtiofsd needs additional vhost-user commands to implement DAX and these never went upstream in QEMU. I hope these challenges are sorted at some point of time. I think virtiofs DAX is a very cool piece of technology. I would not like to deprecate it. It has its own problems and challenges and once we are able to solve these, it might see wider usage/adoption. Thanks Vivek
On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote: > And then there are challenges at QEMU level. virtiofsd needs additional > vhost-user commands to implement DAX and these never went upstream in > QEMU. I hope these challenges are sorted at some point of time. Albert Esteve has been working on QEMU support: https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/ He has a viable solution. I think the remaining issue is how to best structure the memory regions. The reason for slow progress is not because it can't be done, it's probably just because this is a background task. Please discuss with Albert if QEMU support is urgent. Stefan
Hi! On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote: > > And then there are challenges at QEMU level. virtiofsd needs additional > > vhost-user commands to implement DAX and these never went upstream in > > QEMU. I hope these challenges are sorted at some point of time. > > Albert Esteve has been working on QEMU support: > https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/ > > He has a viable solution. I think the remaining issue is how to best > structure the memory regions. The reason for slow progress is not > because it can't be done, it's probably just because this is a > background task. It is partially that, indeed. But what has me blocked for now on posting the next version is that I was reworking a bit the MMAP strategy. Following David comments, I am relying more on RAMBlocks and subregions for mmaps. But this turned out more difficult than anticipated. I hope I can make it work this month and then post the next version. If there are no major blockers/reworks, further iterations on the patch shall go smoother. I have a separate patch for the vhost-user spec which could iterate faster, if that'd help. BR, Albert. > > Please discuss with Albert if QEMU support is urgent. > > Stefan
On Thu, Feb 06, 2025 at 03:59:03PM +0100, Albert Esteve wrote: > Hi! > > On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote: > > > And then there are challenges at QEMU level. virtiofsd needs additional > > > vhost-user commands to implement DAX and these never went upstream in > > > QEMU. I hope these challenges are sorted at some point of time. > > > > Albert Esteve has been working on QEMU support: > > https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/ > > > > He has a viable solution. I think the remaining issue is how to best > > structure the memory regions. The reason for slow progress is not > > because it can't be done, it's probably just because this is a > > background task. > > It is partially that, indeed. But what has me blocked for now on posting the > next version is that I was reworking a bit the MMAP strategy. > Following David comments, I am relying more on RAMBlocks and > subregions for mmaps. But this turned out more difficult than anticipated. > > I hope I can make it work this month and then post the next version. > If there are no major blockers/reworks, further iterations on the > patch shall go smoother. > > I have a separate patch for the vhost-user spec which could > iterate faster, if that'd help. Let's see if anyone needs the vhost-user spec extension now. Otherwise it seems fine to merge it together with the implementation of that spec. Stefan
On 06.02.25 15:59, Albert Esteve wrote: > Hi! > > On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >> >> On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote: >>> And then there are challenges at QEMU level. virtiofsd needs additional >>> vhost-user commands to implement DAX and these never went upstream in >>> QEMU. I hope these challenges are sorted at some point of time. >> >> Albert Esteve has been working on QEMU support: >> https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/ >> >> He has a viable solution. I think the remaining issue is how to best >> structure the memory regions. The reason for slow progress is not >> because it can't be done, it's probably just because this is a >> background task. > > It is partially that, indeed. But what has me blocked for now on posting the > next version is that I was reworking a bit the MMAP strategy. > Following David comments, I am relying more on RAMBlocks and > subregions for mmaps. But this turned out more difficult than anticipated. Yeah, if that turns out to be too painful, we could start with the previous approach and work on that later. I also did not expect that to become that complicated.
Asahi Lina wrote: > Hi, > > On February 6, 2025 1:10:15 AM GMT+01:00, Dan Williams <dan.j.williams@intel.com> wrote: > >Vivek Goyal wrote: > >> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote: > >> > FS DAX requires file systems to call into the DAX layout prior to unlinking > >> > inodes to ensure there is no ongoing DMA or other remote access to the > >> > direct mapped page. The fuse file system implements > >> > fuse_dax_break_layouts() to do this which includes a comment indicating > >> > that passing dmap_end == 0 leads to unmapping of the whole file. > >> > > >> > However this is not true - passing dmap_end == 0 will not unmap anything > >> > before dmap_start, and further more dax_layout_busy_page_range() will not > >> > scan any of the range to see if there maybe ongoing DMA access to the > >> > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() > >> > which will invalidate the entire file range to > >> > dax_layout_busy_page_range(). > >> > >> Hi Alistair, > >> > >> Thanks for fixing DAX related issues for virtiofs. I am wondering how are > >> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust > >> virtiofsd. C version of virtiofsd used to have out of the tree patches > >> for DAX. But C version got deprecated long time ago. > >> > >> Do you have another implementation of virtiofsd somewhere else which > >> supports DAX and allows for testing DAX related changes? > > > >I have personally never seen a virtiofs-dax test. It sounds like you are > >saying we can deprecate that support if there are no longer any users. > >Or, do you expect that C-virtiofsd is alive in the ecosystem? > > I accidentally replied offlist, but I wanted to mention that libkrun > supports DAX and we use it in muvm. It's a critical part of x11bridge > functionality, since it uses DAX to share X11 shm fences between X11 > clients in the VM and the XWayland server on the host, which only > works if the mmaps are coherent. Ah, good to hear. It would be lovely to integrate an muvm smoketest somewhere in https://github.com/pmem/ndctl/tree/main/test so that we have early warning on potential breakage.
On 2/7/25 4:44 AM, Dan Williams wrote: > Asahi Lina wrote: >> Hi, >> >> On February 6, 2025 1:10:15 AM GMT+01:00, Dan Williams <dan.j.williams@intel.com> wrote: >>> Vivek Goyal wrote: >>>> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote: >>>>> FS DAX requires file systems to call into the DAX layout prior to unlinking >>>>> inodes to ensure there is no ongoing DMA or other remote access to the >>>>> direct mapped page. The fuse file system implements >>>>> fuse_dax_break_layouts() to do this which includes a comment indicating >>>>> that passing dmap_end == 0 leads to unmapping of the whole file. >>>>> >>>>> However this is not true - passing dmap_end == 0 will not unmap anything >>>>> before dmap_start, and further more dax_layout_busy_page_range() will not >>>>> scan any of the range to see if there maybe ongoing DMA access to the >>>>> range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() >>>>> which will invalidate the entire file range to >>>>> dax_layout_busy_page_range(). >>>> >>>> Hi Alistair, >>>> >>>> Thanks for fixing DAX related issues for virtiofs. I am wondering how are >>>> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust >>>> virtiofsd. C version of virtiofsd used to have out of the tree patches >>>> for DAX. But C version got deprecated long time ago. >>>> >>>> Do you have another implementation of virtiofsd somewhere else which >>>> supports DAX and allows for testing DAX related changes? >>> >>> I have personally never seen a virtiofs-dax test. It sounds like you are >>> saying we can deprecate that support if there are no longer any users. >>> Or, do you expect that C-virtiofsd is alive in the ecosystem? >> >> I accidentally replied offlist, but I wanted to mention that libkrun >> supports DAX and we use it in muvm. It's a critical part of x11bridge >> functionality, since it uses DAX to share X11 shm fences between X11 >> clients in the VM and the XWayland server on the host, which only >> works if the mmaps are coherent. > > Ah, good to hear. It would be lovely to integrate an muvm smoketest > somewhere in https://github.com/pmem/ndctl/tree/main/test so that we > have early warning on potential breakage. I think you'll probably want a smoke test using libkrun directly, since muvm is quite application-specific. It's really easy to write a quick C file to call into libkrun and spin up a VM. If it's supposed to test an arbitrary kernel though, I'm not sure what the test setup would look like. You'd need to run it on a host (whose kernel is mostly irrelevant) and then use libkrun to spin up a VM with a guest, which then runs the test. libkrun normally uses a bundled kernel though (shipped as libkrunfw), we'd need to add an API to specify an external kernel binary I guess? I'm happy to help with that, but I'll need to know a bit more about the intended usage first. I *think* most of the scaffolding for running arbitrary kernels is already planned, since there was some talk of running the host kernel as the guest kernel, so this wouldn't add much work on top of that. I definitely have a few tests in mind if we do put this together, since I know of one or two things that are definitely broken in DAX upstream right now (which I *think* this series fixes but I never got around to testing it...). Cc: slp for libkrun. ~~ Lina
On Thu, Feb 6, 2025 at 7:22 PM David Hildenbrand <david@redhat.com> wrote: > > On 06.02.25 15:59, Albert Esteve wrote: > > Hi! > > > > On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> > >> On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote: > >>> And then there are challenges at QEMU level. virtiofsd needs additional > >>> vhost-user commands to implement DAX and these never went upstream in > >>> QEMU. I hope these challenges are sorted at some point of time. > >> > >> Albert Esteve has been working on QEMU support: > >> https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/ > >> > >> He has a viable solution. I think the remaining issue is how to best > >> structure the memory regions. The reason for slow progress is not > >> because it can't be done, it's probably just because this is a > >> background task. > > > > It is partially that, indeed. But what has me blocked for now on posting the > > next version is that I was reworking a bit the MMAP strategy. > > Following David comments, I am relying more on RAMBlocks and > > subregions for mmaps. But this turned out more difficult than anticipated. > > Yeah, if that turns out to be too painful, we could start with the > previous approach and work on that later. I also did not expect that to > become that complicated. Thanks. I'd like to do it properly, so I think will try a bit more to get it to work. Maybe another week. If I do not manage, I may do what you suggested (I'll align with you first) to move the patch forward. That said, if I end up doing that, I will definitively revisit it later. BR, Albert. > > -- > Cheers, > > David / dhildenb >
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 9abbc2f..455c4a1 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -681,7 +681,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, 0, 0, fuse_wait_dax_page(inode)); } -/* dmap_end == 0 leads to unmapping of whole file */ int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start, u64 dmap_end) { diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 0b2f856..bc6c893 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1936,7 +1936,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry, if (FUSE_IS_DAX(inode) && is_truncate) { filemap_invalidate_lock(mapping); fault_blocked = true; - err = fuse_dax_break_layouts(inode, 0, 0); + err = fuse_dax_break_layouts(inode, 0, -1); if (err) { filemap_invalidate_unlock(mapping); return err; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 082ee37..cef7a8f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file *file) if (dax_truncate) { filemap_invalidate_lock(inode->i_mapping); - err = fuse_dax_break_layouts(inode, 0, 0); + err = fuse_dax_break_layouts(inode, 0, -1); if (err) goto out_inode_unlock; } @@ -2890,7 +2890,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, inode_lock(inode); if (block_faults) { filemap_invalidate_lock(inode->i_mapping); - err = fuse_dax_break_layouts(inode, 0, 0); + err = fuse_dax_break_layouts(inode, 0, -1); if (err) goto out; }