Message ID | cover.fe275e9819458a4bbb9451b888cafb88af8867d4.1712796818.git-series.apopple@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | fs/dax: Fix FS DAX page reference counts | expand |
Alistair Popple wrote: > FS DAX pages have always maintained their own page reference counts > without following the normal rules for page reference counting. In > particular pages are considered free when the refcount hits one rather > than zero and refcounts are not added when mapping the page. > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary > mechanism for allowing GUP to hold references on the page (see > get_dev_pagemap). However there doesn't seem to be any reason why FS > DAX pages need their own reference counting scheme. This is fair. However, for anyone coming in fresh to this situation maybe some more "how we get here" history helps. That longer story is here: http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/ > This RFC is an initial attempt at removing the special reference > counting and instead refcount FS DAX pages the same as normal pages. > > There are still a couple of rough edges - in particular I haven't > completely removed the devmap PTE bit references from arch specific > code and there is probably some more cleanup of dev_pagemap reference > counting that could be done, particular in mm/gup.c. I also haven't > yet compiled on anything other than x86_64. > > Before continuing further with this clean-up though I would appreciate > some feedback on the viability of this approach and any issues I may > have overlooked, as I am not intimately familiar with FS DAX code (or > for that matter the FS layer in general). > > I have of course run some basic testing which didn't reveal any > problems. FWIW I see the following with the ndctl/dax test-suite (double-checked that vanilla v6.6 passes). I will take a look at the patches, but in the meantime... # meson test -C build --suite ndctl:dax ninja: no work to do. ninja: Entering directory `/root/git/ndctl/build' [1/70] Generating version.h with a custom command 1/13 ndctl:dax / daxdev-errors.sh OK 14.46s 2/13 ndctl:dax / multi-dax.sh OK 2.70s 3/13 ndctl:dax / sub-section.sh OK 7.21s 4/13 ndctl:dax / dax-dev OK 0.08s [5/13]
On Thu, Apr 11, 2024 at 10:28:56AM -0700, Dan Williams wrote: > Alistair Popple wrote: > > FS DAX pages have always maintained their own page reference counts > > without following the normal rules for page reference counting. In > > particular pages are considered free when the refcount hits one rather > > than zero and refcounts are not added when mapping the page. > > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary > > mechanism for allowing GUP to hold references on the page (see > > get_dev_pagemap). However there doesn't seem to be any reason why FS > > DAX pages need their own reference counting scheme. > > This is fair. However, for anyone coming in fresh to this situation > maybe some more "how we get here" history helps. That longer story is > here: > > http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/ This never got merged? :( Jason
Jason Gunthorpe wrote: > On Thu, Apr 11, 2024 at 10:28:56AM -0700, Dan Williams wrote: > > Alistair Popple wrote: > > > FS DAX pages have always maintained their own page reference counts > > > without following the normal rules for page reference counting. In > > > particular pages are considered free when the refcount hits one rather > > > than zero and refcounts are not added when mapping the page. > > > > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary > > > mechanism for allowing GUP to hold references on the page (see > > > get_dev_pagemap). However there doesn't seem to be any reason why FS > > > DAX pages need their own reference counting scheme. > > > > This is fair. However, for anyone coming in fresh to this situation > > maybe some more "how we get here" history helps. That longer story is > > here: > > > > http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/ > > This never got merged? :( Yeah... I got hung up on the "allocation" API to take ZONE_DEVICE pages from recfount 0 to 1 and then never circled back.
Dan Williams <dan.j.williams@intel.com> writes: > Alistair Popple wrote: >> FS DAX pages have always maintained their own page reference counts >> without following the normal rules for page reference counting. In >> particular pages are considered free when the refcount hits one rather >> than zero and refcounts are not added when mapping the page. > >> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary >> mechanism for allowing GUP to hold references on the page (see >> get_dev_pagemap). However there doesn't seem to be any reason why FS >> DAX pages need their own reference counting scheme. > > This is fair. However, for anyone coming in fresh to this situation > maybe some more "how we get here" history helps. That longer story is > here: > > http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/ Good idea. >> This RFC is an initial attempt at removing the special reference >> counting and instead refcount FS DAX pages the same as normal pages. >> >> There are still a couple of rough edges - in particular I haven't >> completely removed the devmap PTE bit references from arch specific >> code and there is probably some more cleanup of dev_pagemap reference >> counting that could be done, particular in mm/gup.c. I also haven't >> yet compiled on anything other than x86_64. >> >> Before continuing further with this clean-up though I would appreciate >> some feedback on the viability of this approach and any issues I may >> have overlooked, as I am not intimately familiar with FS DAX code (or >> for that matter the FS layer in general). >> >> I have of course run some basic testing which didn't reveal any >> problems. > > FWIW I see the following with the ndctl/dax test-suite (double-checked > that vanilla v6.6 passes). I will take a look at the patches, but in the > meantime... Hmmm... > # meson test -C build --suite ndctl:dax > ninja: no work to do. > ninja: Entering directory `/root/git/ndctl/build' > [1/70] Generating version.h with a custom command > 1/13 ndctl:dax / daxdev-errors.sh OK 14.46s > 2/13 ndctl:dax / multi-dax.sh OK 2.70s > 3/13 ndctl:dax / sub-section.sh OK 7.21s > 4/13 ndctl:dax / dax-dev OK 0.08s > [5/13]
Alistair Popple <apopple@nvidia.com> writes: > Dan Williams <dan.j.williams@intel.com> writes: > >> Alistair Popple wrote: >>> FS DAX pages have always maintained their own page reference counts >>> without following the normal rules for page reference counting. In >>> particular pages are considered free when the refcount hits one rather >>> than zero and refcounts are not added when mapping the page. >> >>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary >>> mechanism for allowing GUP to hold references on the page (see >>> get_dev_pagemap). However there doesn't seem to be any reason why FS >>> DAX pages need their own reference counting scheme. >> >> This is fair. However, for anyone coming in fresh to this situation >> maybe some more "how we get here" history helps. That longer story is >> here: >> >> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/ > > Good idea. > >>> This RFC is an initial attempt at removing the special reference >>> counting and instead refcount FS DAX pages the same as normal pages. >>> >>> There are still a couple of rough edges - in particular I haven't >>> completely removed the devmap PTE bit references from arch specific >>> code and there is probably some more cleanup of dev_pagemap reference >>> counting that could be done, particular in mm/gup.c. I also haven't >>> yet compiled on anything other than x86_64. >>> >>> Before continuing further with this clean-up though I would appreciate >>> some feedback on the viability of this approach and any issues I may >>> have overlooked, as I am not intimately familiar with FS DAX code (or >>> for that matter the FS layer in general). >>> >>> I have of course run some basic testing which didn't reveal any >>> problems. >> >> FWIW I see the following with the ndctl/dax test-suite (double-checked >> that vanilla v6.6 passes). I will take a look at the patches, but in the >> meantime... > > Hmmm... > >> # meson test -C build --suite ndctl:dax >> ninja: no work to do. >> ninja: Entering directory `/root/git/ndctl/build' >> [1/70] Generating version.h with a custom command >> 1/13 ndctl:dax / daxdev-errors.sh OK 14.46s >> 2/13 ndctl:dax / multi-dax.sh OK 2.70s >> 3/13 ndctl:dax / sub-section.sh OK 7.21s >> 4/13 ndctl:dax / dax-dev OK 0.08s >> [5/13]
On Fri, Apr 12, 2024 at 04:55:31PM +1000, Alistair Popple wrote: > Ok, I think I found the dragons you were talking about earlier for > device-dax. I completely broke that because as you've already pointed > out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX > (because the pages are essentially normal pages now anyway), but we > don't have a PMD equivalent of vm_normal_page() which leads to all sorts > of issues for DEVDAX. What about vm_normal_page() depends on the radix level ? Doesn't DEVDAX memory have struct page too? > So I will probably have to add something like that unless we only need > to support large (pmd/pud) mappings of DEVDAX pages on systems with > CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter > based on pte_special(). pte_special should only be used by memory without a struct page, is that what DEVDAX is? Jason
Jason Gunthorpe wrote: > On Fri, Apr 12, 2024 at 04:55:31PM +1000, Alistair Popple wrote: > > > Ok, I think I found the dragons you were talking about earlier for > > device-dax. I completely broke that because as you've already pointed > > out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX > > (because the pages are essentially normal pages now anyway), but we > > don't have a PMD equivalent of vm_normal_page() which leads to all sorts > > of issues for DEVDAX. > > What about vm_normal_page() depends on the radix level ? > > Doesn't DEVDAX memory have struct page too? Yes. > > So I will probably have to add something like that unless we only need > > to support large (pmd/pud) mappings of DEVDAX pages on systems with > > CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter > > based on pte_special(). > > pte_special should only be used by memory without a struct page, is > that what DEVDAX is? Right, I don't think pte_special is applicable for any DAX pages.
FS DAX pages have always maintained their own page reference counts without following the normal rules for page reference counting. In particular pages are considered free when the refcount hits one rather than zero and refcounts are not added when mapping the page. Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary mechanism for allowing GUP to hold references on the page (see get_dev_pagemap). However there doesn't seem to be any reason why FS DAX pages need their own reference counting scheme. This RFC is an initial attempt at removing the special reference counting and instead refcount FS DAX pages the same as normal pages. There are still a couple of rough edges - in particular I haven't completely removed the devmap PTE bit references from arch specific code and there is probably some more cleanup of dev_pagemap reference counting that could be done, particular in mm/gup.c. I also haven't yet compiled on anything other than x86_64. Before continuing further with this clean-up though I would appreciate some feedback on the viability of this approach and any issues I may have overlooked, as I am not intimately familiar with FS DAX code (or for that matter the FS layer in general). I have of course run some basic testing which didn't reveal any problems. Signed-off-by: Alistair Popple <apopple@nvidia.com> Alistair Popple (10): mm/gup.c: Remove redundant check for PCI P2PDMA page mm/hmm: Remove dead check for HugeTLB and FS DAX pci/p2pdma: Don't initialise page refcount to one fs/dax: Don't track page mapping/index fs/dax: Refactor wait for dax idle page fs/dax: Add dax_page_free callback mm: Allow compound zone device pages fs/dax: Properly refcount fs dax pages mm/khugepage.c: Warn if trying to scan devmap pmd mm: Remove pXX_devmap Documentation/mm/arch_pgtable_helpers.rst | 6 +- arch/arm64/include/asm/pgtable.h | 24 +--- arch/powerpc/include/asm/book3s/64/pgtable.h | 42 +----- arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +- arch/powerpc/mm/book3s64/pgtable.c | 8 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +- arch/powerpc/mm/pgtable.c | 2 +- arch/x86/include/asm/pgtable.h | 31 +--- drivers/dax/super.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- drivers/nvdimm/pmem.c | 10 +- drivers/pci/p2pdma.c | 4 +- fs/dax.c | 158 +++++++----------- fs/ext4/inode.c | 5 +- fs/fuse/dax.c | 4 +- fs/fuse/virtio_fs.c | 8 +- fs/userfaultfd.c | 2 +- fs/xfs/xfs_file.c | 4 +- include/linux/dax.h | 16 ++- include/linux/huge_mm.h | 11 +- include/linux/memremap.h | 12 +- include/linux/migrate.h | 2 +- include/linux/mm.h | 41 +----- include/linux/page-flags.h | 6 +- include/linux/pgtable.h | 17 +-- lib/test_hmm.c | 2 +- mm/debug_vm_pgtable.c | 51 +------ mm/gup.c | 165 +------------------ mm/hmm.c | 40 +---- mm/huge_memory.c | 180 +++++++++----------- mm/internal.h | 2 +- mm/khugepaged.c | 2 +- mm/mapping_dirty_helpers.c | 4 +- mm/memory-failure.c | 6 +- mm/memory.c | 109 ++++++++---- mm/memremap.c | 36 +--- mm/migrate_device.c | 6 +- mm/mm_init.c | 5 +- mm/mprotect.c | 2 +- mm/mremap.c | 5 +- mm/page_vma_mapped.c | 5 +- mm/pgtable-generic.c | 7 +- mm/swap.c | 2 +- mm/vmscan.c | 5 +- 44 files changed, 338 insertions(+), 721 deletions(-) base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa