Message ID | 20200914224509.17699-1-rcampbell@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: remove extra ZONE_DEVICE struct page refcount | expand |
On Mon, Sep 14, 2020 at 3:45 PM Ralph Campbell <rcampbell@nvidia.com> wrote: > > ZONE_DEVICE struct pages have an extra reference count that complicates the > code for put_page() and several places in the kernel that need to check the > reference count to see that a page is not being used (gup, compaction, > migration, etc.). Clean up the code so the reference count doesn't need to > be treated specially for ZONE_DEVICE. > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > --- > > Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE > struct page reference counting is ugly/broken. This is my attempt to > fix it and it works for the HMM migration self tests. Can you link to a technical description of what's broken? Or better yet, summarize that argument in the changelog? > I'm only sending this out as a RFC since I'm not that familiar with the > DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated > with devm_memremap_pages() or memremap_pages() but my best reading of > the code looks like it might be OK. I could use help testing these > configurations. Back in the 4.15 days I could not convince myself that some code paths blindly assumed that pages with refcount==0 were on an lru list. Since then, struct page has been reorganized to not collide the ->pgmap back pointer with the ->lru list and there have been other cleanups for page pinning that might make this incremental cleanup viable. You also need to fix up ext4_break_layouts() and xfs_break_dax_layouts() to expect ->_refcount is 0 instead of 1. This also needs some fstests exposure. > I have a modified THP migration patch series that applies on top of > this one and is cleaner since I don't have to add code to handle the > +1 reference count. The link below is for the earlier v2: > ("mm/hmm/nouveau: add THP migration to migrate_vma_*") > https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com > > > arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - > drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - > include/linux/memremap.h | 6 +-- > include/linux/mm.h | 39 --------------- > lib/test_hmm.c | 1 - > mm/gup.c | 44 ----------------- > mm/memremap.c | 20 ++++---- > mm/migrate.c | 5 -- > mm/swap.c | 66 +++++++++++--------------- > 9 files changed, 41 insertions(+), 142 deletions(-) This diffstat is indeed appealing. > > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c > index 84e5a2dc8be5..00d97050d7ff 100644 > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c > @@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) > > dpage = pfn_to_page(uvmem_pfn); > dpage->zone_device_data = pvt; > - get_page(dpage); > lock_page(dpage); > return dpage; > out_clear: > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c > index a13c6215bba8..2a4bbe01a455 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c > @@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) > return NULL; > } > > - get_page(page); > lock_page(page); > return page; > } > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 4e9c738f4b31..7dd9802d2612 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -67,9 +67,9 @@ enum memory_type { > > struct dev_pagemap_ops { > /* > - * Called once the page refcount reaches 1. (ZONE_DEVICE pages never > - * reach 0 refcount unless there is a refcount bug. This allows the > - * device driver to implement its own memory management.) > + * Called once the page refcount reaches 0. The reference count is > + * reset to 1 before calling page_free(). This allows the > + * device driver to implement its own memory management. I'd clarify the order events / responsibility of the common core page_free() and the device specific page_free(). At the same time, why not update drivers to expect that the page is already refcount==0 on entry? Seems odd to go through all this trouble to make the reference count appear to be zero to the wider kernel but expect that drivers get a fake reference on entry to their ->page_free() callbacks.
On 9/14/20 4:10 PM, Dan Williams wrote: > On Mon, Sep 14, 2020 at 3:45 PM Ralph Campbell <rcampbell@nvidia.com> wrote: >> >> ZONE_DEVICE struct pages have an extra reference count that complicates the >> code for put_page() and several places in the kernel that need to check the >> reference count to see that a page is not being used (gup, compaction, >> migration, etc.). Clean up the code so the reference count doesn't need to >> be treated specially for ZONE_DEVICE. >> >> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> >> --- >> >> Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE >> struct page reference counting is ugly/broken. This is my attempt to >> fix it and it works for the HMM migration self tests. > > Can you link to a technical description of what's broken? Or better > yet, summarize that argument in the changelog? > >> I'm only sending this out as a RFC since I'm not that familiar with the >> DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated >> with devm_memremap_pages() or memremap_pages() but my best reading of >> the code looks like it might be OK. I could use help testing these >> configurations. > > Back in the 4.15 days I could not convince myself that some code paths > blindly assumed that pages with refcount==0 were on an lru list. Since > then, struct page has been reorganized to not collide the ->pgmap back > pointer with the ->lru list and there have been other cleanups for > page pinning that might make this incremental cleanup viable. > > You also need to fix up ext4_break_layouts() and > xfs_break_dax_layouts() to expect ->_refcount is 0 instead of 1. This > also needs some fstests exposure. Got it. Thanks! >> I have a modified THP migration patch series that applies on top of >> this one and is cleaner since I don't have to add code to handle the >> +1 reference count. The link below is for the earlier v2: >> ("mm/hmm/nouveau: add THP migration to migrate_vma_*") >> https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com >> >> >> arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - >> drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - >> include/linux/memremap.h | 6 +-- >> include/linux/mm.h | 39 --------------- >> lib/test_hmm.c | 1 - >> mm/gup.c | 44 ----------------- >> mm/memremap.c | 20 ++++---- >> mm/migrate.c | 5 -- >> mm/swap.c | 66 +++++++++++--------------- >> 9 files changed, 41 insertions(+), 142 deletions(-) > > This diffstat is indeed appealing. > >> >> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c >> index 84e5a2dc8be5..00d97050d7ff 100644 >> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c >> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c >> @@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) >> >> dpage = pfn_to_page(uvmem_pfn); >> dpage->zone_device_data = pvt; >> - get_page(dpage); >> lock_page(dpage); >> return dpage; >> out_clear: >> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c >> index a13c6215bba8..2a4bbe01a455 100644 >> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c >> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c >> @@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) >> return NULL; >> } >> >> - get_page(page); >> lock_page(page); >> return page; >> } >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >> index 4e9c738f4b31..7dd9802d2612 100644 >> --- a/include/linux/memremap.h >> +++ b/include/linux/memremap.h >> @@ -67,9 +67,9 @@ enum memory_type { >> >> struct dev_pagemap_ops { >> /* >> - * Called once the page refcount reaches 1. (ZONE_DEVICE pages never >> - * reach 0 refcount unless there is a refcount bug. This allows the >> - * device driver to implement its own memory management.) >> + * Called once the page refcount reaches 0. The reference count is >> + * reset to 1 before calling page_free(). This allows the >> + * device driver to implement its own memory management. > > I'd clarify the order events / responsibility of the common core > page_free() and the device specific page_free(). At the same time, why > not update drivers to expect that the page is already refcount==0 on > entry? Seems odd to go through all this trouble to make the reference > count appear to be zero to the wider kernel but expect that drivers > get a fake reference on entry to their ->page_free() callbacks. Good point. Since set_page_refcounted() is defined in mm_interal.h I would have to move the definition to someplace like page_ref.h or have the drivers cal init_page_count() or set_page_count() since get_page() calls VM_BUG_ON_PAGE() if refcount == 0. I'll move set_page_refcounted() since that is what the page allocator uses and seems named for the purpose.
On Mon, Sep 14, 2020 at 04:53:25PM -0700, Ralph Campbell wrote: > Since set_page_refcounted() is defined in mm_interal.h I would have to > move the definition to someplace like page_ref.h or have the drivers > cal init_page_count() or set_page_count() since get_page() calls > VM_BUG_ON_PAGE() if refcount == 0. > I'll move set_page_refcounted() since that is what the page allocator > uses and seems named for the purpose. I don't think any of the three ->page_free instances even cares about the page refcount.
On 9/15/20 9:29 AM, Christoph Hellwig wrote: > On Mon, Sep 14, 2020 at 04:53:25PM -0700, Ralph Campbell wrote: >> Since set_page_refcounted() is defined in mm_interal.h I would have to >> move the definition to someplace like page_ref.h or have the drivers >> cal init_page_count() or set_page_count() since get_page() calls >> VM_BUG_ON_PAGE() if refcount == 0. >> I'll move set_page_refcounted() since that is what the page allocator >> uses and seems named for the purpose. > > I don't think any of the three ->page_free instances even cares about > the page refcount. > Not true. The page_free() callback records the page is free by setting a bit or putting the page on a free list but when it allocates a free device private struct page to be used with migrate_vma_setup(), it needs to increment the refcount. For the ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA struct pages, I think you are correct because they don't define page_free() and from what I can see, don't decrement the page refcount to zero.
On Tue, Sep 15, 2020 at 09:39:47AM -0700, Ralph Campbell wrote: >> I don't think any of the three ->page_free instances even cares about >> the page refcount. >> > Not true. The page_free() callback records the page is free by setting > a bit or putting the page on a free list but when it allocates a free > device private struct page to be used with migrate_vma_setup(), it needs to > increment the refcount. > > For the ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA > struct pages, I think you are correct because they don't define page_free() > and from what I can see, don't decrement the page refcount to zero. Umm, the whole point of ZONE_DEVICE is to have a struct page for something that is not system memory. For both the ppc kvm case (magic hypervisor pool) and Noveau (device internal) memory that clear is the case. But looks like test_hmm uses normal pages to fake this up, so I was wrong about the third caller. But I think we can just call set_page_count just before freeing the page there with a comment explaining what is goin on.
> diff --git a/include/linux/mm.h b/include/linux/mm.h > index 517751310dd2..5a82037a4b26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page *page) > #ifdef CONFIG_DEV_PAGEMAP_OPS > void free_devmap_managed_page(struct page *page); > DECLARE_STATIC_KEY_FALSE(devmap_managed_key); The export for devmap_managed_key can be dropped now. In fact I think we can remove devmap_managed_key entirely now - it is only checked in the actual page free path instead of for each refcount manipulation, so a good old unlikely is probably enough. Also free_devmap_managed_page can move to mm/internal.h. > +#ifdef CONFIG_DEV_PAGEMAP_OPS > +static void __put_devmap_managed_page(struct page *page) > +{ > + if (!static_branch_unlikely(&devmap_managed_key)) > + return; > + > + switch (page->pgmap->type) { > + case MEMORY_DEVICE_PRIVATE: > + case MEMORY_DEVICE_FS_DAX: > + free_devmap_managed_page(page); > + break; > + default: > + break; > + } > +} > +#else > +static inline void __put_devmap_managed_page(struct page *page) > +{ > +} > +#endif I think this should be moved to mm/memremap.c or even better actually be folded into free_devmap_managed_page, which would need a new name like free_zone_device_page(). Something like this incremental patch: diff --git a/include/linux/mm.h b/include/linux/mm.h index 7bb9e93cf86cde..29350dc27cd0cd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1090,11 +1090,6 @@ static inline bool is_zone_device_page(const struct page *page) } #endif -#ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static inline bool is_device_private_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && diff --git a/mm/internal.h b/mm/internal.h index 6345b08ce86ccf..629959a6f26d7c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -618,4 +618,12 @@ struct migration_target_control { gfp_t gfp_mask; }; +#ifdef CONFIG_DEV_PAGEMAP_OPS +void free_zone_device_page(struct page *page); +#else +static inline void free_zone_device_page(struct page *page) +{ +} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memremap.c b/mm/memremap.c index d549e3733f4098..b15ad2264a4f1c 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -12,6 +12,7 @@ #include <linux/types.h> #include <linux/wait_bit.h> #include <linux/xarray.h> +#include "internal.h" static DEFINE_XARRAY(pgmap_array); @@ -37,36 +38,6 @@ unsigned long memremap_compat_align(void) EXPORT_SYMBOL_GPL(memremap_compat_align); #endif -#ifdef CONFIG_DEV_PAGEMAP_OPS -DEFINE_STATIC_KEY_FALSE(devmap_managed_key); -EXPORT_SYMBOL(devmap_managed_key); - -static void devmap_managed_enable_put(void) -{ - static_branch_dec(&devmap_managed_key); -} - -static int devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_PRIVATE && - (!pgmap->ops || !pgmap->ops->page_free)) { - WARN(1, "Missing page_free method\n"); - return -EINVAL; - } - - static_branch_inc(&devmap_managed_key); - return 0; -} -#else -static int devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - return -EINVAL; -} -static void devmap_managed_enable_put(void) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static void pgmap_array_delete(struct range *range) { xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), @@ -181,7 +152,6 @@ void memunmap_pages(struct dev_pagemap *pgmap) pageunmap_range(pgmap, i); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); - devmap_managed_enable_put(); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -319,7 +289,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) .pgprot = PAGE_KERNEL, }; const int nr_range = pgmap->nr_range; - bool need_devmap_managed = true; int error, i; if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) @@ -331,8 +300,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) WARN(1, "Device private memory not supported\n"); return ERR_PTR(-EINVAL); } - if (!pgmap->ops || !pgmap->ops->migrate_to_ram) { - WARN(1, "Missing migrate_to_ram method\n"); + if (!pgmap->ops || + !pgmap->ops->migrate_to_ram || !pgmap->ops->page_free) { + WARN(1, "Missing ops\n"); return ERR_PTR(-EINVAL); } if (!pgmap->owner) { @@ -348,11 +318,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_GENERIC: - need_devmap_managed = false; break; case MEMORY_DEVICE_PCI_P2PDMA: params.pgprot = pgprot_noncached(params.pgprot); - need_devmap_managed = false; break; default: WARN(1, "Invalid pgmap type %d\n", pgmap->type); @@ -376,12 +344,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } } - if (need_devmap_managed) { - error = devmap_managed_enable_get(pgmap); - if (error) - return ERR_PTR(error); - } - /* * Clear the pgmap nr_range as it will be incremented for each * successfully processed range. This communicates how many @@ -496,16 +458,9 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap); #ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page) +static void free_device_private_page(struct page *page) { - /* notify page idle for dax */ - if (!is_device_private_page(page)) { - wake_up_var(&page->_refcount); - return; - } - __ClearPageWaiters(page); - mem_cgroup_uncharge(page); /* @@ -540,4 +495,19 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void free_zone_device_page(struct page *page) +{ + switch (page->pgmap->type) { + case MEMORY_DEVICE_FS_DAX: + /* notify page idle */ + wake_up_var(&page->_refcount); + return; + case MEMORY_DEVICE_PRIVATE: + free_device_private_page(page); + return; + default: + return; + } +} #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index bcab5db351184a..83451ac70d0f05 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -113,36 +113,14 @@ static void __put_compound_page(struct page *page) destroy_compound_page(page); } -#ifdef CONFIG_DEV_PAGEMAP_OPS -static void __put_devmap_managed_page(struct page *page) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return; - - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - free_devmap_managed_page(page); - break; - default: - break; - } -} -#else -static inline void __put_devmap_managed_page(struct page *page) -{ -} -#endif - void __put_page(struct page *page) { if (is_zone_device_page(page)) { - __put_devmap_managed_page(page); - /* * The page belongs to the device that created pgmap. Do * not return it to page allocator. */ + free_zone_device_page(page); return; } @@ -923,7 +901,7 @@ void release_pages(struct page **pages, int nr) flags); locked_pgdat = NULL; } - __put_devmap_managed_page(page); + free_zone_device_page(page); return; }
On Mon, Sep 14, 2020 at 04:10:38PM -0700, Dan Williams wrote: > You also need to fix up ext4_break_layouts() and > xfs_break_dax_layouts() to expect ->_refcount is 0 instead of 1. This > also needs some fstests exposure. While we're at it, can we add a wait_fsdax_unref helper macro that hides the _refcount access from the file systems?
On 9/15/20 10:36 PM, Christoph Hellwig wrote: > On Tue, Sep 15, 2020 at 09:39:47AM -0700, Ralph Campbell wrote: >>> I don't think any of the three ->page_free instances even cares about >>> the page refcount. >>> >> Not true. The page_free() callback records the page is free by setting >> a bit or putting the page on a free list but when it allocates a free >> device private struct page to be used with migrate_vma_setup(), it needs to >> increment the refcount. >> >> For the ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA >> struct pages, I think you are correct because they don't define page_free() >> and from what I can see, don't decrement the page refcount to zero. > > Umm, the whole point of ZONE_DEVICE is to have a struct page for > something that is not system memory. For both the ppc kvm case (magic > hypervisor pool) and Noveau (device internal) memory that clear is the > case. But looks like test_hmm uses normal pages to fake this up, so > I was wrong about the third caller. But I think we can just call > set_page_count just before freeing the page there with a comment > explaining what is goin on. Dan Williams thought that having the ZONE_DEVICE struct pages be on a free list with a refcount of one was a bit strange and that the driver should handle the zero to one transition. But, that would mean a bit more invasive change to the 3 drivers to set the reference count to zero after calling memremap_pages() and setting the reference count to one when allocating a struct page. What you are suggesting is what I also proposed in v1.
On Wed, Sep 16, 2020 at 5:29 PM Ralph Campbell <rcampbell@nvidia.com> wrote: > > > On 9/15/20 10:36 PM, Christoph Hellwig wrote: > > On Tue, Sep 15, 2020 at 09:39:47AM -0700, Ralph Campbell wrote: > >>> I don't think any of the three ->page_free instances even cares about > >>> the page refcount. > >>> > >> Not true. The page_free() callback records the page is free by setting > >> a bit or putting the page on a free list but when it allocates a free > >> device private struct page to be used with migrate_vma_setup(), it needs to > >> increment the refcount. > >> > >> For the ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA > >> struct pages, I think you are correct because they don't define page_free() > >> and from what I can see, don't decrement the page refcount to zero. > > > > Umm, the whole point of ZONE_DEVICE is to have a struct page for > > something that is not system memory. For both the ppc kvm case (magic > > hypervisor pool) and Noveau (device internal) memory that clear is the > > case. But looks like test_hmm uses normal pages to fake this up, so > > I was wrong about the third caller. But I think we can just call > > set_page_count just before freeing the page there with a comment > > explaining what is goin on. > > Dan Williams thought that having the ZONE_DEVICE struct pages > be on a free list with a refcount of one was a bit strange and > that the driver should handle the zero to one transition. > But, that would mean a bit more invasive change to the 3 drivers > to set the reference count to zero after calling memremap_pages() > and setting the reference count to one when allocating a struct > page. What you are suggesting is what I also proposed in v1. IIUC, isn't what Christoph recommending is that drivers handle set_page_count() directly rather than the core since some are prepared for it to be zero on entry?
On 9/15/20 11:09 PM, Christoph Hellwig wrote: >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 517751310dd2..5a82037a4b26 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page *page) >> #ifdef CONFIG_DEV_PAGEMAP_OPS >> void free_devmap_managed_page(struct page *page); >> DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > > The export for devmap_managed_key can be dropped now. In fact I think > we can remove devmap_managed_key entirely now - it is only checked in > the actual page free path instead of for each refcount manipulation, > so a good old unlikely is probably enough. > > Also free_devmap_managed_page can move to mm/internal.h. Good suggestion. >> +#ifdef CONFIG_DEV_PAGEMAP_OPS >> +static void __put_devmap_managed_page(struct page *page) >> +{ >> + if (!static_branch_unlikely(&devmap_managed_key)) >> + return; >> + >> + switch (page->pgmap->type) { >> + case MEMORY_DEVICE_PRIVATE: >> + case MEMORY_DEVICE_FS_DAX: >> + free_devmap_managed_page(page); >> + break; >> + default: >> + break; >> + } >> +} >> +#else >> +static inline void __put_devmap_managed_page(struct page *page) >> +{ >> +} >> +#endif > > I think this should be moved to mm/memremap.c or even better > actually be folded into free_devmap_managed_page, which would need > a new name like free_zone_device_page(). > > Something like this incremental patch: > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 7bb9e93cf86cde..29350dc27cd0cd 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1090,11 +1090,6 @@ static inline bool is_zone_device_page(const struct page *page) > } > #endif > > -#ifdef CONFIG_DEV_PAGEMAP_OPS > -void free_devmap_managed_page(struct page *page); > -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > -#endif /* CONFIG_DEV_PAGEMAP_OPS */ > - > static inline bool is_device_private_page(const struct page *page) > { > return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && > diff --git a/mm/internal.h b/mm/internal.h > index 6345b08ce86ccf..629959a6f26d7c 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -618,4 +618,12 @@ struct migration_target_control { > gfp_t gfp_mask; > }; > > +#ifdef CONFIG_DEV_PAGEMAP_OPS > +void free_zone_device_page(struct page *page); > +#else > +static inline void free_zone_device_page(struct page *page) > +{ > +} > +#endif > + > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/memremap.c b/mm/memremap.c > index d549e3733f4098..b15ad2264a4f1c 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -12,6 +12,7 @@ > #include <linux/types.h> > #include <linux/wait_bit.h> > #include <linux/xarray.h> > +#include "internal.h" > > static DEFINE_XARRAY(pgmap_array); > > @@ -37,36 +38,6 @@ unsigned long memremap_compat_align(void) > EXPORT_SYMBOL_GPL(memremap_compat_align); > #endif > > -#ifdef CONFIG_DEV_PAGEMAP_OPS > -DEFINE_STATIC_KEY_FALSE(devmap_managed_key); > -EXPORT_SYMBOL(devmap_managed_key); > - > -static void devmap_managed_enable_put(void) > -{ > - static_branch_dec(&devmap_managed_key); > -} > - > -static int devmap_managed_enable_get(struct dev_pagemap *pgmap) > -{ > - if (pgmap->type == MEMORY_DEVICE_PRIVATE && > - (!pgmap->ops || !pgmap->ops->page_free)) { > - WARN(1, "Missing page_free method\n"); > - return -EINVAL; > - } > - > - static_branch_inc(&devmap_managed_key); > - return 0; > -} > -#else > -static int devmap_managed_enable_get(struct dev_pagemap *pgmap) > -{ > - return -EINVAL; > -} > -static void devmap_managed_enable_put(void) > -{ > -} > -#endif /* CONFIG_DEV_PAGEMAP_OPS */ > - > static void pgmap_array_delete(struct range *range) > { > xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), > @@ -181,7 +152,6 @@ void memunmap_pages(struct dev_pagemap *pgmap) > pageunmap_range(pgmap, i); > > WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); > - devmap_managed_enable_put(); > } > EXPORT_SYMBOL_GPL(memunmap_pages); > > @@ -319,7 +289,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > .pgprot = PAGE_KERNEL, > }; > const int nr_range = pgmap->nr_range; > - bool need_devmap_managed = true; > int error, i; > > if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) > @@ -331,8 +300,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > WARN(1, "Device private memory not supported\n"); > return ERR_PTR(-EINVAL); > } > - if (!pgmap->ops || !pgmap->ops->migrate_to_ram) { > - WARN(1, "Missing migrate_to_ram method\n"); > + if (!pgmap->ops || > + !pgmap->ops->migrate_to_ram || !pgmap->ops->page_free) { > + WARN(1, "Missing ops\n"); > return ERR_PTR(-EINVAL); > } > if (!pgmap->owner) { > @@ -348,11 +318,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > } > break; > case MEMORY_DEVICE_GENERIC: > - need_devmap_managed = false; > break; > case MEMORY_DEVICE_PCI_P2PDMA: > params.pgprot = pgprot_noncached(params.pgprot); > - need_devmap_managed = false; > break; > default: > WARN(1, "Invalid pgmap type %d\n", pgmap->type); > @@ -376,12 +344,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > } > } > > - if (need_devmap_managed) { > - error = devmap_managed_enable_get(pgmap); > - if (error) > - return ERR_PTR(error); > - } > - > /* > * Clear the pgmap nr_range as it will be incremented for each > * successfully processed range. This communicates how many > @@ -496,16 +458,9 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > EXPORT_SYMBOL_GPL(get_dev_pagemap); > > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void free_devmap_managed_page(struct page *page) > +static void free_device_private_page(struct page *page) > { > - /* notify page idle for dax */ > - if (!is_device_private_page(page)) { > - wake_up_var(&page->_refcount); > - return; > - } > - > __ClearPageWaiters(page); > - > mem_cgroup_uncharge(page); > > /* > @@ -540,4 +495,19 @@ void free_devmap_managed_page(struct page *page) > page->mapping = NULL; > page->pgmap->ops->page_free(page); > } > + > +void free_zone_device_page(struct page *page) > +{ > + switch (page->pgmap->type) { > + case MEMORY_DEVICE_FS_DAX: > + /* notify page idle */ > + wake_up_var(&page->_refcount); > + return; > + case MEMORY_DEVICE_PRIVATE: > + free_device_private_page(page); > + return; > + default: > + return; > + } > +} > #endif /* CONFIG_DEV_PAGEMAP_OPS */ > diff --git a/mm/swap.c b/mm/swap.c > index bcab5db351184a..83451ac70d0f05 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -113,36 +113,14 @@ static void __put_compound_page(struct page *page) > destroy_compound_page(page); > } > > -#ifdef CONFIG_DEV_PAGEMAP_OPS > -static void __put_devmap_managed_page(struct page *page) > -{ > - if (!static_branch_unlikely(&devmap_managed_key)) > - return; > - > - switch (page->pgmap->type) { > - case MEMORY_DEVICE_PRIVATE: > - case MEMORY_DEVICE_FS_DAX: > - free_devmap_managed_page(page); > - break; > - default: > - break; > - } > -} > -#else > -static inline void __put_devmap_managed_page(struct page *page) > -{ > -} > -#endif > - > void __put_page(struct page *page) > { > if (is_zone_device_page(page)) { > - __put_devmap_managed_page(page); > - > /* > * The page belongs to the device that created pgmap. Do > * not return it to page allocator. > */ > + free_zone_device_page(page); > return; > } > > @@ -923,7 +901,7 @@ void release_pages(struct page **pages, int nr) > flags); > locked_pgdat = NULL; > } > - __put_devmap_managed_page(page); > + free_zone_device_page(page); > return; > } > > Thanks for the review! I will apply the above in v2. I found a couple of more reference count checks in fs/dax.c so I need to run fstests with dax before sending v2 out.
On 9/15/20 11:10 PM, Christoph Hellwig wrote: > On Mon, Sep 14, 2020 at 04:10:38PM -0700, Dan Williams wrote: >> You also need to fix up ext4_break_layouts() and >> xfs_break_dax_layouts() to expect ->_refcount is 0 instead of 1. This >> also needs some fstests exposure. > > While we're at it, can we add a wait_fsdax_unref helper macro that hides > the _refcount access from the file systems? Sure. I'll add a separate patch for it in v2.
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 84e5a2dc8be5..00d97050d7ff 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index a13c6215bba8..2a4bbe01a455 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; } - get_page(page); lock_page(page); return page; } diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 4e9c738f4b31..7dd9802d2612 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -67,9 +67,9 @@ enum memory_type { struct dev_pagemap_ops { /* - * Called once the page refcount reaches 1. (ZONE_DEVICE pages never - * reach 0 refcount unless there is a refcount bug. This allows the - * device driver to implement its own memory management.) + * Called once the page refcount reaches 0. The reference count is + * reset to 1 before calling page_free(). This allows the + * device driver to implement its own memory management. */ void (*page_free)(struct page *page); diff --git a/include/linux/mm.h b/include/linux/mm.h index 517751310dd2..5a82037a4b26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page *page) #ifdef CONFIG_DEV_PAGEMAP_OPS void free_devmap_managed_page(struct page *page); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); - -static inline bool page_is_devmap_managed(struct page *page) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return false; - if (!is_zone_device_page(page)) - return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; -} - -void put_devmap_managed_page(struct page *page); - -#else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) -{ - return false; -} - -static inline void put_devmap_managed_page(struct page *page) -{ -} #endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) @@ -1169,17 +1141,6 @@ static inline void put_page(struct page *page) { page = compound_head(page); - /* - * For devmap managed pages we need to catch refcount transition from - * 2 to 1, when refcount reach one it means the page is free and we - * need to inform the device driver through callback. See - * include/linux/memremap.h and HMM for details. - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - return; - } - if (put_page_testzero(page)) __put_page(page); } diff --git a/lib/test_hmm.c b/lib/test_hmm.c index c8133f50160b..5410f7328577 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -561,7 +561,6 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) } dpage->zone_device_data = rpage; - get_page(dpage); lock_page(dpage); return dpage; diff --git a/mm/gup.c b/mm/gup.c index b9b1cd8bcd6a..3c54f0c190a3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) return true; } -#ifdef CONFIG_DEV_PAGEMAP_OPS -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - int count, refs = 1; - - if (!page_is_devmap_managed(page)) - return false; - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, 1); - else - refs = GUP_PIN_COUNTING_BIAS; - - count = page_ref_sub_return(page, refs); - - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); - - return true; -} -#else -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - return false; -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - /** * unpin_user_page() - release a dma-pinned page * @page: pointer to page to be released @@ -227,15 +192,6 @@ void unpin_user_page(struct page *page) page = compound_head(page); - /* - * For devmap managed pages we need to catch refcount transition from - * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the - * page is free and we need to inform the device driver through - * callback. See include/linux/memremap.h and HMM for details. - */ - if (__unpin_devmap_managed_user_page(page)) - return; - if (hpage_pincount_available(page)) hpage_pincount_sub(page, 1); else diff --git a/mm/memremap.c b/mm/memremap.c index 9c951bbfee91..f09562789079 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -91,13 +91,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) return (range->start + range_len(range)) >> PAGE_SHIFT; } -static unsigned long pfn_next(unsigned long pfn) -{ - if (pfn % 1024 == 0) - cond_resched(); - return pfn + 1; -} - /* * This returns true if the page is reserved by ZONE_DEVICE driver. */ @@ -176,13 +169,12 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id) void memunmap_pages(struct dev_pagemap *pgmap) { - unsigned long pfn; int i; dev_pagemap_kill(pgmap); for (i = 0; i < pgmap->nr_range; i++) - for_each_device_pfn(pfn, pgmap, i) - put_page(pfn_to_page(pfn)); + percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) - + pfn_first(pgmap, i)); dev_pagemap_cleanup(pgmap); for (i = 0; i < pgmap->nr_range; i++) @@ -516,6 +508,14 @@ void free_devmap_managed_page(struct page *page) mem_cgroup_uncharge(page); + /* + * ZONE_DEVICE drivers keep a reference to the page while it is on + * the driver's free list so we reset the reference count here. + * This matches the initial reference when the struct pages are + * created by memremap_pages(). + */ + set_page_refcounted(page); + /* * When a device_private page is freed, the page->mapping field * may still contain a (stale) mapping value. For example, the diff --git a/mm/migrate.c b/mm/migrate.c index 4f89360d9e77..be1586582b52 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) { int expected_count = 1; - /* - * Device private pages have an extra refcount as they are - * ZONE_DEVICE pages. - */ - expected_count += is_device_private_page(page); if (mapping) expected_count += thp_nr_pages(page) + page_has_private(page); diff --git a/mm/swap.c b/mm/swap.c index feff680d3de9..8468e72e397f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -113,10 +113,31 @@ static void __put_compound_page(struct page *page) destroy_compound_page(page); } +#ifdef CONFIG_DEV_PAGEMAP_OPS +static void __put_devmap_managed_page(struct page *page) +{ + if (!static_branch_unlikely(&devmap_managed_key)) + return; + + switch (page->pgmap->type) { + case MEMORY_DEVICE_PRIVATE: + case MEMORY_DEVICE_FS_DAX: + free_devmap_managed_page(page); + break; + default: + break; + } +} +#else +static inline void __put_devmap_managed_page(struct page *page) +{ +} +#endif + void __put_page(struct page *page) { if (is_zone_device_page(page)) { - put_dev_pagemap(page->pgmap); + __put_devmap_managed_page(page); /* * The page belongs to the device that created pgmap. Do @@ -851,27 +872,19 @@ void release_pages(struct page **pages, int nr) if (is_huge_zero_page(page)) continue; + page = compound_head(page); + if (!put_page_testzero(page)) + continue; + if (is_zone_device_page(page)) { if (locked_pgdat) { spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); locked_pgdat = NULL; } - /* - * ZONE_DEVICE pages that return 'false' from - * put_devmap_managed_page() do not require special - * processing, and instead, expect a call to - * put_page_testzero(). - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - continue; - } - } - - page = compound_head(page); - if (!put_page_testzero(page)) + __put_devmap_managed_page(page); continue; + } if (PageCompound(page)) { if (locked_pgdat) { @@ -1149,26 +1162,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> --- Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE struct page reference counting is ugly/broken. This is my attempt to fix it and it works for the HMM migration self tests. I'm only sending this out as a RFC since I'm not that familiar with the DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated with devm_memremap_pages() or memremap_pages() but my best reading of the code looks like it might be OK. I could use help testing these configurations. I have a modified THP migration patch series that applies on top of this one and is cleaner since I don't have to add code to handle the +1 reference count. The link below is for the earlier v2: ("mm/hmm/nouveau: add THP migration to migrate_vma_*") https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - include/linux/memremap.h | 6 +-- include/linux/mm.h | 39 --------------- lib/test_hmm.c | 1 - mm/gup.c | 44 ----------------- mm/memremap.c | 20 ++++---- mm/migrate.c | 5 -- mm/swap.c | 66 +++++++++++--------------- 9 files changed, 41 insertions(+), 142 deletions(-)