[04/12] mm: Allow compound zone device pages

Message ID	c7026449473790e2844bb82012216c57047c7639.1725941415.git-series.apopple@nvidia.com
State	New
Headers	show Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2084.outbound.protection.outlook.com [40.107.236.84]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AF9C18C32E; Tue, 10 Sep 2024 04:15:17 +0000 (UTC) From: Alistair Popple <apopple@nvidia.com> To: dan.j.williams@intel.com, linux-mm@kvack.org Cc: Alistair Popple <apopple@nvidia.com>, vishal.l.verma@intel.com, dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com, jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com, will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com, dave.hansen@linux.intel.com, ira.weiny@intel.com, willy@infradead.org, djwong@kernel.org, tytso@mit.edu, linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com, Jason Gunthorpe <jgg@nvidia.com> Subject: [PATCH 04/12] mm: Allow compound zone device pages Date: Tue, 10 Sep 2024 14:14:29 +1000 Message-ID: <c7026449473790e2844bb82012216c57047c7639.1725941415.git-series.apopple@nvidia.com> In-Reply-To: <cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com> References: <cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	fs/dax: Fix FS DAX page reference counts \| expand [00/12] fs/dax: Fix FS DAX page reference counts [01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page [02/12] pci/p2pdma: Don't initialise page refcount to one [03/12] fs/dax: Refactor wait for dax idle page [04/12] mm: Allow compound zone device pages [05/12] mm/memory: Add dax_insert_pfn [06/12] huge_memory: Allow mappings of PUD sized pages [07/12] huge_memory: Allow mappings of PMD sized pages [08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages [09/12] mm: Update vm_normal_page() callers to accept FS DAX pages [10/12] fs/dax: Properly refcount fs dax pages [11/12] mm: Remove pXX_devmap callers [12/12] mm: Remove devmap related functions and page table bits

Alistair Popple Sept. 10, 2024, 4:14 a.m. UTC

Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.

The page->pgmap field is common to all pages within a memory section.
Therefore pgmap is the same for both head and tail pages and can be
moved into the folio and we can use the standard scheme to find
compound_head from a tail page.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

---

Changes since v1:

 - Move pgmap to the folio as suggested by Matthew Wilcox
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  3 ++-
 drivers/pci/p2pdma.c                   |  6 +++---
 include/linux/memremap.h               |  6 +++---
 include/linux/migrate.h                |  4 ++--
 include/linux/mm_types.h               |  9 +++++++--
 include/linux/mmzone.h                 |  8 +++++++-
 lib/test_hmm.c                         |  3 ++-
 mm/hmm.c                               |  2 +-
 mm/memory.c                            |  4 +++-
 mm/memremap.c                          | 14 +++++++-------
 mm/migrate_device.c                    |  7 +++++--
 mm/mm_init.c                           |  2 +-
 12 files changed, 43 insertions(+), 25 deletions(-)

Matthew Wilcox Sept. 10, 2024, 4:47 a.m. UTC | #1

On Tue, Sep 10, 2024 at 02:14:29PM +1000, Alistair Popple wrote:
> @@ -337,6 +341,7 @@ struct folio {
>  	/* private: */
>  				};
>  	/* public: */
> +			struct dev_pagemap *pgmap;

Shouldn't that be indented by one more tab stop?

And for ease of reading, perhaps it should be placed either immediately
before or after 'struct list_head lru;'?

> +++ b/include/linux/mmzone.h
> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
>  
> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
> +{
> +	WARN_ON(!is_zone_device_page(page));
> +	return page_folio(page)->pgmap;
> +}

I haven't read to the end yet, but presumably we'll eventually want:

static inline struct dev_pagemap *folio_dev_pagemap(const struct folio *folio)
{
	WARN_ON(!folio_is_zone_device(folio))
	return folio->pgmap;
}

and since we'll want it eventually, maybe now is the time to add it,
and make page_dev_pagemap() simply call it?

Alistair Popple Sept. 10, 2024, 6:57 a.m. UTC | #2

Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Sep 10, 2024 at 02:14:29PM +1000, Alistair Popple wrote:
>> @@ -337,6 +341,7 @@ struct folio {
>>  	/* private: */
>>  				};
>>  	/* public: */
>> +			struct dev_pagemap *pgmap;
>
> Shouldn't that be indented by one more tab stop?
>
> And for ease of reading, perhaps it should be placed either immediately
> before or after 'struct list_head lru;'?
>
>> +++ b/include/linux/mmzone.h
>> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
>>  	return page_zonenum(page) == ZONE_DEVICE;
>>  }
>>  
>> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
>> +{
>> +	WARN_ON(!is_zone_device_page(page));
>> +	return page_folio(page)->pgmap;
>> +}
>
> I haven't read to the end yet, but presumably we'll eventually want:
>
> static inline struct dev_pagemap *folio_dev_pagemap(const struct folio *folio)
> {
> 	WARN_ON(!folio_is_zone_device(folio))
> 	return folio->pgmap;
> }
>
> and since we'll want it eventually, maybe now is the time to add it,
> and make page_dev_pagemap() simply call it?

Sounds reasonable. I had open-coded folio->pgmap where it's needed
because at those points it's "obviously" a ZONE_DEVICE folio. Will add
it.

Matthew Wilcox Sept. 10, 2024, 1:41 p.m. UTC | #3

On Tue, Sep 10, 2024 at 04:57:41PM +1000, Alistair Popple wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Tue, Sep 10, 2024 at 02:14:29PM +1000, Alistair Popple wrote:
> >> @@ -337,6 +341,7 @@ struct folio {
> >>  	/* private: */
> >>  				};
> >>  	/* public: */
> >> +			struct dev_pagemap *pgmap;
> >
> > Shouldn't that be indented by one more tab stop?
> >
> > And for ease of reading, perhaps it should be placed either immediately
> > before or after 'struct list_head lru;'?
> >
> >> +++ b/include/linux/mmzone.h
> >> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
> >>  	return page_zonenum(page) == ZONE_DEVICE;
> >>  }
> >>  
> >> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
> >> +{
> >> +	WARN_ON(!is_zone_device_page(page));
> >> +	return page_folio(page)->pgmap;
> >> +}
> >
> > I haven't read to the end yet, but presumably we'll eventually want:
> >
> > static inline struct dev_pagemap *folio_dev_pagemap(const struct folio *folio)
> > {
> > 	WARN_ON(!folio_is_zone_device(folio))
> > 	return folio->pgmap;
> > }
> >
> > and since we'll want it eventually, maybe now is the time to add it,
> > and make page_dev_pagemap() simply call it?
> 
> Sounds reasonable. I had open-coded folio->pgmap where it's needed
> because at those points it's "obviously" a ZONE_DEVICE folio. Will add
> it.

Oh, if it's obvious then just do the dereference.

kernel test robot Sept. 12, 2024, 12:44 p.m. UTC | #4

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on 6f1833b8208c3b9e59eff10792667b6639365146]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240910-121806
base:   6f1833b8208c3b9e59eff10792667b6639365146
patch link:    https://lore.kernel.org/r/c7026449473790e2844bb82012216c57047c7639.1725941415.git-series.apopple%40nvidia.com
patch subject: [PATCH 04/12] mm: Allow compound zone device pages
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20240912/202409122055.AMlMSljd-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240912/202409122055.AMlMSljd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409122055.AMlMSljd-lkp@intel.com/

All errors (new ones prefixed by >>):

         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:163:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     163 | _SIG_SET_BINOP(sigandnsets, _sig_andn)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:141:3: note: expanded from macro '_SIG_SET_BINOP'
     141 |                 r->sig[2] = op(a2, b2);                                 \
         |                 ^      ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:27: note: expanded from macro '_SIG_SET_OP'
     174 |         case 4: set->sig[3] = op(set->sig[3]);                          \
         |                                  ^        ~
   include/linux/signal.h:186:24: note: expanded from macro '_sig_not'
     186 | #define _sig_not(x)     (~(x))
         |                            ^
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:10: note: expanded from macro '_SIG_SET_OP'
     174 |         case 4: set->sig[3] = op(set->sig[3]);                          \
         |                 ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:175:20: note: expanded from macro '_SIG_SET_OP'
     175 |                 set->sig[2] = op(set->sig[2]);                          \
         |                                  ^        ~
   include/linux/signal.h:186:24: note: expanded from macro '_sig_not'
     186 | #define _sig_not(x)     (~(x))
         |                            ^
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:175:3: note: expanded from macro '_SIG_SET_OP'
     175 |                 set->sig[2] = op(set->sig[2]);                          \
         |                 ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:51:
   include/linux/mman.h:158:9: warning: division by zero is undefined [-Wdivision-by-zero]
     158 |                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
         |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mman.h:136:21: note: expanded from macro '_calc_vm_trans'
     136 |    : ((x) & (bit1)) / ((bit1) / (bit2))))
         |                     ^ ~~~~~~~~~~~~~~~~~
   include/linux/mman.h:159:9: warning: division by zero is undefined [-Wdivision-by-zero]
     159 |                _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
         |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mman.h:136:21: note: expanded from macro '_calc_vm_trans'
     136 |    : ((x) & (bit1)) / ((bit1) / (bit2))))
         |                     ^ ~~~~~~~~~~~~~~~~~
>> mm/memory.c:4052:12: error: call to undeclared function 'page_dev_pagemap'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4052 |                         pgmap = page_dev_pagemap(vmf->page);
         |                                 ^
>> mm/memory.c:4052:10: error: incompatible integer to pointer conversion assigning to 'struct dev_pagemap *' from 'int' [-Wint-conversion]
    4052 |                         pgmap = page_dev_pagemap(vmf->page);
         |                               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
   42 warnings and 8 errors generated.


vim +/page_dev_pagemap +4052 mm/memory.c

  3988	
  3989	/*
  3990	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  3991	 * but allow concurrent faults), and pte mapped but not yet locked.
  3992	 * We return with pte unmapped and unlocked.
  3993	 *
  3994	 * We return with the mmap_lock locked or unlocked in the same cases
  3995	 * as does filemap_fault().
  3996	 */
  3997	vm_fault_t do_swap_page(struct vm_fault *vmf)
  3998	{
  3999		struct vm_area_struct *vma = vmf->vma;
  4000		struct folio *swapcache, *folio = NULL;
  4001		struct page *page;
  4002		struct swap_info_struct *si = NULL;
  4003		rmap_t rmap_flags = RMAP_NONE;
  4004		bool need_clear_cache = false;
  4005		bool exclusive = false;
  4006		swp_entry_t entry;
  4007		pte_t pte;
  4008		vm_fault_t ret = 0;
  4009		void *shadow = NULL;
  4010		int nr_pages;
  4011		unsigned long page_idx;
  4012		unsigned long address;
  4013		pte_t *ptep;
  4014	
  4015		if (!pte_unmap_same(vmf))
  4016			goto out;
  4017	
  4018		entry = pte_to_swp_entry(vmf->orig_pte);
  4019		if (unlikely(non_swap_entry(entry))) {
  4020			if (is_migration_entry(entry)) {
  4021				migration_entry_wait(vma->vm_mm, vmf->pmd,
  4022						     vmf->address);
  4023			} else if (is_device_exclusive_entry(entry)) {
  4024				vmf->page = pfn_swap_entry_to_page(entry);
  4025				ret = remove_device_exclusive_entry(vmf);
  4026			} else if (is_device_private_entry(entry)) {
  4027				struct dev_pagemap *pgmap;
  4028				if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
  4029					/*
  4030					 * migrate_to_ram is not yet ready to operate
  4031					 * under VMA lock.
  4032					 */
  4033					vma_end_read(vma);
  4034					ret = VM_FAULT_RETRY;
  4035					goto out;
  4036				}
  4037	
  4038				vmf->page = pfn_swap_entry_to_page(entry);
  4039				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4040						vmf->address, &vmf->ptl);
  4041				if (unlikely(!vmf->pte ||
  4042					     !pte_same(ptep_get(vmf->pte),
  4043								vmf->orig_pte)))
  4044					goto unlock;
  4045	
  4046				/*
  4047				 * Get a page reference while we know the page can't be
  4048				 * freed.
  4049				 */
  4050				get_page(vmf->page);
  4051				pte_unmap_unlock(vmf->pte, vmf->ptl);
> 4052				pgmap = page_dev_pagemap(vmf->page);
  4053				ret = pgmap->ops->migrate_to_ram(vmf);
  4054				put_page(vmf->page);
  4055			} else if (is_hwpoison_entry(entry)) {
  4056				ret = VM_FAULT_HWPOISON;
  4057			} else if (is_pte_marker_entry(entry)) {
  4058				ret = handle_pte_marker(vmf);
  4059			} else {
  4060				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  4061				ret = VM_FAULT_SIGBUS;
  4062			}
  4063			goto out;
  4064		}
  4065	
  4066		/* Prevent swapoff from happening to us. */
  4067		si = get_swap_device(entry);
  4068		if (unlikely(!si))
  4069			goto out;
  4070	
  4071		folio = swap_cache_get_folio(entry, vma, vmf->address);
  4072		if (folio)
  4073			page = folio_file_page(folio, swp_offset(entry));
  4074		swapcache = folio;
  4075	
  4076		if (!folio) {
  4077			if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
  4078			    __swap_count(entry) == 1) {
  4079				/*
  4080				 * Prevent parallel swapin from proceeding with
  4081				 * the cache flag. Otherwise, another thread may
  4082				 * finish swapin first, free the entry, and swapout
  4083				 * reusing the same entry. It's undetectable as
  4084				 * pte_same() returns true due to entry reuse.
  4085				 */
  4086				if (swapcache_prepare(entry, 1)) {
  4087					/* Relax a bit to prevent rapid repeated page faults */
  4088					schedule_timeout_uninterruptible(1);
  4089					goto out;
  4090				}
  4091				need_clear_cache = true;
  4092	
  4093				/* skip swapcache */
  4094				folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
  4095							vma, vmf->address, false);
  4096				if (folio) {
  4097					__folio_set_locked(folio);
  4098					__folio_set_swapbacked(folio);
  4099	
  4100					if (mem_cgroup_swapin_charge_folio(folio,
  4101								vma->vm_mm, GFP_KERNEL,
  4102								entry)) {
  4103						ret = VM_FAULT_OOM;
  4104						goto out_page;
  4105					}
  4106					mem_cgroup_swapin_uncharge_swap(entry);
  4107	
  4108					shadow = get_shadow_from_swap_cache(entry);
  4109					if (shadow)
  4110						workingset_refault(folio, shadow);
  4111	
  4112					folio_add_lru(folio);
  4113	
  4114					/* To provide entry to swap_read_folio() */
  4115					folio->swap = entry;
  4116					swap_read_folio(folio, NULL);
  4117					folio->private = NULL;
  4118				}
  4119			} else {
  4120				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
  4121							vmf);
  4122				swapcache = folio;
  4123			}
  4124	
  4125			if (!folio) {
  4126				/*
  4127				 * Back out if somebody else faulted in this pte
  4128				 * while we released the pte lock.
  4129				 */
  4130				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4131						vmf->address, &vmf->ptl);
  4132				if (likely(vmf->pte &&
  4133					   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4134					ret = VM_FAULT_OOM;
  4135				goto unlock;
  4136			}
  4137	
  4138			/* Had to read the page from swap area: Major fault */
  4139			ret = VM_FAULT_MAJOR;
  4140			count_vm_event(PGMAJFAULT);
  4141			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  4142			page = folio_file_page(folio, swp_offset(entry));
  4143		} else if (PageHWPoison(page)) {
  4144			/*
  4145			 * hwpoisoned dirty swapcache pages are kept for killing
  4146			 * owner processes (which may be unknown at hwpoison time)
  4147			 */
  4148			ret = VM_FAULT_HWPOISON;
  4149			goto out_release;
  4150		}
  4151	
  4152		ret |= folio_lock_or_retry(folio, vmf);
  4153		if (ret & VM_FAULT_RETRY)
  4154			goto out_release;
  4155	
  4156		if (swapcache) {
  4157			/*
  4158			 * Make sure folio_free_swap() or swapoff did not release the
  4159			 * swapcache from under us.  The page pin, and pte_same test
  4160			 * below, are not enough to exclude that.  Even if it is still
  4161			 * swapcache, we need to check that the page's swap has not
  4162			 * changed.
  4163			 */
  4164			if (unlikely(!folio_test_swapcache(folio) ||
  4165				     page_swap_entry(page).val != entry.val))
  4166				goto out_page;
  4167	
  4168			/*
  4169			 * KSM sometimes has to copy on read faults, for example, if
  4170			 * page->index of !PageKSM() pages would be nonlinear inside the
  4171			 * anon VMA -- PageKSM() is lost on actual swapout.
  4172			 */
  4173			folio = ksm_might_need_to_copy(folio, vma, vmf->address);
  4174			if (unlikely(!folio)) {
  4175				ret = VM_FAULT_OOM;
  4176				folio = swapcache;
  4177				goto out_page;
  4178			} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
  4179				ret = VM_FAULT_HWPOISON;
  4180				folio = swapcache;
  4181				goto out_page;
  4182			}
  4183			if (folio != swapcache)
  4184				page = folio_page(folio, 0);
  4185	
  4186			/*
  4187			 * If we want to map a page that's in the swapcache writable, we
  4188			 * have to detect via the refcount if we're really the exclusive
  4189			 * owner. Try removing the extra reference from the local LRU
  4190			 * caches if required.
  4191			 */
  4192			if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
  4193			    !folio_test_ksm(folio) && !folio_test_lru(folio))
  4194				lru_add_drain();
  4195		}
  4196	
  4197		folio_throttle_swaprate(folio, GFP_KERNEL);
  4198	
  4199		/*
  4200		 * Back out if somebody else already faulted in this pte.
  4201		 */
  4202		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  4203				&vmf->ptl);
  4204		if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4205			goto out_nomap;
  4206	
  4207		if (unlikely(!folio_test_uptodate(folio))) {
  4208			ret = VM_FAULT_SIGBUS;
  4209			goto out_nomap;
  4210		}
  4211	
  4212		nr_pages = 1;
  4213		page_idx = 0;
  4214		address = vmf->address;
  4215		ptep = vmf->pte;
  4216		if (folio_test_large(folio) && folio_test_swapcache(folio)) {
  4217			int nr = folio_nr_pages(folio);
  4218			unsigned long idx = folio_page_idx(folio, page);
  4219			unsigned long folio_start = address - idx * PAGE_SIZE;
  4220			unsigned long folio_end = folio_start + nr * PAGE_SIZE;
  4221			pte_t *folio_ptep;
  4222			pte_t folio_pte;
  4223	
  4224			if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
  4225				goto check_folio;
  4226			if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
  4227				goto check_folio;
  4228	
  4229			folio_ptep = vmf->pte - idx;
  4230			folio_pte = ptep_get(folio_ptep);
  4231			if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
  4232			    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
  4233				goto check_folio;
  4234	
  4235			page_idx = idx;
  4236			address = folio_start;
  4237			ptep = folio_ptep;
  4238			nr_pages = nr;
  4239			entry = folio->swap;
  4240			page = &folio->page;
  4241		}
  4242	
  4243	check_folio:
  4244		/*
  4245		 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
  4246		 * must never point at an anonymous page in the swapcache that is
  4247		 * PG_anon_exclusive. Sanity check that this holds and especially, that
  4248		 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
  4249		 * check after taking the PT lock and making sure that nobody
  4250		 * concurrently faulted in this page and set PG_anon_exclusive.
  4251		 */
  4252		BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
  4253		BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
  4254	
  4255		/*
  4256		 * Check under PT lock (to protect against concurrent fork() sharing
  4257		 * the swap entry concurrently) for certainly exclusive pages.
  4258		 */
  4259		if (!folio_test_ksm(folio)) {
  4260			exclusive = pte_swp_exclusive(vmf->orig_pte);
  4261			if (folio != swapcache) {
  4262				/*
  4263				 * We have a fresh page that is not exposed to the
  4264				 * swapcache -> certainly exclusive.
  4265				 */
  4266				exclusive = true;
  4267			} else if (exclusive && folio_test_writeback(folio) &&
  4268				  data_race(si->flags & SWP_STABLE_WRITES)) {
  4269				/*
  4270				 * This is tricky: not all swap backends support
  4271				 * concurrent page modifications while under writeback.
  4272				 *
  4273				 * So if we stumble over such a page in the swapcache
  4274				 * we must not set the page exclusive, otherwise we can
  4275				 * map it writable without further checks and modify it
  4276				 * while still under writeback.
  4277				 *
  4278				 * For these problematic swap backends, simply drop the
  4279				 * exclusive marker: this is perfectly fine as we start
  4280				 * writeback only if we fully unmapped the page and
  4281				 * there are no unexpected references on the page after
  4282				 * unmapping succeeded. After fully unmapped, no
  4283				 * further GUP references (FOLL_GET and FOLL_PIN) can
  4284				 * appear, so dropping the exclusive marker and mapping
  4285				 * it only R/O is fine.
  4286				 */
  4287				exclusive = false;
  4288			}
  4289		}
  4290	
  4291		/*
  4292		 * Some architectures may have to restore extra metadata to the page
  4293		 * when reading from swap. This metadata may be indexed by swap entry
  4294		 * so this must be called before swap_free().
  4295		 */
  4296		arch_swap_restore(folio_swap(entry, folio), folio);
  4297	
  4298		/*
  4299		 * Remove the swap entry and conditionally try to free up the swapcache.
  4300		 * We're already holding a reference on the page but haven't mapped it
  4301		 * yet.
  4302		 */
  4303		swap_free_nr(entry, nr_pages);
  4304		if (should_try_to_free_swap(folio, vma, vmf->flags))
  4305			folio_free_swap(folio);
  4306	
  4307		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
  4308		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
  4309		pte = mk_pte(page, vma->vm_page_prot);
  4310		if (pte_swp_soft_dirty(vmf->orig_pte))
  4311			pte = pte_mksoft_dirty(pte);
  4312		if (pte_swp_uffd_wp(vmf->orig_pte))
  4313			pte = pte_mkuffd_wp(pte);
  4314	
  4315		/*
  4316		 * Same logic as in do_wp_page(); however, optimize for pages that are
  4317		 * certainly not shared either because we just allocated them without
  4318		 * exposing them to the swapcache or because the swap entry indicates
  4319		 * exclusivity.
  4320		 */
  4321		if (!folio_test_ksm(folio) &&
  4322		    (exclusive || folio_ref_count(folio) == 1)) {
  4323			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
  4324			    !pte_needs_soft_dirty_wp(vma, pte)) {
  4325				pte = pte_mkwrite(pte, vma);
  4326				if (vmf->flags & FAULT_FLAG_WRITE) {
  4327					pte = pte_mkdirty(pte);
  4328					vmf->flags &= ~FAULT_FLAG_WRITE;
  4329				}
  4330			}
  4331			rmap_flags |= RMAP_EXCLUSIVE;
  4332		}
  4333		folio_ref_add(folio, nr_pages - 1);
  4334		flush_icache_pages(vma, page, nr_pages);
  4335		vmf->orig_pte = pte_advance_pfn(pte, page_idx);
  4336	
  4337		/* ksm created a completely new copy */
  4338		if (unlikely(folio != swapcache && swapcache)) {
  4339			folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
  4340			folio_add_lru_vma(folio, vma);
  4341		} else if (!folio_test_anon(folio)) {
  4342			/*
  4343			 * We currently only expect small !anon folios, which are either
  4344			 * fully exclusive or fully shared. If we ever get large folios
  4345			 * here, we have to be careful.
  4346			 */
  4347			VM_WARN_ON_ONCE(folio_test_large(folio));
  4348			VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
  4349			folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
  4350		} else {
  4351			folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
  4352						rmap_flags);
  4353		}
  4354	
  4355		VM_BUG_ON(!folio_test_anon(folio) ||
  4356				(pte_write(pte) && !PageAnonExclusive(page)));
  4357		set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
  4358		arch_do_swap_page_nr(vma->vm_mm, vma, address,
  4359				pte, pte, nr_pages);
  4360	
  4361		folio_unlock(folio);
  4362		if (folio != swapcache && swapcache) {
  4363			/*
  4364			 * Hold the lock to avoid the swap entry to be reused
  4365			 * until we take the PT lock for the pte_same() check
  4366			 * (to avoid false positives from pte_same). For
  4367			 * further safety release the lock after the swap_free
  4368			 * so that the swap count won't change under a
  4369			 * parallel locked swapcache.
  4370			 */
  4371			folio_unlock(swapcache);
  4372			folio_put(swapcache);
  4373		}
  4374	
  4375		if (vmf->flags & FAULT_FLAG_WRITE) {
  4376			ret |= do_wp_page(vmf);
  4377			if (ret & VM_FAULT_ERROR)
  4378				ret &= VM_FAULT_ERROR;
  4379			goto out;
  4380		}
  4381	
  4382		/* No need to invalidate - it was non-present before */
  4383		update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
  4384	unlock:
  4385		if (vmf->pte)
  4386			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4387	out:
  4388		/* Clear the swap cache pin for direct swapin after PTL unlock */
  4389		if (need_clear_cache)
  4390			swapcache_clear(si, entry, 1);
  4391		if (si)
  4392			put_swap_device(si);
  4393		return ret;
  4394	out_nomap:
  4395		if (vmf->pte)
  4396			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4397	out_page:
  4398		folio_unlock(folio);
  4399	out_release:
  4400		folio_put(folio);
  4401		if (folio != swapcache && swapcache) {
  4402			folio_unlock(swapcache);
  4403			folio_put(swapcache);
  4404		}
  4405		if (need_clear_cache)
  4406			swapcache_clear(si, entry, 1);
  4407		if (si)
  4408			put_swap_device(si);
  4409		return ret;
  4410	}
  4411

kernel test robot Sept. 12, 2024, 12:44 p.m. UTC | #5

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on 6f1833b8208c3b9e59eff10792667b6639365146]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240910-121806
base:   6f1833b8208c3b9e59eff10792667b6639365146
patch link:    https://lore.kernel.org/r/c7026449473790e2844bb82012216c57047c7639.1725941415.git-series.apopple%40nvidia.com
patch subject: [PATCH 04/12] mm: Allow compound zone device pages
config: csky-defconfig (https://download.01.org/0day-ci/archive/20240912/202409122024.PPIwP6vb-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240912/202409122024.PPIwP6vb-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409122024.PPIwP6vb-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:32,
                    from mm/gup.c:7:
   include/linux/memremap.h: In function 'is_device_private_page':
   include/linux/memremap.h:164:17: error: implicit declaration of function 'page_dev_pagemap' [-Wimplicit-function-declaration]
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                 ^~~~~~~~~~~~~~~~
   include/linux/memremap.h:164:39: error: invalid type argument of '->' (have 'int')
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                                       ^~
   include/linux/memremap.h: In function 'is_pci_p2pdma_page':
   include/linux/memremap.h:176:39: error: invalid type argument of '->' (have 'int')
     176 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
         |                                       ^~
   include/linux/memremap.h: In function 'is_device_coherent_page':
   include/linux/memremap.h:182:39: error: invalid type argument of '->' (have 'int')
     182 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
         |                                       ^~
   include/linux/memremap.h: In function 'is_pci_p2pdma_page':
>> include/linux/memremap.h:177:1: warning: control reaches end of non-void function [-Wreturn-type]
     177 | }
         | ^
   include/linux/memremap.h: In function 'is_device_coherent_page':
   include/linux/memremap.h:183:1: warning: control reaches end of non-void function [-Wreturn-type]
     183 | }
         | ^
--
   In file included from include/linux/mm.h:32,
                    from mm/memory.c:44:
   include/linux/memremap.h: In function 'is_device_private_page':
   include/linux/memremap.h:164:17: error: implicit declaration of function 'page_dev_pagemap' [-Wimplicit-function-declaration]
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                 ^~~~~~~~~~~~~~~~
   include/linux/memremap.h:164:39: error: invalid type argument of '->' (have 'int')
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                                       ^~
   include/linux/memremap.h: In function 'is_pci_p2pdma_page':
   include/linux/memremap.h:176:39: error: invalid type argument of '->' (have 'int')
     176 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
         |                                       ^~
   include/linux/memremap.h: In function 'is_device_coherent_page':
   include/linux/memremap.h:182:39: error: invalid type argument of '->' (have 'int')
     182 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
         |                                       ^~
   mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:4052:31: error: assignment to 'struct dev_pagemap *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
    4052 |                         pgmap = page_dev_pagemap(vmf->page);
         |                               ^
   include/linux/memremap.h: In function 'is_device_private_page':
   include/linux/memremap.h:165:1: warning: control reaches end of non-void function [-Wreturn-type]
     165 | }
         | ^


vim +4052 mm/memory.c

  3988	
  3989	/*
  3990	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  3991	 * but allow concurrent faults), and pte mapped but not yet locked.
  3992	 * We return with pte unmapped and unlocked.
  3993	 *
  3994	 * We return with the mmap_lock locked or unlocked in the same cases
  3995	 * as does filemap_fault().
  3996	 */
  3997	vm_fault_t do_swap_page(struct vm_fault *vmf)
  3998	{
  3999		struct vm_area_struct *vma = vmf->vma;
  4000		struct folio *swapcache, *folio = NULL;
  4001		struct page *page;
  4002		struct swap_info_struct *si = NULL;
  4003		rmap_t rmap_flags = RMAP_NONE;
  4004		bool need_clear_cache = false;
  4005		bool exclusive = false;
  4006		swp_entry_t entry;
  4007		pte_t pte;
  4008		vm_fault_t ret = 0;
  4009		void *shadow = NULL;
  4010		int nr_pages;
  4011		unsigned long page_idx;
  4012		unsigned long address;
  4013		pte_t *ptep;
  4014	
  4015		if (!pte_unmap_same(vmf))
  4016			goto out;
  4017	
  4018		entry = pte_to_swp_entry(vmf->orig_pte);
  4019		if (unlikely(non_swap_entry(entry))) {
  4020			if (is_migration_entry(entry)) {
  4021				migration_entry_wait(vma->vm_mm, vmf->pmd,
  4022						     vmf->address);
  4023			} else if (is_device_exclusive_entry(entry)) {
  4024				vmf->page = pfn_swap_entry_to_page(entry);
  4025				ret = remove_device_exclusive_entry(vmf);
  4026			} else if (is_device_private_entry(entry)) {
  4027				struct dev_pagemap *pgmap;
  4028				if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
  4029					/*
  4030					 * migrate_to_ram is not yet ready to operate
  4031					 * under VMA lock.
  4032					 */
  4033					vma_end_read(vma);
  4034					ret = VM_FAULT_RETRY;
  4035					goto out;
  4036				}
  4037	
  4038				vmf->page = pfn_swap_entry_to_page(entry);
  4039				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4040						vmf->address, &vmf->ptl);
  4041				if (unlikely(!vmf->pte ||
  4042					     !pte_same(ptep_get(vmf->pte),
  4043								vmf->orig_pte)))
  4044					goto unlock;
  4045	
  4046				/*
  4047				 * Get a page reference while we know the page can't be
  4048				 * freed.
  4049				 */
  4050				get_page(vmf->page);
  4051				pte_unmap_unlock(vmf->pte, vmf->ptl);
> 4052				pgmap = page_dev_pagemap(vmf->page);
  4053				ret = pgmap->ops->migrate_to_ram(vmf);
  4054				put_page(vmf->page);
  4055			} else if (is_hwpoison_entry(entry)) {
  4056				ret = VM_FAULT_HWPOISON;
  4057			} else if (is_pte_marker_entry(entry)) {
  4058				ret = handle_pte_marker(vmf);
  4059			} else {
  4060				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  4061				ret = VM_FAULT_SIGBUS;
  4062			}
  4063			goto out;
  4064		}
  4065	
  4066		/* Prevent swapoff from happening to us. */
  4067		si = get_swap_device(entry);
  4068		if (unlikely(!si))
  4069			goto out;
  4070	
  4071		folio = swap_cache_get_folio(entry, vma, vmf->address);
  4072		if (folio)
  4073			page = folio_file_page(folio, swp_offset(entry));
  4074		swapcache = folio;
  4075	
  4076		if (!folio) {
  4077			if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
  4078			    __swap_count(entry) == 1) {
  4079				/*
  4080				 * Prevent parallel swapin from proceeding with
  4081				 * the cache flag. Otherwise, another thread may
  4082				 * finish swapin first, free the entry, and swapout
  4083				 * reusing the same entry. It's undetectable as
  4084				 * pte_same() returns true due to entry reuse.
  4085				 */
  4086				if (swapcache_prepare(entry, 1)) {
  4087					/* Relax a bit to prevent rapid repeated page faults */
  4088					schedule_timeout_uninterruptible(1);
  4089					goto out;
  4090				}
  4091				need_clear_cache = true;
  4092	
  4093				/* skip swapcache */
  4094				folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
  4095							vma, vmf->address, false);
  4096				if (folio) {
  4097					__folio_set_locked(folio);
  4098					__folio_set_swapbacked(folio);
  4099	
  4100					if (mem_cgroup_swapin_charge_folio(folio,
  4101								vma->vm_mm, GFP_KERNEL,
  4102								entry)) {
  4103						ret = VM_FAULT_OOM;
  4104						goto out_page;
  4105					}
  4106					mem_cgroup_swapin_uncharge_swap(entry);
  4107	
  4108					shadow = get_shadow_from_swap_cache(entry);
  4109					if (shadow)
  4110						workingset_refault(folio, shadow);
  4111	
  4112					folio_add_lru(folio);
  4113	
  4114					/* To provide entry to swap_read_folio() */
  4115					folio->swap = entry;
  4116					swap_read_folio(folio, NULL);
  4117					folio->private = NULL;
  4118				}
  4119			} else {
  4120				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
  4121							vmf);
  4122				swapcache = folio;
  4123			}
  4124	
  4125			if (!folio) {
  4126				/*
  4127				 * Back out if somebody else faulted in this pte
  4128				 * while we released the pte lock.
  4129				 */
  4130				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4131						vmf->address, &vmf->ptl);
  4132				if (likely(vmf->pte &&
  4133					   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4134					ret = VM_FAULT_OOM;
  4135				goto unlock;
  4136			}
  4137	
  4138			/* Had to read the page from swap area: Major fault */
  4139			ret = VM_FAULT_MAJOR;
  4140			count_vm_event(PGMAJFAULT);
  4141			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  4142			page = folio_file_page(folio, swp_offset(entry));
  4143		} else if (PageHWPoison(page)) {
  4144			/*
  4145			 * hwpoisoned dirty swapcache pages are kept for killing
  4146			 * owner processes (which may be unknown at hwpoison time)
  4147			 */
  4148			ret = VM_FAULT_HWPOISON;
  4149			goto out_release;
  4150		}
  4151	
  4152		ret |= folio_lock_or_retry(folio, vmf);
  4153		if (ret & VM_FAULT_RETRY)
  4154			goto out_release;
  4155	
  4156		if (swapcache) {
  4157			/*
  4158			 * Make sure folio_free_swap() or swapoff did not release the
  4159			 * swapcache from under us.  The page pin, and pte_same test
  4160			 * below, are not enough to exclude that.  Even if it is still
  4161			 * swapcache, we need to check that the page's swap has not
  4162			 * changed.
  4163			 */
  4164			if (unlikely(!folio_test_swapcache(folio) ||
  4165				     page_swap_entry(page).val != entry.val))
  4166				goto out_page;
  4167	
  4168			/*
  4169			 * KSM sometimes has to copy on read faults, for example, if
  4170			 * page->index of !PageKSM() pages would be nonlinear inside the
  4171			 * anon VMA -- PageKSM() is lost on actual swapout.
  4172			 */
  4173			folio = ksm_might_need_to_copy(folio, vma, vmf->address);
  4174			if (unlikely(!folio)) {
  4175				ret = VM_FAULT_OOM;
  4176				folio = swapcache;
  4177				goto out_page;
  4178			} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
  4179				ret = VM_FAULT_HWPOISON;
  4180				folio = swapcache;
  4181				goto out_page;
  4182			}
  4183			if (folio != swapcache)
  4184				page = folio_page(folio, 0);
  4185	
  4186			/*
  4187			 * If we want to map a page that's in the swapcache writable, we
  4188			 * have to detect via the refcount if we're really the exclusive
  4189			 * owner. Try removing the extra reference from the local LRU
  4190			 * caches if required.
  4191			 */
  4192			if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
  4193			    !folio_test_ksm(folio) && !folio_test_lru(folio))
  4194				lru_add_drain();
  4195		}
  4196	
  4197		folio_throttle_swaprate(folio, GFP_KERNEL);
  4198	
  4199		/*
  4200		 * Back out if somebody else already faulted in this pte.
  4201		 */
  4202		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  4203				&vmf->ptl);
  4204		if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4205			goto out_nomap;
  4206	
  4207		if (unlikely(!folio_test_uptodate(folio))) {
  4208			ret = VM_FAULT_SIGBUS;
  4209			goto out_nomap;
  4210		}
  4211	
  4212		nr_pages = 1;
  4213		page_idx = 0;
  4214		address = vmf->address;
  4215		ptep = vmf->pte;
  4216		if (folio_test_large(folio) && folio_test_swapcache(folio)) {
  4217			int nr = folio_nr_pages(folio);
  4218			unsigned long idx = folio_page_idx(folio, page);
  4219			unsigned long folio_start = address - idx * PAGE_SIZE;
  4220			unsigned long folio_end = folio_start + nr * PAGE_SIZE;
  4221			pte_t *folio_ptep;
  4222			pte_t folio_pte;
  4223	
  4224			if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
  4225				goto check_folio;
  4226			if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
  4227				goto check_folio;
  4228	
  4229			folio_ptep = vmf->pte - idx;
  4230			folio_pte = ptep_get(folio_ptep);
  4231			if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
  4232			    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
  4233				goto check_folio;
  4234	
  4235			page_idx = idx;
  4236			address = folio_start;
  4237			ptep = folio_ptep;
  4238			nr_pages = nr;
  4239			entry = folio->swap;
  4240			page = &folio->page;
  4241		}
  4242	
  4243	check_folio:
  4244		/*
  4245		 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
  4246		 * must never point at an anonymous page in the swapcache that is
  4247		 * PG_anon_exclusive. Sanity check that this holds and especially, that
  4248		 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
  4249		 * check after taking the PT lock and making sure that nobody
  4250		 * concurrently faulted in this page and set PG_anon_exclusive.
  4251		 */
  4252		BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
  4253		BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
  4254	
  4255		/*
  4256		 * Check under PT lock (to protect against concurrent fork() sharing
  4257		 * the swap entry concurrently) for certainly exclusive pages.
  4258		 */
  4259		if (!folio_test_ksm(folio)) {
  4260			exclusive = pte_swp_exclusive(vmf->orig_pte);
  4261			if (folio != swapcache) {
  4262				/*
  4263				 * We have a fresh page that is not exposed to the
  4264				 * swapcache -> certainly exclusive.
  4265				 */
  4266				exclusive = true;
  4267			} else if (exclusive && folio_test_writeback(folio) &&
  4268				  data_race(si->flags & SWP_STABLE_WRITES)) {
  4269				/*
  4270				 * This is tricky: not all swap backends support
  4271				 * concurrent page modifications while under writeback.
  4272				 *
  4273				 * So if we stumble over such a page in the swapcache
  4274				 * we must not set the page exclusive, otherwise we can
  4275				 * map it writable without further checks and modify it
  4276				 * while still under writeback.
  4277				 *
  4278				 * For these problematic swap backends, simply drop the
  4279				 * exclusive marker: this is perfectly fine as we start
  4280				 * writeback only if we fully unmapped the page and
  4281				 * there are no unexpected references on the page after
  4282				 * unmapping succeeded. After fully unmapped, no
  4283				 * further GUP references (FOLL_GET and FOLL_PIN) can
  4284				 * appear, so dropping the exclusive marker and mapping
  4285				 * it only R/O is fine.
  4286				 */
  4287				exclusive = false;
  4288			}
  4289		}
  4290	
  4291		/*
  4292		 * Some architectures may have to restore extra metadata to the page
  4293		 * when reading from swap. This metadata may be indexed by swap entry
  4294		 * so this must be called before swap_free().
  4295		 */
  4296		arch_swap_restore(folio_swap(entry, folio), folio);
  4297	
  4298		/*
  4299		 * Remove the swap entry and conditionally try to free up the swapcache.
  4300		 * We're already holding a reference on the page but haven't mapped it
  4301		 * yet.
  4302		 */
  4303		swap_free_nr(entry, nr_pages);
  4304		if (should_try_to_free_swap(folio, vma, vmf->flags))
  4305			folio_free_swap(folio);
  4306	
  4307		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
  4308		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
  4309		pte = mk_pte(page, vma->vm_page_prot);
  4310		if (pte_swp_soft_dirty(vmf->orig_pte))
  4311			pte = pte_mksoft_dirty(pte);
  4312		if (pte_swp_uffd_wp(vmf->orig_pte))
  4313			pte = pte_mkuffd_wp(pte);
  4314	
  4315		/*
  4316		 * Same logic as in do_wp_page(); however, optimize for pages that are
  4317		 * certainly not shared either because we just allocated them without
  4318		 * exposing them to the swapcache or because the swap entry indicates
  4319		 * exclusivity.
  4320		 */
  4321		if (!folio_test_ksm(folio) &&
  4322		    (exclusive || folio_ref_count(folio) == 1)) {
  4323			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
  4324			    !pte_needs_soft_dirty_wp(vma, pte)) {
  4325				pte = pte_mkwrite(pte, vma);
  4326				if (vmf->flags & FAULT_FLAG_WRITE) {
  4327					pte = pte_mkdirty(pte);
  4328					vmf->flags &= ~FAULT_FLAG_WRITE;
  4329				}
  4330			}
  4331			rmap_flags |= RMAP_EXCLUSIVE;
  4332		}
  4333		folio_ref_add(folio, nr_pages - 1);
  4334		flush_icache_pages(vma, page, nr_pages);
  4335		vmf->orig_pte = pte_advance_pfn(pte, page_idx);
  4336	
  4337		/* ksm created a completely new copy */
  4338		if (unlikely(folio != swapcache && swapcache)) {
  4339			folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
  4340			folio_add_lru_vma(folio, vma);
  4341		} else if (!folio_test_anon(folio)) {
  4342			/*
  4343			 * We currently only expect small !anon folios, which are either
  4344			 * fully exclusive or fully shared. If we ever get large folios
  4345			 * here, we have to be careful.
  4346			 */
  4347			VM_WARN_ON_ONCE(folio_test_large(folio));
  4348			VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
  4349			folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
  4350		} else {
  4351			folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
  4352						rmap_flags);
  4353		}
  4354	
  4355		VM_BUG_ON(!folio_test_anon(folio) ||
  4356				(pte_write(pte) && !PageAnonExclusive(page)));
  4357		set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
  4358		arch_do_swap_page_nr(vma->vm_mm, vma, address,
  4359				pte, pte, nr_pages);
  4360	
  4361		folio_unlock(folio);
  4362		if (folio != swapcache && swapcache) {
  4363			/*
  4364			 * Hold the lock to avoid the swap entry to be reused
  4365			 * until we take the PT lock for the pte_same() check
  4366			 * (to avoid false positives from pte_same). For
  4367			 * further safety release the lock after the swap_free
  4368			 * so that the swap count won't change under a
  4369			 * parallel locked swapcache.
  4370			 */
  4371			folio_unlock(swapcache);
  4372			folio_put(swapcache);
  4373		}
  4374	
  4375		if (vmf->flags & FAULT_FLAG_WRITE) {
  4376			ret |= do_wp_page(vmf);
  4377			if (ret & VM_FAULT_ERROR)
  4378				ret &= VM_FAULT_ERROR;
  4379			goto out;
  4380		}
  4381	
  4382		/* No need to invalidate - it was non-present before */
  4383		update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
  4384	unlock:
  4385		if (vmf->pte)
  4386			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4387	out:
  4388		/* Clear the swap cache pin for direct swapin after PTL unlock */
  4389		if (need_clear_cache)
  4390			swapcache_clear(si, entry, 1);
  4391		if (si)
  4392			put_swap_device(si);
  4393		return ret;
  4394	out_nomap:
  4395		if (vmf->pte)
  4396			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4397	out_page:
  4398		folio_unlock(folio);
  4399	out_release:
  4400		folio_put(folio);
  4401		if (folio != swapcache && swapcache) {
  4402			folio_unlock(swapcache);
  4403			folio_put(swapcache);
  4404		}
  4405		if (need_clear_cache)
  4406			swapcache_clear(si, entry, 1);
  4407		if (si)
  4408			put_swap_device(si);
  4409		return ret;
  4410	}
  4411

Dan Williams Sept. 22, 2024, 1:01 a.m. UTC | #6

Alistair Popple wrote:
> Zone device pages are used to represent various type of device memory
> managed by device drivers. Currently compound zone device pages are
> not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
> user of higher order zone device pages and have their own page
> reference counting.
> 
> A future change will unify FS DAX reference counting with normal page
> reference counting rules and remove the special FS DAX reference
> counting. Supporting that requires compound zone device pages.
> 
> Supporting compound zone device pages requires compound_head() to
> distinguish between head and tail pages whilst still preserving the
> special struct page fields that are specific to zone device pages.
> 
> A tail page is distinguished by having bit zero being set in
> page->compound_head, with the remaining bits pointing to the head
> page. For zone device pages page->compound_head is shared with
> page->pgmap.
> 
> The page->pgmap field is common to all pages within a memory section.
> Therefore pgmap is the same for both head and tail pages and can be
> moved into the folio and we can use the standard scheme to find
> compound_head from a tail page.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> ---
> 
> Changes since v1:
> 
>  - Move pgmap to the folio as suggested by Matthew Wilcox
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  3 ++-
>  drivers/pci/p2pdma.c                   |  6 +++---
>  include/linux/memremap.h               |  6 +++---
>  include/linux/migrate.h                |  4 ++--
>  include/linux/mm_types.h               |  9 +++++++--
>  include/linux/mmzone.h                 |  8 +++++++-
>  lib/test_hmm.c                         |  3 ++-
>  mm/hmm.c                               |  2 +-
>  mm/memory.c                            |  4 +++-
>  mm/memremap.c                          | 14 +++++++-------
>  mm/migrate_device.c                    |  7 +++++--
>  mm/mm_init.c                           |  2 +-
>  12 files changed, 43 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 6fb65b0..58d308c 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -88,7 +88,8 @@ struct nouveau_dmem {
>  
>  static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
>  {
> -	return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
> +	return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk,

page_dev_pagemap() feels like a mouthful. I would be ok with
page_pgmap() since that is the most common idenifier for struct
struct dev_pagemap instances.

> +			    pagemap);
>  }
>  
>  static struct nouveau_drm *page_to_drm(struct page *page)
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 210b9f4..a58f2c1 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -199,7 +199,7 @@ static const struct attribute_group p2pmem_group = {
>  
>  static void p2pdma_page_free(struct page *page)
>  {
> -	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> +	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
>  	/* safe to dereference while a reference is held to the percpu ref */
>  	struct pci_p2pdma *p2pdma =
>  		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
> @@ -1022,8 +1022,8 @@ enum pci_p2pdma_map_type
>  pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
>  		       struct scatterlist *sg)
>  {
> -	if (state->pgmap != sg_page(sg)->pgmap) {
> -		state->pgmap = sg_page(sg)->pgmap;
> +	if (state->pgmap != page_dev_pagemap(sg_page(sg))) {
> +		state->pgmap = page_dev_pagemap(sg_page(sg));
>  		state->map = pci_p2pdma_map_type(state->pgmap, dev);
>  		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
>  	}
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 3f7143a..14273e6 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
>  		is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
>  static inline bool folio_is_device_private(const struct folio *folio)
> @@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
>  		is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>  
>  static inline bool is_device_coherent_page(const struct page *page)
>  {
>  	return is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_COHERENT;
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
>  }
>  
>  static inline bool folio_is_device_coherent(const struct folio *folio)
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 002e49b..9a85a82 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -207,8 +207,8 @@ struct migrate_vma {
>  	unsigned long		end;
>  
>  	/*
> -	 * Set to the owner value also stored in page->pgmap->owner for
> -	 * migrating out of device private memory. The flags also need to
> +	 * Set to the owner value also stored in page_dev_pagemap(page)->owner
> +	 * for migrating out of device private memory. The flags also need to
>  	 * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
>  	 * The caller should always set this field when using mmu notifier
>  	 * callbacks to avoid device MMU invalidations for device private
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8..c2f1d53 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -129,8 +129,11 @@ struct page {
>  			unsigned long compound_head;	/* Bit zero is set */
>  		};
>  		struct {	/* ZONE_DEVICE pages */
> -			/** @pgmap: Points to the hosting device page map. */
> -			struct dev_pagemap *pgmap;
> +			/*
> +			 * The first word is used for compound_head or folio
> +			 * pgmap
> +			 */
> +			void *_unused;

I would feel better with "_unused_pgmap_compound_head", similar to how
_unused_slab_obj_exts in 'struct foio' indicates the placeholer
contents.

>  			void *zone_device_data;
>  			/*
>  			 * ZONE_DEVICE private pages are counted as being
> @@ -299,6 +302,7 @@ typedef struct {
>   * @_refcount: Do not access this member directly.  Use folio_ref_count()
>   *    to find how many references there are to this folio.
>   * @memcg_data: Memory Control Group data.
> + * @pgmap: Metadata for ZONE_DEVICE mappings
>   * @virtual: Virtual address in the kernel direct map.
>   * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
>   * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> @@ -337,6 +341,7 @@ struct folio {
>  	/* private: */
>  				};
>  	/* public: */
> +			struct dev_pagemap *pgmap;
>  			};
>  			struct address_space *mapping;
>  			pgoff_t index;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 17506e4..e191434 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
>  
> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
> +{
> +	WARN_ON(!is_zone_device_page(page));

VM_WARN_ON()?

With the above fixups:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

[04/12] mm: Allow compound zone device pages

Commit Message

Comments

Patch