[v3,0/3] Encapsulate PTE contents from non-arch code

Message ID	20230612151545.3317766-1-ryan.roberts@arm.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, SeongJae Park <sj@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Mike Rapoport <rppt@kernel.org>, Yu Zhao <yuzhao@google.com>, Jason Gunthorpe <jgg@ziepe.ca>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Dimitri Sivanich <dimitri.sivanich@hpe.com>, Alex Williamson <alex.williamson@redhat.com>, Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <muchun.song@linux.dev>, Mark Rutland <mark.rutland@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Alexander Potapenko <glider@google.com>, Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Miaohe Lin <linmiaohe@huawei.com>, Pasha Tatashin <pasha.tatashin@soleen.com>, Uladzislau Rezki <urezki@gmail.com>, Christoph Hellwig <hch@infradead.org>, Lorenzo Stoakes <lstoakes@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, damon@lists.linux.dev Subject: [PATCH v3 0/3] Encapsulate PTE contents from non-arch code Date: Mon, 12 Jun 2023 16:15:42 +0100 Message-Id: <20230612151545.3317766-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Encapsulate PTE contents from non-arch code \| expand [v3,0/3] Encapsulate PTE contents from non-arch code [v3,1/3] mm: ptdump should use ptep_get_lockless() [v3,2/3] mm: Move ptep_get() and pmdp_get() helpers [v3,3/3] mm: ptep_get() conversion

Message ID

20230612151545.3317766-1-ryan.roberts@arm.com (mailing list archive)

Headers

From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>, SeongJae Park <sj@kernel.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
 Mike Rapoport <rppt@kernel.org>, Yu Zhao <yuzhao@google.com>,
 Jason Gunthorpe <jgg@ziepe.ca>, David Airlie <airlied@gmail.com>,
 Daniel Vetter <daniel@ffwll.ch>, Dimitri Sivanich <dimitri.sivanich@hpe.com>,
 Alex Williamson <alex.williamson@redhat.com>,
 Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
 Alexander Viro <viro@zeniv.linux.org.uk>,
 Christian Brauner <brauner@kernel.org>,
 Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <muchun.song@linux.dev>,
 Mark Rutland <mark.rutland@arm.com>,
 Alexander Shishkin <alexander.shishkin@linux.intel.com>,
 Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>,
 Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>,
	=?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>,
 Andrey Ryabinin <ryabinin.a.a@gmail.com>,
 Alexander Potapenko <glider@google.com>,
 Andrey Konovalov <andreyknvl@gmail.com>, Dmitry Vyukov <dvyukov@google.com>,
 Vincenzo Frascino <vincenzo.frascino@arm.com>,
 Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
 Roman Gushchin <roman.gushchin@linux.dev>,
 Shakeel Butt <shakeelb@google.com>,
 Naoya Horiguchi <naoya.horiguchi@nec.com>, Miaohe Lin <linmiaohe@huawei.com>,
 Pasha Tatashin <pasha.tatashin@soleen.com>,
 Uladzislau Rezki <urezki@gmail.com>, Christoph Hellwig <hch@infradead.org>,
 Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	damon@lists.linux.dev
Subject: [PATCH v3 0/3] Encapsulate PTE contents from non-arch code
Date: Mon, 12 Jun 2023 16:15:42 +0100
Message-Id: <20230612151545.3317766-1-ryan.roberts@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Encapsulate PTE contents from non-arch code | expand

Message

Ryan Roberts June 12, 2023, 3:15 p.m. UTC

Hi All,

(Including wider audience this time since changes touch a fair few subsystems)

This is the second half of v3 of a series to improve the encapsulation of pte
entries by disallowing non-arch code from directly dereferencing pte_t pointers.
Based on earlier feedback, I split the series in 2; the first part, fixes for
existing bugs, was already posted at [3] and merged into mm-stable. This second
part contains the conversion from direct dereferences to instead use
ptep_get()/ptep_get_lockless().

See the v1 cover letter at [1] for rationale for this work.

Based on feedback at v2, I've removed the new ptep_deref() helper I originally
added, and am now using the existing ptep_get() and ptep_get_lockless() helpers.
Testing on Ampere Altra (arm64) showed no difference in performance when using
ptep_deref() (*pte) vs ptep_get() (READ_ONCE(*pte)).

Patches are based on mm-unstable (49e038b1919e) and a branch is available at [4]
(Let me know if this is the wrong branch to target - I'm still not familiar with
the details of the mm- dev process!). Note that Hugh Dickins's "mm: allow
pte_offset_map[_lock]() to fail" (now in mm-unstable) patch set caused a number
of conflicts which I've resolved. But due to that, you won't be able to apply
these patches on top of Linus's tree. I have an alternate branch on top of
v6.4-rc6 at [5].

Changes since v2 [2]:
   - Removed ptep_deref() helper
   - Converted ptep_deref() callsites to use ptep_get[_lockless]()

Changes since v1 [1]:
   - Fixed sh build bug reported by 0-day CI

[1] https://lore.kernel.org/linux-mm/20230511132113.80196-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230518110727.2106156-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20230602092949.545577-1-ryan.roberts@arm.com/
[4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/ptep_get-mm-unstable-lkml_v3
[5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/ptep_get-v6.4-rc6-lkml_v3

Thanks,
Ryan

Ryan Roberts (3):
  mm: ptdump should use ptep_get_lockless()
  mm: Move ptep_get() and pmdp_get() helpers
  mm: ptep_get() conversion

 .../drm/i915/gem/selftests/i915_gem_mman.c    |   8 +-
 drivers/misc/sgi-gru/grufault.c               |   2 +-
 drivers/vfio/vfio_iommu_type1.c               |   7 +-
 drivers/xen/privcmd.c                         |   2 +-
 fs/proc/task_mmu.c                            |  33 +++---
 fs/userfaultfd.c                              |   6 +-
 include/linux/hugetlb.h                       |   4 +
 include/linux/mm_inline.h                     |   2 +-
 include/linux/pgtable.h                       |  34 +++---
 kernel/events/uprobes.c                       |   2 +-
 mm/damon/ops-common.c                         |   2 +-
 mm/damon/paddr.c                              |   2 +-
 mm/damon/vaddr.c                              |  10 +-
 mm/filemap.c                                  |   2 +-
 mm/gup.c                                      |  21 ++--
 mm/highmem.c                                  |  12 ++-
 mm/hmm.c                                      |   2 +-
 mm/huge_memory.c                              |   4 +-
 mm/hugetlb.c                                  |   2 +-
 mm/hugetlb_vmemmap.c                          |   6 +-
 mm/kasan/init.c                               |   9 +-
 mm/kasan/shadow.c                             |  10 +-
 mm/khugepaged.c                               |  22 ++--
 mm/ksm.c                                      |  22 ++--
 mm/madvise.c                                  |   6 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memcontrol.c                               |   4 +-
 mm/memory-failure.c                           |  26 ++---
 mm/memory.c                                   | 100 ++++++++++--------
 mm/mempolicy.c                                |   6 +-
 mm/migrate.c                                  |  14 +--
 mm/migrate_device.c                           |  15 +--
 mm/mincore.c                                  |   2 +-
 mm/mlock.c                                    |   6 +-
 mm/mprotect.c                                 |   8 +-
 mm/mremap.c                                   |   2 +-
 mm/page_table_check.c                         |   4 +-
 mm/page_vma_mapped.c                          |  27 +++--
 mm/pgtable-generic.c                          |   2 +-
 mm/ptdump.c                                   |   2 +-
 mm/rmap.c                                     |  34 +++---
 mm/sparse-vmemmap.c                           |   8 +-
 mm/swap_state.c                               |   8 +-
 mm/swapfile.c                                 |  20 ++--
 mm/userfaultfd.c                              |   4 +-
 mm/vmalloc.c                                  |   6 +-
 mm/vmscan.c                                   |  14 +--
 virt/kvm/kvm_main.c                           |  11 +-
 48 files changed, 316 insertions(+), 243 deletions(-)

--
2.25.1

Comments

Andrew Morton June 12, 2023, 8:16 p.m. UTC | #1

On Mon, 12 Jun 2023 16:15:42 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:

> Hi All,
> 
> (Including wider audience this time since changes touch a fair few subsystems)
> 
> This is the second half of v3 of a series to improve the encapsulation of pte
> entries by disallowing non-arch code from directly dereferencing pte_t pointers.

That's basically all we have here for [0/N] cover letter content.  I
stole some words from the [3/3] changelog, so we now have:

: A series to improve the encapsulation of pte entries by disallowing
: non-arch code from directly dereferencing pte_t pointers.
: 
: This means that by default, the accesses change from a C dereference to a
: READ_ONCE().  This is technically the correct thing to do since where
: pgtables are modified by HW (for access/dirty) they are volatile and
: therefore we should always ensure READ_ONCE() semantics.
: 
: But more importantly, by always using the helper, it can be overridden by
: the architecture to fully encapsulate the contents of the pte.  Arch code
: is deliberately not converted, as the arch code knows best.  It is
: intended that arch code (arm64) will override the default with its own
: implementation that can (e.g.) hide certain bits from the core code, or
: determine young/dirty status by mixing in state from another source.

> Based on earlier feedback, I split the series in 2; the first part, fixes for
> existing bugs, was already posted at [3] and merged into mm-stable. This second
> part contains the conversion from direct dereferences to instead use
> ptep_get()/ptep_get_lockless().
> 
> See the v1 cover letter at [1] for rationale for this work.
> 
> Based on feedback at v2, I've removed the new ptep_deref() helper I originally
> added, and am now using the existing ptep_get() and ptep_get_lockless() helpers.
> Testing on Ampere Altra (arm64) showed no difference in performance when using
> ptep_deref() (*pte) vs ptep_get() (READ_ONCE(*pte)).
> 
> Patches are based on mm-unstable (49e038b1919e) and a branch is available at [4]
> (Let me know if this is the wrong branch to target - I'm still not familiar with
> the details of the mm- dev process!). Note that Hugh Dickins's "mm: allow
> pte_offset_map[_lock]() to fail" (now in mm-unstable) patch set caused a number
> of conflicts which I've resolved. But due to that, you won't be able to apply
> these patches on top of Linus's tree. I have an alternate branch on top of
> v6.4-rc6 at [5].

Yep, that's all great, thanks.

Is there some clever trick we can do to prevent new open-coded derefs
of pte_t* from being introduced?

I suppose we could convert pte_t to a single-member struct to force a
compile error.  That struct will get passed by value to ptep_get() so
that's OK.  But this isn't viable unless/until all architectures are
converted :(

Or we rely upon Ryan to grep the tree occasionally ;)

Muchun Song June 13, 2023, 2:16 a.m. UTC | #2

> On Jun 12, 2023, at 23:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> Hi All,
> 
> (Including wider audience this time since changes touch a fair few subsystems)
> 
> This is the second half of v3 of a series to improve the encapsulation of pte
> entries by disallowing non-arch code from directly dereferencing pte_t pointers.
> Based on earlier feedback, I split the series in 2; the first part, fixes for
> existing bugs, was already posted at [3] and merged into mm-stable. This second
> part contains the conversion from direct dereferences to instead use
> ptep_get()/ptep_get_lockless().
> 
> See the v1 cover letter at [1] for rationale for this work.
> 
> Based on feedback at v2, I've removed the new ptep_deref() helper I originally
> added, and am now using the existing ptep_get() and ptep_get_lockless() helpers.

When I first saw the name of ptep_get()/ptep_get_lockless(), I thought
the pte seems like to be protected by the refcount mechanism (Why I have
this though? Because Qi Zheng has proposed a approach to free pte page tables
by using the refcount mechanism [1]). And your proposed name of ptep_deref()
is intuitive for me, so I have another thought, should we rename ptep_get()
to ptep_deref()? Just a thought from me, I'd like to hear if others object.

Thanks.

[1] https://lore.kernel.org/lkml/20211110105428.32458-7-zhengqi.arch@bytedance.com/

Ryan Roberts June 13, 2023, 8:43 a.m. UTC | #3

Hi Andrew,

Thanks for taking the patches!

On 12/06/2023 21:16, Andrew Morton wrote:
> On Mon, 12 Jun 2023 16:15:42 +0100 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> Hi All,
>>
>> (Including wider audience this time since changes touch a fair few subsystems)
>>
>> This is the second half of v3 of a series to improve the encapsulation of pte
>> entries by disallowing non-arch code from directly dereferencing pte_t pointers.
> 
> That's basically all we have here for [0/N] cover letter content.  I
> stole some words from the [3/3] changelog, so we now have:
> 
> : A series to improve the encapsulation of pte entries by disallowing
> : non-arch code from directly dereferencing pte_t pointers.
> : 
> : This means that by default, the accesses change from a C dereference to a
> : READ_ONCE().  This is technically the correct thing to do since where
> : pgtables are modified by HW (for access/dirty) they are volatile and
> : therefore we should always ensure READ_ONCE() semantics.
> : 
> : But more importantly, by always using the helper, it can be overridden by
> : the architecture to fully encapsulate the contents of the pte.  Arch code
> : is deliberately not converted, as the arch code knows best.  It is
> : intended that arch code (arm64) will override the default with its own
> : implementation that can (e.g.) hide certain bits from the core code, or
> : determine young/dirty status by mixing in state from another source.

I'm not sure I've understood what you are saying here? Is there some expectation
for the content of the cover letter for mm patches, which I'm not aware of? If
you can point me to some docs I'll make sure I'm conformant in future.

> 
> 
>> Based on earlier feedback, I split the series in 2; the first part, fixes for
>> existing bugs, was already posted at [3] and merged into mm-stable. This second
>> part contains the conversion from direct dereferences to instead use
>> ptep_get()/ptep_get_lockless().
>>
>> See the v1 cover letter at [1] for rationale for this work.
>>
>> Based on feedback at v2, I've removed the new ptep_deref() helper I originally
>> added, and am now using the existing ptep_get() and ptep_get_lockless() helpers.
>> Testing on Ampere Altra (arm64) showed no difference in performance when using
>> ptep_deref() (*pte) vs ptep_get() (READ_ONCE(*pte)).
>>
>> Patches are based on mm-unstable (49e038b1919e) and a branch is available at [4]
>> (Let me know if this is the wrong branch to target - I'm still not familiar with
>> the details of the mm- dev process!). Note that Hugh Dickins's "mm: allow
>> pte_offset_map[_lock]() to fail" (now in mm-unstable) patch set caused a number
>> of conflicts which I've resolved. But due to that, you won't be able to apply
>> these patches on top of Linus's tree. I have an alternate branch on top of
>> v6.4-rc6 at [5].
> 
> Yep, that's all great, thanks.
> 
> 
> Is there some clever trick we can do to prevent new open-coded derefs
> of pte_t* from being introduced?

I've been thinking about this myself but don't have a good answer at the moment.
I used 2 techniques to find them all, but neither are fullproof so required hand
editing:

1) global find/replace of "pte_t *" to "pte_handle_t ", and define pte_handle_t
as void *. Then the compiler will throw an error for an attempted dereference.
This obbiously only checks code that is actually being compiled, and it doesn't
prevent anyone from declaring a pte_t* and directly dereferencing in future.

2) Using the coccinelle script. This worked pretty well, but finds a few places
where the pte_t pointer is not pointing to a page table, but a variable on the
stack, and there it is legitimate to dereference (we don't want to use
ptep_get()/ptep_set_at() there). And also flags up lots of arch code that needs
to be ignored.

There are coccinelle scripts kept in scripts/coccinelle that apprarently get run
semi-regularly [1]. So could submit something there, but would need to add
support for excluding the arch directory and for marking known false-positives,
neither of which I think are currently available. Does anyone know better?

> 
> I suppose we could convert pte_t to a single-member struct to force a
> compile error.  That struct will get passed by value to ptep_get() so
> that's OK.  But this isn't viable unless/until all architectures are
> converted :(

It already is defined that way for arm64 at least:

  typedef struct { pteval_t pte; } pte_t;

But I'm not sure how that helps? You can still legally deref it in C.

> 
> Or we rely upon Ryan to grep the tree occasionally ;)

This is my least favourite option ;-) But it wouldn't be the end of the world in
the short term. Once I achieve the full objective of using arm64's contpte
capability, open coded derefs will manifest as bugs (at least for cases where
the accuracy of the access/dirty bits matter), which will certainly make sure
they get fixed.

Thanks,
Ryan

Ryan Roberts June 13, 2023, 8:52 a.m. UTC | #4

On 13/06/2023 03:16, Muchun Song wrote:
> 
> 
>> On Jun 12, 2023, at 23:15, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi All,
>>
>> (Including wider audience this time since changes touch a fair few subsystems)
>>
>> This is the second half of v3 of a series to improve the encapsulation of pte
>> entries by disallowing non-arch code from directly dereferencing pte_t pointers.
>> Based on earlier feedback, I split the series in 2; the first part, fixes for
>> existing bugs, was already posted at [3] and merged into mm-stable. This second
>> part contains the conversion from direct dereferences to instead use
>> ptep_get()/ptep_get_lockless().
>>
>> See the v1 cover letter at [1] for rationale for this work.
>>
>> Based on feedback at v2, I've removed the new ptep_deref() helper I originally
>> added, and am now using the existing ptep_get() and ptep_get_lockless() helpers.
> 
> When I first saw the name of ptep_get()/ptep_get_lockless(), I thought
> the pte seems like to be protected by the refcount mechanism (Why I have
> this though? Because Qi Zheng has proposed a approach to free pte page tables
> by using the refcount mechanism [1]). And your proposed name of ptep_deref()
> is intuitive for me, so I have another thought, should we rename ptep_get()
> to ptep_deref()? Just a thought from me, I'd like to hear if others object.

I see your point, but I think any renaming exercise should be discussed and
applied in the context of a separate patch series, given that ptep_get() and
ptep_get_lockless() already exist in the code base. This would be a much bigger
job, since it would need to cover all the arch code too.

> 
> Thanks.
> 
> [1] https://lore.kernel.org/lkml/20211110105428.32458-7-zhengqi.arch@bytedance.com/