diff mbox series

[v4] mm: Add PM_HUGE_THP_MAPPING to /proc/pid/pagemap

Message ID 20211107235754.1395488-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v4] mm: Add PM_HUGE_THP_MAPPING to /proc/pid/pagemap | expand

Commit Message

Mina Almasry Nov. 7, 2021, 11:57 p.m. UTC
Add PM_HUGE_THP MAPPING to allow userspace to detect whether a given virt
address is currently mapped by a transparent huge page or not.

Example use case is a process requesting THPs from the kernel (via
a huge tmpfs mount for example), for a performance critical region of
memory.  The userspace may want to query whether the kernel is actually
backing this memory by hugepages or not.

PM_HUGE_THP_MAPPING bit is set if the virt address is mapped at the PMD
level and the underlying page is a transparent huge page.

Tested manually by adding logging into transhuge-stress, and by
allocating THP and querying the PM_HUGE_THP_MAPPING flag at those
virtual addresses.

Signed-off-by: Mina Almasry <almasrymina@google.com>

Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes rientjes@google.com
Cc: Paul E. McKenney <paulmckrcu@fb.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org


---

Changes in v4:
- Removed unnecessary moving of flags variable declaration

Changes in v3:
- Renamed PM_THP to PM_HUGE_THP_MAPPING
- Fixed checks to set PM_HUGE_THP_MAPPING
- Added PM_HUGE_THP_MAPPING docs

---
 Documentation/admin-guide/mm/pagemap.rst      |  3 ++-
 fs/proc/task_mmu.c                            |  3 +++
 tools/testing/selftests/vm/transhuge-stress.c | 21 +++++++++++++++----
 3 files changed, 22 insertions(+), 5 deletions(-)

--
2.34.0.rc0.344.g81b53c2807-goog

Comments

Peter Xu Nov. 10, 2021, 7:03 a.m. UTC | #1
Hi, Mina,

Sorry to comment late.

On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> index fdc19fbc10839..8a0f0064ff336 100644
> --- a/Documentation/admin-guide/mm/pagemap.rst
> +++ b/Documentation/admin-guide/mm/pagemap.rst
> @@ -23,7 +23,8 @@ There are four components to pagemap:
>      * Bit  56    page exclusively mapped (since 4.2)
>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
> -    * Bits 57-60 zero
> +    * Bit  58    page is a huge (PMD size) THP mapping
> +    * Bits 59-60 zero
>      * Bit  61    page is file-page or shared-anon (since 3.5)
>      * Bit  62    page swapped
>      * Bit  63    page present
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ad667dbc96f5c..6f1403f83b310 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1302,6 +1302,7 @@ struct pagemapread {
>  #define PM_SOFT_DIRTY		BIT_ULL(55)
>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
>  #define PM_UFFD_WP		BIT_ULL(57)
> +#define PM_HUGE_THP_MAPPING	BIT_ULL(58)

The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
"PM_HUGE" (as THP also means HUGE already)?

IMHO the core problem is about permission controls, and it seems to me we're
actually trying to workaround it by duplicating some information we have.. so
it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
if it's the permission part that to be enhanced, rather than a new but slightly
duplicated interface.

Thanks,
David Hildenbrand Nov. 10, 2021, 8:14 a.m. UTC | #2
On 10.11.21 08:03, Peter Xu wrote:
> Hi, Mina,
> 
> Sorry to comment late.
> 
> On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>> index fdc19fbc10839..8a0f0064ff336 100644
>> --- a/Documentation/admin-guide/mm/pagemap.rst
>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>> @@ -23,7 +23,8 @@ There are four components to pagemap:
>>      * Bit  56    page exclusively mapped (since 4.2)
>>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
>>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
>> -    * Bits 57-60 zero
>> +    * Bit  58    page is a huge (PMD size) THP mapping
>> +    * Bits 59-60 zero
>>      * Bit  61    page is file-page or shared-anon (since 3.5)
>>      * Bit  62    page swapped
>>      * Bit  63    page present
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index ad667dbc96f5c..6f1403f83b310 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1302,6 +1302,7 @@ struct pagemapread {
>>  #define PM_SOFT_DIRTY		BIT_ULL(55)
>>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
>>  #define PM_UFFD_WP		BIT_ULL(57)
>> +#define PM_HUGE_THP_MAPPING	BIT_ULL(58)
> 
> The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> "PM_HUGE" (as THP also means HUGE already)?
> 
> IMHO the core problem is about permission controls, and it seems to me we're
> actually trying to workaround it by duplicating some information we have.. so
> it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
> if it's the permission part that to be enhanced, rather than a new but slightly
> duplicated interface.

It's not a permission problem AFAIKS: even with permissions "changed",
any attempt to use /proc/kpageflags is just racy. Let's not go down that
path, it's really the wrong mechanism to export to random userspace.

We do have an interface to access this information from userspace
already: /proc/self/smaps IIRC. Mina commented that they are seeing
performance issues with that approach.

It would be valuable to add these details to the patch description,
including a performance difference when using both interfaces we have
available. As the patch description stands, there is no explanation
"why" we want this change.
Peter Xu Nov. 10, 2021, 8:27 a.m. UTC | #3
On Wed, Nov 10, 2021 at 09:14:42AM +0100, David Hildenbrand wrote:
> On 10.11.21 08:03, Peter Xu wrote:
> > Hi, Mina,
> > 
> > Sorry to comment late.
> > 
> > On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
> >> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> >> index fdc19fbc10839..8a0f0064ff336 100644
> >> --- a/Documentation/admin-guide/mm/pagemap.rst
> >> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >> @@ -23,7 +23,8 @@ There are four components to pagemap:
> >>      * Bit  56    page exclusively mapped (since 4.2)
> >>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
> >>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
> >> -    * Bits 57-60 zero
> >> +    * Bit  58    page is a huge (PMD size) THP mapping
> >> +    * Bits 59-60 zero
> >>      * Bit  61    page is file-page or shared-anon (since 3.5)
> >>      * Bit  62    page swapped
> >>      * Bit  63    page present
> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >> index ad667dbc96f5c..6f1403f83b310 100644
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
> >> @@ -1302,6 +1302,7 @@ struct pagemapread {
> >>  #define PM_SOFT_DIRTY		BIT_ULL(55)
> >>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
> >>  #define PM_UFFD_WP		BIT_ULL(57)
> >> +#define PM_HUGE_THP_MAPPING	BIT_ULL(58)
> > 
> > The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> > "PM_HUGE" (as THP also means HUGE already)?
> > 
> > IMHO the core problem is about permission controls, and it seems to me we're
> > actually trying to workaround it by duplicating some information we have.. so
> > it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
> > if it's the permission part that to be enhanced, rather than a new but slightly
> > duplicated interface.
> 
> It's not a permission problem AFAIKS: even with permissions "changed",
> any attempt to use /proc/kpageflags is just racy. Let's not go down that
> path, it's really the wrong mechanism to export to random userspace.

I agree it's racy, but IMHO that's fine.  These are hints for userspace to make
decisions, they cannot be always right.  Even if we fetch atomically and seeing
that this pte is swapped out, it can be quickly accessed at the same time and
it'll be in-memory again.  Only if we can freeze the whole pgtable but we
can't, so they can only be used as hints.

> 
> We do have an interface to access this information from userspace
> already: /proc/self/smaps IIRC. Mina commented that they are seeing
> performance issues with that approach.
> 
> It would be valuable to add these details to the patch description,
> including a performance difference when using both interfaces we have
> available. As the patch description stands, there is no explanation
> "why" we want this change.

I didn't notice Mina mention about performance issues with kpageflags, if so
then I agree this solution helps.  I doubt the performance is an issue, though,
as THP info shouldn't be something changing rapidly so it should be some hint
to do sanity checks only (e.g., to make sure no unwanted split of THP
happening, but the scanning should not require to be super fast; it could be
done with a relatively long scanning period).  If there's a performance
concern, yes it would be great to mention it too in the commit message.
David Hildenbrand Nov. 10, 2021, 8:30 a.m. UTC | #4
On 10.11.21 09:27, Peter Xu wrote:
> On Wed, Nov 10, 2021 at 09:14:42AM +0100, David Hildenbrand wrote:
>> On 10.11.21 08:03, Peter Xu wrote:
>>> Hi, Mina,
>>>
>>> Sorry to comment late.
>>>
>>> On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>>> index fdc19fbc10839..8a0f0064ff336 100644
>>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>> @@ -23,7 +23,8 @@ There are four components to pagemap:
>>>>      * Bit  56    page exclusively mapped (since 4.2)
>>>>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
>>>>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
>>>> -    * Bits 57-60 zero
>>>> +    * Bit  58    page is a huge (PMD size) THP mapping
>>>> +    * Bits 59-60 zero
>>>>      * Bit  61    page is file-page or shared-anon (since 3.5)
>>>>      * Bit  62    page swapped
>>>>      * Bit  63    page present
>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>> index ad667dbc96f5c..6f1403f83b310 100644
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
>>>> @@ -1302,6 +1302,7 @@ struct pagemapread {
>>>>  #define PM_SOFT_DIRTY		BIT_ULL(55)
>>>>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
>>>>  #define PM_UFFD_WP		BIT_ULL(57)
>>>> +#define PM_HUGE_THP_MAPPING	BIT_ULL(58)
>>>
>>> The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
>>> "PM_HUGE" (as THP also means HUGE already)?
>>>
>>> IMHO the core problem is about permission controls, and it seems to me we're
>>> actually trying to workaround it by duplicating some information we have.. so
>>> it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
>>> if it's the permission part that to be enhanced, rather than a new but slightly
>>> duplicated interface.
>>
>> It's not a permission problem AFAIKS: even with permissions "changed",
>> any attempt to use /proc/kpageflags is just racy. Let's not go down that
>> path, it's really the wrong mechanism to export to random userspace.
> 
> I agree it's racy, but IMHO that's fine.  These are hints for userspace to make
> decisions, they cannot be always right.  Even if we fetch atomically and seeing
> that this pte is swapped out, it can be quickly accessed at the same time and
> it'll be in-memory again.  Only if we can freeze the whole pgtable but we
> can't, so they can only be used as hints.

Sorry, I don't think /proc/kpageflags (or exporting the PFNs to random
users via /proc/self/pagemap) is the way to go.

"Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get
PFNs. In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting
from 4.2 the PFN field is zeroed if the user does not have
CAP_SYS_ADMIN. Reason: information about PFNs helps in exploiting
Rowhammer vulnerability."

> 
>>
>> We do have an interface to access this information from userspace
>> already: /proc/self/smaps IIRC. Mina commented that they are seeing
>> performance issues with that approach.
>>
>> It would be valuable to add these details to the patch description,
>> including a performance difference when using both interfaces we have
>> available. As the patch description stands, there is no explanation
>> "why" we want this change.
> 
> I didn't notice Mina mention about performance issues with kpageflags, if so
> then I agree this solution helps. 
The performance issue seems to be with /proc/self/smaps.
Peter Xu Nov. 10, 2021, 8:57 a.m. UTC | #5
On Wed, Nov 10, 2021 at 09:30:50AM +0100, David Hildenbrand wrote:
> On 10.11.21 09:27, Peter Xu wrote:
> > On Wed, Nov 10, 2021 at 09:14:42AM +0100, David Hildenbrand wrote:
> >> On 10.11.21 08:03, Peter Xu wrote:
> >>> Hi, Mina,
> >>>
> >>> Sorry to comment late.
> >>>
> >>> On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
> >>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> >>>> index fdc19fbc10839..8a0f0064ff336 100644
> >>>> --- a/Documentation/admin-guide/mm/pagemap.rst
> >>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >>>> @@ -23,7 +23,8 @@ There are four components to pagemap:
> >>>>      * Bit  56    page exclusively mapped (since 4.2)
> >>>>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
> >>>>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
> >>>> -    * Bits 57-60 zero
> >>>> +    * Bit  58    page is a huge (PMD size) THP mapping
> >>>> +    * Bits 59-60 zero
> >>>>      * Bit  61    page is file-page or shared-anon (since 3.5)
> >>>>      * Bit  62    page swapped
> >>>>      * Bit  63    page present
> >>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>>> index ad667dbc96f5c..6f1403f83b310 100644
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> >>>> @@ -1302,6 +1302,7 @@ struct pagemapread {
> >>>>  #define PM_SOFT_DIRTY		BIT_ULL(55)
> >>>>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
> >>>>  #define PM_UFFD_WP		BIT_ULL(57)
> >>>> +#define PM_HUGE_THP_MAPPING	BIT_ULL(58)
> >>>
> >>> The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> >>> "PM_HUGE" (as THP also means HUGE already)?
> >>>
> >>> IMHO the core problem is about permission controls, and it seems to me we're
> >>> actually trying to workaround it by duplicating some information we have.. so
> >>> it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
> >>> if it's the permission part that to be enhanced, rather than a new but slightly
> >>> duplicated interface.
> >>
> >> It's not a permission problem AFAIKS: even with permissions "changed",
> >> any attempt to use /proc/kpageflags is just racy. Let's not go down that
> >> path, it's really the wrong mechanism to export to random userspace.
> > 
> > I agree it's racy, but IMHO that's fine.  These are hints for userspace to make
> > decisions, they cannot be always right.  Even if we fetch atomically and seeing
> > that this pte is swapped out, it can be quickly accessed at the same time and
> > it'll be in-memory again.  Only if we can freeze the whole pgtable but we
> > can't, so they can only be used as hints.
> 
> Sorry, I don't think /proc/kpageflags (or exporting the PFNs to random
> users via /proc/self/pagemap) is the way to go.
> 
> "Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get
> PFNs. In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting
> from 4.2 the PFN field is zeroed if the user does not have
> CAP_SYS_ADMIN. Reason: information about PFNs helps in exploiting
> Rowhammer vulnerability."

IMHO these are two problems that you mentioned.  That's also what I was
wondering about: could the app be granted with CAP_SYS_ADMIN then?

I am not sure whether that'll work well with /proc/kpage* though, as it's by
default 0400.  So perhaps we need to manual adjust the file permission too to
make sure the app can both access PFNs (with SYS_ADMIN) and the flags.  Totally
no expert on the permissions..

> 
> > 
> >>
> >> We do have an interface to access this information from userspace
> >> already: /proc/self/smaps IIRC. Mina commented that they are seeing
> >> performance issues with that approach.
> >>
> >> It would be valuable to add these details to the patch description,
> >> including a performance difference when using both interfaces we have
> >> available. As the patch description stands, there is no explanation
> >> "why" we want this change.
> > 
> > I didn't notice Mina mention about performance issues with kpageflags, if so
> > then I agree this solution helps. 
> The performance issue seems to be with /proc/self/smaps.

This also reminded me that we've got issue with smaps being too slow, and in
many cases we're only interested in a small portion of the whole memory.  This
made me wonder how about a new smaps interface taking memory range as input.

Thanks,
David Hildenbrand Nov. 10, 2021, 10:24 a.m. UTC | #6
On 10.11.21 09:57, Peter Xu wrote:
> On Wed, Nov 10, 2021 at 09:30:50AM +0100, David Hildenbrand wrote:
>> On 10.11.21 09:27, Peter Xu wrote:
>>> On Wed, Nov 10, 2021 at 09:14:42AM +0100, David Hildenbrand wrote:
>>>> On 10.11.21 08:03, Peter Xu wrote:
>>>>> Hi, Mina,
>>>>>
>>>>> Sorry to comment late.
>>>>>
>>>>> On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
>>>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>>>>> index fdc19fbc10839..8a0f0064ff336 100644
>>>>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>>>>> @@ -23,7 +23,8 @@ There are four components to pagemap:
>>>>>>      * Bit  56    page exclusively mapped (since 4.2)
>>>>>>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
>>>>>>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
>>>>>> -    * Bits 57-60 zero
>>>>>> +    * Bit  58    page is a huge (PMD size) THP mapping
>>>>>> +    * Bits 59-60 zero
>>>>>>      * Bit  61    page is file-page or shared-anon (since 3.5)
>>>>>>      * Bit  62    page swapped
>>>>>>      * Bit  63    page present
>>>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>>>> index ad667dbc96f5c..6f1403f83b310 100644
>>>>>> --- a/fs/proc/task_mmu.c
>>>>>> +++ b/fs/proc/task_mmu.c
>>>>>> @@ -1302,6 +1302,7 @@ struct pagemapread {
>>>>>>  #define PM_SOFT_DIRTY		BIT_ULL(55)
>>>>>>  #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
>>>>>>  #define PM_UFFD_WP		BIT_ULL(57)
>>>>>> +#define PM_HUGE_THP_MAPPING	BIT_ULL(58)
>>>>>
>>>>> The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
>>>>> "PM_HUGE" (as THP also means HUGE already)?
>>>>>
>>>>> IMHO the core problem is about permission controls, and it seems to me we're
>>>>> actually trying to workaround it by duplicating some information we have.. so
>>>>> it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
>>>>> if it's the permission part that to be enhanced, rather than a new but slightly
>>>>> duplicated interface.
>>>>
>>>> It's not a permission problem AFAIKS: even with permissions "changed",
>>>> any attempt to use /proc/kpageflags is just racy. Let's not go down that
>>>> path, it's really the wrong mechanism to export to random userspace.
>>>
>>> I agree it's racy, but IMHO that's fine.  These are hints for userspace to make
>>> decisions, they cannot be always right.  Even if we fetch atomically and seeing
>>> that this pte is swapped out, it can be quickly accessed at the same time and
>>> it'll be in-memory again.  Only if we can freeze the whole pgtable but we
>>> can't, so they can only be used as hints.
>>
>> Sorry, I don't think /proc/kpageflags (or exporting the PFNs to random
>> users via /proc/self/pagemap) is the way to go.
>>
>> "Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get
>> PFNs. In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting
>> from 4.2 the PFN field is zeroed if the user does not have
>> CAP_SYS_ADMIN. Reason: information about PFNs helps in exploiting
>> Rowhammer vulnerability."
> 
> IMHO these are two problems that you mentioned.  That's also what I was
> wondering about: could the app be granted with CAP_SYS_ADMIN then?
> 
> I am not sure whether that'll work well with /proc/kpage* though, as it's by
> default 0400.  So perhaps we need to manual adjust the file permission too to
> make sure the app can both access PFNs (with SYS_ADMIN) and the flags.  Totally
> no expert on the permissions..

Me too :)

IIRC changing permissions that was not an option -- which is why the
first approach suggested a new /proc/self/pageflags. But I guess Mina
can remind us (and eventually document all that in the patch description
:) ).
Mina Almasry Nov. 10, 2021, 5:42 p.m. UTC | #7
On Wed, Nov 10, 2021 at 12:57 AM Peter Xu <peterx@redhat.com> wrote:
>
>
> This also reminded me that we've got issue with smaps being too slow, and in
> many cases we're only interested in a small portion of the whole memory.  This
> made me wonder how about a new smaps interface taking memory range as input.
>

Does a patch like I'm providing here address the perf issues you're seeing?

> Thanks,
>
> --
> Peter Xu
>

On Wed, Nov 10, 2021 at 2:24 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.11.21 09:57, Peter Xu wrote:
> > On Wed, Nov 10, 2021 at 09:30:50AM +0100, David Hildenbrand wrote:
> >> On 10.11.21 09:27, Peter Xu wrote:
> >>> On Wed, Nov 10, 2021 at 09:14:42AM +0100, David Hildenbrand wrote:
> >>>> On 10.11.21 08:03, Peter Xu wrote:
> >>>>> Hi, Mina,
> >>>>>
> >>>>> Sorry to comment late.
> >>>>>
> >>>>> On Sun, Nov 07, 2021 at 03:57:54PM -0800, Mina Almasry wrote:
> >>>>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> >>>>>> index fdc19fbc10839..8a0f0064ff336 100644
> >>>>>> --- a/Documentation/admin-guide/mm/pagemap.rst
> >>>>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >>>>>> @@ -23,7 +23,8 @@ There are four components to pagemap:
> >>>>>>      * Bit  56    page exclusively mapped (since 4.2)
> >>>>>>      * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
> >>>>>>        :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
> >>>>>> -    * Bits 57-60 zero
> >>>>>> +    * Bit  58    page is a huge (PMD size) THP mapping
> >>>>>> +    * Bits 59-60 zero
> >>>>>>      * Bit  61    page is file-page or shared-anon (since 3.5)
> >>>>>>      * Bit  62    page swapped
> >>>>>>      * Bit  63    page present
> >>>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>>>>> index ad667dbc96f5c..6f1403f83b310 100644
> >>>>>> --- a/fs/proc/task_mmu.c
> >>>>>> +++ b/fs/proc/task_mmu.c
> >>>>>> @@ -1302,6 +1302,7 @@ struct pagemapread {
> >>>>>>  #define PM_SOFT_DIRTY           BIT_ULL(55)
> >>>>>>  #define PM_MMAP_EXCLUSIVE       BIT_ULL(56)
> >>>>>>  #define PM_UFFD_WP              BIT_ULL(57)
> >>>>>> +#define PM_HUGE_THP_MAPPING     BIT_ULL(58)
> >>>>>
> >>>>> The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> >>>>> "PM_HUGE" (as THP also means HUGE already)?
> >>>>>
> >>>>> IMHO the core problem is about permission controls, and it seems to me we're
> >>>>> actually trying to workaround it by duplicating some information we have.. so
> >>>>> it's kind of a pity.  Totally not against this patch, but imho it'll be nicer
> >>>>> if it's the permission part that to be enhanced, rather than a new but slightly
> >>>>> duplicated interface.
> >>>>
> >>>> It's not a permission problem AFAIKS: even with permissions "changed",
> >>>> any attempt to use /proc/kpageflags is just racy. Let's not go down that
> >>>> path, it's really the wrong mechanism to export to random userspace.
> >>>
> >>> I agree it's racy, but IMHO that's fine.  These are hints for userspace to make
> >>> decisions, they cannot be always right.  Even if we fetch atomically and seeing
> >>> that this pte is swapped out, it can be quickly accessed at the same time and
> >>> it'll be in-memory again.  Only if we can freeze the whole pgtable but we
> >>> can't, so they can only be used as hints.
> >>
> >> Sorry, I don't think /proc/kpageflags (or exporting the PFNs to random
> >> users via /proc/self/pagemap) is the way to go.
> >>
> >> "Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get
> >> PFNs. In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting
> >> from 4.2 the PFN field is zeroed if the user does not have
> >> CAP_SYS_ADMIN. Reason: information about PFNs helps in exploiting
> >> Rowhammer vulnerability."
> >
> > IMHO these are two problems that you mentioned.  That's also what I was
> > wondering about: could the app be granted with CAP_SYS_ADMIN then?
> >
> > I am not sure whether that'll work well with /proc/kpage* though, as it's by
> > default 0400.  So perhaps we need to manual adjust the file permission too to
> > make sure the app can both access PFNs (with SYS_ADMIN) and the flags.  Totally
> > no expert on the permissions..
>
> Me too :)
>
> IIRC changing permissions that was not an option -- which is why the
> first approach suggested a new /proc/self/pageflags. But I guess Mina
> can remind us (and eventually document all that in the patch description
> :) ).
>

Sorry, yes I should update the commit message with this info. The
issues with smaps are:
1. Performance: I've pinged our network service folks to obtain a
rough perf comparison but I haven't been able to get one. I can try to
get a performance measurement myself but Peter seems to be also seeing
this.
2. smaps output is human readable and a bit convoluted for userspace to parse.

>
> --
> Thanks,
>
> David / dhildenb
>
Mina Almasry Nov. 10, 2021, 5:50 p.m. UTC | #8
On Tue, Nov 9, 2021 at 11:03 PM Peter Xu <peterx@redhat.com> wrote:
>
> The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> "PM_HUGE" (as THP also means HUGE already)?
>

So I want to make it clear that the flag is set only when the page is
PMD mappend and is a THP (not hugetlbfs or some other PMD device
mapping). PM_THP would imply the flag is set only if the underlying
page is THP without regard to whether it's actually PMD mapped or not.
Peter Xu Nov. 12, 2021, 7:41 a.m. UTC | #9
On Wed, Nov 10, 2021 at 09:42:25AM -0800, Mina Almasry wrote:
> Sorry, yes I should update the commit message with this info. The
> issues with smaps are:
> 1. Performance: I've pinged our network service folks to obtain a
> rough perf comparison but I haven't been able to get one. I can try to
> get a performance measurement myself but Peter seems to be also seeing
> this.

No I was not seeing any real issues in my environment, but I remembered people
complaining about it because smaps needs to walk the whole memory of the
process, then if one program is only interested in some small portion of the
whole memory, it'll be slow because smaps will still need to walk all the
memory anyway.

> 2. smaps output is human readable and a bit convoluted for userspace to parse.

IMHO this is not a major issue.  AFAIK lots of programs will still try to parse
human readable output like smaps to get some solid numbers.  It's just that
it'll be indeed an perf issue if it's only a part of the memory that is of
interest.

Could we consider exporting a new smaps interface that:

  1. allows to specify a range of memory, and,
  2. expose information as "struct mem_size_stats" in binary format
     (we may want to replace "unsigned long" with "u64", then also have some
      versioning or having a "size" field for the struct, though; seems doable)

I'm wondering whether this could be helpful in even more scenarios.
Peter Xu Nov. 12, 2021, 7:43 a.m. UTC | #10
On Wed, Nov 10, 2021 at 09:50:13AM -0800, Mina Almasry wrote:
> On Tue, Nov 9, 2021 at 11:03 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> > "PM_HUGE" (as THP also means HUGE already)?
> >
> 
> So I want to make it clear that the flag is set only when the page is
> PMD mappend and is a THP (not hugetlbfs or some other PMD device
> mapping). PM_THP would imply the flag is set only if the underlying
> page is THP without regard to whether it's actually PMD mapped or not.

I see, that's fine.

However as I mentioned I still think HUGE and THP dup with each other.
Meanwhile, "MAPPING" does not sound like a boolean status on whether it's thp
mapped..

If you still prefer this approach, how about PM_THP_MAPPED?
Mina Almasry Nov. 15, 2021, 10:50 p.m. UTC | #11
On Thu, Nov 11, 2021 at 11:41 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 10, 2021 at 09:42:25AM -0800, Mina Almasry wrote:
> > Sorry, yes I should update the commit message with this info. The
> > issues with smaps are:
> > 1. Performance: I've pinged our network service folks to obtain a
> > rough perf comparison but I haven't been able to get one. I can try to
> > get a performance measurement myself but Peter seems to be also seeing
> > this.
>
> No I was not seeing any real issues in my environment, but I remembered people
> complaining about it because smaps needs to walk the whole memory of the
> process, then if one program is only interested in some small portion of the
> whole memory, it'll be slow because smaps will still need to walk all the
> memory anyway.
>
> > 2. smaps output is human readable and a bit convoluted for userspace to parse.
>
> IMHO this is not a major issue.  AFAIK lots of programs will still try to parse
> human readable output like smaps to get some solid numbers.  It's just that
> it'll be indeed an perf issue if it's only a part of the memory that is of
> interest.
>
> Could we consider exporting a new smaps interface that:
>
>   1. allows to specify a range of memory, and,
>   2. expose information as "struct mem_size_stats" in binary format
>      (we may want to replace "unsigned long" with "u64", then also have some
>       versioning or having a "size" field for the struct, though; seems doable)
>
> I'm wondering whether this could be helpful in even more scenarios.
>

Sorry could you elaborate more? How do we allow users to specify the
range? Are you envisioning a new added syscall? Or is there some way
for users to pass the range to the existing /proc/pid/smaps that I'm
missing?

On Thu, Nov 11, 2021 at 11:43 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 10, 2021 at 09:50:13AM -0800, Mina Almasry wrote:
> > On Tue, Nov 9, 2021 at 11:03 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > The ending "_MAPPING" seems redundant to me, how about just call it "PM_THP" or
> > > "PM_HUGE" (as THP also means HUGE already)?
> > >
> >
> > So I want to make it clear that the flag is set only when the page is
> > PMD mappend and is a THP (not hugetlbfs or some other PMD device
> > mapping). PM_THP would imply the flag is set only if the underlying
> > page is THP without regard to whether it's actually PMD mapped or not.
>
> I see, that's fine.
>
> However as I mentioned I still think HUGE and THP dup with each other.
> Meanwhile, "MAPPING" does not sound like a boolean status on whether it's thp
> mapped..
>
> If you still prefer this approach, how about PM_THP_MAPPED?

PM_THP_MAPPED sounds good to me.

TBH I think I still prefer this approach because it's a very simple 2
line patch which addresses the concrete use case I have well. I'm not
too familiar with the smaps code to be honest but I think adding a
range-based smaps API will be a sizeable patch to add a syscall,
handle a stable interface, and handle cases where the memory range
doesn't match a VMA boundary. I'm not sure the performance benefit
would justify this patch and I'm not sure the extra info from smaps
would be widely useful. However if you insist and folks believe this
is the better approach I can prototype a range-based smaps and test
its performance to see if it works for us as well, just let me know
what kind of API you're envisioning.

>
> --
> Peter Xu
>
Peter Xu Nov. 16, 2021, 1:59 a.m. UTC | #12
On Mon, Nov 15, 2021 at 02:50:26PM -0800, Mina Almasry wrote:
> PM_THP_MAPPED sounds good to me.
> 
> TBH I think I still prefer this approach because it's a very simple 2
> line patch which addresses the concrete use case I have well. I'm not
> too familiar with the smaps code to be honest but I think adding a
> range-based smaps API will be a sizeable patch to add a syscall,
> handle a stable interface, and handle cases where the memory range
> doesn't match a VMA boundary. I'm not sure the performance benefit
> would justify this patch and I'm not sure the extra info from smaps
> would be widely useful. However if you insist and folks believe this
> is the better approach I can prototype a range-based smaps and test
> its performance to see if it works for us as well, just let me know
> what kind of API you're envisioning.

Yeah indeed I haven't yet thought enough on such a new interface, it's just
that I think it'll be something that solves a broader range of requests
including the thp-aware issue, so I raised it up.

That shouldn't require a lot code change either afaiu, as smap_gather_stats()
already takes a "start" and I think what's missing is another end where we just
pass in 0 when we want the default vma->vm_end as the end of range.

I don't have a solid clue on other use case to ask for that more generic
interface, so please feel free to move on with it.  If you'll need a repost to
address the comment from Andrew on removing the debugging lines, please also
consider using the shorter PM_THP_MAPPED then it looks good to me too.

Thanks!
Mina Almasry Nov. 17, 2021, 7:50 p.m. UTC | #13
On Mon, Nov 15, 2021 at 5:41 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 10 Nov 2021 14:11:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
>
> > Add PM_HUGE_THP MAPPING to allow userspace to detect whether a given virt
> > address is currently mapped by a transparent huge page or not.  Example
> > use case is a process requesting THPs from the kernel (via a huge tmpfs
> > mount for example), for a performance critical region of memory.  The
> > userspace may want to query whether the kernel is actually backing this
> > memory by hugepages or not.
> >
> > PM_HUGE_THP_MAPPING bit is set if the virt address is mapped at the PMD
> > level and the underlying page is a transparent huge page.
> >
> > A few options were considered:
> > 1. Add /proc/pid/pageflags that exports the same info as
> >    /proc/kpageflags.  This is not appropriate because many kpageflags are
> >    inappropriate to expose to userspace processes.
> > 2. Simply get this info from the existing /proc/pid/smaps interface.
> >    There are a couple of issues with that:
> >    1. /proc/pid/smaps output is human readable and unfriendly to
> >       programatically parse.
> >    2. /proc/pid/smaps is slow.  The cost of reading /proc/pid/smaps into
> >       userspace buffers is about ~800us per call, and this doesn't
> >       include parsing the output to get the information you need. The
> >       cost of querying 1 virt address in /proc/pid/pagemaps however is
> >       around 5-7us.
> >
> > Tested manually by adding logging into transhuge-stress, and by
> > allocating THP and querying the PM_HUGE_THP_MAPPING flag at those
> > virtual addresses.
> >
> > --- a/tools/testing/selftests/vm/transhuge-stress.c
> > +++ b/tools/testing/selftests/vm/transhuge-stress.c
> > @@ -16,6 +16,12 @@
> >  #include <string.h>
> >  #include <sys/mman.h>
> >
> > +/*
> > + * We can use /proc/pid/pagemap to detect whether the kernel was able to find
> > + * hugepages or no. This can be very noisy, so is disabled by default.
> > + */
> > +#define NO_DETECT_HUGEPAGES
> > +
> >
> > ...
> >
> > +#ifndef NO_DETECT_HUGEPAGES
> > +             if (!PAGEMAP_THP(ent[0]))
> > +                     fprintf(stderr, "WARNING: detected non THP page\n");
> > +#endif
>
> This looks like a developer thing.  Is there any point in leaving it in
> the mainline code?

I used this to test locally and I thought it may be useful, but on
second thought probably not worth it. Removed in v6 I just sent.

On Mon, Nov 15, 2021 at 5:59 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Nov 15, 2021 at 02:50:26PM -0800, Mina Almasry wrote:
> > PM_THP_MAPPED sounds good to me.
> >
> > TBH I think I still prefer this approach because it's a very simple 2
> > line patch which addresses the concrete use case I have well. I'm not
> > too familiar with the smaps code to be honest but I think adding a
> > range-based smaps API will be a sizeable patch to add a syscall,
> > handle a stable interface, and handle cases where the memory range
> > doesn't match a VMA boundary. I'm not sure the performance benefit
> > would justify this patch and I'm not sure the extra info from smaps
> > would be widely useful. However if you insist and folks believe this
> > is the better approach I can prototype a range-based smaps and test
> > its performance to see if it works for us as well, just let me know
> > what kind of API you're envisioning.
>
> Yeah indeed I haven't yet thought enough on such a new interface, it's just
> that I think it'll be something that solves a broader range of requests
> including the thp-aware issue, so I raised it up.
>
> That shouldn't require a lot code change either afaiu, as smap_gather_stats()
> already takes a "start" and I think what's missing is another end where we just
> pass in 0 when we want the default vma->vm_end as the end of range.
>
> I don't have a solid clue on other use case to ask for that more generic
> interface, so please feel free to move on with it.  If you'll need a repost to
> address the comment from Andrew on removing the debugging lines, please also
> consider using the shorter PM_THP_MAPPED then it looks good to me too.
>

Awesome, thanks! PM_THP_MAPPED sounds good to me and I just sent v6
with these changes.

> Thanks!
>
> --
> Peter Xu
>
Peter Xu Nov. 18, 2021, 12:35 a.m. UTC | #14
On Wed, Nov 17, 2021 at 11:50:23AM -0800, Mina Almasry wrote:
> Awesome, thanks! PM_THP_MAPPED sounds good to me and I just sent v6
> with these changes.

Sorry I just noticed one paragraph of the new commit message that may need some
amending. I commented in the new version, please have a look.  Thanks,
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index fdc19fbc10839..8a0f0064ff336 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -23,7 +23,8 @@  There are four components to pagemap:
     * Bit  56    page exclusively mapped (since 4.2)
     * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
       :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
-    * Bits 57-60 zero
+    * Bit  58    page is a huge (PMD size) THP mapping
+    * Bits 59-60 zero
     * Bit  61    page is file-page or shared-anon (since 3.5)
     * Bit  62    page swapped
     * Bit  63    page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ad667dbc96f5c..6f1403f83b310 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1302,6 +1302,7 @@  struct pagemapread {
 #define PM_SOFT_DIRTY		BIT_ULL(55)
 #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
 #define PM_UFFD_WP		BIT_ULL(57)
+#define PM_HUGE_THP_MAPPING	BIT_ULL(58)
 #define PM_FILE			BIT_ULL(61)
 #define PM_SWAP			BIT_ULL(62)
 #define PM_PRESENT		BIT_ULL(63)
@@ -1456,6 +1457,8 @@  static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

 		if (page && page_mapcount(page) == 1)
 			flags |= PM_MMAP_EXCLUSIVE;
+		if (page && is_transparent_hugepage(page))
+			flags |= PM_HUGE_THP_MAPPING;

 		for (; addr != end; addr += PAGE_SIZE) {
 			pagemap_entry_t pme = make_pme(frame, flags);
diff --git a/tools/testing/selftests/vm/transhuge-stress.c b/tools/testing/selftests/vm/transhuge-stress.c
index fd7f1b4a96f94..7dce18981fff5 100644
--- a/tools/testing/selftests/vm/transhuge-stress.c
+++ b/tools/testing/selftests/vm/transhuge-stress.c
@@ -16,6 +16,12 @@ 
 #include <string.h>
 #include <sys/mman.h>

+/*
+ * We can use /proc/pid/pagemap to detect whether the kernel was able to find
+ * hugepages or no. This can be very noisy, so is disabled by default.
+ */
+#define NO_DETECT_HUGEPAGES
+
 #define PAGE_SHIFT 12
 #define HPAGE_SHIFT 21

@@ -23,6 +29,7 @@ 
 #define HPAGE_SIZE (1 << HPAGE_SHIFT)

 #define PAGEMAP_PRESENT(ent)	(((ent) & (1ull << 63)) != 0)
+#define PAGEMAP_THP(ent)	(((ent) & (1ull << 58)) != 0)
 #define PAGEMAP_PFN(ent)	((ent) & ((1ull << 55) - 1))

 int pagemap_fd;
@@ -47,10 +54,16 @@  int64_t allocate_transhuge(void *ptr)
 			(uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
 		err(2, "read pagemap");

-	if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) &&
-	    PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) &&
-	    !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1)))
-		return PAGEMAP_PFN(ent[0]);
+	if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1])) {
+#ifndef NO_DETECT_HUGEPAGES
+		if (!PAGEMAP_THP(ent[0]))
+			fprintf(stderr, "WARNING: detected non THP page\n");
+#endif
+		if (PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) &&
+		    !(PAGEMAP_PFN(ent[0]) &
+		      ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1)))
+			return PAGEMAP_PFN(ent[0]);
+	}

 	return -1;
 }