[v5,0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

Message ID	20190919222421.27408-1-almasrymina@google.com (mailing list archive)
Headers	show Return-Path: <SRS0=+Iky=XO=vger.kernel.org=linux-kselftest-owner@kernel.org> Date: Thu, 19 Sep 2019 15:24:14 -0700 Message-Id: <20190919222421.27408-1-almasrymina@google.com> Mime-Version: 1.0 Subject: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits From: Mina Almasry <almasrymina@google.com> To: mike.kravetz@oracle.com Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com, shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org, khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk
Series	hugetlb_cgroup: Add hugetlb_cgroup reservation limits \| expand [v5,0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits [v5,1/7] hugetlb_cgroup: Add hugetlb_cgroup reservation counter [v5,2/7] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations [v5,3/7] hugetlb_cgroup: add reservation accounting for private mappings [v5,4/7] hugetlb: disable region_add file_region coalescing [v5,5/7] hugetlb_cgroup: add accounting for shared mappings [v5,6/7] hugetlb_cgroup: Add hugetlb_cgroup reservation tests [v5,7/7] hugetlb_cgroup: Add hugetlb_cgroup reservation docs

Mina Almasry Sept. 19, 2019, 10:24 p.m. UTC

Patch series implements hugetlb_cgroup reservation usage and limits, which
track hugetlb reservations rather than hugetlb memory faulted in. Details of
the approach is 1/7.

Changes in v5:
- Moved the bulk of the description to the first patch in the series.
- Clang formatted the entire series.
- Split off 'hugetlb: remove duplicated code' and 'hugetlb: region_chg provides
  only cache entry' into their own patch series.
- Added comments to HUGETLB_RES enum.
- Fixed bug in 'hugetlb: disable region_add file_region coalescing' calculating
  the wrong number of regions_needed in some cases.
- Changed sleeps in test to proper conditions.
- Misc fixes in test based on shuah@ review.

Changes in v4:
- Split up 'hugetlb_cgroup: add accounting for shared mappings' into 4 patches
  for better isolation and context on the indvidual changes:
  - hugetlb_cgroup: add accounting for shared mappings
  - hugetlb: disable region_add file_region coalescing
  - hugetlb: remove duplicated code
  - hugetlb: region_chg provides only cache entry
- Fixed resv->adds_in_progress accounting.
- Retained behavior that region_add never fails, in earlier patchsets region_add
  could return failure.
- Fixed libhugetlbfs failure.
- Minor fix to the added tests that was preventing them from running on some
  environments.

Changes in v3:
- Addressed comments of Hillf Danton:
  - Added docs.
  - cgroup_files now uses enum.
  - Various readability improvements.
- Addressed comments of Mike Kravetz.
  - region_* functions no longer coalesce file_region entries in the resv_map.
  - region_add() and region_chg() refactored to make them much easier to
    understand and remove duplicated code so this patch doesn't add too much
    complexity.
  - Refactored common functionality into helpers.

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (7):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  hugetlb_cgroup: Add hugetlb_cgroup reservation docs

 .../admin-guide/cgroup-v1/hugetlb.rst         |  85 +++-
 include/linux/hugetlb.h                       |  31 +-
 include/linux/hugetlb_cgroup.h                |  33 +-
 mm/hugetlb.c                                  | 423 +++++++++++-----
 mm/hugetlb_cgroup.c                           | 190 ++++++--
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   1 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 461 ++++++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 250 ++++++++++
 10 files changed, 1306 insertions(+), 191 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.351.gc4317032e6-goog

Mike Kravetz Sept. 23, 2019, 5:47 p.m. UTC | #1

On 9/19/19 3:24 PM, Mina Almasry wrote:
> Patch series implements hugetlb_cgroup reservation usage and limits, which
> track hugetlb reservations rather than hugetlb memory faulted in. Details of
> the approach is 1/7.

Thanks for your continued efforts Mina.

One thing that has bothered me with this approach from the beginning is that
hugetlb reservations are related to, but somewhat distinct from hugetlb
allocations.  The original (existing) huegtlb cgroup implementation does not
take reservations into account.  This is an issue you are trying to address
by adding a cgroup support for hugetlb reservations.  However, this new
reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
of hugetlb pages.  Both the existing cgroup code and the reservation approach
have what I think are some serious flaws.  Consider a system with 100 hugetlb
pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of
100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
nobody in group B can allocate ANY huge pages.  This is true even though
no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE
and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the
existing (allocation) and new reservation cgroups together.  Perhaps if both
are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding
reservation support to the existing allocation cgroup support.  One could
even argue that a reservation is an allocation as it sets aside huge pages
that can only be used for a specific purpose.  Here is something that
may work.

Starting with the existing allocation cgroup.
- When hugetlb pages are reserved, the cgroup of the task making the
  reservations is charged.  Tracking for the charged cgroup is done in the
  reservation map in the same way proposed by this patch set.
- At page fault time,
  - If a reservation already exists for that specific area do not charge the
    faulting task.  No tracking in page, just the reservation map.
  - If no reservation exists, charge the group of the faulting task.  Tracking
    of this information is in the page itself as implemented today.
- When the hugetlb object is removed, compare the reservation map with any
  allocated pages.  If cgroup tracking information exists in page, uncharge
  that group.  Otherwise, unharge the group (if any) in the reservation map.

One of the advantages of a separate reservation cgroup is that the existing
code is unmodified.  Combining the two provides a more complete/accurate
solution IMO.  But, it has the potential to break existing users.

I really would like to get feedback from anyone that knows how the existing
hugetlb cgroup controller may be used today.  Comments from Aneesh would
be very welcome to know if reservations were considered in development of the
existing code.

Mina Almasry Sept. 23, 2019, 7:18 p.m. UTC | #2

On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 9/19/19 3:24 PM, Mina Almasry wrote:
> > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > track hugetlb reservations rather than hugetlb memory faulted in. Details of
> > the approach is 1/7.
>
> Thanks for your continued efforts Mina.
>

And thanks for your reviews so far.

> One thing that has bothered me with this approach from the beginning is that
> hugetlb reservations are related to, but somewhat distinct from hugetlb
> allocations.  The original (existing) huegtlb cgroup implementation does not
> take reservations into account.  This is an issue you are trying to address
> by adding a cgroup support for hugetlb reservations.  However, this new
> reservation cgroup ignores hugetlb allocations at fault time.
>
> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> of hugetlb pages.  Both the existing cgroup code and the reservation approach
> have what I think are some serious flaws.  Consider a system with 100 hugetlb
> pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
> usage to 50 pages each.
>
> With the existing implementation, a task in group A could create a mmap of
> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> nobody in group B can allocate ANY huge pages.  This is true even though
> no pages have been allocated in A (or B).
>
> With the reservation implementation, a task in group A could use MAP_NORESERVE
> and allocate all 100 pages without taking any reservations.
>
> As mentioned in your documentation, it would be possible to use both the
> existing (allocation) and new reservation cgroups together.  Perhaps if both
> are setup for the 50/50 split things would work a little better.
>
> However, instead of creating a new reservation crgoup how about adding
> reservation support to the existing allocation cgroup support.  One could
> even argue that a reservation is an allocation as it sets aside huge pages
> that can only be used for a specific purpose.  Here is something that
> may work.
>
> Starting with the existing allocation cgroup.
> - When hugetlb pages are reserved, the cgroup of the task making the
>   reservations is charged.  Tracking for the charged cgroup is done in the
>   reservation map in the same way proposed by this patch set.
> - At page fault time,
>   - If a reservation already exists for that specific area do not charge the
>     faulting task.  No tracking in page, just the reservation map.
>   - If no reservation exists, charge the group of the faulting task.  Tracking
>     of this information is in the page itself as implemented today.
> - When the hugetlb object is removed, compare the reservation map with any
>   allocated pages.  If cgroup tracking information exists in page, uncharge
>   that group.  Otherwise, unharge the group (if any) in the reservation map.
>
> One of the advantages of a separate reservation cgroup is that the existing
> code is unmodified.  Combining the two provides a more complete/accurate
> solution IMO.  But, it has the potential to break existing users.
>
> I really would like to get feedback from anyone that knows how the existing
> hugetlb cgroup controller may be used today.  Comments from Aneesh would
> be very welcome to know if reservations were considered in development of the
> existing code.
> --

FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:

AFAICT, the 2 counter approach we have here is strictly superior to
the 1 upgraded counter approach. Consider these points:

- From what I can tell so far, everything you can do with the 1
counter approach, you can do with the two counter approach by setting
both limit_in_bytes and reservation_limit_in_bytes to the limit value.
That will limit both reservations and at fault allocations.

- The 2 counter approach preserves existing usage of hugetlb cgroups,
so no need to muck around with reverting the feature some time from
now because of broken users. No existing users of hugetlb cgroups need
to worry about the effect of this on their usage.

- Users that use hugetlb memory strictly through reservations can use
only reservation_limit_in_bytes and enjoy cgroup limits that never
SIGBUS the application. This is our usage for example.

- The 2 counter approach provides more info to the sysadmin. The
sysadmin knows exactly how much reserved bytes there are via
reservation_usage_in_bytes, and how much actually in use bytes there
are via usage_in_bytes. They can even detect NORESERVE usage if
usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed
reservations *and* failed allocations at fault, etc. All around better
debuggability when things go wrong. I think this is particularly
troubling for the 1 upgraded counter approach. That counter's
usage_in_bytes doesn't tell you if the usage came from reservations or
allocations at fault time.

- Honestly, I think the 2 counter approach is easier to document and
understand by the userspace? 1 counter that vaguely tracks both the
reservations and usage and decides whether or not to charge at fault
time seems hard to understand what really happened after something
goes wrong. 1 counter that tracks reservations and 1 counter that
tracks actual usage seem much simpler to digest, and provide better
visibility to what the cgroup is doing as I mentioned above.

I think it may be better if I keep the 2 counter approach but
thoroughly document the interaction between the existing counters and
NORESERVE. What do you think?

FWIW, it may be prudent to consider deprecating MAP_NORESERVE, if
that's an option. I'm not sure what that benefit that provides
applications, and on the other hand it makes it hard for the kernel to
guarantee the hugetlb memory is available to the application that
requested it, and makes it harder for the cgroups to police hugetlb
usage without SIGBUSing something. But that may be a discussion for
another proposal.

> Mike Kravetz

Mike Kravetz Sept. 23, 2019, 9:27 p.m. UTC | #3

On 9/23/19 12:18 PM, Mina Almasry wrote:
> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 9/19/19 3:24 PM, Mina Almasry wrote:
>>> Patch series implements hugetlb_cgroup reservation usage and limits, which
>>> track hugetlb reservations rather than hugetlb memory faulted in. Details of
>>> the approach is 1/7.
>>
>> Thanks for your continued efforts Mina.
>>
> 
> And thanks for your reviews so far.
> 
>> One thing that has bothered me with this approach from the beginning is that
>> hugetlb reservations are related to, but somewhat distinct from hugetlb
>> allocations.  The original (existing) huegtlb cgroup implementation does not
>> take reservations into account.  This is an issue you are trying to address
>> by adding a cgroup support for hugetlb reservations.  However, this new
>> reservation cgroup ignores hugetlb allocations at fault time.
>>
>> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
>> of hugetlb pages.  Both the existing cgroup code and the reservation approach
>> have what I think are some serious flaws.  Consider a system with 100 hugetlb
>> pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
>> usage to 50 pages each.
>>
>> With the existing implementation, a task in group A could create a mmap of
>> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
>> nobody in group B can allocate ANY huge pages.  This is true even though
>> no pages have been allocated in A (or B).
>>
>> With the reservation implementation, a task in group A could use MAP_NORESERVE
>> and allocate all 100 pages without taking any reservations.
>>
>> As mentioned in your documentation, it would be possible to use both the
>> existing (allocation) and new reservation cgroups together.  Perhaps if both
>> are setup for the 50/50 split things would work a little better.
>>
>> However, instead of creating a new reservation crgoup how about adding
>> reservation support to the existing allocation cgroup support.  One could
>> even argue that a reservation is an allocation as it sets aside huge pages
>> that can only be used for a specific purpose.  Here is something that
>> may work.
>>
>> Starting with the existing allocation cgroup.
>> - When hugetlb pages are reserved, the cgroup of the task making the
>>   reservations is charged.  Tracking for the charged cgroup is done in the
>>   reservation map in the same way proposed by this patch set.
>> - At page fault time,
>>   - If a reservation already exists for that specific area do not charge the
>>     faulting task.  No tracking in page, just the reservation map.
>>   - If no reservation exists, charge the group of the faulting task.  Tracking
>>     of this information is in the page itself as implemented today.
>> - When the hugetlb object is removed, compare the reservation map with any
>>   allocated pages.  If cgroup tracking information exists in page, uncharge
>>   that group.  Otherwise, unharge the group (if any) in the reservation map.
>>
>> One of the advantages of a separate reservation cgroup is that the existing
>> code is unmodified.  Combining the two provides a more complete/accurate
>> solution IMO.  But, it has the potential to break existing users.
>>
>> I really would like to get feedback from anyone that knows how the existing
>> hugetlb cgroup controller may be used today.  Comments from Aneesh would
>> be very welcome to know if reservations were considered in development of the
>> existing code.
>> --
> 
> FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:
> 
> AFAICT, the 2 counter approach we have here is strictly superior to
> the 1 upgraded counter approach. Consider these points:
> 
> - From what I can tell so far, everything you can do with the 1
> counter approach, you can do with the two counter approach by setting
> both limit_in_bytes and reservation_limit_in_bytes to the limit value.
> That will limit both reservations and at fault allocations.
> 
> - The 2 counter approach preserves existing usage of hugetlb cgroups,
> so no need to muck around with reverting the feature some time from
> now because of broken users. No existing users of hugetlb cgroups need
> to worry about the effect of this on their usage.
> 
> - Users that use hugetlb memory strictly through reservations can use
> only reservation_limit_in_bytes and enjoy cgroup limits that never
> SIGBUS the application. This is our usage for example.
> 
> - The 2 counter approach provides more info to the sysadmin. The
> sysadmin knows exactly how much reserved bytes there are via
> reservation_usage_in_bytes, and how much actually in use bytes there
> are via usage_in_bytes. They can even detect NORESERVE usage if
> usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed
> reservations *and* failed allocations at fault, etc. All around better
> debuggability when things go wrong. I think this is particularly
> troubling for the 1 upgraded counter approach. That counter's
> usage_in_bytes doesn't tell you if the usage came from reservations or
> allocations at fault time.
> 
> - Honestly, I think the 2 counter approach is easier to document and
> understand by the userspace? 1 counter that vaguely tracks both the
> reservations and usage and decides whether or not to charge at fault
> time seems hard to understand what really happened after something
> goes wrong. 1 counter that tracks reservations and 1 counter that
> tracks actual usage seem much simpler to digest, and provide better
> visibility to what the cgroup is doing as I mentioned above.
> 
> I think it may be better if I keep the 2 counter approach but
> thoroughly document the interaction between the existing counters and
> NORESERVE. What do you think?

I personally prefer the one counter approach only for the reason that it
exposes less information about hugetlb reservations.  I was not around
for the introduction of hugetlb reservations, but I have fixed several
issues having to do with reservations.  IMO, reservations should be hidden
from users as much as possible.  Others may disagree.

I really hope that Aneesh will comment.  He added the existing hugetlb
cgroup code.  I was not involved in that effort, but it looks like there
might have been some thought given to reservations in early versions of
that code.  It would be interesting to get his perspective.

Changes included in patch 4 (disable region_add file_region coalescing)
would be needed in a one counter approach as well, so I do plan to
review those changes.

Mina Almasry Sept. 24, 2019, 10:42 p.m. UTC | #4

On Mon, Sep 23, 2019 at 2:27 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 9/23/19 12:18 PM, Mina Almasry wrote:
> > On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> On 9/19/19 3:24 PM, Mina Almasry wrote:
> >>> Patch series implements hugetlb_cgroup reservation usage and limits, which
> >>> track hugetlb reservations rather than hugetlb memory faulted in. Details of
> >>> the approach is 1/7.
> >>
> >> Thanks for your continued efforts Mina.
> >>
> >
> > And thanks for your reviews so far.
> >
> >> One thing that has bothered me with this approach from the beginning is that
> >> hugetlb reservations are related to, but somewhat distinct from hugetlb
> >> allocations.  The original (existing) huegtlb cgroup implementation does not
> >> take reservations into account.  This is an issue you are trying to address
> >> by adding a cgroup support for hugetlb reservations.  However, this new
> >> reservation cgroup ignores hugetlb allocations at fault time.
> >>
> >> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> >> of hugetlb pages.  Both the existing cgroup code and the reservation approach
> >> have what I think are some serious flaws.  Consider a system with 100 hugetlb
> >> pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
> >> usage to 50 pages each.
> >>
> >> With the existing implementation, a task in group A could create a mmap of
> >> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> >> nobody in group B can allocate ANY huge pages.  This is true even though
> >> no pages have been allocated in A (or B).
> >>
> >> With the reservation implementation, a task in group A could use MAP_NORESERVE
> >> and allocate all 100 pages without taking any reservations.
> >>
> >> As mentioned in your documentation, it would be possible to use both the
> >> existing (allocation) and new reservation cgroups together.  Perhaps if both
> >> are setup for the 50/50 split things would work a little better.
> >>
> >> However, instead of creating a new reservation crgoup how about adding
> >> reservation support to the existing allocation cgroup support.  One could
> >> even argue that a reservation is an allocation as it sets aside huge pages
> >> that can only be used for a specific purpose.  Here is something that
> >> may work.
> >>
> >> Starting with the existing allocation cgroup.
> >> - When hugetlb pages are reserved, the cgroup of the task making the
> >>   reservations is charged.  Tracking for the charged cgroup is done in the
> >>   reservation map in the same way proposed by this patch set.
> >> - At page fault time,
> >>   - If a reservation already exists for that specific area do not charge the
> >>     faulting task.  No tracking in page, just the reservation map.
> >>   - If no reservation exists, charge the group of the faulting task.  Tracking
> >>     of this information is in the page itself as implemented today.
> >> - When the hugetlb object is removed, compare the reservation map with any
> >>   allocated pages.  If cgroup tracking information exists in page, uncharge
> >>   that group.  Otherwise, unharge the group (if any) in the reservation map.
> >>
> >> One of the advantages of a separate reservation cgroup is that the existing
> >> code is unmodified.  Combining the two provides a more complete/accurate
> >> solution IMO.  But, it has the potential to break existing users.
> >>
> >> I really would like to get feedback from anyone that knows how the existing
> >> hugetlb cgroup controller may be used today.  Comments from Aneesh would
> >> be very welcome to know if reservations were considered in development of the
> >> existing code.
> >> --
> >
> > FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:
> >
> > AFAICT, the 2 counter approach we have here is strictly superior to
> > the 1 upgraded counter approach. Consider these points:
> >
> > - From what I can tell so far, everything you can do with the 1
> > counter approach, you can do with the two counter approach by setting
> > both limit_in_bytes and reservation_limit_in_bytes to the limit value.
> > That will limit both reservations and at fault allocations.
> >
> > - The 2 counter approach preserves existing usage of hugetlb cgroups,
> > so no need to muck around with reverting the feature some time from
> > now because of broken users. No existing users of hugetlb cgroups need
> > to worry about the effect of this on their usage.
> >
> > - Users that use hugetlb memory strictly through reservations can use
> > only reservation_limit_in_bytes and enjoy cgroup limits that never
> > SIGBUS the application. This is our usage for example.
> >
> > - The 2 counter approach provides more info to the sysadmin. The
> > sysadmin knows exactly how much reserved bytes there are via
> > reservation_usage_in_bytes, and how much actually in use bytes there
> > are via usage_in_bytes. They can even detect NORESERVE usage if
> > usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed
> > reservations *and* failed allocations at fault, etc. All around better
> > debuggability when things go wrong. I think this is particularly
> > troubling for the 1 upgraded counter approach. That counter's
> > usage_in_bytes doesn't tell you if the usage came from reservations or
> > allocations at fault time.
> >
> > - Honestly, I think the 2 counter approach is easier to document and
> > understand by the userspace? 1 counter that vaguely tracks both the
> > reservations and usage and decides whether or not to charge at fault
> > time seems hard to understand what really happened after something
> > goes wrong. 1 counter that tracks reservations and 1 counter that
> > tracks actual usage seem much simpler to digest, and provide better
> > visibility to what the cgroup is doing as I mentioned above.
> >
> > I think it may be better if I keep the 2 counter approach but
> > thoroughly document the interaction between the existing counters and
> > NORESERVE. What do you think?
>
> I personally prefer the one counter approach only for the reason that it
> exposes less information about hugetlb reservations.  I was not around
> for the introduction of hugetlb reservations, but I have fixed several
> issues having to do with reservations.  IMO, reservations should be hidden
> from users as much as possible.  Others may disagree.
>
> I really hope that Aneesh will comment.  He added the existing hugetlb
> cgroup code.  I was not involved in that effort, but it looks like there
> might have been some thought given to reservations in early versions of
> that code.  It would be interesting to get his perspective.
>
> Changes included in patch 4 (disable region_add file_region coalescing)
> would be needed in a one counter approach as well, so I do plan to
> review those changes.

OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
not really opposed. David, maybe chime in if you see a problem here?
From the perspective of hiding reservations from the user as much as
possible, it is an improvement.

I'm only wary about changing the behavior of the current and having
that regress applications. I'm hoping you and Aneesh can shed light on
this.

> --
> Mike Kravetz

David Rientjes Sept. 26, 2019, 7:28 p.m. UTC | #5

On Tue, 24 Sep 2019, Mina Almasry wrote:

> > I personally prefer the one counter approach only for the reason that it
> > exposes less information about hugetlb reservations.  I was not around
> > for the introduction of hugetlb reservations, but I have fixed several
> > issues having to do with reservations.  IMO, reservations should be hidden
> > from users as much as possible.  Others may disagree.
> >
> > I really hope that Aneesh will comment.  He added the existing hugetlb
> > cgroup code.  I was not involved in that effort, but it looks like there
> > might have been some thought given to reservations in early versions of
> > that code.  It would be interesting to get his perspective.
> >
> > Changes included in patch 4 (disable region_add file_region coalescing)
> > would be needed in a one counter approach as well, so I do plan to
> > review those changes.
> 
> OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
> not really opposed. David, maybe chime in if you see a problem here?
> From the perspective of hiding reservations from the user as much as
> possible, it is an improvement.
> 
> I'm only wary about changing the behavior of the current and having
> that regress applications. I'm hoping you and Aneesh can shed light on
> this.
> 

I think neither Aneesh nor myself are going to be able to provide a 
complete answer on the use of hugetlb cgroup today, anybody could be using 
it without our knowledge and that opens up the possibility that combining 
the limits would adversely affect a real system configuration.

If that is a possibility, I think we need to do some due diligence and try 
to deprecate allocation limits if possible.  One of the benefits to 
separate limits is that we can make reasonable steps to deprecating the 
actual allocation limits, if possible: we could add warnings about the 
deprecation of allocation limits and see if anybody complains.

That could take the form of two separate limits or a tunable in the root 
hugetlb cgroup that defines whether the limits are for allocation or 
reservation.

Combining them in the first pass seems to be very risky and could cause 
pain for users that will not detect this during an rc cycle and will 
report the issue only when their distro gets it.  Then we are left with no 
alternative other than stable backports and the separation of the limits 
anyway.

Mike Kravetz Sept. 26, 2019, 9:23 p.m. UTC | #6

On 9/26/19 12:28 PM, David Rientjes wrote:
> On Tue, 24 Sep 2019, Mina Almasry wrote:
> 
>>> I personally prefer the one counter approach only for the reason that it
>>> exposes less information about hugetlb reservations.  I was not around
>>> for the introduction of hugetlb reservations, but I have fixed several
>>> issues having to do with reservations.  IMO, reservations should be hidden
>>> from users as much as possible.  Others may disagree.
>>>
>>> I really hope that Aneesh will comment.  He added the existing hugetlb
>>> cgroup code.  I was not involved in that effort, but it looks like there
>>> might have been some thought given to reservations in early versions of
>>> that code.  It would be interesting to get his perspective.
>>>
>>> Changes included in patch 4 (disable region_add file_region coalescing)
>>> would be needed in a one counter approach as well, so I do plan to
>>> review those changes.
>>
>> OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
>> not really opposed. David, maybe chime in if you see a problem here?
>> From the perspective of hiding reservations from the user as much as
>> possible, it is an improvement.
>>
>> I'm only wary about changing the behavior of the current and having
>> that regress applications. I'm hoping you and Aneesh can shed light on
>> this.
>>
> 
> I think neither Aneesh nor myself are going to be able to provide a 
> complete answer on the use of hugetlb cgroup today, anybody could be using 
> it without our knowledge and that opens up the possibility that combining 
> the limits would adversely affect a real system configuration.

I agree that nobody can provide complete information on hugetlb cgroup usage
today.  My interest was in anything Aneesh could remember about development
of the current cgroup code.  It 'appears' that the idea of including
reservations or mmap ranges was considered or at least discussed.  But, those
discussions happened more than 7 years old and my searches are not providing
a complete picture.  My hope was that Aneesh may remember those discussions.

> If that is a possibility, I think we need to do some due diligence and try 
> to deprecate allocation limits if possible.  One of the benefits to 
> separate limits is that we can make reasonable steps to deprecating the 
> actual allocation limits, if possible: we could add warnings about the 
> deprecation of allocation limits and see if anybody complains.
> 
> That could take the form of two separate limits or a tunable in the root 
> hugetlb cgroup that defines whether the limits are for allocation or 
> reservation.
> 
> Combining them in the first pass seems to be very risky and could cause 
> pain for users that will not detect this during an rc cycle and will 
> report the issue only when their distro gets it.  Then we are left with no 
> alternative other than stable backports and the separation of the limits 
> anyway.

I agree that changing behavior of the existing controller is too risky.
Such a change is likely to break someone.  The more I think about it, the
best way forward will be to retain the existing controller and create a
new controller that satisfies the new use cases.  The question remains as
to what that new controller will be.  Does it control reservations only?
Is it a combination of reservations and allocations?  If a combined
controller will work for new use cases, that would be my preference.  Of
course, I have not prototyped such a controller so there may be issues when
we get into the details.  For a reservation only or combined controller,
the region_* changes proposed by Mina would be used.

Mina Almasry Sept. 27, 2019, 12:55 a.m. UTC | #7

On Thu, Sep 26, 2019 at 2:23 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 9/26/19 12:28 PM, David Rientjes wrote:
> > On Tue, 24 Sep 2019, Mina Almasry wrote:
> >
> >>> I personally prefer the one counter approach only for the reason that it
> >>> exposes less information about hugetlb reservations.  I was not around
> >>> for the introduction of hugetlb reservations, but I have fixed several
> >>> issues having to do with reservations.  IMO, reservations should be hidden
> >>> from users as much as possible.  Others may disagree.
> >>>
> >>> I really hope that Aneesh will comment.  He added the existing hugetlb
> >>> cgroup code.  I was not involved in that effort, but it looks like there
> >>> might have been some thought given to reservations in early versions of
> >>> that code.  It would be interesting to get his perspective.
> >>>
> >>> Changes included in patch 4 (disable region_add file_region coalescing)
> >>> would be needed in a one counter approach as well, so I do plan to
> >>> review those changes.
> >>
> >> OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
> >> not really opposed. David, maybe chime in if you see a problem here?
> >> From the perspective of hiding reservations from the user as much as
> >> possible, it is an improvement.
> >>
> >> I'm only wary about changing the behavior of the current and having
> >> that regress applications. I'm hoping you and Aneesh can shed light on
> >> this.
> >>
> >
> > I think neither Aneesh nor myself are going to be able to provide a
> > complete answer on the use of hugetlb cgroup today, anybody could be using
> > it without our knowledge and that opens up the possibility that combining
> > the limits would adversely affect a real system configuration.
>
> I agree that nobody can provide complete information on hugetlb cgroup usage
> today.  My interest was in anything Aneesh could remember about development
> of the current cgroup code.  It 'appears' that the idea of including
> reservations or mmap ranges was considered or at least discussed.  But, those
> discussions happened more than 7 years old and my searches are not providing
> a complete picture.  My hope was that Aneesh may remember those discussions.
>
> > If that is a possibility, I think we need to do some due diligence and try
> > to deprecate allocation limits if possible.  One of the benefits to
> > separate limits is that we can make reasonable steps to deprecating the
> > actual allocation limits, if possible: we could add warnings about the
> > deprecation of allocation limits and see if anybody complains.
> >
> > That could take the form of two separate limits or a tunable in the root
> > hugetlb cgroup that defines whether the limits are for allocation or
> > reservation.
> >
> > Combining them in the first pass seems to be very risky and could cause
> > pain for users that will not detect this during an rc cycle and will
> > report the issue only when their distro gets it.  Then we are left with no
> > alternative other than stable backports and the separation of the limits
> > anyway.
>
> I agree that changing behavior of the existing controller is too risky.
> Such a change is likely to break someone.

I'm glad we're converging on keeping the existing behavior unchanged.

> The more I think about it, the
> best way forward will be to retain the existing controller and create a
> new controller that satisfies the new use cases.

My guess is that a new controller needs to support cgroups-v2, which
is fine. But can a new controller also support v1? Or is there a
requirement that new controllers support *only* v2? I need whatever
solution here to work on v1. Added Tejun to hopefully comment on this.

>The question remains as
> to what that new controller will be.  Does it control reservations only?
> Is it a combination of reservations and allocations?  If a combined
> controller will work for new use cases, that would be my preference.  Of
> course, I have not prototyped such a controller so there may be issues when
> we get into the details.  For a reservation only or combined controller,
> the region_* changes proposed by Mina would be used.

Provided we keep the existing controller untouched, should the new
controller track:

1. only reservations, or
2. both reservations and allocations for which no reservations exist
(such as the MAP_NORESERVE case)?

I like the 'both' approach. Seems to me a counter like that would work
automatically regardless of whether the application is allocating
hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
into reserved hugetlb pages, correct? If so, then applications that
allocate with NORESERVE will get sigbused when they hit their limit,
and applications that allocate without NORESERVE may get an error at
mmap time but will always be within their limits while they access the
mmap'd memory, correct? So the 'both' counter seems like a one size
fits all.

I think the only sticking point left is whether an added controller
can support both cgroup-v2 and cgroup-v1. If I could get confirmation
on that I'll provide a patchset.

> --
> Mike Kravetz

Mike Kravetz Sept. 27, 2019, 9:59 p.m. UTC | #8

On 9/26/19 5:55 PM, Mina Almasry wrote:
> Provided we keep the existing controller untouched, should the new
> controller track:
> 
> 1. only reservations, or
> 2. both reservations and allocations for which no reservations exist
> (such as the MAP_NORESERVE case)?
> 
> I like the 'both' approach. Seems to me a counter like that would work
> automatically regardless of whether the application is allocating
> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
> into reserved hugetlb pages, correct?

Correct.  One other easy way to allocate huge pages without reserves
(that I know is used today) is via the fallocate system call.

>                                       If so, then applications that
> allocate with NORESERVE will get sigbused when they hit their limit,
> and applications that allocate without NORESERVE may get an error at
> mmap time but will always be within their limits while they access the
> mmap'd memory, correct?

Correct.  At page allocation time we can easily check to see if a reservation
exists and not charge.  For any specific page within a hugetlbfs file,
a charge would happen at mmap time or allocation time.

One exception (that I can think of) to this mmap(RESERVE) will not cause
a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
a file, not only do they remove pages associated with the file but the
reservation information as well.  Therefore, a subsequent fault will be
the same as an allocation without reservation.

I 'think' the code to remove/truncate a file will work corrctly as it
is today, but I need to think about this some more.

> mmap'd memory, correct? So the 'both' counter seems like a one size
> fits all.
> 
> I think the only sticking point left is whether an added controller
> can support both cgroup-v2 and cgroup-v1. If I could get confirmation
> on that I'll provide a patchset.

Sorry, but I can not provide cgroup expertise.

Mina Almasry Sept. 27, 2019, 10:51 p.m. UTC | #9

On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 9/26/19 5:55 PM, Mina Almasry wrote:
> > Provided we keep the existing controller untouched, should the new
> > controller track:
> >
> > 1. only reservations, or
> > 2. both reservations and allocations for which no reservations exist
> > (such as the MAP_NORESERVE case)?
> >
> > I like the 'both' approach. Seems to me a counter like that would work
> > automatically regardless of whether the application is allocating
> > hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
> > into reserved hugetlb pages, correct?
>
> Correct.  One other easy way to allocate huge pages without reserves
> (that I know is used today) is via the fallocate system call.
>
> >                                       If so, then applications that
> > allocate with NORESERVE will get sigbused when they hit their limit,
> > and applications that allocate without NORESERVE may get an error at
> > mmap time but will always be within their limits while they access the
> > mmap'd memory, correct?
>
> Correct.  At page allocation time we can easily check to see if a reservation
> exists and not charge.  For any specific page within a hugetlbfs file,
> a charge would happen at mmap time or allocation time.
>
> One exception (that I can think of) to this mmap(RESERVE) will not cause
> a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
> a file, not only do they remove pages associated with the file but the
> reservation information as well.  Therefore, a subsequent fault will be
> the same as an allocation without reservation.
>

I don't think it causes a sigbus. This is the scenario, right:

1. Make cgroup with limit X bytes.
2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged
3. A hole of size Y is punched in the file, causing the cgroup to get
uncharged Y bytes.
4. The task faults in memory from the hole, getting charged up to Y
bytes again. But they will be still within their limits.

IIUC userspace only gets sigbus'd if the limit is lowered between
steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

> I 'think' the code to remove/truncate a file will work corrctly as it
> is today, but I need to think about this some more.
>
> > mmap'd memory, correct? So the 'both' counter seems like a one size
> > fits all.
> >
> > I think the only sticking point left is whether an added controller
> > can support both cgroup-v2 and cgroup-v1. If I could get confirmation
> > on that I'll provide a patchset.
>
> Sorry, but I can not provide cgroup expertise.
> --
> Mike Kravetz

Mike Kravetz Sept. 27, 2019, 10:56 p.m. UTC | #10

On 9/27/19 3:51 PM, Mina Almasry wrote:
> On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 9/26/19 5:55 PM, Mina Almasry wrote:
>>> Provided we keep the existing controller untouched, should the new
>>> controller track:
>>>
>>> 1. only reservations, or
>>> 2. both reservations and allocations for which no reservations exist
>>> (such as the MAP_NORESERVE case)?
>>>
>>> I like the 'both' approach. Seems to me a counter like that would work
>>> automatically regardless of whether the application is allocating
>>> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
>>> into reserved hugetlb pages, correct?
>>
>> Correct.  One other easy way to allocate huge pages without reserves
>> (that I know is used today) is via the fallocate system call.
>>
>>>                                       If so, then applications that
>>> allocate with NORESERVE will get sigbused when they hit their limit,
>>> and applications that allocate without NORESERVE may get an error at
>>> mmap time but will always be within their limits while they access the
>>> mmap'd memory, correct?
>>
>> Correct.  At page allocation time we can easily check to see if a reservation
>> exists and not charge.  For any specific page within a hugetlbfs file,
>> a charge would happen at mmap time or allocation time.
>>
>> One exception (that I can think of) to this mmap(RESERVE) will not cause
>> a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
>> a file, not only do they remove pages associated with the file but the
>> reservation information as well.  Therefore, a subsequent fault will be
>> the same as an allocation without reservation.
>>
> 
> I don't think it causes a sigbus. This is the scenario, right:
> 
> 1. Make cgroup with limit X bytes.
> 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged
> 3. A hole of size Y is punched in the file, causing the cgroup to get
> uncharged Y bytes.
> 4. The task faults in memory from the hole, getting charged up to Y
> bytes again. But they will be still within their limits.
> 
> IIUC userspace only gets sigbus'd if the limit is lowered between
> steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

You are correct.  That was my mistake.  I was still thinking of behavior
for a reservation only cgroup model.  It would behave as you describe
above (no SIGBUS) for a combined reservation/allocate model.

Michal Koutný Sept. 30, 2019, 3:12 p.m. UTC | #11

Hi.

On Thu, Sep 26, 2019 at 05:55:29PM -0700, Mina Almasry <almasrymina@google.com> wrote:
> My guess is that a new controller needs to support cgroups-v2, which
> is fine. But can a new controller also support v1? Or is there a
> requirement that new controllers support *only* v2? I need whatever
> solution here to work on v1.
Here is my view of important criteria:

	1) stability, v1 APIs and semantics should not be changed,
	2) futureproofness, v1 uses should be convertible to v2 uses,
	3) maintainability, the less (similar) code the better.

And here is my braindump of some approaches:

A) new controller, v2 only
- 1) ok
- 2) may be ok
- 3) separate v1 and v2 implementations
- exclusion must be ensured on hybrid hierarchies

B) new controller, version oblivious (see e.g. pid)
- 1) sort of ok
- 2) partially ok
- 3) two v1 implementations
- exclusion must be ensured even on pure v1 hierarchies

C) extending the existing controller, w/out v2 counterpart
- 1) ok with workarounds (new option switching behavior)
- 2) not ok
- 3) likely OK

D) extending the existing controller, with v2 counterpart
- 1) ok with workarounds (new option switching behavior, see cpuset)
- 2) may be ok
- 3) likely OK

AFAIU, the current patchset is variation of C). Personally, I think
something like D) could work, I'm not convinced about A) and B) based on
the breakdown above. But it may induce some other ideas.


My two cents,
Michal

Mina Almasry Oct. 11, 2019, 7:10 p.m. UTC | #12

On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 9/19/19 3:24 PM, Mina Almasry wrote:
> > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > track hugetlb reservations rather than hugetlb memory faulted in. Details of
> > the approach is 1/7.
>
> Thanks for your continued efforts Mina.
>
> One thing that has bothered me with this approach from the beginning is that
> hugetlb reservations are related to, but somewhat distinct from hugetlb
> allocations.  The original (existing) huegtlb cgroup implementation does not
> take reservations into account.  This is an issue you are trying to address
> by adding a cgroup support for hugetlb reservations.  However, this new
> reservation cgroup ignores hugetlb allocations at fault time.
>
> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> of hugetlb pages.  Both the existing cgroup code and the reservation approach
> have what I think are some serious flaws.  Consider a system with 100 hugetlb
> pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
> usage to 50 pages each.
>
> With the existing implementation, a task in group A could create a mmap of
> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> nobody in group B can allocate ANY huge pages.  This is true even though
> no pages have been allocated in A (or B).
>
> With the reservation implementation, a task in group A could use MAP_NORESERVE
> and allocate all 100 pages without taking any reservations.
>
> As mentioned in your documentation, it would be possible to use both the
> existing (allocation) and new reservation cgroups together.  Perhaps if both
> are setup for the 50/50 split things would work a little better.
>
> However, instead of creating a new reservation crgoup how about adding
> reservation support to the existing allocation cgroup support.  One could
> even argue that a reservation is an allocation as it sets aside huge pages
> that can only be used for a specific purpose.  Here is something that
> may work.
>
> Starting with the existing allocation cgroup.
> - When hugetlb pages are reserved, the cgroup of the task making the
>   reservations is charged.  Tracking for the charged cgroup is done in the
>   reservation map in the same way proposed by this patch set.
> - At page fault time,
>   - If a reservation already exists for that specific area do not charge the
>     faulting task.  No tracking in page, just the reservation map.
>   - If no reservation exists, charge the group of the faulting task.  Tracking
>     of this information is in the page itself as implemented today.
> - When the hugetlb object is removed, compare the reservation map with any
>   allocated pages.  If cgroup tracking information exists in page, uncharge
>   that group.  Otherwise, unharge the group (if any) in the reservation map.
>

Sorry for the late response here. I've been prototyping the
suggestions from this conversation:

1. Supporting cgroup-v2 on the current controller seems trivial.
Basically just specifying the dfl files seems to do it, and my tests
on top of cgroup-v2 don't see any problems so far at least. In light
of this I'm not sure it's best to create a new controller per say.
Seems like it would duplicate a lot of code with the current
controller, so I've tentatively just stuck to the plan in my current
patchset, a new counter on the existing controller.

2. I've been working on transitioning the new counter to the behavior
Mike specified in the email I'm responding to. So far I have a flow
that works for shared mappings but not private mappings:

- On reservation, charge the new counter and store the info in the
resv_map. The counter gets uncharged when the resv_map entry gets
removed (works fine).
- On alloc_huge_page(), check if there is a reservation for the page
being allocated. If not, charge the new counter and store the
information in resv_map. The counter still gets uncharged when the
resv_map entry gets removed.

The above works for all shared mappings and reserved private mappings,
but I' having trouble supporting private NORESERVE mappings. Charging
can work the same as for shared mappings: charge the new counter on
reservation and on allocations that do not have a reservation. But the
question still comes up: where to store the counter to uncharge this
page? I thought of a couple of things that don't seem to work:

1. I thought of putting the counter in resv_map->reservation_counter,
so that it gets uncharged on vm_op_close. But, private NORESERVE
mappings don't even have a resv_map allocated for them.

2. I thought of detecting on free_huge_page that the page being freed
belonged to a private NORESERVE mapping, and uncharging the
hugetlb_cgroup in the page itself, but free_huge_page only gets a
struct page* and I can't seem to find a way to detect that that the
page comes from a private NORESERVE mapping from only the struct
page*.

Mike, note your suggestion above to check if the page hugetlb_cgroup
is null doesn't work if we want to keep the current counter working
the same: the page will always have a hugetlb_cgroup that points that
contains the old counter. Any ideas how to apply this new counter
behavior to a private NORESERVE mappings? Is there maybe a flag I can
set on the pages at allocation time that I can read on free time to
know whether to uncharge the hugetlb_cgroup or not?

Mina Almasry Oct. 11, 2019, 8:41 p.m. UTC | #13

On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 9/19/19 3:24 PM, Mina Almasry wrote:
> > > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > > track hugetlb reservations rather than hugetlb memory faulted in. Details of
> > > the approach is 1/7.
> >
> > Thanks for your continued efforts Mina.
> >
> > One thing that has bothered me with this approach from the beginning is that
> > hugetlb reservations are related to, but somewhat distinct from hugetlb
> > allocations.  The original (existing) huegtlb cgroup implementation does not
> > take reservations into account.  This is an issue you are trying to address
> > by adding a cgroup support for hugetlb reservations.  However, this new
> > reservation cgroup ignores hugetlb allocations at fault time.
> >
> > I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> > of hugetlb pages.  Both the existing cgroup code and the reservation approach
> > have what I think are some serious flaws.  Consider a system with 100 hugetlb
> > pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
> > usage to 50 pages each.
> >
> > With the existing implementation, a task in group A could create a mmap of
> > 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> > nobody in group B can allocate ANY huge pages.  This is true even though
> > no pages have been allocated in A (or B).
> >
> > With the reservation implementation, a task in group A could use MAP_NORESERVE
> > and allocate all 100 pages without taking any reservations.
> >
> > As mentioned in your documentation, it would be possible to use both the
> > existing (allocation) and new reservation cgroups together.  Perhaps if both
> > are setup for the 50/50 split things would work a little better.
> >
> > However, instead of creating a new reservation crgoup how about adding
> > reservation support to the existing allocation cgroup support.  One could
> > even argue that a reservation is an allocation as it sets aside huge pages
> > that can only be used for a specific purpose.  Here is something that
> > may work.
> >
> > Starting with the existing allocation cgroup.
> > - When hugetlb pages are reserved, the cgroup of the task making the
> >   reservations is charged.  Tracking for the charged cgroup is done in the
> >   reservation map in the same way proposed by this patch set.
> > - At page fault time,
> >   - If a reservation already exists for that specific area do not charge the
> >     faulting task.  No tracking in page, just the reservation map.
> >   - If no reservation exists, charge the group of the faulting task.  Tracking
> >     of this information is in the page itself as implemented today.
> > - When the hugetlb object is removed, compare the reservation map with any
> >   allocated pages.  If cgroup tracking information exists in page, uncharge
> >   that group.  Otherwise, unharge the group (if any) in the reservation map.
> >
>
> Sorry for the late response here. I've been prototyping the
> suggestions from this conversation:
>
> 1. Supporting cgroup-v2 on the current controller seems trivial.
> Basically just specifying the dfl files seems to do it, and my tests
> on top of cgroup-v2 don't see any problems so far at least. In light
> of this I'm not sure it's best to create a new controller per say.
> Seems like it would duplicate a lot of code with the current
> controller, so I've tentatively just stuck to the plan in my current
> patchset, a new counter on the existing controller.
>
> 2. I've been working on transitioning the new counter to the behavior
> Mike specified in the email I'm responding to. So far I have a flow
> that works for shared mappings but not private mappings:
>
> - On reservation, charge the new counter and store the info in the
> resv_map. The counter gets uncharged when the resv_map entry gets
> removed (works fine).
> - On alloc_huge_page(), check if there is a reservation for the page
> being allocated. If not, charge the new counter and store the
> information in resv_map. The counter still gets uncharged when the
> resv_map entry gets removed.
>
> The above works for all shared mappings and reserved private mappings,
> but I' having trouble supporting private NORESERVE mappings. Charging
> can work the same as for shared mappings: charge the new counter on
> reservation and on allocations that do not have a reservation. But the
> question still comes up: where to store the counter to uncharge this
> page? I thought of a couple of things that don't seem to work:
>
> 1. I thought of putting the counter in resv_map->reservation_counter,
> so that it gets uncharged on vm_op_close. But, private NORESERVE
> mappings don't even have a resv_map allocated for them.
>
> 2. I thought of detecting on free_huge_page that the page being freed
> belonged to a private NORESERVE mapping, and uncharging the
> hugetlb_cgroup in the page itself, but free_huge_page only gets a
> struct page* and I can't seem to find a way to detect that that the
> page comes from a private NORESERVE mapping from only the struct
> page*.
>
> Mike, note your suggestion above to check if the page hugetlb_cgroup
> is null doesn't work if we want to keep the current counter working
> the same: the page will always have a hugetlb_cgroup that points that
> contains the old counter. Any ideas how to apply this new counter
> behavior to a private NORESERVE mappings? Is there maybe a flag I can
> set on the pages at allocation time that I can read on free time to
> know whether to uncharge the hugetlb_cgroup or not?

Reading the code and asking around a bit, it seems the pointer to the
hugetlb_cgroup is in page[2].private. Is it reasonable to use
page[3].private to store the hugetlb_cgroup to uncharge for the new
counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that
would solve my problem. When allocating a private NORESERVE page, set
page[3].private to the hugetlb_cgroup to uncharge, then on
free_huge_page, check page[3].private, if it is non-NULL, uncharge the
new counter on it.

Mike Kravetz Oct. 14, 2019, 5:33 p.m. UTC | #14

On 10/11/19 1:41 PM, Mina Almasry wrote:
> On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry <almasrymina@google.com> wrote:
>>
>> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> On 9/19/19 3:24 PM, Mina Almasry wrote:
>>
>> Mike, note your suggestion above to check if the page hugetlb_cgroup
>> is null doesn't work if we want to keep the current counter working
>> the same: the page will always have a hugetlb_cgroup that points that
>> contains the old counter. Any ideas how to apply this new counter
>> behavior to a private NORESERVE mappings? Is there maybe a flag I can
>> set on the pages at allocation time that I can read on free time to
>> know whether to uncharge the hugetlb_cgroup or not?
> 
> Reading the code and asking around a bit, it seems the pointer to the
> hugetlb_cgroup is in page[2].private. Is it reasonable to use
> page[3].private to store the hugetlb_cgroup to uncharge for the new
> counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that
> would solve my problem. When allocating a private NORESERVE page, set
> page[3].private to the hugetlb_cgroup to uncharge, then on
> free_huge_page, check page[3].private, if it is non-NULL, uncharge the
> new counter on it.

Sorry for not responding sooner.  This approach should work, and it looks like
you have a v6 of the series.  I'll take a look.

Mina Almasry Oct. 14, 2019, 6:01 p.m. UTC | #15

On Mon, Oct 14, 2019 at 10:33 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 10/11/19 1:41 PM, Mina Almasry wrote:
> > On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>>
> >>> On 9/19/19 3:24 PM, Mina Almasry wrote:
> >>
> >> Mike, note your suggestion above to check if the page hugetlb_cgroup
> >> is null doesn't work if we want to keep the current counter working
> >> the same: the page will always have a hugetlb_cgroup that points that
> >> contains the old counter. Any ideas how to apply this new counter
> >> behavior to a private NORESERVE mappings? Is there maybe a flag I can
> >> set on the pages at allocation time that I can read on free time to
> >> know whether to uncharge the hugetlb_cgroup or not?
> >
> > Reading the code and asking around a bit, it seems the pointer to the
> > hugetlb_cgroup is in page[2].private. Is it reasonable to use
> > page[3].private to store the hugetlb_cgroup to uncharge for the new
> > counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that
> > would solve my problem. When allocating a private NORESERVE page, set
> > page[3].private to the hugetlb_cgroup to uncharge, then on
> > free_huge_page, check page[3].private, if it is non-NULL, uncharge the
> > new counter on it.
>
> Sorry for not responding sooner.  This approach should work, and it looks like
> you have a v6 of the series.  I'll take a look.
>

Great! Thanks! That's the approach I went with in v6.

> --
> Mike Kravetz

[v5,0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

Message

Comments