Message ID | 20190808231340.53601-1-almasrymina@google.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlb_cgroup: Add hugetlb_cgroup reservation limits | expand |
(+CC Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar) On 8/8/19 4:13 PM, Mina Almasry wrote: > Problem: > Currently tasks attempting to allocate more hugetlb memory than is available get > a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. > However, if a task attempts to allocate hugetlb memory only more than its > hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, > but will SIGBUS the task when it attempts to fault the memory in. > > We have developers interested in using hugetlb_cgroups, and they have expressed > dissatisfaction regarding this behavior. We'd like to improve this > behavior such that tasks violating the hugetlb_cgroup limits get an error on > mmap/shmget time, rather than getting SIGBUS'd when they try to fault > the excess memory in. > > The underlying problem is that today's hugetlb_cgroup accounting happens > at hugetlb memory *fault* time, rather than at *reservation* time. > Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and > the offending task gets SIGBUS'd. > > Proposed Solution: > A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This > counter has slightly different semantics than > hugetlb.xMB.[limit|usage]_in_bytes: > > - While usage_in_bytes tracks all *faulted* hugetlb memory, > reservation_usage_in_bytes tracks all *reserved* hugetlb memory. > > - If a task attempts to reserve more memory than limit_in_bytes allows, > the kernel will allow it to do so. But if a task attempts to reserve > more memory than reservation_limit_in_bytes, the kernel will fail this > reservation. > > This proposal is implemented in this patch, with tests to verify > functionality and show the usage. Thanks for taking on this effort Mina. Before looking at the details of the code, it might be helpful to discuss the expected semantics of the proposed reservation limits. I see you took into account the differences between private and shared mappings. This is good, as the reservation behavior is different for each of these cases. First let's look at private mappings. For private mappings, the reservation usage will be the size of the mapping. This should be fairly simple. As reservations are consumed in the hugetlbfs code, reservations in the resv_map are removed. I see you have a hook into region_del. So, the expectation is that as reservations are consumed the reservation usage will drop for the cgroup. Correct? The only tricky thing about private mappings is COW because of fork. Current reservation semantics specify that all reservations stay with the parent. If child faults and can not get page, SIGBUS. I assume the new reservation limits will work the same. I believe tracking reservations for shared mappings can get quite complicated. The hugetlbfs reservation code around shared mappings 'works' on the basis that shared mapping reservations are global. As a result, reservations are more associated with the inode than with the task making the reservation. For example, consider a file of size 4 hugetlb pages. Task A maps the first 2 pages, and 2 reservations are taken. Task B maps all 4 pages, and 2 additional reservations are taken. I am not really sure of the desired semantics here for reservation limits if A and B are in separate cgroups. Should B be charged for 4 or 2 reservations? Also in the example above, after both tasks create their mappings suppose Task B faults in the first page. Does the reservation usage of Task A go down as it originally had the reservation? It should also be noted that when hugetlbfs reservations are 'consumed' for shared mappings there are no changes to the resv_map. Rather the unmap code compares the contents of the page cache to the resv_map to determine how many reservations were actually consumed. I did not look close enough to determine the code drops reservation usage counts as pages are added to shared mappings.
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > (+CC Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar) > > On 8/8/19 4:13 PM, Mina Almasry wrote: > > Problem: > > Currently tasks attempting to allocate more hugetlb memory than is available get > > a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. > > However, if a task attempts to allocate hugetlb memory only more than its > > hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, > > but will SIGBUS the task when it attempts to fault the memory in. > > > > We have developers interested in using hugetlb_cgroups, and they have expressed > > dissatisfaction regarding this behavior. We'd like to improve this > > behavior such that tasks violating the hugetlb_cgroup limits get an error on > > mmap/shmget time, rather than getting SIGBUS'd when they try to fault > > the excess memory in. > > > > The underlying problem is that today's hugetlb_cgroup accounting happens > > at hugetlb memory *fault* time, rather than at *reservation* time. > > Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and > > the offending task gets SIGBUS'd. > > > > Proposed Solution: > > A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This > > counter has slightly different semantics than > > hugetlb.xMB.[limit|usage]_in_bytes: > > > > - While usage_in_bytes tracks all *faulted* hugetlb memory, > > reservation_usage_in_bytes tracks all *reserved* hugetlb memory. > > > > - If a task attempts to reserve more memory than limit_in_bytes allows, > > the kernel will allow it to do so. But if a task attempts to reserve > > more memory than reservation_limit_in_bytes, the kernel will fail this > > reservation. > > > > This proposal is implemented in this patch, with tests to verify > > functionality and show the usage. > > Thanks for taking on this effort Mina. > No problem! Thanks for reviewing! > Before looking at the details of the code, it might be helpful to discuss > the expected semantics of the proposed reservation limits. > > I see you took into account the differences between private and shared > mappings. This is good, as the reservation behavior is different for each > of these cases. First let's look at private mappings. > > For private mappings, the reservation usage will be the size of the mapping. > This should be fairly simple. As reservations are consumed in the hugetlbfs > code, reservations in the resv_map are removed. I see you have a hook into > region_del. So, the expectation is that as reservations are consumed the > reservation usage will drop for the cgroup. Correct? I assume by 'reservations are consumed' you mean when a reservation goes from just 'reserved' to actually in use (as in the task is writing to the hugetlb page or something). If so, then the answer is no, that is not correct. When reservations are consumed, the reservation usage stays the same. I.e. the reservation usage tracks hugetlb memory (reserved + used) you could say. This is 100% the intention, as we want to know on mmap time if there is enough 'free' (that is unreserved and unused) memory left over in the cgroup to satisfy the mmap call. The hooks in region_add and region_del are to account shared mappings only. There is a check in those code blocks that makes sure the code is only engaged in shared mappings. The commit messages of patches 3/5 and 4/5 go into more details regarding this. > The only tricky thing about private mappings is COW because of fork. Current > reservation semantics specify that all reservations stay with the parent. > If child faults and can not get page, SIGBUS. I assume the new reservation > limits will work the same. > Although I did not explicitly try it, yes. It should work the same. The additional reservation due to the COW will get charged to whatever cgroup the fork is in. If the task can't get a page it gets SIGBUS'd. If there is not enough room to charge the cgroup it's in, then the charge will fail, which I assume will trigger error path that also leads to SIGBUS. > I believe tracking reservations for shared mappings can get quite complicated. > The hugetlbfs reservation code around shared mappings 'works' on the basis > that shared mapping reservations are global. As a result, reservations are > more associated with the inode than with the task making the reservation. FWIW, I found it not too bad. And my tests at least don't detect an anomaly around shared mappings. The key I think is that I'm tracking cgroup to uncharge on the file_region entry inside the resv_map, so we know who allocated each file_region entry exactly and we can uncharge them when the entry is region_del'd. > For example, consider a file of size 4 hugetlb pages. > Task A maps the first 2 pages, and 2 reservations are taken. Task B maps > all 4 pages, and 2 additional reservations are taken. I am not really sure > of the desired semantics here for reservation limits if A and B are in separate > cgroups. Should B be charged for 4 or 2 reservations? Task A's cgroup is charged 2 pages to its reservation usage. Task B's cgroup is charged 2 pages to its reservation usage. This is analogous to how shared memory accounting is done for things like tmpfs, and I see no strong reason right now to deviate. I.e. the task that made the reservation is charged with it, and others use it without getting charged. > Also in the example above, after both tasks create their mappings suppose > Task B faults in the first page. Does the reservation usage of Task A go > down as it originally had the reservation? > Reservation usage never goes down when pages are consumed. Yes, I would have this problem if I was planning to decrement reservation usage when pages are put into use, but, the goal is to find out if there is 'free' memory (unreserved + unused) in the cgroup at mmap time, so we want a counter that tracks (reserved + used). > It should also be noted that when hugetlbfs reservations are 'consumed' for > shared mappings there are no changes to the resv_map. Rather the unmap code > compares the contents of the page cache to the resv_map to determine how > many reservations were actually consumed. I did not look close enough to > determine the code drops reservation usage counts as pages are added to shared > mappings. I think this concern also goes away if reservation usage doesn't go down when pages are consumed, but let me know if you still have doubts. Thanks for taking a look so far! > > -- > Mike Kravetz
On 8/9/19 12:42 PM, Mina Almasry wrote: > On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: >> On 8/8/19 4:13 PM, Mina Almasry wrote: >>> Problem: >>> Currently tasks attempting to allocate more hugetlb memory than is available get >>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. >>> However, if a task attempts to allocate hugetlb memory only more than its >>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, >>> but will SIGBUS the task when it attempts to fault the memory in. <snip> >> I believe tracking reservations for shared mappings can get quite complicated. >> The hugetlbfs reservation code around shared mappings 'works' on the basis >> that shared mapping reservations are global. As a result, reservations are >> more associated with the inode than with the task making the reservation. > > FWIW, I found it not too bad. And my tests at least don't detect an > anomaly around shared mappings. The key I think is that I'm tracking > cgroup to uncharge on the file_region entry inside the resv_map, so we > know who allocated each file_region entry exactly and we can uncharge > them when the entry is region_del'd. > >> For example, consider a file of size 4 hugetlb pages. >> Task A maps the first 2 pages, and 2 reservations are taken. Task B maps >> all 4 pages, and 2 additional reservations are taken. I am not really sure >> of the desired semantics here for reservation limits if A and B are in separate >> cgroups. Should B be charged for 4 or 2 reservations? > > Task A's cgroup is charged 2 pages to its reservation usage. > Task B's cgroup is charged 2 pages to its reservation usage. OK, Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages allocation. The mmap would succeed, but Task B could potentially need to allocate more than 2 huge pages. So, when faulting in more than 2 huge pages B would get a SIGBUS. Correct? Or, am I missing something? Perhaps reservation charge should always be the same as map size/maximum allocation size?
On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 8/9/19 12:42 PM, Mina Almasry wrote: > > On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> On 8/8/19 4:13 PM, Mina Almasry wrote: > >>> Problem: > >>> Currently tasks attempting to allocate more hugetlb memory than is available get > >>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. > >>> However, if a task attempts to allocate hugetlb memory only more than its > >>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, > >>> but will SIGBUS the task when it attempts to fault the memory in. > <snip> > >> I believe tracking reservations for shared mappings can get quite complicated. > >> The hugetlbfs reservation code around shared mappings 'works' on the basis > >> that shared mapping reservations are global. As a result, reservations are > >> more associated with the inode than with the task making the reservation. > > > > FWIW, I found it not too bad. And my tests at least don't detect an > > anomaly around shared mappings. The key I think is that I'm tracking > > cgroup to uncharge on the file_region entry inside the resv_map, so we > > know who allocated each file_region entry exactly and we can uncharge > > them when the entry is region_del'd. > > > >> For example, consider a file of size 4 hugetlb pages. > >> Task A maps the first 2 pages, and 2 reservations are taken. Task B maps > >> all 4 pages, and 2 additional reservations are taken. I am not really sure > >> of the desired semantics here for reservation limits if A and B are in separate > >> cgroups. Should B be charged for 4 or 2 reservations? > > > > Task A's cgroup is charged 2 pages to its reservation usage. > > Task B's cgroup is charged 2 pages to its reservation usage. > > OK, > Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages > allocation. The mmap would succeed, but Task B could potentially need to > allocate more than 2 huge pages. So, when faulting in more than 2 huge > pages B would get a SIGBUS. Correct? Or, am I missing something? > > Perhaps reservation charge should always be the same as map size/maximum > allocation size? I'm thinking this would work similar to how other shared memory like tmpfs is accounted for right now. I.e. if a task conducts an operation that causes memory to be allocated then that task is charged for that memory, and if another task uses memory that has already been allocated and charged by another task, then it can use the memory without being charged. So in case of hugetlb memory, if a task is mmaping memory that causes a new reservation to be made, and new entries to be created in the resv_map for the shared mapping, then that task gets charged. If the task is mmaping memory that is already reserved or faulted, then it reserves or faults it without getting charged. In the example above, in chronological order: - Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations. - Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb reservations because the first 2 are charged already and can be used without incurring a charge. - Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults, since none of the 4 pages are faulted in yet. If the task is only allowed 2 hugetlb page faults then it will actually get a SIGBUS. - Task A accesses 4 hugetlb pages, gets charged no faults, since all the hugetlb faults is charged to Task B. So, yes, I can see a scenario where userspace still gets SIGBUS'd, but I think that's fine because: 1. Notice that the SIGBUS is due to the faulting limit, and not the reservation limit, so we're not regressing the status quo per say. Folks using the fault limit today understand the SIGBUS risk. 2. the way I expect folks to use this is to use 'reservation limits' to partition the available hugetlb memory on the machine using it and forgo using the existing fault limits. Using both at the same time I think would be a superuser feature for folks that really know what they are doing, and understand the risk of SIGBUS that comes with using the existing fault limits. 3. I expect userspace to in general handle this correctly because there are similar challenges with all shared memory and accounting of it, even in tmpfs, I think. I would not like to charge the full reservation to every process that does the mmap. Think of this, much more common scenario: Task A and B are supposed to collaborate on a 10 hugetlb pages of data. Task B should not access any hugetlb memory other than the memory it is working on with Task A, so: 1. Task A is put in a cgroup with 10 hugetlb pages reservation limit. 2. Task B is put in a cgroup with 0 hugetlb pages of reservation limit. 3. Task A mmaps 10 hugetlb pages of hugetlb memory, and notifies Task B that it is done. 4. Task B, due to programmer error, tries to mmap hugetlb memory beyond what Task A set up for it, it gets denied at mmap time by the cgroup reservation limit. 5. Task B mmaps the same 10 hugetlb pages of memory and starts working on them. The mmap succeeds because Task B is not charged anything. If we were charging the full reservation to both Tasks A and B, then both A and B would have be in cgroups that allow 10 pages of hugetlb reservations, and the mmap in step 4 would succeed, and we accidentally overcommitted the amount of hugetlb memory available. > -- > Mike Kravetz
On 8/10/19 3:01 PM, Mina Almasry wrote: > On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >> On 8/9/19 12:42 PM, Mina Almasry wrote: >>> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>> On 8/8/19 4:13 PM, Mina Almasry wrote: >>>>> Problem: >>>>> Currently tasks attempting to allocate more hugetlb memory than is available get >>>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. >>>>> However, if a task attempts to allocate hugetlb memory only more than its >>>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, >>>>> but will SIGBUS the task when it attempts to fault the memory in. >> <snip> >>>> I believe tracking reservations for shared mappings can get quite complicated. >>>> The hugetlbfs reservation code around shared mappings 'works' on the basis >>>> that shared mapping reservations are global. As a result, reservations are >>>> more associated with the inode than with the task making the reservation. >>> >>> FWIW, I found it not too bad. And my tests at least don't detect an >>> anomaly around shared mappings. The key I think is that I'm tracking >>> cgroup to uncharge on the file_region entry inside the resv_map, so we >>> know who allocated each file_region entry exactly and we can uncharge >>> them when the entry is region_del'd. >>> >>>> For example, consider a file of size 4 hugetlb pages. >>>> Task A maps the first 2 pages, and 2 reservations are taken. Task B maps >>>> all 4 pages, and 2 additional reservations are taken. I am not really sure >>>> of the desired semantics here for reservation limits if A and B are in separate >>>> cgroups. Should B be charged for 4 or 2 reservations? >>> >>> Task A's cgroup is charged 2 pages to its reservation usage. >>> Task B's cgroup is charged 2 pages to its reservation usage. >> >> OK, >> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages >> allocation. The mmap would succeed, but Task B could potentially need to >> allocate more than 2 huge pages. So, when faulting in more than 2 huge >> pages B would get a SIGBUS. Correct? Or, am I missing something? >> >> Perhaps reservation charge should always be the same as map size/maximum >> allocation size? > > I'm thinking this would work similar to how other shared memory like > tmpfs is accounted for right now. I.e. if a task conducts an operation > that causes memory to be allocated then that task is charged for that > memory, and if another task uses memory that has already been > allocated and charged by another task, then it can use the memory > without being charged. > > So in case of hugetlb memory, if a task is mmaping memory that causes > a new reservation to be made, and new entries to be created in the > resv_map for the shared mapping, then that task gets charged. If the > task is mmaping memory that is already reserved or faulted, then it > reserves or faults it without getting charged. > > In the example above, in chronological order: > - Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations. > - Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb > reservations because the first 2 are charged already and can be used > without incurring a charge. > - Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults, > since none of the 4 pages are faulted in yet. If the task is only > allowed 2 hugetlb page faults then it will actually get a SIGBUS. > - Task A accesses 4 hugetlb pages, gets charged no faults, since all > the hugetlb faults is charged to Task B. > > So, yes, I can see a scenario where userspace still gets SIGBUS'd, but > I think that's fine because: > 1. Notice that the SIGBUS is due to the faulting limit, and not the > reservation limit, so we're not regressing the status quo per say. > Folks using the fault limit today understand the SIGBUS risk. > 2. the way I expect folks to use this is to use 'reservation limits' > to partition the available hugetlb memory on the machine using it and > forgo using the existing fault limits. Using both at the same time I > think would be a superuser feature for folks that really know what > they are doing, and understand the risk of SIGBUS that comes with > using the existing fault limits. > 3. I expect userspace to in general handle this correctly because > there are similar challenges with all shared memory and accounting of > it, even in tmpfs, I think. Ok, that helps explain your use case. I agree that it would be difficult to use both fault and reservation limits together. Especially in the case of shared mappings.
On Thu, 8 Aug 2019 16:13:36 -0700 Mina Almasry wrote: > > These counters will track hugetlb reservations rather than hugetlb > memory faulted in. This patch only adds the counter, following patches > add the charging and uncharging of the counter. > --- !?! > include/linux/hugetlb.h | 2 +- > mm/hugetlb_cgroup.c | 86 +++++++++++++++++++++++++++++++++++++---- > 2 files changed, 80 insertions(+), 8 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index edfca42783192..6777b3013345d 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -340,7 +340,7 @@ struct hstate { > unsigned int surplus_huge_pages_node[MAX_NUMNODES]; > #ifdef CONFIG_CGROUP_HUGETLB > /* cgroup control files */ > - struct cftype cgroup_files[5]; > + struct cftype cgroup_files[9]; Move that enum in this header file and replace numbers with characters to easy both reading and maintaining. > #endif > char name[HSTATE_NAME_LEN]; > }; > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c > index 68c2f2f3c05b7..708103663988a 100644 > --- a/mm/hugetlb_cgroup.c > +++ b/mm/hugetlb_cgroup.c > @@ -25,6 +25,10 @@ struct hugetlb_cgroup { > * the counter to account for hugepages from hugetlb. > */ > struct page_counter hugepage[HUGE_MAX_HSTATE]; > + /* > + * the counter to account for hugepage reservations from hugetlb. > + */ > + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; > }; > > #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) > @@ -33,6 +37,15 @@ struct hugetlb_cgroup { > > static struct hugetlb_cgroup *root_h_cgroup __read_mostly; > > +static inline > +struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx, > + bool reserved) s/get_/hugetlb_cgroup_get_/ to make it not too generic. > +{ > + if (reserved) > + return &h_cg->reserved_hugepage[idx]; > + return &h_cg->hugepage[idx]; > +} > + > static inline > struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) > { > @@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, > > enum { > RES_USAGE, > + RES_RESERVATION_USAGE, > RES_LIMIT, > + RES_RESERVATION_LIMIT, > RES_MAX_USAGE, > + RES_RESERVATION_MAX_USAGE, > RES_FAILCNT, > + RES_RESERVATION_FAILCNT, > }; > > static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > struct page_counter *counter; > + struct page_counter *reserved_counter; > struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); > > counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; > + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)]; > > switch (MEMFILE_ATTR(cft->private)) { > case RES_USAGE: > return (u64)page_counter_read(counter) * PAGE_SIZE; > + case RES_RESERVATION_USAGE: > + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE; > case RES_LIMIT: > return (u64)counter->max * PAGE_SIZE; > + case RES_RESERVATION_LIMIT: > + return (u64)reserved_counter->max * PAGE_SIZE; > case RES_MAX_USAGE: > return (u64)counter->watermark * PAGE_SIZE; > + case RES_RESERVATION_MAX_USAGE: > + return (u64)reserved_counter->watermark * PAGE_SIZE; > case RES_FAILCNT: > return counter->failcnt; > + case RES_RESERVATION_FAILCNT: > + return reserved_counter->failcnt; > default: > BUG(); > } > @@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, > int ret, idx; > unsigned long nr_pages; > struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); > + bool reserved = false; > > if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ > return -EINVAL; > @@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, > idx = MEMFILE_IDX(of_cft(of)->private); > nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx])); > > + if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) { > + reserved = true; > + } > + > switch (MEMFILE_ATTR(of_cft(of)->private)) { > + case RES_RESERVATION_LIMIT: reserved = true; /* fall thru */ > case RES_LIMIT: > mutex_lock(&hugetlb_limit_mutex); > - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages); > + ret = page_counter_set_max(get_counter(h_cg, idx, reserved), > + nr_pages); > mutex_unlock(&hugetlb_limit_mutex); > break; > default: > @@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off) > { > int ret = 0; > - struct page_counter *counter; > + struct page_counter *counter, *reserved_counter; > struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); > > counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; > + reserved_counter = &h_cg->reserved_hugepage[ > + MEMFILE_IDX(of_cft(of)->private)]; > > switch (MEMFILE_ATTR(of_cft(of)->private)) { > case RES_MAX_USAGE: > page_counter_reset_watermark(counter); > break; > + case RES_RESERVATION_MAX_USAGE: > + page_counter_reset_watermark(reserved_counter); > + break; > case RES_FAILCNT: > counter->failcnt = 0; > break; > + case RES_RESERVATION_FAILCNT: > + reserved_counter->failcnt = 0; > + break; > default: > ret = -EINVAL; > break; > @@ -357,7 +399,7 @@ static void __init __hugetlb_cgroup_file_init(int idx) > struct hstate *h = &hstates[idx]; > > /* format the size */ > - mem_fmt(buf, 32, huge_page_size(h)); > + mem_fmt(buf, sizeof(buf), huge_page_size(h)); > > /* Add the limit file */ > cft = &h->cgroup_files[0]; > @@ -366,28 +408,58 @@ static void __init __hugetlb_cgroup_file_init(int idx) > cft->read_u64 = hugetlb_cgroup_read_u64; > cft->write = hugetlb_cgroup_write; > > - /* Add the usage file */ > + /* Add the reservation limit file */ > cft = &h->cgroup_files[1]; > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes", > + buf); > + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT); > + cft->read_u64 = hugetlb_cgroup_read_u64; > + cft->write = hugetlb_cgroup_write; > + > + /* Add the usage file */ > + cft = &h->cgroup_files[2]; > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); > cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); > cft->read_u64 = hugetlb_cgroup_read_u64; > > + /* Add the reservation usage file */ > + cft = &h->cgroup_files[3]; > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes", > + buf); > + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE); > + cft->read_u64 = hugetlb_cgroup_read_u64; > + > /* Add the MAX usage file */ > - cft = &h->cgroup_files[2]; > + cft = &h->cgroup_files[4]; > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); > cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); > cft->write = hugetlb_cgroup_reset; > cft->read_u64 = hugetlb_cgroup_read_u64; > > + /* Add the MAX reservation usage file */ > + cft = &h->cgroup_files[5]; > + snprintf(cft->name, MAX_CFTYPE_NAME, > + "%s.reservation_max_usage_in_bytes", buf); > + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE); > + cft->write = hugetlb_cgroup_reset; > + cft->read_u64 = hugetlb_cgroup_read_u64; > + > /* Add the failcntfile */ > - cft = &h->cgroup_files[3]; > + cft = &h->cgroup_files[6]; > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); > cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); > cft->write = hugetlb_cgroup_reset; > cft->read_u64 = hugetlb_cgroup_read_u64; > > + /* Add the reservation failcntfile */ > + cft = &h->cgroup_files[7]; > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf); > + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); > + cft->write = hugetlb_cgroup_reset; > + cft->read_u64 = hugetlb_cgroup_read_u64; > + > /* NULL terminate the last cft */ > - cft = &h->cgroup_files[4]; > + cft = &h->cgroup_files[8]; > memset(cft, 0, sizeof(*cft)); Replace numbers with characters. > > WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, > -- > 2.23.0.rc1.153.gdeed80330f-goog
On Wed, Aug 14, 2019 at 8:54 PM Hillf Danton <hdanton@sina.com> wrote: > > > On Thu, 8 Aug 2019 16:13:36 -0700 Mina Almasry wrote: > > > > These counters will track hugetlb reservations rather than hugetlb > > memory faulted in. This patch only adds the counter, following patches > > add the charging and uncharging of the counter. > > --- > > !?! > Thanks for reviewing. I'm not sure what you're referring to though. What's wrong here? > > include/linux/hugetlb.h | 2 +- > > mm/hugetlb_cgroup.c | 86 +++++++++++++++++++++++++++++++++++++---- > > 2 files changed, 80 insertions(+), 8 deletions(-) > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index edfca42783192..6777b3013345d 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -340,7 +340,7 @@ struct hstate { > > unsigned int surplus_huge_pages_node[MAX_NUMNODES]; > > #ifdef CONFIG_CGROUP_HUGETLB > > /* cgroup control files */ > > - struct cftype cgroup_files[5]; > > + struct cftype cgroup_files[9]; > > Move that enum in this header file and replace numbers with characters > to easy both reading and maintaining. > > #endif > > char name[HSTATE_NAME_LEN]; > > }; > > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c > > index 68c2f2f3c05b7..708103663988a 100644 > > --- a/mm/hugetlb_cgroup.c > > +++ b/mm/hugetlb_cgroup.c > > @@ -25,6 +25,10 @@ struct hugetlb_cgroup { > > * the counter to account for hugepages from hugetlb. > > */ > > struct page_counter hugepage[HUGE_MAX_HSTATE]; > > + /* > > + * the counter to account for hugepage reservations from hugetlb. > > + */ > > + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; > > }; > > > > #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) > > @@ -33,6 +37,15 @@ struct hugetlb_cgroup { > > > > static struct hugetlb_cgroup *root_h_cgroup __read_mostly; > > > > +static inline > > +struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx, > > + bool reserved) > > s/get_/hugetlb_cgroup_get_/ to make it not too generic. > > +{ > > + if (reserved) > > + return &h_cg->reserved_hugepage[idx]; > > + return &h_cg->hugepage[idx]; > > +} > > + > > static inline > > struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) > > { > > @@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, > > > > enum { > > RES_USAGE, > > + RES_RESERVATION_USAGE, > > RES_LIMIT, > > + RES_RESERVATION_LIMIT, > > RES_MAX_USAGE, > > + RES_RESERVATION_MAX_USAGE, > > RES_FAILCNT, > > + RES_RESERVATION_FAILCNT, > > }; > > > > static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, > > struct cftype *cft) > > { > > struct page_counter *counter; > > + struct page_counter *reserved_counter; > > struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); > > > > counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; > > + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)]; > > > > switch (MEMFILE_ATTR(cft->private)) { > > case RES_USAGE: > > return (u64)page_counter_read(counter) * PAGE_SIZE; > > + case RES_RESERVATION_USAGE: > > + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE; > > case RES_LIMIT: > > return (u64)counter->max * PAGE_SIZE; > > + case RES_RESERVATION_LIMIT: > > + return (u64)reserved_counter->max * PAGE_SIZE; > > case RES_MAX_USAGE: > > return (u64)counter->watermark * PAGE_SIZE; > > + case RES_RESERVATION_MAX_USAGE: > > + return (u64)reserved_counter->watermark * PAGE_SIZE; > > case RES_FAILCNT: > > return counter->failcnt; > > + case RES_RESERVATION_FAILCNT: > > + return reserved_counter->failcnt; > > default: > > BUG(); > > } > > @@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, > > int ret, idx; > > unsigned long nr_pages; > > struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); > > + bool reserved = false; > > > > if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ > > return -EINVAL; > > @@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, > > idx = MEMFILE_IDX(of_cft(of)->private); > > nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx])); > > > > + if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) { > > + reserved = true; > > + } > > + > > switch (MEMFILE_ATTR(of_cft(of)->private)) { > > + case RES_RESERVATION_LIMIT: > reserved = true; > /* fall thru */ > > > case RES_LIMIT: > > mutex_lock(&hugetlb_limit_mutex); > > - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages); > > + ret = page_counter_set_max(get_counter(h_cg, idx, reserved), > > + nr_pages); > > mutex_unlock(&hugetlb_limit_mutex); > > break; > > default: > > @@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, > > char *buf, size_t nbytes, loff_t off) > > { > > int ret = 0; > > - struct page_counter *counter; > > + struct page_counter *counter, *reserved_counter; > > struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); > > > > counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; > > + reserved_counter = &h_cg->reserved_hugepage[ > > + MEMFILE_IDX(of_cft(of)->private)]; > > > > switch (MEMFILE_ATTR(of_cft(of)->private)) { > > case RES_MAX_USAGE: > > page_counter_reset_watermark(counter); > > break; > > + case RES_RESERVATION_MAX_USAGE: > > + page_counter_reset_watermark(reserved_counter); > > + break; > > case RES_FAILCNT: > > counter->failcnt = 0; > > break; > > + case RES_RESERVATION_FAILCNT: > > + reserved_counter->failcnt = 0; > > + break; > > default: > > ret = -EINVAL; > > break; > > @@ -357,7 +399,7 @@ static void __init __hugetlb_cgroup_file_init(int idx) > > struct hstate *h = &hstates[idx]; > > > > /* format the size */ > > - mem_fmt(buf, 32, huge_page_size(h)); > > + mem_fmt(buf, sizeof(buf), huge_page_size(h)); > > > > /* Add the limit file */ > > cft = &h->cgroup_files[0]; > > @@ -366,28 +408,58 @@ static void __init __hugetlb_cgroup_file_init(int idx) > > cft->read_u64 = hugetlb_cgroup_read_u64; > > cft->write = hugetlb_cgroup_write; > > > > - /* Add the usage file */ > > + /* Add the reservation limit file */ > > cft = &h->cgroup_files[1]; > > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes", > > + buf); > > + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT); > > + cft->read_u64 = hugetlb_cgroup_read_u64; > > + cft->write = hugetlb_cgroup_write; > > + > > + /* Add the usage file */ > > + cft = &h->cgroup_files[2]; > > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); > > cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); > > cft->read_u64 = hugetlb_cgroup_read_u64; > > > > + /* Add the reservation usage file */ > > + cft = &h->cgroup_files[3]; > > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes", > > + buf); > > + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE); > > + cft->read_u64 = hugetlb_cgroup_read_u64; > > + > > /* Add the MAX usage file */ > > - cft = &h->cgroup_files[2]; > > + cft = &h->cgroup_files[4]; > > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); > > cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); > > cft->write = hugetlb_cgroup_reset; > > cft->read_u64 = hugetlb_cgroup_read_u64; > > > > + /* Add the MAX reservation usage file */ > > + cft = &h->cgroup_files[5]; > > + snprintf(cft->name, MAX_CFTYPE_NAME, > > + "%s.reservation_max_usage_in_bytes", buf); > > + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE); > > + cft->write = hugetlb_cgroup_reset; > > + cft->read_u64 = hugetlb_cgroup_read_u64; > > + > > /* Add the failcntfile */ > > - cft = &h->cgroup_files[3]; > > + cft = &h->cgroup_files[6]; > > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); > > cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); > > cft->write = hugetlb_cgroup_reset; > > cft->read_u64 = hugetlb_cgroup_read_u64; > > > > + /* Add the reservation failcntfile */ > > + cft = &h->cgroup_files[7]; > > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf); > > + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); > > + cft->write = hugetlb_cgroup_reset; > > + cft->read_u64 = hugetlb_cgroup_read_u64; > > + > > /* NULL terminate the last cft */ > > - cft = &h->cgroup_files[4]; > > + cft = &h->cgroup_files[8]; > > memset(cft, 0, sizeof(*cft)); > > Replace numbers with characters. > > > > WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, > > -- > > 2.23.0.rc1.153.gdeed80330f-goog >
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in. We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in. The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd. Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes: - While usage_in_bytes tracks all *faulted* hugetlb memory, reservation_usage_in_bytes tracks all *reserved* hugetlb memory. - If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation. This proposal is implemented in this patch, with tests to verify functionality and show the usage. Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well. 2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so. Caveats: 1. This support is implemented for cgroups-v1. I have not tried hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet. This is largely because we use cgroups-v1 for now. If required, I can add hugetlb_cgroup support to cgroups v2 in this patch or a follow up. 2. Most complicated bit of this patch I believe is: where to store the pointer to the hugetlb_cgroup to uncharge at unreservation time? Normally the cgroup pointers hang off the struct page. But, with hugetlb_cgroup reservations, one task can reserve a specific page and another task may fault it in (I believe), so storing the pointer in struct page is not appropriate. Proposed approach here is to store the pointer in the resv_map. See patch for details. Signed-off-by: Mina Almasry <almasrymina@google.com> [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html Changes in v2: - Split the patch into a 5 patch series. - Fixed patch subject. Mina Almasry (5): hugetlb_cgroup: Add hugetlb_cgroup reservation counter hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations hugetlb_cgroup: add reservation accounting for private mappings hugetlb_cgroup: add accounting for shared mappings hugetlb_cgroup: Add hugetlb_cgroup reservation tests include/linux/hugetlb.h | 10 +- include/linux/hugetlb_cgroup.h | 19 +- mm/hugetlb.c | 256 ++++++++-- mm/hugetlb_cgroup.c | 153 +++++- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 4 + .../selftests/vm/charge_reserved_hugetlb.sh | 438 ++++++++++++++++++ .../selftests/vm/write_hugetlb_memory.sh | 22 + .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++ 9 files changed, 1087 insertions(+), 68 deletions(-) create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c -- 2.23.0.rc1.153.gdeed80330f-goog