Message ID | 20250407124706.2688092-1-tujinjiang@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm/hugetlb: fix set_max_huge_pages() when there are surplus pages | expand |
On Mon, Apr 07, 2025 at 08:47:06PM +0800, Jinjiang Tu wrote: > In set_max_huge_pages(), min_count should mean the acquired persistent > huge pages, but it contains surplus huge pages. It will leads to failing > to freeing free huge pages for a Node. > > Steps to reproduce: > 1) create 5 hugetlb folios in Node0 > 2) run a program to use all the hugetlb folios > 3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5 > hugetlb folios in Node0 are accounted as surplus. > 4) create 5 hugetlb folios in Node1 > 5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios > > The result: > Node0 Node1 > Total 5 5 > Free 0 5 > Surp 5 5 I would put this after the explanation, as otherwise is a bit hard to follow. > We couldn't subtract surplus_huge_pages from min_mount, since free hugetlb > folios may be surplus due to HVO. In __update_and_free_hugetlb_folio(), > hugetlb_vmemmap_restore_folio() may fail, add the folio back to pool and > treat it as surplus. If we directly subtract surplus_huge_pages from > min_mount, some free folios will be subtracted twice. > > To fix it, check if count is less than the num of free huge pages that > could be destroyed (i.e., available_huge_pages(h)), and remove hugetlb > folios if so. But this is not true, you are no longer comparing against available_huge_pages(h) as you did in v2. I would go with something along these lines as changelog. "In set_max_huge_pages(), min_count is computed taking into account also surplushuge pages, which might lead in some cases to not be able to free huge pages and end up accounting them as surplus intead. One way to solve it is to substract surplus_huge_pages directly, but we cannot do it blindly because there might be surplus pages thar are also free pages, which might happen when we fail to restore the vmemmap for optimized hvo pages. So we could be subtracting the same page twice. In order to work this around, let us first compute the number of free persistent pages, and use that along with surplus pages to compute min_count." And then put the PoC. > Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes") > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de>
在 2025/4/8 21:10, Oscar Salvador 写道: > On Mon, Apr 07, 2025 at 08:47:06PM +0800, Jinjiang Tu wrote: >> In set_max_huge_pages(), min_count should mean the acquired persistent >> huge pages, but it contains surplus huge pages. It will leads to failing >> to freeing free huge pages for a Node. >> >> Steps to reproduce: >> 1) create 5 hugetlb folios in Node0 >> 2) run a program to use all the hugetlb folios >> 3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5 >> hugetlb folios in Node0 are accounted as surplus. >> 4) create 5 hugetlb folios in Node1 >> 5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios >> >> The result: >> Node0 Node1 >> Total 5 5 >> Free 0 5 >> Surp 5 5 > I would put this after the explanation, as otherwise is a bit hard to > follow. > >> We couldn't subtract surplus_huge_pages from min_mount, since free hugetlb >> folios may be surplus due to HVO. In __update_and_free_hugetlb_folio(), >> hugetlb_vmemmap_restore_folio() may fail, add the folio back to pool and >> treat it as surplus. If we directly subtract surplus_huge_pages from >> min_mount, some free folios will be subtracted twice. >> >> To fix it, check if count is less than the num of free huge pages that >> could be destroyed (i.e., available_huge_pages(h)), and remove hugetlb >> folios if so. > But this is not true, you are no longer comparing against > available_huge_pages(h) as you did in v2. > > I would go with something along these lines as changelog. > > "In set_max_huge_pages(), min_count is computed taking into account also > surplushuge pages, which might lead in some cases to not be able to free > huge pages and end up accounting them as surplus intead. > > One way to solve it is to substract surplus_huge_pages directly, but we > cannot do it blindly because there might be surplus pages thar are also > free pages, which might happen when we fail to restore the vmemmap for > optimized hvo pages. > So we could be subtracting the same page twice. > > In order to work this around, let us first compute the number of > free persistent pages, and use that along with surplus pages > to compute min_count." > > And then put the PoC. Thanks very much, I will update the changelog and send a new version later. >> Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes") >> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> > Acked-by: Oscar Salvador <osalvador@suse.de> > >
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 39f92aad7bd1..e4aed3557339 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3825,6 +3825,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed, static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { + unsigned long persistent_free_count; unsigned long min_count; unsigned long allocated; struct folio *folio; @@ -3959,8 +3960,24 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, * though, we'll note that we're not allowed to exceed surplus * and won't grow the pool anywhere else. Not until one of the * sysctls are changed, or the surplus pages go out of use. + * + * min_count is the expected number of persistent pages, we + * shouldn't calculate min_count by using + * resv_huge_pages + persistent_huge_pages() - free_huge_pages, + * because there may exist free surplus huge pages, and this will + * lead to subtracting twice. Free surplus huge pages come from HVO + * failing to restore vmemmap, see comments in the callers of + * hugetlb_vmemmap_restore_folio(). Thus, we should calculate + * persistent free count first. */ - min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; + persistent_free_count = h->free_huge_pages; + if (h->free_huge_pages > persistent_huge_pages(h)) { + if (h->free_huge_pages > h->surplus_huge_pages) + persistent_free_count -= h->surplus_huge_pages; + else + persistent_free_count = 0; + } + min_count = h->resv_huge_pages + persistent_huge_pages(h) - persistent_free_count; min_count = max(count, min_count); try_to_free_low(h, min_count, nodes_allowed);
In set_max_huge_pages(), min_count should mean the acquired persistent huge pages, but it contains surplus huge pages. It will leads to failing to freeing free huge pages for a Node. Steps to reproduce: 1) create 5 hugetlb folios in Node0 2) run a program to use all the hugetlb folios 3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5 hugetlb folios in Node0 are accounted as surplus. 4) create 5 hugetlb folios in Node1 5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios The result: Node0 Node1 Total 5 5 Free 0 5 Surp 5 5 We couldn't subtract surplus_huge_pages from min_mount, since free hugetlb folios may be surplus due to HVO. In __update_and_free_hugetlb_folio(), hugetlb_vmemmap_restore_folio() may fail, add the folio back to pool and treat it as surplus. If we directly subtract surplus_huge_pages from min_mount, some free folios will be subtracted twice. To fix it, check if count is less than the num of free huge pages that could be destroyed (i.e., available_huge_pages(h)), and remove hugetlb folios if so. Since there may exist free surplus hugetlb folios, we should remove surplus folios first to make surplus count correct. The result with this patch: Node0 Node1 Total 5 0 Free 0 0 Surp 5 0 Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> --- Changelog since v2: * Fix this issue by calculating free surplus count, and add comments, suggested by Oscar Salvador mm/hugetlb.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)