Message ID | 20210511151016.2310627-2-nao.horiguchi@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hwpoison: fix race with compound page allocation | expand |
On Wed, May 12, 2021 at 12:10:15AM +0900, Naoya Horiguchi wrote: > @@ -1095,30 +1095,43 @@ static int __get_hwpoison_page(struct page *page) > { > struct page *head = compound_head(page); > > - if (!PageHuge(head) && PageTransHuge(head)) { > - /* > - * Non anonymous thp exists only in allocation/free time. We > - * can't handle such a case correctly, so let's give it up. > - * This should be better than triggering BUG_ON when kernel > - * tries to touch the "partially handled" page. > - */ > - if (!PageAnon(head)) { > - pr_err("Memory failure: %#lx: non anonymous thp\n", > - page_to_pfn(page)); > - return 0; > + if (PageCompound(page)) { > + if (PageSlab(page)) { > + return get_page_unless_zero(page); > + } else if (PageHuge(head)) { > + int ret = 0; > + > + spin_lock(&hugetlb_lock); > + if (!PageHuge(head)) > + ret = -EBUSY; > + else if (HPageFreed(head) || HPageMigratable(head)) > + ret = get_page_unless_zero(head); > + spin_unlock(&hugetlb_lock); > + return ret; Uhm, I am having a hard time with that -EBUSY. At this stage, we expect __get_hwpoison_page() to either return true or false, depending on whether it could grab a page's refcount or not. Returning -EBUSY here seems wrong (plus it is inconsistent with the comment above the function). It might be useful for the latter patch, I do not know as I yet have to check that one, but if anything, let us stay consistent here in this one. So, if hugetlb vanished under us, let us return "we could not grab the refcount". Does it make sense?
On Wed 12-05-21 00:10:15, Naoya Horiguchi wrote: > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > When hugetlb page fault (under overcommiting situation) and > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: > > CPU0: CPU1: > > gather_surplus_pages() > page = alloc_surplus_huge_page() > memory_failure_hugetlb() > get_hwpoison_page(page) > __get_hwpoison_page(page) > get_page_unless_zero(page) > zero = put_page_testzero(page) > VM_BUG_ON_PAGE(!zero, page) > enqueue_huge_page(h, page) > put_page(page) > > __get_hwpoison_page() only checks page refcount before taking additional > one for memory error handling, which is wrong because there's time > windows where compound pages have non-zero refcount during initialization. > > So makes __get_hwpoison_page() check page status a bit more for a few > types of compound pages. PageSlab() check is added because otherwise > "non anonymous thp" path is wrongly chosen. This should really describe the fix in more details. E.g. [...] > @@ -1095,30 +1095,43 @@ static int __get_hwpoison_page(struct page *page) > { > struct page *head = compound_head(page); > > - if (!PageHuge(head) && PageTransHuge(head)) { > - /* > - * Non anonymous thp exists only in allocation/free time. We > - * can't handle such a case correctly, so let's give it up. > - * This should be better than triggering BUG_ON when kernel > - * tries to touch the "partially handled" page. > - */ > - if (!PageAnon(head)) { > - pr_err("Memory failure: %#lx: non anonymous thp\n", > - page_to_pfn(page)); > - return 0; > + if (PageCompound(page)) { So you do rely on PageCompound to be true. Which is prone to races as well. All you need is to check before prep_compound_page and run into get_page_unless_zero (down in this function) before hugetlb reaches put_page_testzero. Or do I miss something that would prevent from that?
On Wed, May 12, 2021 at 02:19:32PM +0200, Michal Hocko wrote: > On Wed 12-05-21 00:10:15, Naoya Horiguchi wrote: > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > When hugetlb page fault (under overcommiting situation) and > > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: > > > > CPU0: CPU1: > > > > gather_surplus_pages() > > page = alloc_surplus_huge_page() > > memory_failure_hugetlb() > > get_hwpoison_page(page) > > __get_hwpoison_page(page) > > get_page_unless_zero(page) > > zero = put_page_testzero(page) > > VM_BUG_ON_PAGE(!zero, page) > > enqueue_huge_page(h, page) > > put_page(page) > > > > __get_hwpoison_page() only checks page refcount before taking additional > > one for memory error handling, which is wrong because there's time > > windows where compound pages have non-zero refcount during initialization. > > > > So makes __get_hwpoison_page() check page status a bit more for a few > > types of compound pages. PageSlab() check is added because otherwise > > "non anonymous thp" path is wrongly chosen. > > This should really describe the fix in more details. E.g. > [...] > > @@ -1095,30 +1095,43 @@ static int __get_hwpoison_page(struct page *page) > > { > > struct page *head = compound_head(page); > > > > - if (!PageHuge(head) && PageTransHuge(head)) { > > - /* > > - * Non anonymous thp exists only in allocation/free time. We > > - * can't handle such a case correctly, so let's give it up. > > - * This should be better than triggering BUG_ON when kernel > > - * tries to touch the "partially handled" page. > > - */ > > - if (!PageAnon(head)) { > > - pr_err("Memory failure: %#lx: non anonymous thp\n", > > - page_to_pfn(page)); > > - return 0; > > + if (PageCompound(page)) { > > So you do rely on PageCompound to be true. Which is prone to races as > well. All you need is to check before prep_compound_page and run into > get_page_unless_zero (down in this function) before hugetlb reaches > put_page_testzero. Or do I miss something that would prevent from that? No, you're right, the race can still happen (I only considered the case when a compound page is already allocated, that was insufficient...). Any check outside locking seems harmful, so does the checking like below make more sense? static int __get_hwpoison_page(struct page *page) { struct page *head = compound_head(page); int ret = 0; #ifdef CONFIG_HUGETLB_PAGE // this #ifdef is needed for hugetlb_lock spin_lock(&hugetlb_lock); if (PageHuge(head) && (HPageFreed(head) || HPageMigratable(head))) ret = get_page_unless_zero(head); spin_unlock(&hugetlb_lock); if (ret > 0) return ret; #endif // other types of compound page. ... return get_page_unless_zero(page); } - Naoya
On Wed, May 12, 2021 at 10:33:24AM +0200, Oscar Salvador wrote: > On Wed, May 12, 2021 at 12:10:15AM +0900, Naoya Horiguchi wrote: > > @@ -1095,30 +1095,43 @@ static int __get_hwpoison_page(struct page *page) > > { > > struct page *head = compound_head(page); > > > > - if (!PageHuge(head) && PageTransHuge(head)) { > > - /* > > - * Non anonymous thp exists only in allocation/free time. We > > - * can't handle such a case correctly, so let's give it up. > > - * This should be better than triggering BUG_ON when kernel > > - * tries to touch the "partially handled" page. > > - */ > > - if (!PageAnon(head)) { > > - pr_err("Memory failure: %#lx: non anonymous thp\n", > > - page_to_pfn(page)); > > - return 0; > > + if (PageCompound(page)) { > > + if (PageSlab(page)) { > > + return get_page_unless_zero(page); > > + } else if (PageHuge(head)) { > > + int ret = 0; > > + > > + spin_lock(&hugetlb_lock); > > + if (!PageHuge(head)) > > + ret = -EBUSY; > > + else if (HPageFreed(head) || HPageMigratable(head)) > > + ret = get_page_unless_zero(head); > > + spin_unlock(&hugetlb_lock); > > + return ret; > > Uhm, I am having a hard time with that -EBUSY. > At this stage, we expect __get_hwpoison_page() to either return true or false, > depending on whether it could grab a page's refcount or not. Returning -EBUSY > here seems wrong (plus it is inconsistent with the comment above the function). > It might be useful for the latter patch, I do not know as I yet have to check > that one, but if anything, let us stay consistent here in this one. > So, if hugetlb vanished under us, let us return "we could not grab the > refcount". Does it make sense? Yes, you are totally right. I failed to properly split the patch. -EBUSY is non-zero, so it's considererd as "successfully pinned", which is not true. I should've set ret to 0. - Naoya
diff --git v5.12/mm/memory-failure.c v5.12_patched/mm/memory-failure.c index a3659619d293..02668b24e512 100644 --- v5.12/mm/memory-failure.c +++ v5.12_patched/mm/memory-failure.c @@ -1095,30 +1095,43 @@ static int __get_hwpoison_page(struct page *page) { struct page *head = compound_head(page); - if (!PageHuge(head) && PageTransHuge(head)) { - /* - * Non anonymous thp exists only in allocation/free time. We - * can't handle such a case correctly, so let's give it up. - * This should be better than triggering BUG_ON when kernel - * tries to touch the "partially handled" page. - */ - if (!PageAnon(head)) { - pr_err("Memory failure: %#lx: non anonymous thp\n", - page_to_pfn(page)); - return 0; + if (PageCompound(page)) { + if (PageSlab(page)) { + return get_page_unless_zero(page); + } else if (PageHuge(head)) { + int ret = 0; + + spin_lock(&hugetlb_lock); + if (!PageHuge(head)) + ret = -EBUSY; + else if (HPageFreed(head) || HPageMigratable(head)) + ret = get_page_unless_zero(head); + spin_unlock(&hugetlb_lock); + return ret; + } else if (PageTransHuge(head)) { + /* + * Non anonymous thp exists only in allocation/free time. We + * can't handle such a case correctly, so let's give it up. + * This should be better than triggering BUG_ON when kernel + * tries to touch the "partially handled" page. + */ + if (!PageAnon(head)) { + pr_err("Memory failure: %#lx: non anonymous thp\n", + page_to_pfn(page)); + return 0; + } + if (get_page_unless_zero(head)) { + if (head == compound_head(page)) + return 1; + pr_info("Memory failure: %#lx cannot catch tail\n", + page_to_pfn(page)); + put_page(head); + } } + return 0; } - if (get_page_unless_zero(head)) { - if (head == compound_head(page)) - return 1; - - pr_info("Memory failure: %#lx cannot catch tail\n", - page_to_pfn(page)); - put_page(head); - } - - return 0; + return get_page_unless_zero(page); } /*