Message ID | 20200109143054.13203-1-richardw.yang@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: thp: grab the lock before manipulation defer list | expand |
On Thu, 9 Jan 2020, Wei Yang wrote: > As all the other places, we grab the lock before manipulate the defer list. > Current implementation may face a race condition. > > For example, the potential race would be: > > CPU1 CPU2 > mem_cgroup_move_account split_huge_page_to_list > !list_empty > lock > !list_empty > list_del > unlock > lock > # !list_empty might not hold anymore > list_del_init > unlock > > When this sequence happens, the list_del_init() in > mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the > page is already been removed by list_del in split_huge_page_to_list(). > > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") > > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> > Acked-by: David Rientjes <rientjes@google.com> Thanks Wei! Andrew, I'd also suggest: Cc: stable@vger.kernel.org # 5.4+
On Thu 09-01-20 22:30:54, Wei Yang wrote: > As all the other places, we grab the lock before manipulate the defer list. > Current implementation may face a race condition. > > For example, the potential race would be: > > CPU1 CPU2 > mem_cgroup_move_account split_huge_page_to_list > !list_empty > lock > !list_empty > list_del > unlock > lock > # !list_empty might not hold anymore > list_del_init > unlock > > When this sequence happens, the list_del_init() in > mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the > page is already been removed by list_del in split_huge_page_to_list(). > > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") > > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> > Acked-by: David Rientjes <rientjes@google.com> Thanks a lot for the changelog approvements! Acked-by: Michal Hocko <mhocko@suse.com> > > --- > v2: > * move check on compound outside suggested by Alexander > * an example of the race condition, suggested by Michal > --- > mm/memcontrol.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index bc01423277c5..1492eefe4f3c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page, > } > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (compound && !list_empty(page_deferred_list(page))) { > + if (compound) { > spin_lock(&from->deferred_split_queue.split_queue_lock); > - list_del_init(page_deferred_list(page)); > - from->deferred_split_queue.split_queue_len--; > + if (!list_empty(page_deferred_list(page))) { > + list_del_init(page_deferred_list(page)); > + from->deferred_split_queue.split_queue_len--; > + } > spin_unlock(&from->deferred_split_queue.split_queue_lock); > } > #endif > @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page, > page->mem_cgroup = to; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (compound && list_empty(page_deferred_list(page))) { > + if (compound) { > spin_lock(&to->deferred_split_queue.split_queue_lock); > - list_add_tail(page_deferred_list(page), > - &to->deferred_split_queue.split_queue); > - to->deferred_split_queue.split_queue_len++; > + if (list_empty(page_deferred_list(page))) { > + list_add_tail(page_deferred_list(page), > + &to->deferred_split_queue.split_queue); > + to->deferred_split_queue.split_queue_len++; > + } > spin_unlock(&to->deferred_split_queue.split_queue_lock); > } > #endif > -- > 2.17.1
On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: > As all the other places, we grab the lock before manipulate the defer list. > Current implementation may face a race condition. > > For example, the potential race would be: > > CPU1 CPU2 > mem_cgroup_move_account split_huge_page_to_list > !list_empty > lock > !list_empty > list_del > unlock > lock > # !list_empty might not hold anymore > list_del_init > unlock I don't think this particular race is possible. Both parties take page lock before messing with deferred queue, but anytway: Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > When this sequence happens, the list_del_init() in > mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the > page is already been removed by list_del in split_huge_page_to_list(). > > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") > > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> > Acked-by: David Rientjes <rientjes@google.com> > > --- > v2: > * move check on compound outside suggested by Alexander > * an example of the race condition, suggested by Michal > --- > mm/memcontrol.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index bc01423277c5..1492eefe4f3c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page, > } > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (compound && !list_empty(page_deferred_list(page))) { > + if (compound) { > spin_lock(&from->deferred_split_queue.split_queue_lock); > - list_del_init(page_deferred_list(page)); > - from->deferred_split_queue.split_queue_len--; > + if (!list_empty(page_deferred_list(page))) { > + list_del_init(page_deferred_list(page)); > + from->deferred_split_queue.split_queue_len--; > + } > spin_unlock(&from->deferred_split_queue.split_queue_lock); > } > #endif > @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page, > page->mem_cgroup = to; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (compound && list_empty(page_deferred_list(page))) { > + if (compound) { > spin_lock(&to->deferred_split_queue.split_queue_lock); > - list_add_tail(page_deferred_list(page), > - &to->deferred_split_queue.split_queue); > - to->deferred_split_queue.split_queue_len++; > + if (list_empty(page_deferred_list(page))) { > + list_add_tail(page_deferred_list(page), > + &to->deferred_split_queue.split_queue); > + to->deferred_split_queue.split_queue_len++; > + } > spin_unlock(&to->deferred_split_queue.split_queue_lock); > } > #endif > -- > 2.17.1 > >
On Sat, Jan 11, 2020 at 03:03:52AM +0300, Kirill A. Shutemov wrote: >On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: >> As all the other places, we grab the lock before manipulate the defer list. >> Current implementation may face a race condition. >> >> For example, the potential race would be: >> >> CPU1 CPU2 >> mem_cgroup_move_account split_huge_page_to_list >> !list_empty >> lock >> !list_empty >> list_del >> unlock >> lock >> # !list_empty might not hold anymore >> list_del_init >> unlock > >I don't think this particular race is possible. Both parties take page >lock before messing with deferred queue, but anytway: I am afraid not. Page lock is per page, while defer queue is per pgdate or memcg. It is possible two page in the same pgdate or memcg grab page lock respectively and then access the same defer queue concurrently. > >Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > >> >> When this sequence happens, the list_del_init() in >> mem_cgroup_move_account() would crash if CONFIG_DEBUG_LIST since the >> page is already been removed by list_del in split_huge_page_to_list(). >> >> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") >> >> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> >> Acked-by: David Rientjes <rientjes@google.com> >> >> --- >> v2: >> * move check on compound outside suggested by Alexander >> * an example of the race condition, suggested by Michal >> --- >> mm/memcontrol.c | 18 +++++++++++------- >> 1 file changed, 11 insertions(+), 7 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index bc01423277c5..1492eefe4f3c 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page, >> } >> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - if (compound && !list_empty(page_deferred_list(page))) { >> + if (compound) { >> spin_lock(&from->deferred_split_queue.split_queue_lock); >> - list_del_init(page_deferred_list(page)); >> - from->deferred_split_queue.split_queue_len--; >> + if (!list_empty(page_deferred_list(page))) { >> + list_del_init(page_deferred_list(page)); >> + from->deferred_split_queue.split_queue_len--; >> + } >> spin_unlock(&from->deferred_split_queue.split_queue_lock); >> } >> #endif >> @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page, >> page->mem_cgroup = to; >> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - if (compound && list_empty(page_deferred_list(page))) { >> + if (compound) { >> spin_lock(&to->deferred_split_queue.split_queue_lock); >> - list_add_tail(page_deferred_list(page), >> - &to->deferred_split_queue.split_queue); >> - to->deferred_split_queue.split_queue_len++; >> + if (list_empty(page_deferred_list(page))) { >> + list_add_tail(page_deferred_list(page), >> + &to->deferred_split_queue.split_queue); >> + to->deferred_split_queue.split_queue_len++; >> + } >> spin_unlock(&to->deferred_split_queue.split_queue_lock); >> } >> #endif >> -- >> 2.17.1 >> >> > >-- > Kirill A. Shutemov
On Sun, Jan 12, 2020 at 10:28:58AM +0800, Wei Yang wrote: > On Sat, Jan 11, 2020 at 03:03:52AM +0300, Kirill A. Shutemov wrote: > >On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: > >> As all the other places, we grab the lock before manipulate the defer list. > >> Current implementation may face a race condition. > >> > >> For example, the potential race would be: > >> > >> CPU1 CPU2 > >> mem_cgroup_move_account split_huge_page_to_list > >> !list_empty > >> lock > >> !list_empty > >> list_del > >> unlock > >> lock > >> # !list_empty might not hold anymore > >> list_del_init > >> unlock > > > >I don't think this particular race is possible. Both parties take page > >lock before messing with deferred queue, but anytway: > > I am afraid not. Page lock is per page, while defer queue is per pgdate or > memcg. > > It is possible two page in the same pgdate or memcg grab page lock > respectively and then access the same defer queue concurrently. Look closer on the list_empty() argument. It's list_head local to the page. Too different pages can be handled in parallel without any problem in this particular scenario. As long as we as we modify it under the lock. Said that, page lock here was somewhat accidential and I still belive we need to move the check under the lock anyway.
On Mon, Jan 13, 2020 at 01:57:18AM +0300, Kirill A. Shutemov wrote: >On Sun, Jan 12, 2020 at 10:28:58AM +0800, Wei Yang wrote: >> On Sat, Jan 11, 2020 at 03:03:52AM +0300, Kirill A. Shutemov wrote: >> >On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: >> >> As all the other places, we grab the lock before manipulate the defer list. >> >> Current implementation may face a race condition. >> >> >> >> For example, the potential race would be: >> >> >> >> CPU1 CPU2 >> >> mem_cgroup_move_account split_huge_page_to_list >> >> !list_empty >> >> lock >> >> !list_empty >> >> list_del >> >> unlock >> >> lock >> >> # !list_empty might not hold anymore >> >> list_del_init >> >> unlock >> > >> >I don't think this particular race is possible. Both parties take page >> >lock before messing with deferred queue, but anytway: >> >> I am afraid not. Page lock is per page, while defer queue is per pgdate or >> memcg. >> >> It is possible two page in the same pgdate or memcg grab page lock >> respectively and then access the same defer queue concurrently. > >Look closer on the list_empty() argument. It's list_head local to the >page. Too different pages can be handled in parallel without any problem >in this particular scenario. As long as we as we modify it under the lock. > >Said that, page lock here was somewhat accidential and I still belive we >need to move the check under the lock anyway. > If my understanding is correct, you agree with my statement? >-- > Kirill A. Shutemov
On Mon, Jan 13, 2020 at 08:44:57AM +0800, Wei Yang wrote: > >> It is possible two page in the same pgdate or memcg grab page lock > >> respectively and then access the same defer queue concurrently. > > If my understanding is correct, you agree with my statement? Which one? If the one above then no. list_empty only accesses list_head for the struct page, not the queue.
On Mon, Jan 13, 2020 at 10:36:14AM +0300, Kirill A. Shutemov wrote: >On Mon, Jan 13, 2020 at 08:44:57AM +0800, Wei Yang wrote: >> >> It is possible two page in the same pgdate or memcg grab page lock >> >> respectively and then access the same defer queue concurrently. >> >> If my understanding is correct, you agree with my statement? > >Which one? If the one above then no. list_empty only accesses list_head >for the struct page, not the queue. > Ah, I get your point. >-- > Kirill A. Shutemov
On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote: > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: > > As all the other places, we grab the lock before manipulate the defer list. > > Current implementation may face a race condition. > > > > For example, the potential race would be: > > > > CPU1 CPU2 > > mem_cgroup_move_account split_huge_page_to_list > > !list_empty > > lock > > !list_empty > > list_del > > unlock > > lock > > # !list_empty might not hold anymore > > list_del_init > > unlock > > I don't think this particular race is possible. Both parties take page > lock before messing with deferred queue, but anytway: > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> I am confused, if the above race is not possible then what would be a real race? We really do not want to have a patch with a misleading changelog, do we?
On Tue, Jan 14, 2020 at 10:31:22AM +0100, Michal Hocko wrote: > On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote: > > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: > > > As all the other places, we grab the lock before manipulate the defer list. > > > Current implementation may face a race condition. > > > > > > For example, the potential race would be: > > > > > > CPU1 CPU2 > > > mem_cgroup_move_account split_huge_page_to_list > > > !list_empty > > > lock > > > !list_empty > > > list_del > > > unlock > > > lock > > > # !list_empty might not hold anymore > > > list_del_init > > > unlock > > > > I don't think this particular race is possible. Both parties take page > > lock before messing with deferred queue, but anytway: > > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > I am confused, if the above race is not possible then what would be a > real race? We really do not want to have a patch with a misleading > changelog, do we? The alternative is to make sure that all page_deferred_list() called with page lock taken. I'll look into it.
On Tue, Jan 14, 2020 at 01:31:12PM +0300, Kirill A. Shutemov wrote: > On Tue, Jan 14, 2020 at 10:31:22AM +0100, Michal Hocko wrote: > > On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote: > > > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: > > > > As all the other places, we grab the lock before manipulate the defer list. > > > > Current implementation may face a race condition. > > > > > > > > For example, the potential race would be: > > > > > > > > CPU1 CPU2 > > > > mem_cgroup_move_account split_huge_page_to_list > > > > !list_empty > > > > lock > > > > !list_empty > > > > list_del > > > > unlock > > > > lock > > > > # !list_empty might not hold anymore > > > > list_del_init > > > > unlock > > > > > > I don't think this particular race is possible. Both parties take page > > > lock before messing with deferred queue, but anytway: > > > > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > > I am confused, if the above race is not possible then what would be a > > real race? We really do not want to have a patch with a misleading > > changelog, do we? > > The alternative is to make sure that all page_deferred_list() called with > page lock taken. > > I'll look into it. split_huge_page_to_list() has page lock taken. free_transhuge_page() is in the free path and doesn't susceptible to the race. deferred_split_scan() is trickier. list_move() should be safe against list_empty() as it will not produce false-positive list_empty(). list_del_init() *should* (correct me if I'm wrong) be safe because the page is freeing and memcg will not touch the page anymore. deferred_split_huge_page() is a problematic one. It called from page_remove_rmap() path witch does require page lock. I don't see any obvious way to exclude race with mem_cgroup_move_account() here. Anybody else? Wei, could you rewrite the commit message with deferred_split_huge_page() as a race source instead of split_huge_page_to_list()?
On Tue, 14 Jan 2020, Kirill A. Shutemov wrote: > split_huge_page_to_list() has page lock taken. > > free_transhuge_page() is in the free path and doesn't susceptible to the > race. > > deferred_split_scan() is trickier. list_move() should be safe against > list_empty() as it will not produce false-positive list_empty(). > list_del_init() *should* (correct me if I'm wrong) be safe because the page > is freeing and memcg will not touch the page anymore. > > deferred_split_huge_page() is a problematic one. It called from > page_remove_rmap() path witch does require page lock. I don't see any > obvious way to exclude race with mem_cgroup_move_account() here. > Anybody else? > > Wei, could you rewrite the commit message with deferred_split_huge_page() > as a race source instead of split_huge_page_to_list()? > I think describing the race in terms of deferred_split_huge_page() makes the most sense and I'd prefer a cc to stable for 5.4+. Even getting the split_queue_len, which is unsigned long, to underflow because of a list_empty(page_deferred_list()) check that is no longer accurate after the lock is taken would be a significant issue for shrinkers.
On Tue, Jan 14, 2020 at 01:59:21PM +0300, Kirill A. Shutemov wrote: >On Tue, Jan 14, 2020 at 01:31:12PM +0300, Kirill A. Shutemov wrote: >> On Tue, Jan 14, 2020 at 10:31:22AM +0100, Michal Hocko wrote: >> > On Sat 11-01-20 03:03:52, Kirill A. Shutemov wrote: >> > > On Thu, Jan 09, 2020 at 10:30:54PM +0800, Wei Yang wrote: >> > > > As all the other places, we grab the lock before manipulate the defer list. >> > > > Current implementation may face a race condition. >> > > > >> > > > For example, the potential race would be: >> > > > >> > > > CPU1 CPU2 >> > > > mem_cgroup_move_account split_huge_page_to_list >> > > > !list_empty >> > > > lock >> > > > !list_empty >> > > > list_del >> > > > unlock >> > > > lock >> > > > # !list_empty might not hold anymore >> > > > list_del_init >> > > > unlock >> > > >> > > I don't think this particular race is possible. Both parties take page >> > > lock before messing with deferred queue, but anytway: >> > > >> > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> > >> > I am confused, if the above race is not possible then what would be a >> > real race? We really do not want to have a patch with a misleading >> > changelog, do we? >> >> The alternative is to make sure that all page_deferred_list() called with >> page lock taken. >> >> I'll look into it. > >split_huge_page_to_list() has page lock taken. > >free_transhuge_page() is in the free path and doesn't susceptible to the >race. > >deferred_split_scan() is trickier. list_move() should be safe against >list_empty() as it will not produce false-positive list_empty(). >list_del_init() *should* (correct me if I'm wrong) be safe because the page >is freeing and memcg will not touch the page anymore. > >deferred_split_huge_page() is a problematic one. It called from >page_remove_rmap() path witch does require page lock. I don't see any >obvious way to exclude race with mem_cgroup_move_account() here. >Anybody else? If my understanding is correct, the reason is deferred_split_huge_page() doesn't has page lock taken, right? > >Wei, could you rewrite the commit message with deferred_split_huge_page() >as a race source instead of split_huge_page_to_list()? > >-- > Kirill A. Shutemov
On Tue, Jan 14, 2020 at 12:57:22PM -0800, David Rientjes wrote: >On Tue, 14 Jan 2020, Kirill A. Shutemov wrote: > >> split_huge_page_to_list() has page lock taken. >> >> free_transhuge_page() is in the free path and doesn't susceptible to the >> race. >> >> deferred_split_scan() is trickier. list_move() should be safe against >> list_empty() as it will not produce false-positive list_empty(). >> list_del_init() *should* (correct me if I'm wrong) be safe because the page >> is freeing and memcg will not touch the page anymore. >> >> deferred_split_huge_page() is a problematic one. It called from >> page_remove_rmap() path witch does require page lock. I don't see any >> obvious way to exclude race with mem_cgroup_move_account() here. >> Anybody else? >> >> Wei, could you rewrite the commit message with deferred_split_huge_page() >> as a race source instead of split_huge_page_to_list()? >> > >I think describing the race in terms of deferred_split_huge_page() makes >the most sense and I'd prefer a cc to stable for 5.4+. Even getting the >split_queue_len, which is unsigned long, to underflow because of a >list_empty(page_deferred_list()) check that is no longer accurate after >the lock is taken would be a significant issue for shrinkers. Oh, you are right. Even the list is not corrupted between deferred_split_scan() and mem_cgroup_move_account(), split_queue_len would be in a wrong state. Hmm... to some extend, the page lock complicates the picture a little here, even it helps in some cases.
On Wed, 15 Jan 2020, Wei Yang wrote: > >split_huge_page_to_list() has page lock taken. > > > >free_transhuge_page() is in the free path and doesn't susceptible to the > >race. > > > >deferred_split_scan() is trickier. list_move() should be safe against > >list_empty() as it will not produce false-positive list_empty(). > >list_del_init() *should* (correct me if I'm wrong) be safe because the page > >is freeing and memcg will not touch the page anymore. > > > >deferred_split_huge_page() is a problematic one. It called from > >page_remove_rmap() path witch does require page lock. I don't see any > >obvious way to exclude race with mem_cgroup_move_account() here. > >Anybody else? > > If my understanding is correct, the reason is deferred_split_huge_page() > doesn't has page lock taken, right? > I think the fix that you have proposed has inspired some deeper looks at the locking around the deferred split queue and the hope was that perhaps this could be protected by the page lock but it was found that at least in one path that isn't taken. So I believe your fix is still needed and any possible optimizations in this area can be proposed on top.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bc01423277c5..1492eefe4f3c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5368,10 +5368,12 @@ static int mem_cgroup_move_account(struct page *page, } #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && !list_empty(page_deferred_list(page))) { + if (compound) { spin_lock(&from->deferred_split_queue.split_queue_lock); - list_del_init(page_deferred_list(page)); - from->deferred_split_queue.split_queue_len--; + if (!list_empty(page_deferred_list(page))) { + list_del_init(page_deferred_list(page)); + from->deferred_split_queue.split_queue_len--; + } spin_unlock(&from->deferred_split_queue.split_queue_lock); } #endif @@ -5385,11 +5387,13 @@ static int mem_cgroup_move_account(struct page *page, page->mem_cgroup = to; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && list_empty(page_deferred_list(page))) { + if (compound) { spin_lock(&to->deferred_split_queue.split_queue_lock); - list_add_tail(page_deferred_list(page), - &to->deferred_split_queue.split_queue); - to->deferred_split_queue.split_queue_len++; + if (list_empty(page_deferred_list(page))) { + list_add_tail(page_deferred_list(page), + &to->deferred_split_queue.split_queue); + to->deferred_split_queue.split_queue_len++; + } spin_unlock(&to->deferred_split_queue.split_queue_lock); } #endif