Message ID | 20240304103757.235352-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: hold PTL from the first PTE while reclaiming a large folio | expand |
Hi Barry, On 04/03/2024 10:37, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > page_vma_mapped_walk() within try_to_unmap_one() races with other > PTEs modification such as break-before-make, while iterating PTEs > of a large folio, it will only begin to acquire PTL after it gets > a valid(present) PTE. break-before-make intermediately sets PTEs > to pte_none. Thus, a large folio's PTEs might be partially skipped > in try_to_unmap_one(). I just want to check my understanding here - I think the problem occurs for PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now that I've had a look at the code and have a better understanding, I think that must be the case? And therefore this problem exists independently of my work to support swap-out of mTHP? (From your previous report I was under the impression that it only affected mTHP). Its just that the problem is becoming more pronounced because with mTHP, PTE-mapped large folios are much more common? > For example, for an anon folio, after try_to_unmap_one(), we may > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > So folio will be still mapped, the folio fails to be reclaimed. > What’s even more worrying is, its PTEs are no longer in a unified > state. This might lead to accident folio_split() afterwards. And > since a part of PTEs are now swap entries, accessing them will > incur page fault - do_swap_page. > It creates both anxiety and more expense. While we can't avoid > userspace's unmap to break up unified PTEs such as CONT-PTE for > a large folio, we can indeed keep away from kernel's breaking up > them due to its code design. > This patch is holding PTL from PTE0, thus, the folio will either > be entirely reclaimed or entirely kept. On the other hand, this > approach doesn't increase PTL contention. Even w/o the patch, > page_vma_mapped_walk() will always get PTL after it sometimes > skips one or two PTEs because intermediate break-before-makes > are short, according to test. Of course, even w/o this patch, > the vast majority of try_to_unmap_one still can get PTL from > PTE0. This patch makes the number 100%. > The other option is that we can give up in try_to_unmap_one > once we find PTE0 is not the first entry we get PTL, we call > page_vma_mapped_walk_done() to end the iteration at this case. > This will keep the unified PTEs while the folio isn't reclaimed. > The result is quite similar with small folios with one PTE - > either entirely reclaimed or entirely kept. > Reclaiming large folios by holding PTL from PTE0 seems a better > option comparing to giving up after detecting PTL begins from > non-PTE0. > > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> Do we need a Fixes tag? > --- > mm/vmscan.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 0b888a2afa58..e4722fbbcd0c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (folio_test_pmd_mappable(folio)) > flags |= TTU_SPLIT_HUGE_PMD; > + /* > + * if page table lock is not held from the first PTE of > + * a large folio, some PTEs might be skipped because of > + * races with break-before-make, for example, PTEs can > + * be pte_none intermediately, thus one or more PTEs > + * might be skipped in try_to_unmap_one, we might result > + * in a large folio is partially mapped and partially > + * unmapped after try_to_unmap > + */ > + if (folio_test_large(folio)) > + flags |= TTU_SYNC; This looks sensible to me after thinking about it for a while. But I also have a gut feeling that there might be some more subtleties that are going over my head, since I'm not expert in this area. So will leave others to provide R-b :) Thanks, Ryan > > try_to_unmap(folio, flags); > if (folio_mapped(folio)) {
On 04.03.24 13:20, Ryan Roberts wrote: > Hi Barry, > > On 04/03/2024 10:37, Barry Song wrote: >> From: Barry Song <v-songbaohua@oppo.com> >> >> page_vma_mapped_walk() within try_to_unmap_one() races with other >> PTEs modification such as break-before-make, while iterating PTEs >> of a large folio, it will only begin to acquire PTL after it gets >> a valid(present) PTE. break-before-make intermediately sets PTEs >> to pte_none. Thus, a large folio's PTEs might be partially skipped >> in try_to_unmap_one(). > > I just want to check my understanding here - I think the problem occurs for > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > that I've had a look at the code and have a better understanding, I think that > must be the case? And therefore this problem exists independently of my work to > support swap-out of mTHP? (From your previous report I was under the impression > that it only affected mTHP). > > Its just that the problem is becoming more pronounced because with mTHP, > PTE-mapped large folios are much more common? That is my understanding. > >> For example, for an anon folio, after try_to_unmap_one(), we may >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >> So folio will be still mapped, the folio fails to be reclaimed. >> What’s even more worrying is, its PTEs are no longer in a unified >> state. This might lead to accident folio_split() afterwards. And >> since a part of PTEs are now swap entries, accessing them will >> incur page fault - do_swap_page. >> It creates both anxiety and more expense. While we can't avoid >> userspace's unmap to break up unified PTEs such as CONT-PTE for >> a large folio, we can indeed keep away from kernel's breaking up >> them due to its code design. >> This patch is holding PTL from PTE0, thus, the folio will either >> be entirely reclaimed or entirely kept. On the other hand, this >> approach doesn't increase PTL contention. Even w/o the patch, >> page_vma_mapped_walk() will always get PTL after it sometimes >> skips one or two PTEs because intermediate break-before-makes >> are short, according to test. Of course, even w/o this patch, >> the vast majority of try_to_unmap_one still can get PTL from >> PTE0. This patch makes the number 100%. >> The other option is that we can give up in try_to_unmap_one >> once we find PTE0 is not the first entry we get PTL, we call >> page_vma_mapped_walk_done() to end the iteration at this case. >> This will keep the unified PTEs while the folio isn't reclaimed. >> The result is quite similar with small folios with one PTE - >> either entirely reclaimed or entirely kept. >> Reclaiming large folios by holding PTL from PTE0 seems a better >> option comparing to giving up after detecting PTL begins from >> non-PTE0. >> I'm sure that wall of text can be formatted in a better way :) . Also, I think we can drop some of the details, If you need some inspiration, I can give it a shot. >> Cc: Hugh Dickins <hughd@google.com> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > Do we need a Fixes tag? > What would be the description of the problem we are fixing? 1) failing to unmap? That can happen with small folios as well IIUC. 2) Putting the large folio on the deferred split queue? That sounds more reasonable. >> --- >> mm/vmscan.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 0b888a2afa58..e4722fbbcd0c 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> >> if (folio_test_pmd_mappable(folio)) >> flags |= TTU_SPLIT_HUGE_PMD; >> + /* >> + * if page table lock is not held from the first PTE of >> + * a large folio, some PTEs might be skipped because of >> + * races with break-before-make, for example, PTEs can >> + * be pte_none intermediately, thus one or more PTEs >> + * might be skipped in try_to_unmap_one, we might result >> + * in a large folio is partially mapped and partially >> + * unmapped after try_to_unmap >> + */ >> + if (folio_test_large(folio)) >> + flags |= TTU_SYNC; > > This looks sensible to me after thinking about it for a while. But I also have a > gut feeling that there might be some more subtleties that are going over my > head, since I'm not expert in this area. So will leave others to provide R-b :) > As we are seeing more such problems with lockless PT walks, maybe we really want some other special value (nonswap entry?) to indicate that a PTE this is currently ondergoing protection changes. So we'd avoid the pte_none() temporarily, if possible. Without that, TTU_SYNC feels like the right thing to do.
On 04/03/2024 12:41, David Hildenbrand wrote: > On 04.03.24 13:20, Ryan Roberts wrote: >> Hi Barry, >> >> On 04/03/2024 10:37, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>> PTEs modification such as break-before-make, while iterating PTEs >>> of a large folio, it will only begin to acquire PTL after it gets >>> a valid(present) PTE. break-before-make intermediately sets PTEs >>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>> in try_to_unmap_one(). >> >> I just want to check my understanding here - I think the problem occurs for >> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >> that I've had a look at the code and have a better understanding, I think that >> must be the case? And therefore this problem exists independently of my work to >> support swap-out of mTHP? (From your previous report I was under the impression >> that it only affected mTHP). >> >> Its just that the problem is becoming more pronounced because with mTHP, >> PTE-mapped large folios are much more common? > > That is my understanding. > >> >>> For example, for an anon folio, after try_to_unmap_one(), we may >>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >>> So folio will be still mapped, the folio fails to be reclaimed. >>> What’s even more worrying is, its PTEs are no longer in a unified >>> state. This might lead to accident folio_split() afterwards. And >>> since a part of PTEs are now swap entries, accessing them will >>> incur page fault - do_swap_page. >>> It creates both anxiety and more expense. While we can't avoid >>> userspace's unmap to break up unified PTEs such as CONT-PTE for >>> a large folio, we can indeed keep away from kernel's breaking up >>> them due to its code design. >>> This patch is holding PTL from PTE0, thus, the folio will either >>> be entirely reclaimed or entirely kept. On the other hand, this >>> approach doesn't increase PTL contention. Even w/o the patch, >>> page_vma_mapped_walk() will always get PTL after it sometimes >>> skips one or two PTEs because intermediate break-before-makes >>> are short, according to test. Of course, even w/o this patch, >>> the vast majority of try_to_unmap_one still can get PTL from >>> PTE0. This patch makes the number 100%. >>> The other option is that we can give up in try_to_unmap_one >>> once we find PTE0 is not the first entry we get PTL, we call >>> page_vma_mapped_walk_done() to end the iteration at this case. >>> This will keep the unified PTEs while the folio isn't reclaimed. >>> The result is quite similar with small folios with one PTE - >>> either entirely reclaimed or entirely kept. >>> Reclaiming large folios by holding PTL from PTE0 seems a better >>> option comparing to giving up after detecting PTL begins from >>> non-PTE0. >>> > > I'm sure that wall of text can be formatted in a better way :) . Also, I think > we can drop some of the details, > > If you need some inspiration, I can give it a shot. > >>> Cc: Hugh Dickins <hughd@google.com> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >> >> Do we need a Fixes tag? >> > > What would be the description of the problem we are fixing? > > 1) failing to unmap? > > That can happen with small folios as well IIUC. > > 2) Putting the large folio on the deferred split queue? > > That sounds more reasonable. Isn't the real problem today that we can end up writng a THP to the swap file (so 2M more IO and space used) but we can't remove it from memory, so no actual reclaim happens? Although I guess your (2) is really just another way of saying that. > >>> --- >>> mm/vmscan.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index 0b888a2afa58..e4722fbbcd0c 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head >>> *folio_list, >>> if (folio_test_pmd_mappable(folio)) >>> flags |= TTU_SPLIT_HUGE_PMD; >>> + /* >>> + * if page table lock is not held from the first PTE of >>> + * a large folio, some PTEs might be skipped because of >>> + * races with break-before-make, for example, PTEs can >>> + * be pte_none intermediately, thus one or more PTEs >>> + * might be skipped in try_to_unmap_one, we might result >>> + * in a large folio is partially mapped and partially >>> + * unmapped after try_to_unmap >>> + */ >>> + if (folio_test_large(folio)) >>> + flags |= TTU_SYNC; >> >> This looks sensible to me after thinking about it for a while. But I also have a >> gut feeling that there might be some more subtleties that are going over my >> head, since I'm not expert in this area. So will leave others to provide R-b :) >> > > As we are seeing more such problems with lockless PT walks, maybe we really want > some other special value (nonswap entry?) to indicate that a PTE this is > currently ondergoing protection changes. So we'd avoid the pte_none() > temporarily, if possible. > > Without that, TTU_SYNC feels like the right thing to do. >
On 04.03.24 14:03, Ryan Roberts wrote: > On 04/03/2024 12:41, David Hildenbrand wrote: >> On 04.03.24 13:20, Ryan Roberts wrote: >>> Hi Barry, >>> >>> On 04/03/2024 10:37, Barry Song wrote: >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>>> PTEs modification such as break-before-make, while iterating PTEs >>>> of a large folio, it will only begin to acquire PTL after it gets >>>> a valid(present) PTE. break-before-make intermediately sets PTEs >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>>> in try_to_unmap_one(). >>> >>> I just want to check my understanding here - I think the problem occurs for >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >>> that I've had a look at the code and have a better understanding, I think that >>> must be the case? And therefore this problem exists independently of my work to >>> support swap-out of mTHP? (From your previous report I was under the impression >>> that it only affected mTHP). >>> >>> Its just that the problem is becoming more pronounced because with mTHP, >>> PTE-mapped large folios are much more common? >> >> That is my understanding. >> >>> >>>> For example, for an anon folio, after try_to_unmap_one(), we may >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >>>> So folio will be still mapped, the folio fails to be reclaimed. >>>> What’s even more worrying is, its PTEs are no longer in a unified >>>> state. This might lead to accident folio_split() afterwards. And >>>> since a part of PTEs are now swap entries, accessing them will >>>> incur page fault - do_swap_page. >>>> It creates both anxiety and more expense. While we can't avoid >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for >>>> a large folio, we can indeed keep away from kernel's breaking up >>>> them due to its code design. >>>> This patch is holding PTL from PTE0, thus, the folio will either >>>> be entirely reclaimed or entirely kept. On the other hand, this >>>> approach doesn't increase PTL contention. Even w/o the patch, >>>> page_vma_mapped_walk() will always get PTL after it sometimes >>>> skips one or two PTEs because intermediate break-before-makes >>>> are short, according to test. Of course, even w/o this patch, >>>> the vast majority of try_to_unmap_one still can get PTL from >>>> PTE0. This patch makes the number 100%. >>>> The other option is that we can give up in try_to_unmap_one >>>> once we find PTE0 is not the first entry we get PTL, we call >>>> page_vma_mapped_walk_done() to end the iteration at this case. >>>> This will keep the unified PTEs while the folio isn't reclaimed. >>>> The result is quite similar with small folios with one PTE - >>>> either entirely reclaimed or entirely kept. >>>> Reclaiming large folios by holding PTL from PTE0 seems a better >>>> option comparing to giving up after detecting PTL begins from >>>> non-PTE0. >>>> >> >> I'm sure that wall of text can be formatted in a better way :) . Also, I think >> we can drop some of the details, >> >> If you need some inspiration, I can give it a shot. >> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> >>> Do we need a Fixes tag? >>> >> >> What would be the description of the problem we are fixing? >> >> 1) failing to unmap? >> >> That can happen with small folios as well IIUC. >> >> 2) Putting the large folio on the deferred split queue? >> >> That sounds more reasonable. > > Isn't the real problem today that we can end up writng a THP to the swap file > (so 2M more IO and space used) but we can't remove it from memory, so no actual > reclaim happens? Although I guess your (2) is really just another way of saying > that. The same could happen with small folios I believe? We might end up running into the folio_mapped() after the try_to_unmap(). Note that the actual I/O does not happen during add_to_swap(), but during the pageout() call when we find the folio to be dirty. So there would not actually be more I/O. Only swap space would be reserved, that would be used later when not running into the race.
On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote: > > On 04.03.24 14:03, Ryan Roberts wrote: > > On 04/03/2024 12:41, David Hildenbrand wrote: > >> On 04.03.24 13:20, Ryan Roberts wrote: > >>> Hi Barry, > >>> > >>> On 04/03/2024 10:37, Barry Song wrote: > >>>> From: Barry Song <v-songbaohua@oppo.com> > >>>> > >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>> PTEs modification such as break-before-make, while iterating PTEs > >>>> of a large folio, it will only begin to acquire PTL after it gets > >>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>> in try_to_unmap_one(). > >>> > >>> I just want to check my understanding here - I think the problem occurs for > >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > >>> that I've had a look at the code and have a better understanding, I think that > >>> must be the case? And therefore this problem exists independently of my work to > >>> support swap-out of mTHP? (From your previous report I was under the impression > >>> that it only affected mTHP). > >>> > >>> Its just that the problem is becoming more pronounced because with mTHP, > >>> PTE-mapped large folios are much more common? > >> > >> That is my understanding. > >> > >>> > >>>> For example, for an anon folio, after try_to_unmap_one(), we may > >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >>>> So folio will be still mapped, the folio fails to be reclaimed. > >>>> What’s even more worrying is, its PTEs are no longer in a unified > >>>> state. This might lead to accident folio_split() afterwards. And > >>>> since a part of PTEs are now swap entries, accessing them will > >>>> incur page fault - do_swap_page. > >>>> It creates both anxiety and more expense. While we can't avoid > >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for > >>>> a large folio, we can indeed keep away from kernel's breaking up > >>>> them due to its code design. > >>>> This patch is holding PTL from PTE0, thus, the folio will either > >>>> be entirely reclaimed or entirely kept. On the other hand, this > >>>> approach doesn't increase PTL contention. Even w/o the patch, > >>>> page_vma_mapped_walk() will always get PTL after it sometimes > >>>> skips one or two PTEs because intermediate break-before-makes > >>>> are short, according to test. Of course, even w/o this patch, > >>>> the vast majority of try_to_unmap_one still can get PTL from > >>>> PTE0. This patch makes the number 100%. > >>>> The other option is that we can give up in try_to_unmap_one > >>>> once we find PTE0 is not the first entry we get PTL, we call > >>>> page_vma_mapped_walk_done() to end the iteration at this case. > >>>> This will keep the unified PTEs while the folio isn't reclaimed. > >>>> The result is quite similar with small folios with one PTE - > >>>> either entirely reclaimed or entirely kept. > >>>> Reclaiming large folios by holding PTL from PTE0 seems a better > >>>> option comparing to giving up after detecting PTL begins from > >>>> non-PTE0. > >>>> > >> > >> I'm sure that wall of text can be formatted in a better way :) . Also, I think > >> we can drop some of the details, > >> > >> If you need some inspiration, I can give it a shot. > >> > >>>> Cc: Hugh Dickins <hughd@google.com> > >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>> > >>> Do we need a Fixes tag? > >>> > >> > >> What would be the description of the problem we are fixing? > >> > >> 1) failing to unmap? > >> > >> That can happen with small folios as well IIUC. > >> > >> 2) Putting the large folio on the deferred split queue? > >> > >> That sounds more reasonable. > > > > Isn't the real problem today that we can end up writng a THP to the swap file > > (so 2M more IO and space used) but we can't remove it from memory, so no actual > > reclaim happens? Although I guess your (2) is really just another way of saying > > that. > > The same could happen with small folios I believe? We might end up > running into the > > folio_mapped() > > after the try_to_unmap(). > > Note that the actual I/O does not happen during add_to_swap(), but > during the pageout() call when we find the folio to be dirty. > > So there would not actually be more I/O. Only swap space would be > reserved, that would be used later when not running into the race. I am not worried about small folios at all as they have only one PTE. so the PTE is either completely unmapped or completely mapped. In terms of large folios, it is a different story. for example, a large folio with 16 PTEs with CONT-PTE, we will have 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries 2. page faults on PTE1-PTE15 after try_to_unmap if we access them. This is totally useless PF and can be avoided if we can try_to_unmap properly at the beginning. 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT, MADV_FREE might split it after finding it is not completely mapped. For small folios, we don't have any concern on the above issues. > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 04.03.24 21:42, Barry Song wrote: > On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 04.03.24 14:03, Ryan Roberts wrote: >>> On 04/03/2024 12:41, David Hildenbrand wrote: >>>> On 04.03.24 13:20, Ryan Roberts wrote: >>>>> Hi Barry, >>>>> >>>>> On 04/03/2024 10:37, Barry Song wrote: >>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>> >>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>>>>> PTEs modification such as break-before-make, while iterating PTEs >>>>>> of a large folio, it will only begin to acquire PTL after it gets >>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs >>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>>>>> in try_to_unmap_one(). >>>>> >>>>> I just want to check my understanding here - I think the problem occurs for >>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >>>>> that I've had a look at the code and have a better understanding, I think that >>>>> must be the case? And therefore this problem exists independently of my work to >>>>> support swap-out of mTHP? (From your previous report I was under the impression >>>>> that it only affected mTHP). >>>>> >>>>> Its just that the problem is becoming more pronounced because with mTHP, >>>>> PTE-mapped large folios are much more common? >>>> >>>> That is my understanding. >>>> >>>>> >>>>>> For example, for an anon folio, after try_to_unmap_one(), we may >>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >>>>>> So folio will be still mapped, the folio fails to be reclaimed. >>>>>> What’s even more worrying is, its PTEs are no longer in a unified >>>>>> state. This might lead to accident folio_split() afterwards. And >>>>>> since a part of PTEs are now swap entries, accessing them will >>>>>> incur page fault - do_swap_page. >>>>>> It creates both anxiety and more expense. While we can't avoid >>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for >>>>>> a large folio, we can indeed keep away from kernel's breaking up >>>>>> them due to its code design. >>>>>> This patch is holding PTL from PTE0, thus, the folio will either >>>>>> be entirely reclaimed or entirely kept. On the other hand, this >>>>>> approach doesn't increase PTL contention. Even w/o the patch, >>>>>> page_vma_mapped_walk() will always get PTL after it sometimes >>>>>> skips one or two PTEs because intermediate break-before-makes >>>>>> are short, according to test. Of course, even w/o this patch, >>>>>> the vast majority of try_to_unmap_one still can get PTL from >>>>>> PTE0. This patch makes the number 100%. >>>>>> The other option is that we can give up in try_to_unmap_one >>>>>> once we find PTE0 is not the first entry we get PTL, we call >>>>>> page_vma_mapped_walk_done() to end the iteration at this case. >>>>>> This will keep the unified PTEs while the folio isn't reclaimed. >>>>>> The result is quite similar with small folios with one PTE - >>>>>> either entirely reclaimed or entirely kept. >>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better >>>>>> option comparing to giving up after detecting PTL begins from >>>>>> non-PTE0. >>>>>> >>>> >>>> I'm sure that wall of text can be formatted in a better way :) . Also, I think >>>> we can drop some of the details, >>>> >>>> If you need some inspiration, I can give it a shot. >>>> >>>>>> Cc: Hugh Dickins <hughd@google.com> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> Do we need a Fixes tag? >>>>> >>>> >>>> What would be the description of the problem we are fixing? >>>> >>>> 1) failing to unmap? >>>> >>>> That can happen with small folios as well IIUC. >>>> >>>> 2) Putting the large folio on the deferred split queue? >>>> >>>> That sounds more reasonable. >>> >>> Isn't the real problem today that we can end up writng a THP to the swap file >>> (so 2M more IO and space used) but we can't remove it from memory, so no actual >>> reclaim happens? Although I guess your (2) is really just another way of saying >>> that. >> >> The same could happen with small folios I believe? We might end up >> running into the >> >> folio_mapped() >> >> after the try_to_unmap(). >> >> Note that the actual I/O does not happen during add_to_swap(), but >> during the pageout() call when we find the folio to be dirty. >> >> So there would not actually be more I/O. Only swap space would be >> reserved, that would be used later when not running into the race. > > I am not worried about small folios at all as they have only one PTE. > so the PTE is either completely unmapped or completely mapped. > > In terms of large folios, it is a different story. for example, a large > folio with 16 PTEs with CONT-PTE, we will have > > 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries > > 2. page faults on PTE1-PTE15 after try_to_unmap if we access them. > > This is totally useless PF and can be avoided if we can try_to_unmap > properly at the beginning. > > 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT, > MADV_FREE might split it after finding it is not completely mapped. > > For small folios, we don't have any concern on the above issues. Right, but when we talk about "Fixes:", what exactly are we consider "really broken" above and what is "undesired"? (a) is there a correctness issue? I don't think so. (b) is there a real performance issue? I'd like to understand. After all, we've been living with that ever since we supported THP_SWAP, correct? "something does not work ideally in some corner cases" might be reasonable to handle here (and I really think we should), but might not be worth a "Fixes:". So if we could clarify that, it would be great.
On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@redhat.com> wrote: > > On 04.03.24 13:20, Ryan Roberts wrote: > > Hi Barry, > > > > On 04/03/2024 10:37, Barry Song wrote: > >> From: Barry Song <v-songbaohua@oppo.com> > >> > >> page_vma_mapped_walk() within try_to_unmap_one() races with other > >> PTEs modification such as break-before-make, while iterating PTEs > >> of a large folio, it will only begin to acquire PTL after it gets > >> a valid(present) PTE. break-before-make intermediately sets PTEs > >> to pte_none. Thus, a large folio's PTEs might be partially skipped > >> in try_to_unmap_one(). > > > > I just want to check my understanding here - I think the problem occurs for > > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > > that I've had a look at the code and have a better understanding, I think that > > must be the case? And therefore this problem exists independently of my work to > > support swap-out of mTHP? (From your previous report I was under the impression > > that it only affected mTHP). > > > > Its just that the problem is becoming more pronounced because with mTHP, > > PTE-mapped large folios are much more common? > > That is my understanding. > > > > >> For example, for an anon folio, after try_to_unmap_one(), we may > >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >> So folio will be still mapped, the folio fails to be reclaimed. > >> What’s even more worrying is, its PTEs are no longer in a unified > >> state. This might lead to accident folio_split() afterwards. And > >> since a part of PTEs are now swap entries, accessing them will > >> incur page fault - do_swap_page. > >> It creates both anxiety and more expense. While we can't avoid > >> userspace's unmap to break up unified PTEs such as CONT-PTE for > >> a large folio, we can indeed keep away from kernel's breaking up > >> them due to its code design. > >> This patch is holding PTL from PTE0, thus, the folio will either > >> be entirely reclaimed or entirely kept. On the other hand, this > >> approach doesn't increase PTL contention. Even w/o the patch, > >> page_vma_mapped_walk() will always get PTL after it sometimes > >> skips one or two PTEs because intermediate break-before-makes > >> are short, according to test. Of course, even w/o this patch, > >> the vast majority of try_to_unmap_one still can get PTL from > >> PTE0. This patch makes the number 100%. > >> The other option is that we can give up in try_to_unmap_one > >> once we find PTE0 is not the first entry we get PTL, we call > >> page_vma_mapped_walk_done() to end the iteration at this case. > >> This will keep the unified PTEs while the folio isn't reclaimed. > >> The result is quite similar with small folios with one PTE - > >> either entirely reclaimed or entirely kept. > >> Reclaiming large folios by holding PTL from PTE0 seems a better > >> option comparing to giving up after detecting PTL begins from > >> non-PTE0. > >> > > I'm sure that wall of text can be formatted in a better way :) . Also, I > think we can drop some of the details, > > If you need some inspiration, I can give it a shot. > > >> Cc: Hugh Dickins <hughd@google.com> > >> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > > Do we need a Fixes tag? I am not quite sure which commit should be here for a fixes tag. I think it's more of an optimization. > > > > What would be the description of the problem we are fixing? > > 1) failing to unmap? > > That can happen with small folios as well IIUC. > > 2) Putting the large folio on the deferred split queue? > > That sounds more reasonable. I don't feel it is reasonable. Avoiding this kind of accident splitting from the kernel's improper code is a more reasonable approach as there is always a price to pay for splitting and unfolding PTEs etc. While we can't avoid splitting coming from userspace's MADV_DONTNEED, munmap, mprotect, we have a way to ensure the kernel itself doesn't accidently break up a large folio. In OPPO's phones, we ran into some weird bugs due to skipped PTEs in try_to_unmap_one. hardly could we fix it from the root cause. with various races, figuring out their timings was really a big pain :-) But we did "resolve" those bugs by entirely untouching all PTEs if we found some PTEs were skipped in try_to_unmap_one [1]. While we find we only get the PTL from 2nd, 3rd but not 1st PTE, we entirely give up on try_to_unmap_one, and leave all PTEs untouched. /* we are not starting from head */ if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { ret = false; atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); set_pte_at(mm, address, pvmw.pte, pteval); page_vma_mapped_walk_done(&pvmw); break; } This will ensure all PTEs still have a unified state such as CONT-PTE after try_to_unmap fails. I feel this could have some false postive because when racing with unmap, 1st PTE might really become pte_none. So explicitly holding PTL from 1st PTE seems a better way. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c#L1730 > > >> --- > >> mm/vmscan.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > >> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c > >> index 0b888a2afa58..e4722fbbcd0c 100644 > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > >> > >> if (folio_test_pmd_mappable(folio)) > >> flags |= TTU_SPLIT_HUGE_PMD; > >> + /* > >> + * if page table lock is not held from the first PTE of > >> + * a large folio, some PTEs might be skipped because of > >> + * races with break-before-make, for example, PTEs can > >> + * be pte_none intermediately, thus one or more PTEs > >> + * might be skipped in try_to_unmap_one, we might result > >> + * in a large folio is partially mapped and partially > >> + * unmapped after try_to_unmap > >> + */ > >> + if (folio_test_large(folio)) > >> + flags |= TTU_SYNC; > > > > This looks sensible to me after thinking about it for a while. But I also have a > > gut feeling that there might be some more subtleties that are going over my > > head, since I'm not expert in this area. So will leave others to provide R-b :) > > > > As we are seeing more such problems with lockless PT walks, maybe we > really want some other special value (nonswap entry?) to indicate that a > PTE this is currently ondergoing protection changes. So we'd avoid the > pte_none() temporarily, if possible. > > Without that, TTU_SYNC feels like the right thing to do. > > -- > Cheers, > > David / dhildenb > Thanks Barry
>>> Do we need a Fixes tag? > > I am not quite sure which commit should be here for a fixes tag. > I think it's more of an optimization. Good, that helps! > >>> >> >> What would be the description of the problem we are fixing? >> >> 1) failing to unmap? >> >> That can happen with small folios as well IIUC. >> >> 2) Putting the large folio on the deferred split queue? >> >> That sounds more reasonable. > > I don't feel it is reasonable. Avoiding this kind of accident splitting > from the kernel's improper code is a more reasonable approach > as there is always a price to pay for splitting and unfolding PTEs > etc. > > While we can't avoid splitting coming from userspace's > MADV_DONTNEED, munmap, mprotect, we have a way > to ensure the kernel itself doesn't accidently break up a > large folio. Note that on the next vmscan we would retry, find the remaining present entries and swapout that thing completely :) > > In OPPO's phones, we ran into some weird bugs due to skipped PTEs > in try_to_unmap_one. hardly could we fix it from the root cause. with > various races, figuring out their timings was really a big pain :-) > I can imagine. I assume, though, that it might be related to the way the cont-pte bit was handled. Ryan's implementation should be able to cope with that. > But we did "resolve" those bugs by entirely untouching all PTEs if we > found some PTEs were skipped in try_to_unmap_one [1]. > > While we find we only get the PTL from 2nd, 3rd but not > 1st PTE, we entirely give up on try_to_unmap_one, and leave > all PTEs untouched. > > /* we are not starting from head */ > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { > ret = false; > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); > set_pte_at(mm, address, pvmw.pte, pteval); > page_vma_mapped_walk_done(&pvmw); > break; > } > This will ensure all PTEs still have a unified state such as CONT-PTE > after try_to_unmap fails. > I feel this could have some false postive because when racing > with unmap, 1st PTE might really become pte_none. So explicitly > holding PTL from 1st PTE seems a better way. Can we estimate the "cost" of holding the PTL?
On Tue, Mar 5, 2024 at 10:02 AM David Hildenbrand <david@redhat.com> wrote: > > On 04.03.24 21:42, Barry Song wrote: > > On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 04.03.24 14:03, Ryan Roberts wrote: > >>> On 04/03/2024 12:41, David Hildenbrand wrote: > >>>> On 04.03.24 13:20, Ryan Roberts wrote: > >>>>> Hi Barry, > >>>>> > >>>>> On 04/03/2024 10:37, Barry Song wrote: > >>>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>>> > >>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>>>> PTEs modification such as break-before-make, while iterating PTEs > >>>>>> of a large folio, it will only begin to acquire PTL after it gets > >>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>>>> in try_to_unmap_one(). > >>>>> > >>>>> I just want to check my understanding here - I think the problem occurs for > >>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > >>>>> that I've had a look at the code and have a better understanding, I think that > >>>>> must be the case? And therefore this problem exists independently of my work to > >>>>> support swap-out of mTHP? (From your previous report I was under the impression > >>>>> that it only affected mTHP). > >>>>> > >>>>> Its just that the problem is becoming more pronounced because with mTHP, > >>>>> PTE-mapped large folios are much more common? > >>>> > >>>> That is my understanding. > >>>> > >>>>> > >>>>>> For example, for an anon folio, after try_to_unmap_one(), we may > >>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >>>>>> So folio will be still mapped, the folio fails to be reclaimed. > >>>>>> What’s even more worrying is, its PTEs are no longer in a unified > >>>>>> state. This might lead to accident folio_split() afterwards. And > >>>>>> since a part of PTEs are now swap entries, accessing them will > >>>>>> incur page fault - do_swap_page. > >>>>>> It creates both anxiety and more expense. While we can't avoid > >>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for > >>>>>> a large folio, we can indeed keep away from kernel's breaking up > >>>>>> them due to its code design. > >>>>>> This patch is holding PTL from PTE0, thus, the folio will either > >>>>>> be entirely reclaimed or entirely kept. On the other hand, this > >>>>>> approach doesn't increase PTL contention. Even w/o the patch, > >>>>>> page_vma_mapped_walk() will always get PTL after it sometimes > >>>>>> skips one or two PTEs because intermediate break-before-makes > >>>>>> are short, according to test. Of course, even w/o this patch, > >>>>>> the vast majority of try_to_unmap_one still can get PTL from > >>>>>> PTE0. This patch makes the number 100%. > >>>>>> The other option is that we can give up in try_to_unmap_one > >>>>>> once we find PTE0 is not the first entry we get PTL, we call > >>>>>> page_vma_mapped_walk_done() to end the iteration at this case. > >>>>>> This will keep the unified PTEs while the folio isn't reclaimed. > >>>>>> The result is quite similar with small folios with one PTE - > >>>>>> either entirely reclaimed or entirely kept. > >>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better > >>>>>> option comparing to giving up after detecting PTL begins from > >>>>>> non-PTE0. > >>>>>> > >>>> > >>>> I'm sure that wall of text can be formatted in a better way :) . Also, I think > >>>> we can drop some of the details, > >>>> > >>>> If you need some inspiration, I can give it a shot. > >>>> > >>>>>> Cc: Hugh Dickins <hughd@google.com> > >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>>>> > >>>>> Do we need a Fixes tag? > >>>>> > >>>> > >>>> What would be the description of the problem we are fixing? > >>>> > >>>> 1) failing to unmap? > >>>> > >>>> That can happen with small folios as well IIUC. > >>>> > >>>> 2) Putting the large folio on the deferred split queue? > >>>> > >>>> That sounds more reasonable. > >>> > >>> Isn't the real problem today that we can end up writng a THP to the swap file > >>> (so 2M more IO and space used) but we can't remove it from memory, so no actual > >>> reclaim happens? Although I guess your (2) is really just another way of saying > >>> that. > >> > >> The same could happen with small folios I believe? We might end up > >> running into the > >> > >> folio_mapped() > >> > >> after the try_to_unmap(). > >> > >> Note that the actual I/O does not happen during add_to_swap(), but > >> during the pageout() call when we find the folio to be dirty. > >> > >> So there would not actually be more I/O. Only swap space would be > >> reserved, that would be used later when not running into the race. > > > > I am not worried about small folios at all as they have only one PTE. > > so the PTE is either completely unmapped or completely mapped. > > > > In terms of large folios, it is a different story. for example, a large > > folio with 16 PTEs with CONT-PTE, we will have > > > > 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries > > > > 2. page faults on PTE1-PTE15 after try_to_unmap if we access them. > > > > This is totally useless PF and can be avoided if we can try_to_unmap > > properly at the beginning. > > > > 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT, > > MADV_FREE might split it after finding it is not completely mapped. > > > > For small folios, we don't have any concern on the above issues. > > Right, but when we talk about "Fixes:", what exactly are we consider > "really broken" above and what is "undesired"? > > (a) is there a correctness issue? I don't think so. > > (b) is there a real performance issue? I'd like to understand. > > After all, we've been living with that ever since we supported THP_SWAP, > correct? "something does not work ideally in some corner cases" might be > reasonable to handle here (and I really think we should), but might not > be worth a "Fixes:". > > So if we could clarify that, it would be great. I don't think this needs a fixes tag, and I would think it is an optimization on corner cases. And I don't think we will see noticeable performance improvement as this isn't happening quite often though I believe it does improve the performance of corner cases. but if the corner case only takes 0.1% of all try_to_unmap_one, no noticeable performance improvement will be seen. I feel it is more of a behavior to kick flies out while flies don't kill people but it can be sometimes quite annoying :-) I ran into another bug in large-folio swapin series due to this problem, 16 PTEs had contiguous swap entries from Ryan's add_to_swap(), but they were splitted in MADV_PAGEOUT because the folio was not completely mapped after try_to_unmap_one. Then after some time, some of the 16 pages were in swapcache, while others were not in. In this case, I couldn't swap them in together and had to handle PF one by one, but i was incorrectly handling it and trying to swap-in them together by reading swapfile. if they were atomically handled in try_to_unmap_one, we could avoid this. we may have to cope with this kind of problem from time to time in future work. > > -- > Cheers, > > David / dhildenb Thanks Barry
On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi Barry, > > On 04/03/2024 10:37, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > page_vma_mapped_walk() within try_to_unmap_one() races with other > > PTEs modification such as break-before-make, while iterating PTEs > > of a large folio, it will only begin to acquire PTL after it gets > > a valid(present) PTE. break-before-make intermediately sets PTEs > > to pte_none. Thus, a large folio's PTEs might be partially skipped > > in try_to_unmap_one(). > > I just want to check my understanding here - I think the problem occurs for > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > that I've had a look at the code and have a better understanding, I think that > must be the case? And therefore this problem exists independently of my work to > support swap-out of mTHP? (From your previous report I was under the impression > that it only affected mTHP). I think this affects all large folios with PTEs entries more than 1. but hugeTLB is handled as a whole in try_to_unmap_one and its rmap is removed all together, i feel hugeTLB doesn't have this problem. > > Its just that the problem is becoming more pronounced because with mTHP, > PTE-mapped large folios are much more common? right. as now large folios become a more common case, and it is my case running in millions of phones. BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim all PTEs all together rather than iterating PTEs one by one. This will improve performance. for example, a batched set_ptes_to_swap_entries() { } then we only need to loop once for a large folio, right now we are looping nr_pages times. > > > For example, for an anon folio, after try_to_unmap_one(), we may > > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > > So folio will be still mapped, the folio fails to be reclaimed. > > What’s even more worrying is, its PTEs are no longer in a unified > > state. This might lead to accident folio_split() afterwards. And > > since a part of PTEs are now swap entries, accessing them will > > incur page fault - do_swap_page. > > It creates both anxiety and more expense. While we can't avoid > > userspace's unmap to break up unified PTEs such as CONT-PTE for > > a large folio, we can indeed keep away from kernel's breaking up > > them due to its code design. > > This patch is holding PTL from PTE0, thus, the folio will either > > be entirely reclaimed or entirely kept. On the other hand, this > > approach doesn't increase PTL contention. Even w/o the patch, > > page_vma_mapped_walk() will always get PTL after it sometimes > > skips one or two PTEs because intermediate break-before-makes > > are short, according to test. Of course, even w/o this patch, > > the vast majority of try_to_unmap_one still can get PTL from > > PTE0. This patch makes the number 100%. > > The other option is that we can give up in try_to_unmap_one > > once we find PTE0 is not the first entry we get PTL, we call > > page_vma_mapped_walk_done() to end the iteration at this case. > > This will keep the unified PTEs while the folio isn't reclaimed. > > The result is quite similar with small folios with one PTE - > > either entirely reclaimed or entirely kept. > > Reclaiming large folios by holding PTL from PTE0 seems a better > > option comparing to giving up after detecting PTL begins from > > non-PTE0. > > > > Cc: Hugh Dickins <hughd@google.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > Do we need a Fixes tag? I don't feel a strong need for this as this doesn't cause a crash, memory leak or whatever serious. > > > --- > > mm/vmscan.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 0b888a2afa58..e4722fbbcd0c 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > if (folio_test_pmd_mappable(folio)) > > flags |= TTU_SPLIT_HUGE_PMD; > > + /* > > + * if page table lock is not held from the first PTE of > > + * a large folio, some PTEs might be skipped because of > > + * races with break-before-make, for example, PTEs can > > + * be pte_none intermediately, thus one or more PTEs > > + * might be skipped in try_to_unmap_one, we might result > > + * in a large folio is partially mapped and partially > > + * unmapped after try_to_unmap > > + */ > > + if (folio_test_large(folio)) > > + flags |= TTU_SYNC; > > This looks sensible to me after thinking about it for a while. But I also have a > gut feeling that there might be some more subtleties that are going over my > head, since I'm not expert in this area. So will leave others to provide R-b :) > ok, thanks :-) > Thanks, > Ryan > > > > > try_to_unmap(folio, flags); > > if (folio_mapped(folio)) { > Thanks Barry
On 04/03/2024 21:04, Barry Song wrote: > On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 04.03.24 13:20, Ryan Roberts wrote: >>> Hi Barry, >>> >>> On 04/03/2024 10:37, Barry Song wrote: >>>> From: Barry Song <v-songbaohua@oppo.com> >>>> >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>>> PTEs modification such as break-before-make, while iterating PTEs >>>> of a large folio, it will only begin to acquire PTL after it gets >>>> a valid(present) PTE. break-before-make intermediately sets PTEs >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>>> in try_to_unmap_one(). >>> >>> I just want to check my understanding here - I think the problem occurs for >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >>> that I've had a look at the code and have a better understanding, I think that >>> must be the case? And therefore this problem exists independently of my work to >>> support swap-out of mTHP? (From your previous report I was under the impression >>> that it only affected mTHP). >>> >>> Its just that the problem is becoming more pronounced because with mTHP, >>> PTE-mapped large folios are much more common? >> >> That is my understanding. >> >>> >>>> For example, for an anon folio, after try_to_unmap_one(), we may >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >>>> So folio will be still mapped, the folio fails to be reclaimed. >>>> What’s even more worrying is, its PTEs are no longer in a unified >>>> state. This might lead to accident folio_split() afterwards. And >>>> since a part of PTEs are now swap entries, accessing them will >>>> incur page fault - do_swap_page. >>>> It creates both anxiety and more expense. While we can't avoid >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for >>>> a large folio, we can indeed keep away from kernel's breaking up >>>> them due to its code design. >>>> This patch is holding PTL from PTE0, thus, the folio will either >>>> be entirely reclaimed or entirely kept. On the other hand, this >>>> approach doesn't increase PTL contention. Even w/o the patch, >>>> page_vma_mapped_walk() will always get PTL after it sometimes >>>> skips one or two PTEs because intermediate break-before-makes >>>> are short, according to test. Of course, even w/o this patch, >>>> the vast majority of try_to_unmap_one still can get PTL from >>>> PTE0. This patch makes the number 100%. >>>> The other option is that we can give up in try_to_unmap_one >>>> once we find PTE0 is not the first entry we get PTL, we call >>>> page_vma_mapped_walk_done() to end the iteration at this case. >>>> This will keep the unified PTEs while the folio isn't reclaimed. >>>> The result is quite similar with small folios with one PTE - >>>> either entirely reclaimed or entirely kept. >>>> Reclaiming large folios by holding PTL from PTE0 seems a better >>>> option comparing to giving up after detecting PTL begins from >>>> non-PTE0. >>>> >> >> I'm sure that wall of text can be formatted in a better way :) . Also, I >> think we can drop some of the details, >> >> If you need some inspiration, I can give it a shot. >> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> >>> Do we need a Fixes tag? It seems my original question has snowballed a bit. I was conflating this change with other reports Barry has made where the kernel was panicking (I think?). Given we are not seeing any incorrect functional behaviour that this change fixes, I agree we don't need a Fixes tag here.
On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote: > > > >>> Do we need a Fixes tag? > > > > I am not quite sure which commit should be here for a fixes tag. > > I think it's more of an optimization. > > Good, that helps! > > > > >>> > >> > >> What would be the description of the problem we are fixing? > >> > >> 1) failing to unmap? > >> > >> That can happen with small folios as well IIUC. > >> > >> 2) Putting the large folio on the deferred split queue? > >> > >> That sounds more reasonable. > > > > I don't feel it is reasonable. Avoiding this kind of accident splitting > > from the kernel's improper code is a more reasonable approach > > as there is always a price to pay for splitting and unfolding PTEs > > etc. > > > > While we can't avoid splitting coming from userspace's > > MADV_DONTNEED, munmap, mprotect, we have a way > > to ensure the kernel itself doesn't accidently break up a > > large folio. > > Note that on the next vmscan we would retry, find the remaining present > entries and swapout that thing completely :) This is true, but since we can finish the job the first time, it seems second retry is a cost :-) > > > > > In OPPO's phones, we ran into some weird bugs due to skipped PTEs > > in try_to_unmap_one. hardly could we fix it from the root cause. with > > various races, figuring out their timings was really a big pain :-) > > > > I can imagine. I assume, though, that it might be related to the way the > cont-pte bit was handled. Ryan's implementation should be able to cope > with that. I guess you are probably right. Ryan's implementation decouples CONT-PTE from mm core. nice to have it. > > > But we did "resolve" those bugs by entirely untouching all PTEs if we > > found some PTEs were skipped in try_to_unmap_one [1]. > > > > While we find we only get the PTL from 2nd, 3rd but not > > 1st PTE, we entirely give up on try_to_unmap_one, and leave > > all PTEs untouched. > > > > /* we are not starting from head */ > > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { > > ret = false; > > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); > > set_pte_at(mm, address, pvmw.pte, pteval); > > page_vma_mapped_walk_done(&pvmw); > > break; > > } > > This will ensure all PTEs still have a unified state such as CONT-PTE > > after try_to_unmap fails. > > I feel this could have some false postive because when racing > > with unmap, 1st PTE might really become pte_none. So explicitly > > holding PTL from 1st PTE seems a better way. > > Can we estimate the "cost" of holding the PTL? > This is just moving PTL acquisition one or two PTE earlier in those corner cases. In normal cases, it doesn't affect when PTL is held. In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold PTL immediately. in corner cases, page_vma_mapped_walk races with break- before-make, after skipping one or two PTEs whose states are transferring, it will find a present pte then acquire lock. > -- > Cheers, > > David / dhildenb Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > From: Barry Song <v-songbaohua@oppo.com> > > page_vma_mapped_walk() within try_to_unmap_one() races with other > PTEs modification such as break-before-make, while iterating PTEs Sorry, I don't know what is "break-before-make", can you elaborate? IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE temporarily, which may cause race with page_vma_mapped_walk(). Is that the issue that you try to fix? -- Best Regards, Huang, Ying > of a large folio, it will only begin to acquire PTL after it gets > a valid(present) PTE. break-before-make intermediately sets PTEs > to pte_none. Thus, a large folio's PTEs might be partially skipped > in try_to_unmap_one(). > For example, for an anon folio, after try_to_unmap_one(), we may > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > So folio will be still mapped, the folio fails to be reclaimed. > What’s even more worrying is, its PTEs are no longer in a unified > state. This might lead to accident folio_split() afterwards. And > since a part of PTEs are now swap entries, accessing them will > incur page fault - do_swap_page. > It creates both anxiety and more expense. While we can't avoid > userspace's unmap to break up unified PTEs such as CONT-PTE for > a large folio, we can indeed keep away from kernel's breaking up > them due to its code design. > This patch is holding PTL from PTE0, thus, the folio will either > be entirely reclaimed or entirely kept. On the other hand, this > approach doesn't increase PTL contention. Even w/o the patch, > page_vma_mapped_walk() will always get PTL after it sometimes > skips one or two PTEs because intermediate break-before-makes > are short, according to test. Of course, even w/o this patch, > the vast majority of try_to_unmap_one still can get PTL from > PTE0. This patch makes the number 100%. > The other option is that we can give up in try_to_unmap_one > once we find PTE0 is not the first entry we get PTL, we call > page_vma_mapped_walk_done() to end the iteration at this case. > This will keep the unified PTEs while the folio isn't reclaimed. > The result is quite similar with small folios with one PTE - > either entirely reclaimed or entirely kept. > Reclaiming large folios by holding PTL from PTE0 seems a better > option comparing to giving up after detecting PTL begins from > non-PTE0. > > Cc: Hugh Dickins <hughd@google.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/vmscan.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 0b888a2afa58..e4722fbbcd0c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (folio_test_pmd_mappable(folio)) > flags |= TTU_SPLIT_HUGE_PMD; > + /* > + * if page table lock is not held from the first PTE of > + * a large folio, some PTEs might be skipped because of > + * races with break-before-make, for example, PTEs can > + * be pte_none intermediately, thus one or more PTEs > + * might be skipped in try_to_unmap_one, we might result > + * in a large folio is partially mapped and partially > + * unmapped after try_to_unmap > + */ > + if (folio_test_large(folio)) > + flags |= TTU_SYNC; > > try_to_unmap(folio, flags); > if (folio_mapped(folio)) {
David Hildenbrand <david@redhat.com> writes: > > As we are seeing more such problems with lockless PT walks, maybe we > really want some other special value (nonswap entry?) to indicate that > a PTE this is currently ondergoing protection changes. So we'd avoid > the pte_none() temporarily, if possible. This sounds like a good idea. This can solve other issue caused by temporarily pte_none() issue too, like the following, https://lore.kernel.org/linux-mm/20240229060907.836589-1-zhangpeng362@huawei.com/ -- Best Regards, Huang, Ying
Barry Song <21cnbao@gmail.com> writes: > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote: >> > But we did "resolve" those bugs by entirely untouching all PTEs if we >> > found some PTEs were skipped in try_to_unmap_one [1]. >> > >> > While we find we only get the PTL from 2nd, 3rd but not >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave >> > all PTEs untouched. >> > >> > /* we are not starting from head */ >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { >> > ret = false; >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); >> > set_pte_at(mm, address, pvmw.pte, pteval); >> > page_vma_mapped_walk_done(&pvmw); >> > break; >> > } >> > This will ensure all PTEs still have a unified state such as CONT-PTE >> > after try_to_unmap fails. >> > I feel this could have some false postive because when racing >> > with unmap, 1st PTE might really become pte_none. So explicitly >> > holding PTL from 1st PTE seems a better way. >> >> Can we estimate the "cost" of holding the PTL? >> > > This is just moving PTL acquisition one or two PTE earlier in those corner > cases. In normal cases, it doesn't affect when PTL is held. The mTHP may be mapped at the end of page table. In that case, the PTL will be held longer. Or am I missing something? -- Best Regards, Huang, Ying > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold > PTL immediately. in corner cases, page_vma_mapped_walk races with break- > before-make, after skipping one or two PTEs whose states are transferring, > it will find a present pte then acquire lock. > >> -- >> Cheers, >> >> David / dhildenb > > Thanks > Barry
On 04/03/2024 21:57, Barry Song wrote: > On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi Barry, >> >> On 04/03/2024 10:37, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>> PTEs modification such as break-before-make, while iterating PTEs >>> of a large folio, it will only begin to acquire PTL after it gets >>> a valid(present) PTE. break-before-make intermediately sets PTEs >>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>> in try_to_unmap_one(). >> >> I just want to check my understanding here - I think the problem occurs for >> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >> that I've had a look at the code and have a better understanding, I think that >> must be the case? And therefore this problem exists independently of my work to >> support swap-out of mTHP? (From your previous report I was under the impression >> that it only affected mTHP). > > I think this affects all large folios with PTEs entries more than 1. but hugeTLB > is handled as a whole in try_to_unmap_one and its rmap is removed all > together, i feel hugeTLB doesn't have this problem. > >> >> Its just that the problem is becoming more pronounced because with mTHP, >> PTE-mapped large folios are much more common? > > right. as now large folios become a more common case, and it is my case > running in millions of phones. > > BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim > all PTEs all together rather than iterating PTEs one by one. This will improve > performance. for example, a batched > set_ptes_to_swap_entries() > { > } > then we only need to loop once for a large folio, right now we are looping > nr_pages times. You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes things a bit easier too, whereas in the regular mm, they are now a variable size. David and I introduced folio_pte_batch() to help gather batches of ptes, and it uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding swap_pte_batch() which does a similar thing for swap entry batching in v4 of my swap-out series. For your set_ptes_to_swap_entries() example, I'm not sure what it would do other than loop over the PTEs setting an incremented swap entry to each one? How is that more performant? Thanks, Ryan
On Tue, Mar 5, 2024 at 8:30 PM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > page_vma_mapped_walk() within try_to_unmap_one() races with other > > PTEs modification such as break-before-make, while iterating PTEs > > Sorry, I don't know what is "break-before-make", can you elaborate? > IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE > temporarily, which may cause race with page_vma_mapped_walk(). Is that > the issue that you try to fix? we are writing pte to zero(break) before writing a new value(make). while this behavior is within PTL in another thread, page_vma_mapped_walk() of try_to_unmap_one thread won't take PTL till it meets a present PTE. for example, if another threads are modifying nr_pages PTEs under PTL, but we don't hold PTL, we might skip one or two PTEs at the beginning of a large folio. For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1 untouched but PTE2~nr_pages-1 are set to swap entries. by holding PTL from PTE0 for large folios, we won't get these intermediate values. At the moment we get PTL, other threads have done. > > -- > Best Regards, > Huang, Ying > > > of a large folio, it will only begin to acquire PTL after it gets > > a valid(present) PTE. break-before-make intermediately sets PTEs > > to pte_none. Thus, a large folio's PTEs might be partially skipped > > in try_to_unmap_one(). > > For example, for an anon folio, after try_to_unmap_one(), we may > > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > > So folio will be still mapped, the folio fails to be reclaimed. > > What’s even more worrying is, its PTEs are no longer in a unified > > state. This might lead to accident folio_split() afterwards. And > > since a part of PTEs are now swap entries, accessing them will > > incur page fault - do_swap_page. > > It creates both anxiety and more expense. While we can't avoid > > userspace's unmap to break up unified PTEs such as CONT-PTE for > > a large folio, we can indeed keep away from kernel's breaking up > > them due to its code design. > > This patch is holding PTL from PTE0, thus, the folio will either > > be entirely reclaimed or entirely kept. On the other hand, this > > approach doesn't increase PTL contention. Even w/o the patch, > > page_vma_mapped_walk() will always get PTL after it sometimes > > skips one or two PTEs because intermediate break-before-makes > > are short, according to test. Of course, even w/o this patch, > > the vast majority of try_to_unmap_one still can get PTL from > > PTE0. This patch makes the number 100%. > > The other option is that we can give up in try_to_unmap_one > > once we find PTE0 is not the first entry we get PTL, we call > > page_vma_mapped_walk_done() to end the iteration at this case. > > This will keep the unified PTEs while the folio isn't reclaimed. > > The result is quite similar with small folios with one PTE - > > either entirely reclaimed or entirely kept. > > Reclaiming large folios by holding PTL from PTE0 seems a better > > option comparing to giving up after detecting PTL begins from > > non-PTE0. > > > > Cc: Hugh Dickins <hughd@google.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/vmscan.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 0b888a2afa58..e4722fbbcd0c 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > if (folio_test_pmd_mappable(folio)) > > flags |= TTU_SPLIT_HUGE_PMD; > > + /* > > + * if page table lock is not held from the first PTE of > > + * a large folio, some PTEs might be skipped because of > > + * races with break-before-make, for example, PTEs can > > + * be pte_none intermediately, thus one or more PTEs > > + * might be skipped in try_to_unmap_one, we might result > > + * in a large folio is partially mapped and partially > > + * unmapped after try_to_unmap > > + */ > > + if (folio_test_large(folio)) > > + flags |= TTU_SYNC; > > > > try_to_unmap(folio, flags); > > if (folio_mapped(folio)) { Thanks Barry
On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote: > >> > But we did "resolve" those bugs by entirely untouching all PTEs if we > >> > found some PTEs were skipped in try_to_unmap_one [1]. > >> > > >> > While we find we only get the PTL from 2nd, 3rd but not > >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave > >> > all PTEs untouched. > >> > > >> > /* we are not starting from head */ > >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { > >> > ret = false; > >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); > >> > set_pte_at(mm, address, pvmw.pte, pteval); > >> > page_vma_mapped_walk_done(&pvmw); > >> > break; > >> > } > >> > This will ensure all PTEs still have a unified state such as CONT-PTE > >> > after try_to_unmap fails. > >> > I feel this could have some false postive because when racing > >> > with unmap, 1st PTE might really become pte_none. So explicitly > >> > holding PTL from 1st PTE seems a better way. > >> > >> Can we estimate the "cost" of holding the PTL? > >> > > > > This is just moving PTL acquisition one or two PTE earlier in those corner > > cases. In normal cases, it doesn't affect when PTL is held. > > The mTHP may be mapped at the end of page table. In that case, the PTL > will be held longer. Or am I missing something? no. this patch doesn't change when we release PTL but change when we get PTL. when the original code iterates nr_pages PTEs in a large folio, it will skip invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets intermediate PTE values some other threads are modifying, it might skip PTE0, or sometimes PTE0 and PTE1 according to my test. but arriving at PTE2, likely other threads have written a new value, so we will begin to hold PTL and iterate till the end of the large folio. The proposal is that we directly get PTL from PTE0, thus we don't get intermediate values for the head of nr_pages PTEs. this will ensure a large folio is either completely unmapped or completely mapped. but not partially mapped and partially unmapped. > > -- > Best Regards, > Huang, Ying > > > > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold > > PTL immediately. in corner cases, page_vma_mapped_walk races with break- > > before-make, after skipping one or two PTEs whose states are transferring, > > it will find a present pte then acquire lock. > > > >> -- > >> Cheers, > >> > >> David / dhildenb > > Thanks Barry
Barry Song <21cnbao@gmail.com> writes: > On Tue, Mar 5, 2024 at 8:30 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > From: Barry Song <v-songbaohua@oppo.com> >> > >> > page_vma_mapped_walk() within try_to_unmap_one() races with other >> > PTEs modification such as break-before-make, while iterating PTEs >> >> Sorry, I don't know what is "break-before-make", can you elaborate? >> IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE >> temporarily, which may cause race with page_vma_mapped_walk(). Is that >> the issue that you try to fix? > > we are writing pte to zero(break) before writing a new value(make). OK. Is break and make is commonly used terminology in kernel? If not, it's better to explain a little (e.g., ptep_get_and_clear() / modify / set_pte_at()). > while > this behavior is within PTL in another thread, page_vma_mapped_walk() > of try_to_unmap_one thread won't take PTL till it meets a present PTE. IIUC, !pte_none() should be enough? > for example, if another threads are modifying nr_pages PTEs under PTL, > but we don't hold PTL, we might skip one or two PTEs at the beginning of > a large folio. > For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1 > untouched but PTE2~nr_pages-1 are set to swap entries. > > by holding PTL from PTE0 for large folios, we won't get these intermediate > values. At the moment we get PTL, other threads have done. Got it! Thanks! -- Best Regards, Huang, Ying >> >> -- >> Best Regards, >> Huang, Ying >> >> > of a large folio, it will only begin to acquire PTL after it gets >> > a valid(present) PTE. break-before-make intermediately sets PTEs >> > to pte_none. Thus, a large folio's PTEs might be partially skipped >> > in try_to_unmap_one(). >> > For example, for an anon folio, after try_to_unmap_one(), we may >> > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >> > So folio will be still mapped, the folio fails to be reclaimed. >> > What’s even more worrying is, its PTEs are no longer in a unified >> > state. This might lead to accident folio_split() afterwards. And >> > since a part of PTEs are now swap entries, accessing them will >> > incur page fault - do_swap_page. >> > It creates both anxiety and more expense. While we can't avoid >> > userspace's unmap to break up unified PTEs such as CONT-PTE for >> > a large folio, we can indeed keep away from kernel's breaking up >> > them due to its code design. >> > This patch is holding PTL from PTE0, thus, the folio will either >> > be entirely reclaimed or entirely kept. On the other hand, this >> > approach doesn't increase PTL contention. Even w/o the patch, >> > page_vma_mapped_walk() will always get PTL after it sometimes >> > skips one or two PTEs because intermediate break-before-makes >> > are short, according to test. Of course, even w/o this patch, >> > the vast majority of try_to_unmap_one still can get PTL from >> > PTE0. This patch makes the number 100%. >> > The other option is that we can give up in try_to_unmap_one >> > once we find PTE0 is not the first entry we get PTL, we call >> > page_vma_mapped_walk_done() to end the iteration at this case. >> > This will keep the unified PTEs while the folio isn't reclaimed. >> > The result is quite similar with small folios with one PTE - >> > either entirely reclaimed or entirely kept. >> > Reclaiming large folios by holding PTL from PTE0 seems a better >> > option comparing to giving up after detecting PTL begins from >> > non-PTE0. >> > >> > Cc: Hugh Dickins <hughd@google.com> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> >> > --- >> > mm/vmscan.c | 11 +++++++++++ >> > 1 file changed, 11 insertions(+) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 0b888a2afa58..e4722fbbcd0c 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > >> > if (folio_test_pmd_mappable(folio)) >> > flags |= TTU_SPLIT_HUGE_PMD; >> > + /* >> > + * if page table lock is not held from the first PTE of >> > + * a large folio, some PTEs might be skipped because of >> > + * races with break-before-make, for example, PTEs can >> > + * be pte_none intermediately, thus one or more PTEs >> > + * might be skipped in try_to_unmap_one, we might result >> > + * in a large folio is partially mapped and partially >> > + * unmapped after try_to_unmap >> > + */ >> > + if (folio_test_large(folio)) >> > + flags |= TTU_SYNC; >> > >> > try_to_unmap(folio, flags); >> > if (folio_mapped(folio)) { > > Thanks > Barry
On 05/03/2024 08:56, Barry Song wrote: > are writing pte to zero(break) before writing a new value(make). while As an aside, "break-before-make" as defined in the Arm architecture would also require a TLBI, which usually isn't done for these write-0-modify-prots-write-back operations. Arm doesn't require "break-before-make" in these situations so its legal (as long as only certain bits are changed). To my understanding purpose of doing this is to avoid races with HW access/dirty flag updates; if the MMU wants to set either flag and finds the PTE is 0 (invalid) it will cause an exception which will be queued waiting for the PTL. So I don't think you really mean break-before-make here. > this behavior is within PTL in another thread, page_vma_mapped_walk() > of try_to_unmap_one thread won't take PTL till it meets a present PTE. > for example, if another threads are modifying nr_pages PTEs under PTL, > but we don't hold PTL, we might skip one or two PTEs at the beginning of > a large folio. > For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1 > untouched but PTE2~nr_pages-1 are set to swap entries. > > by holding PTL from PTE0 for large folios, we won't get these intermediate > values. At the moment we get PTL, other threads have done.
On Tue, Mar 5, 2024 at 9:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/03/2024 21:57, Barry Song wrote: > > On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi Barry, > >> > >> On 04/03/2024 10:37, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>> PTEs modification such as break-before-make, while iterating PTEs > >>> of a large folio, it will only begin to acquire PTL after it gets > >>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>> in try_to_unmap_one(). > >> > >> I just want to check my understanding here - I think the problem occurs for > >> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > >> that I've had a look at the code and have a better understanding, I think that > >> must be the case? And therefore this problem exists independently of my work to > >> support swap-out of mTHP? (From your previous report I was under the impression > >> that it only affected mTHP). > > > > I think this affects all large folios with PTEs entries more than 1. but hugeTLB > > is handled as a whole in try_to_unmap_one and its rmap is removed all > > together, i feel hugeTLB doesn't have this problem. > > > >> > >> Its just that the problem is becoming more pronounced because with mTHP, > >> PTE-mapped large folios are much more common? > > > > right. as now large folios become a more common case, and it is my case > > running in millions of phones. > > > > BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim > > all PTEs all together rather than iterating PTEs one by one. This will improve > > performance. for example, a batched > > set_ptes_to_swap_entries() > > { > > } > > then we only need to loop once for a large folio, right now we are looping > > nr_pages times. > > You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch > implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes > things a bit easier too, whereas in the regular mm, they are now a variable size. > > David and I introduced folio_pte_batch() to help gather batches of ptes, and it > uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding > swap_pte_batch() which does a similar thing for swap entry batching in v4 of my > swap-out series. > > For your set_ptes_to_swap_entries() example, I'm not sure what it would do other > than loop over the PTEs setting an incremented swap entry to each one? How is > that more performant? right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for each PTE, if each PTE, we do lots of checks within the loop. by implementing set_ptes_to_swap_entries(), we can iterate once for page_vma_mapped_walk(), after folio_pte_batch() has confirmed the large folio is completely mapped, we set nr_pages swap entries all together. we are replacing for(i=0;i<nr_pages;i++) /* page_vma_mapped_walk */ { lots of checks; clear PTEn set PTEn to swap } by if (large folio && folio_pte_batch() == nr_pages) set_ptes_to_swap_entries(). > Thanks, Ryan
Barry Song <21cnbao@gmail.com> writes: > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Barry Song <21cnbao@gmail.com> writes: >> >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote: >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we >> >> > found some PTEs were skipped in try_to_unmap_one [1]. >> >> > >> >> > While we find we only get the PTL from 2nd, 3rd but not >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave >> >> > all PTEs untouched. >> >> > >> >> > /* we are not starting from head */ >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { >> >> > ret = false; >> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); >> >> > set_pte_at(mm, address, pvmw.pte, pteval); >> >> > page_vma_mapped_walk_done(&pvmw); >> >> > break; >> >> > } >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE >> >> > after try_to_unmap fails. >> >> > I feel this could have some false postive because when racing >> >> > with unmap, 1st PTE might really become pte_none. So explicitly >> >> > holding PTL from 1st PTE seems a better way. >> >> >> >> Can we estimate the "cost" of holding the PTL? >> >> >> > >> > This is just moving PTL acquisition one or two PTE earlier in those corner >> > cases. In normal cases, it doesn't affect when PTL is held. >> >> The mTHP may be mapped at the end of page table. In that case, the PTL >> will be held longer. Or am I missing something? > > no. this patch doesn't change when we release PTL but change when we > get PTL. > > when the original code iterates nr_pages PTEs in a large folio, it will skip > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets > intermediate PTE values some other threads are modifying, it might > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but > arriving at PTE2, likely other threads have written a new value, so we > will begin to hold PTL and iterate till the end of the large folio. Is there any guarantee that the mTHP will always be mapped at the beginning of the page table (PTE0)? IIUC, mTHP can be mapped at PTE496. If so, with your patch, PTL will be held from PTE0 instead of PTE496 in some cases. -- Best Regards, Huang, Ying > The proposal is that we directly get PTL from PTE0, thus we don't get > intermediate values for the head of nr_pages PTEs. this will ensure > a large folio is either completely unmapped or completely mapped. > but not partially mapped and partially unmapped. > >> >> -- >> Best Regards, >> Huang, Ying >> >> >> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold >> > PTL immediately. in corner cases, page_vma_mapped_walk races with break- >> > before-make, after skipping one or two PTEs whose states are transferring, >> > it will find a present pte then acquire lock. >> > >> >> -- >> >> Cheers, >> >> >> >> David / dhildenb >> > > Thanks > Barry
On 05/03/2024 09:08, Barry Song wrote: > On Tue, Mar 5, 2024 at 9:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 04/03/2024 21:57, Barry Song wrote: >>> On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi Barry, >>>> >>>> On 04/03/2024 10:37, Barry Song wrote: >>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>>>> PTEs modification such as break-before-make, while iterating PTEs >>>>> of a large folio, it will only begin to acquire PTL after it gets >>>>> a valid(present) PTE. break-before-make intermediately sets PTEs >>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>>>> in try_to_unmap_one(). >>>> >>>> I just want to check my understanding here - I think the problem occurs for >>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >>>> that I've had a look at the code and have a better understanding, I think that >>>> must be the case? And therefore this problem exists independently of my work to >>>> support swap-out of mTHP? (From your previous report I was under the impression >>>> that it only affected mTHP). >>> >>> I think this affects all large folios with PTEs entries more than 1. but hugeTLB >>> is handled as a whole in try_to_unmap_one and its rmap is removed all >>> together, i feel hugeTLB doesn't have this problem. >>> >>>> >>>> Its just that the problem is becoming more pronounced because with mTHP, >>>> PTE-mapped large folios are much more common? >>> >>> right. as now large folios become a more common case, and it is my case >>> running in millions of phones. >>> >>> BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim >>> all PTEs all together rather than iterating PTEs one by one. This will improve >>> performance. for example, a batched >>> set_ptes_to_swap_entries() >>> { >>> } >>> then we only need to loop once for a large folio, right now we are looping >>> nr_pages times. >> >> You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch >> implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes >> things a bit easier too, whereas in the regular mm, they are now a variable size. >> >> David and I introduced folio_pte_batch() to help gather batches of ptes, and it >> uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding >> swap_pte_batch() which does a similar thing for swap entry batching in v4 of my >> swap-out series. >> >> For your set_ptes_to_swap_entries() example, I'm not sure what it would do other >> than loop over the PTEs setting an incremented swap entry to each one? How is >> that more performant? > > right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for each > PTE, if each PTE, we do lots of checks within the loop. > > by implementing set_ptes_to_swap_entries(), we can iterate once for > page_vma_mapped_walk(), after folio_pte_batch() has confirmed > the large folio is completely mapped, we set nr_pages swap entries > all together. > > we are replacing > > for(i=0;i<nr_pages;i++) /* page_vma_mapped_walk */ > { > lots of checks; > clear PTEn > set PTEn to swap > } OK so you are effectively hoisting "lots of checks" out of the loop? > > by > > if (large folio && folio_pte_batch() == nr_pages) > set_ptes_to_swap_entries(). > >> > > Thanks, > Ryan
On Tue, Mar 5, 2024 at 10:08 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/03/2024 08:56, Barry Song wrote: > > are writing pte to zero(break) before writing a new value(make). while > > As an aside, "break-before-make" as defined in the Arm architecture would also > require a TLBI, which usually isn't done for these > write-0-modify-prots-write-back operations. Arm doesn't require > "break-before-make" in these situations so its legal (as long as only certain > bits are changed). To my understanding purpose of doing this is to avoid races > with HW access/dirty flag updates; if the MMU wants to set either flag and finds > the PTE is 0 (invalid) it will cause an exception which will be queued waiting > for the PTL. > > So I don't think you really mean break-before-make here. I agree I use a stronger term. will change it to something lighter in v2. > > > this behavior is within PTL in another thread, page_vma_mapped_walk() > > of try_to_unmap_one thread won't take PTL till it meets a present PTE. > > for example, if another threads are modifying nr_pages PTEs under PTL, > > but we don't hold PTL, we might skip one or two PTEs at the beginning of > > a large folio. > > For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1 > > untouched but PTE2~nr_pages-1 are set to swap entries. > > > > by holding PTL from PTE0 for large folios, we won't get these intermediate > > values. At the moment we get PTL, other threads have done. > Thanks Barry
On Tue, Mar 5, 2024 at 10:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/03/2024 09:08, Barry Song wrote: > > On Tue, Mar 5, 2024 at 9:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 04/03/2024 21:57, Barry Song wrote: > >>> On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> Hi Barry, > >>>> > >>>> On 04/03/2024 10:37, Barry Song wrote: > >>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>> > >>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>>> PTEs modification such as break-before-make, while iterating PTEs > >>>>> of a large folio, it will only begin to acquire PTL after it gets > >>>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>>> in try_to_unmap_one(). > >>>> > >>>> I just want to check my understanding here - I think the problem occurs for > >>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now > >>>> that I've had a look at the code and have a better understanding, I think that > >>>> must be the case? And therefore this problem exists independently of my work to > >>>> support swap-out of mTHP? (From your previous report I was under the impression > >>>> that it only affected mTHP). > >>> > >>> I think this affects all large folios with PTEs entries more than 1. but hugeTLB > >>> is handled as a whole in try_to_unmap_one and its rmap is removed all > >>> together, i feel hugeTLB doesn't have this problem. > >>> > >>>> > >>>> Its just that the problem is becoming more pronounced because with mTHP, > >>>> PTE-mapped large folios are much more common? > >>> > >>> right. as now large folios become a more common case, and it is my case > >>> running in millions of phones. > >>> > >>> BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim > >>> all PTEs all together rather than iterating PTEs one by one. This will improve > >>> performance. for example, a batched > >>> set_ptes_to_swap_entries() > >>> { > >>> } > >>> then we only need to loop once for a large folio, right now we are looping > >>> nr_pages times. > >> > >> You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch > >> implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes > >> things a bit easier too, whereas in the regular mm, they are now a variable size. > >> > >> David and I introduced folio_pte_batch() to help gather batches of ptes, and it > >> uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding > >> swap_pte_batch() which does a similar thing for swap entry batching in v4 of my > >> swap-out series. > >> > >> For your set_ptes_to_swap_entries() example, I'm not sure what it would do other > >> than loop over the PTEs setting an incremented swap entry to each one? How is > >> that more performant? > > > > right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for each > > PTE, if each PTE, we do lots of checks within the loop. > > > > by implementing set_ptes_to_swap_entries(), we can iterate once for > > page_vma_mapped_walk(), after folio_pte_batch() has confirmed > > the large folio is completely mapped, we set nr_pages swap entries > > all together. > > > > we are replacing > > > > for(i=0;i<nr_pages;i++) /* page_vma_mapped_walk */ > > { > > lots of checks; > > clear PTEn > > set PTEn to swap > > } > > OK so you are effectively hoisting "lots of checks" out of the loop? no. page_vma_mapped_walk returns nr_pages times. We are doing same check each time. Each time, we do tlbi and set one PTE. > > > > > by > > > > if (large folio && folio_pte_batch() == nr_pages) > > set_ptes_to_swap_entries(). for this, we do check for one time, and we do much less tlbi. > > > >> > > > > Thanks, > > Ryan Thanks Barry
On Tue, Mar 5, 2024 at 10:12 PM Huang, Ying <ying.huang@intel.com> wrote: > > Barry Song <21cnbao@gmail.com> writes: > > > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote: > >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we > >> >> > found some PTEs were skipped in try_to_unmap_one [1]. > >> >> > > >> >> > While we find we only get the PTL from 2nd, 3rd but not > >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave > >> >> > all PTEs untouched. > >> >> > > >> >> > /* we are not starting from head */ > >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { > >> >> > ret = false; > >> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); > >> >> > set_pte_at(mm, address, pvmw.pte, pteval); > >> >> > page_vma_mapped_walk_done(&pvmw); > >> >> > break; > >> >> > } > >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE > >> >> > after try_to_unmap fails. > >> >> > I feel this could have some false postive because when racing > >> >> > with unmap, 1st PTE might really become pte_none. So explicitly > >> >> > holding PTL from 1st PTE seems a better way. > >> >> > >> >> Can we estimate the "cost" of holding the PTL? > >> >> > >> > > >> > This is just moving PTL acquisition one or two PTE earlier in those corner > >> > cases. In normal cases, it doesn't affect when PTL is held. > >> > >> The mTHP may be mapped at the end of page table. In that case, the PTL > >> will be held longer. Or am I missing something? > > > > no. this patch doesn't change when we release PTL but change when we > > get PTL. > > > > when the original code iterates nr_pages PTEs in a large folio, it will skip > > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets > > intermediate PTE values some other threads are modifying, it might > > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but > > arriving at PTE2, likely other threads have written a new value, so we > > will begin to hold PTL and iterate till the end of the large folio. > > Is there any guarantee that the mTHP will always be mapped at the > beginning of the page table (PTE0)? IIUC, mTHP can be mapped at PTE496. > If so, with your patch, PTL will be held from PTE0 instead of PTE496 in > some cases. I agree. but in another discussion[1], the plan is if we find a large folio has been deferred split, we split it before try_to_unmap and pageout. otherwise, we may result in lots of redundant I/O, because PTE0-495 will still be pageout()-ed. [1] https://lore.kernel.org/linux-mm/a4a9054f-2040-4f70-8d10-a5af4972e5aa@arm.com/ > > -- > Best Regards, > Huang, Ying > > > The proposal is that we directly get PTL from PTE0, thus we don't get > > intermediate values for the head of nr_pages PTEs. this will ensure > > a large folio is either completely unmapped or completely mapped. > > but not partially mapped and partially unmapped. > > > >> > >> -- > >> Best Regards, > >> Huang, Ying > >> > >> > >> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold > >> > PTL immediately. in corner cases, page_vma_mapped_walk races with break- > >> > before-make, after skipping one or two PTEs whose states are transferring, > >> > it will find a present pte then acquire lock. > >> > > >> >> -- > >> >> Cheers, > >> >> > >> >> David / dhildenb > >> > Thanks Barry
On Tue, Mar 5, 2024 at 10:21 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 10:12 PM Huang, Ying <ying.huang@intel.com> wrote: > > > > Barry Song <21cnbao@gmail.com> writes: > > > > > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote: > > >> > > >> Barry Song <21cnbao@gmail.com> writes: > > >> > > >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote: > > >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we > > >> >> > found some PTEs were skipped in try_to_unmap_one [1]. > > >> >> > > > >> >> > While we find we only get the PTL from 2nd, 3rd but not > > >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave > > >> >> > all PTEs untouched. > > >> >> > > > >> >> > /* we are not starting from head */ > > >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) { > > >> >> > ret = false; > > >> >> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head); > > >> >> > set_pte_at(mm, address, pvmw.pte, pteval); > > >> >> > page_vma_mapped_walk_done(&pvmw); > > >> >> > break; > > >> >> > } > > >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE > > >> >> > after try_to_unmap fails. > > >> >> > I feel this could have some false postive because when racing > > >> >> > with unmap, 1st PTE might really become pte_none. So explicitly > > >> >> > holding PTL from 1st PTE seems a better way. > > >> >> > > >> >> Can we estimate the "cost" of holding the PTL? > > >> >> > > >> > > > >> > This is just moving PTL acquisition one or two PTE earlier in those corner > > >> > cases. In normal cases, it doesn't affect when PTL is held. > > >> > > >> The mTHP may be mapped at the end of page table. In that case, the PTL > > >> will be held longer. Or am I missing something? > > > > > > no. this patch doesn't change when we release PTL but change when we > > > get PTL. > > > > > > when the original code iterates nr_pages PTEs in a large folio, it will skip > > > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets > > > intermediate PTE values some other threads are modifying, it might > > > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but > > > arriving at PTE2, likely other threads have written a new value, so we > > > will begin to hold PTL and iterate till the end of the large folio. > > > > Is there any guarantee that the mTHP will always be mapped at the > > beginning of the page table (PTE0)? IIUC, mTHP can be mapped at PTE496. > > If so, with your patch, PTL will be held from PTE0 instead of PTE496 in > > some cases. > > I agree. but in another discussion[1], the plan is if we find a large folio has > been deferred split, we split it before try_to_unmap and pageout. otherwise, > we may result in lots of redundant I/O, because PTE0-495 will still be > pageout()-ed. > > [1] https://lore.kernel.org/linux-mm/a4a9054f-2040-4f70-8d10-a5af4972e5aa@arm.com/ I thought about this again, seems we can cope with it even w/o the above plan by: + if (folio_test_large(folio) && list_empty(&folio->_deferred_list)) + flags |= TTU_SYNC; if a folio has been deferred split, it seems no sense to have the optimization for the corner cases this patch wants to provide. Only while we know this folio is still entirely mapped, we have this optimization. This should have reduced the chance to be quite small though we still have a bit. > > > > > -- > > Best Regards, > > Huang, Ying > > > > > The proposal is that we directly get PTL from PTE0, thus we don't get > > > intermediate values for the head of nr_pages PTEs. this will ensure > > > a large folio is either completely unmapped or completely mapped. > > > but not partially mapped and partially unmapped. > > > > > >> Thanks Barry
diff --git a/mm/vmscan.c b/mm/vmscan.c index 0b888a2afa58..e4722fbbcd0c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; + /* + * if page table lock is not held from the first PTE of + * a large folio, some PTEs might be skipped because of + * races with break-before-make, for example, PTEs can + * be pte_none intermediately, thus one or more PTEs + * might be skipped in try_to_unmap_one, we might result + * in a large folio is partially mapped and partially + * unmapped after try_to_unmap + */ + if (folio_test_large(folio)) + flags |= TTU_SYNC; try_to_unmap(folio, flags); if (folio_mapped(folio)) {