Message ID | 20220315104741.63071-2-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: COW fixes part 2: reliable GUP pins of anonymous pages | expand |
On 15.03.22 11:47, David Hildenbrand wrote: > In case arch_unmap_one() fails, we already did a swap_duplicate(). let's > undo that properly via swap_free(). > > Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") > Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > mm/rmap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 6a1e8c7f6213..f825aeef61ca 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > break; > } > if (arch_unmap_one(mm, vma, address, pteval) < 0) { > + swap_free(entry); > set_pte_at(mm, address, pvmw.pte, pteval); > ret = false; > page_vma_mapped_walk_done(&pvmw); Hi Khalid, I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify. See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g., @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, page_vma_mapped_walk_done(&pvmw); break; } + if (anon_exclusive && + page_try_share_anon_rmap(subpage)) { + swap_free(entry); + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + /* + * Note: We don't remember yet if the page was mapped + * exclusively in the swap entry, so swapin code has + * to re-determine that manually and might detect the + * page as possibly shared, for example, if there are + * other references on the page or if the page is under + * writeback. We made sure that there are no GUP pins + * on the page that would rely on it, so for GUP pins + * this is fine. + */ if (list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); if (list_empty(&mm->mmlist)) For now, I was under the impression that we don't have to undo anything after arch_unmap_one() succeeded, because we seem to not do anything for two cases below. But looking into arch_unmap_one() and how it allocates stuff I do wonder what we would actually want to do here -- I'd assume we'd want to trigger the del_tag_store() somehow? arch_unmap_one() calls adi_save_tags(), which allocates memory. adi_restore_tags()->del_tag_store() reverts that operation and ends up freeing memory conditionally; However, it's only called from arch_do_swap_page(). Here is why I have to scratch my head: a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar for mm/swapfile.c:unuse_pte(), aren't we missing something? b) try_to_migrate_one() does the arch_unmap_one(), but who will do the restore+free after migration succeeded or failed, aren't we missing something? I assume that we won't be properly freeing the tag space in case of a), and won't be properly restoring/migrating the tags in case of a) and b). I'll send out v3 of this series today, and I'll keep ignoring arch_unmap_one() for now, because I have no clue what to do and it looks incomplete already, unless I am missing something important.
On 3/29/22 07:59, David Hildenbrand wrote: > On 15.03.22 11:47, David Hildenbrand wrote: >> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's >> undo that properly via swap_free(). >> >> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") >> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> mm/rmap.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 6a1e8c7f6213..f825aeef61ca 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> break; >> } >> if (arch_unmap_one(mm, vma, address, pteval) < 0) { >> + swap_free(entry); >> set_pte_at(mm, address, pvmw.pte, pteval); >> ret = false; >> page_vma_mapped_walk_done(&pvmw); > > Hi Khalid, > > I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify. > > > See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g., > > @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > page_vma_mapped_walk_done(&pvmw); > break; > } > + if (anon_exclusive && > + page_try_share_anon_rmap(subpage)) { > + swap_free(entry); > + set_pte_at(mm, address, pvmw.pte, pteval); > + ret = false; > + page_vma_mapped_walk_done(&pvmw); > + break; > + } > + /* > + * Note: We don't remember yet if the page was mapped > + * exclusively in the swap entry, so swapin code has > + * to re-determine that manually and might detect the > + * page as possibly shared, for example, if there are > + * other references on the page or if the page is under > + * writeback. We made sure that there are no GUP pins > + * on the page that would rely on it, so for GUP pins > + * this is fine. > + */ > if (list_empty(&mm->mmlist)) { > spin_lock(&mmlist_lock); > if (list_empty(&mm->mmlist)) > > > For now, I was under the impression that we don't have to undo anything after > arch_unmap_one() succeeded, because we seem to not do anything for two > cases below. But looking into arch_unmap_one() and how it allocates stuff I do > wonder what we would actually want to do here -- I'd assume we'd want to > trigger the del_tag_store() somehow? Hi David, Currently once arch_unmap_one() completes successfully, we are at the point of no return for this pte. It will be replaced by swap pte soon thereafter. Patch 11 adds another case where we may return without replacing current pte with swap pte. For now could you resolve this by moving the above code block in patch 11 to before the call to arch_unmap_one(). That still leaves open the issue having the flexibility of undoing what arch_unmap_one() does for some other reason in future. That will require coming up with a properly architected way to do it. > > arch_unmap_one() calls adi_save_tags(), which allocates memory. > adi_restore_tags()->del_tag_store() reverts that operation and ends up > freeing memory conditionally; However, it's only > called from arch_do_swap_page(). > > > Here is why I have to scratch my head: > > a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar > for mm/swapfile.c:unuse_pte(), aren't we missing something? My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags. > > b) try_to_migrate_one() does the arch_unmap_one(), but who will do the > restore+free after migration succeeded or failed, aren't we missing something? try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally result in a call to arch_do_swap_page() which restores the tags. > > > I assume that we won't be properly freeing the tag space in case of a), and won't > be properly restoring/migrating the tags in case of a) and b). I think both cases are covered correctly unless I am misunderstand the code. -- Khalid
On 29.03.22 22:42, Khalid Aziz wrote: > On 3/29/22 07:59, David Hildenbrand wrote: >> On 15.03.22 11:47, David Hildenbrand wrote: >>> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's >>> undo that properly via swap_free(). >>> >>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") >>> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> mm/rmap.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 6a1e8c7f6213..f825aeef61ca 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> break; >>> } >>> if (arch_unmap_one(mm, vma, address, pteval) < 0) { >>> + swap_free(entry); >>> set_pte_at(mm, address, pvmw.pte, pteval); >>> ret = false; >>> page_vma_mapped_walk_done(&pvmw); >> >> Hi Khalid, >> >> I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify. >> >> >> See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g., >> >> @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> page_vma_mapped_walk_done(&pvmw); >> break; >> } >> + if (anon_exclusive && >> + page_try_share_anon_rmap(subpage)) { >> + swap_free(entry); >> + set_pte_at(mm, address, pvmw.pte, pteval); >> + ret = false; >> + page_vma_mapped_walk_done(&pvmw); >> + break; >> + } >> + /* >> + * Note: We don't remember yet if the page was mapped >> + * exclusively in the swap entry, so swapin code has >> + * to re-determine that manually and might detect the >> + * page as possibly shared, for example, if there are >> + * other references on the page or if the page is under >> + * writeback. We made sure that there are no GUP pins >> + * on the page that would rely on it, so for GUP pins >> + * this is fine. >> + */ >> if (list_empty(&mm->mmlist)) { >> spin_lock(&mmlist_lock); >> if (list_empty(&mm->mmlist)) >> >> >> For now, I was under the impression that we don't have to undo anything after >> arch_unmap_one() succeeded, because we seem to not do anything for two >> cases below. But looking into arch_unmap_one() and how it allocates stuff I do >> wonder what we would actually want to do here -- I'd assume we'd want to >> trigger the del_tag_store() somehow? > > Hi David, > Hi, thanks for your fast reply. > Currently once arch_unmap_one() completes successfully, we are at the point of no return for this pte. It will be > replaced by swap pte soon thereafter. Patch 11 adds another case where we may return without replacing current pte with > swap pte. For now could you resolve this by moving the above code block in patch 11 to before the call to > arch_unmap_one(). That still leaves open the issue having the flexibility of undoing what arch_unmap_one() does for some > other reason in future. That will require coming up with a properly architected way to do it. I really want clearing PG_anon_exclusive be the last action, without eventually having to set it again and overcomplicating PG_anon_exclusive/rmap handling. Ideally, we'd have a "arch_remap_one()" that just reverts what arch_unmap_one() did. > >> >> arch_unmap_one() calls adi_save_tags(), which allocates memory. >> adi_restore_tags()->del_tag_store() reverts that operation and ends up >> freeing memory conditionally; However, it's only >> called from arch_do_swap_page(). >> >> >> Here is why I have to scratch my head: >> >> a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar >> for mm/swapfile.c:unuse_pte(), aren't we missing something? > > My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called > upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from > swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the > page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this > mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags. unuse_pte() will replace a swap pte directly by a proper, present pte, just like do_swap_page() would. You won't end up in do_swap_page() anymore and arch_do_swap_page() won't be called, because there is no swap PTE anymore. > >> >> b) try_to_migrate_one() does the arch_unmap_one(), but who will do the >> restore+free after migration succeeded or failed, aren't we missing something? > > try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes > __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally > result in a call to arch_do_swap_page() which restores the tags. Migration PTEs are restore via mm/migrate.c:remove_migration_ptes(). arch_do_swap_page() won't be called. What you mention is if someone accesses the migration PTE while migration is active and the migration PTEs have not been removed yet. While we'll end up in do_swap_page(), we'll do a migration_entry_wait(), followed by an effective immediate "return 0;". arch_do_swap_page() won't get called. So in essence, I think this doesn't work as expected yet. In the best case we don't immediately free memory. In the worst case we lose the tags.
On 3/29/22 14:55, David Hildenbrand wrote: > On 29.03.22 22:42, Khalid Aziz wrote: >> On 3/29/22 07:59, David Hildenbrand wrote: >>> On 15.03.22 11:47, David Hildenbrand wrote: >>>> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's >>>> undo that properly via swap_free(). >>>> >>>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap") >>>> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> mm/rmap.c | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 6a1e8c7f6213..f825aeef61ca 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> break; >>>> } >>>> if (arch_unmap_one(mm, vma, address, pteval) < 0) { >>>> + swap_free(entry); >>>> set_pte_at(mm, address, pvmw.pte, pteval); >>>> ret = false; >>>> page_vma_mapped_walk_done(&pvmw); >>> >>> Hi Khalid, >>> >>> I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify. >>> >>> >>> See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g., >>> >>> @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> page_vma_mapped_walk_done(&pvmw); >>> break; >>> } >>> + if (anon_exclusive && >>> + page_try_share_anon_rmap(subpage)) { >>> + swap_free(entry); >>> + set_pte_at(mm, address, pvmw.pte, pteval); >>> + ret = false; >>> + page_vma_mapped_walk_done(&pvmw); >>> + break; >>> + } >>> + /* >>> + * Note: We don't remember yet if the page was mapped >>> + * exclusively in the swap entry, so swapin code has >>> + * to re-determine that manually and might detect the >>> + * page as possibly shared, for example, if there are >>> + * other references on the page or if the page is under >>> + * writeback. We made sure that there are no GUP pins >>> + * on the page that would rely on it, so for GUP pins >>> + * this is fine. >>> + */ >>> if (list_empty(&mm->mmlist)) { >>> spin_lock(&mmlist_lock); >>> if (list_empty(&mm->mmlist)) >>> >>> >>> For now, I was under the impression that we don't have to undo anything after >>> arch_unmap_one() succeeded, because we seem to not do anything for two >>> cases below. But looking into arch_unmap_one() and how it allocates stuff I do >>> wonder what we would actually want to do here -- I'd assume we'd want to >>> trigger the del_tag_store() somehow? >> >> Hi David, >> > > Hi, > > thanks for your fast reply. > >> Currently once arch_unmap_one() completes successfully, we are at the point of no return for this pte. It will be >> replaced by swap pte soon thereafter. Patch 11 adds another case where we may return without replacing current pte with >> swap pte. For now could you resolve this by moving the above code block in patch 11 to before the call to >> arch_unmap_one(). That still leaves open the issue having the flexibility of undoing what arch_unmap_one() does for some >> other reason in future. That will require coming up with a properly architected way to do it. > > I really want clearing PG_anon_exclusive be the last action, without > eventually having to set it again and overcomplicating > PG_anon_exclusive/rmap handling. Ideally, we'd have a "arch_remap_one()" > that just reverts what arch_unmap_one() did. Hi David, arch_remap_one() sounds reasonable. Would you like to add that in your patch series, or would you like me to create a separate patch for adding this on top of your patch series? > >> >>> >>> arch_unmap_one() calls adi_save_tags(), which allocates memory. >>> adi_restore_tags()->del_tag_store() reverts that operation and ends up >>> freeing memory conditionally; However, it's only >>> called from arch_do_swap_page(). >>> >>> >>> Here is why I have to scratch my head: >>> >>> a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar >>> for mm/swapfile.c:unuse_pte(), aren't we missing something? >> >> My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called >> upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from >> swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the >> page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this >> mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags. > > unuse_pte() will replace a swap pte directly by a proper, present pte, > just like do_swap_page() would. You won't end up in do_swap_page() > anymore and arch_do_swap_page() won't be called, because there is no > swap PTE anymore. > >> >>> >>> b) try_to_migrate_one() does the arch_unmap_one(), but who will do the >>> restore+free after migration succeeded or failed, aren't we missing something? >> >> try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes >> __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally >> result in a call to arch_do_swap_page() which restores the tags. > > Migration PTEs are restore via mm/migrate.c:remove_migration_ptes(). > arch_do_swap_page() won't be called. > > What you mention is if someone accesses the migration PTE while > migration is active and the migration PTEs have not been removed yet. > While we'll end up in do_swap_page(), we'll do a migration_entry_wait(), > followed by an effective immediate "return 0;". arch_do_swap_page() > won't get called. > > > So in essence, I think this doesn't work as expected yet. In the best > case we don't immediately free memory. In the worst case we lose the tags. > I see what you mean. I can work on fixing these issues up. Thanks, Khalid
> Hi David, Hi, > > arch_remap_one() sounds reasonable. Would you like to add that in your patch series, or would you like me to create a > separate patch for adding this on top of your patch series? Let's handle it separately, because I think there is still a lot to be clarified. I also feel like the terminology ("arch_unmap_one()") is a little too generic, if we really only care about anonymous pages (do we?). > >> >>> >>>> >>>> arch_unmap_one() calls adi_save_tags(), which allocates memory. >>>> adi_restore_tags()->del_tag_store() reverts that operation and ends up >>>> freeing memory conditionally; However, it's only >>>> called from arch_do_swap_page(). >>>> >>>> >>>> Here is why I have to scratch my head: >>>> >>>> a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar >>>> for mm/swapfile.c:unuse_pte(), aren't we missing something? >>> >>> My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called >>> upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from >>> swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the >>> page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this >>> mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags. >> >> unuse_pte() will replace a swap pte directly by a proper, present pte, >> just like do_swap_page() would. You won't end up in do_swap_page() >> anymore and arch_do_swap_page() won't be called, because there is no >> swap PTE anymore. >> >>> >>>> >>>> b) try_to_migrate_one() does the arch_unmap_one(), but who will do the >>>> restore+free after migration succeeded or failed, aren't we missing something? >>> >>> try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes >>> __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally >>> result in a call to arch_do_swap_page() which restores the tags. >> >> Migration PTEs are restore via mm/migrate.c:remove_migration_ptes(). >> arch_do_swap_page() won't be called. >> >> What you mention is if someone accesses the migration PTE while >> migration is active and the migration PTEs have not been removed yet. >> While we'll end up in do_swap_page(), we'll do a migration_entry_wait(), >> followed by an effective immediate "return 0;". arch_do_swap_page() >> won't get called. >> >> >> So in essence, I think this doesn't work as expected yet. In the best >> case we don't immediately free memory. In the worst case we lose the tags. >> > > I see what you mean. I can work on fixing these issues up. Ideally, I think we'd handle migration differently: migrate the tags directly instead of temporarily storing them I also wonder, how to handle migration of THP (which uses migration PMDs) and THP splitting (which also uses migration PTEs); maybe they don't apply on sparc? Further, I wonder if you have to handle zapping of swap/migration PTEs, and if you'd also have to cleanup+freeup any allcoated tag memory? E.g., zap_pte_range() can simply zap swap and migration entries, wouldn't Last but not least, I do wonder if mremap() -- e.g., move_pte() -- and fork() -- e.g., copy_pte_range() -- are handled properly, where we can end up moving/copying swap+migration PTEs around.
diff --git a/mm/rmap.c b/mm/rmap.c index 6a1e8c7f6213..f825aeef61ca 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, break; } if (arch_unmap_one(mm, vma, address, pteval) < 0) { + swap_free(entry); set_pte_at(mm, address, pvmw.pte, pteval); ret = false; page_vma_mapped_walk_done(&pvmw);