Message ID | 20240531104819.140218-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: swap: reuse exclusive folio directly instead of wp page faults | expand |
On 31.05.24 12:48, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > After swapping out, we perform a swap-in operation. If we first read > and then write, we encounter a major fault in do_swap_page for reading, > along with additional minor faults in do_wp_page for writing. However, > the latter appears to be unnecessary and inefficient. Instead, we can > directly reuse in do_swap_page and completely eliminate the need for > do_wp_page. > > This patch achieves that optimization specifically for exclusive folios. > The following microbenchmark demonstrates the significant reduction in > minor faults. > > #define DATA_SIZE (2UL * 1024 * 1024) > #define PAGE_SIZE (4UL * 1024) > > static void *read_write_data(char *addr) > { > char tmp; > > for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { > tmp = *(volatile char *)(addr + i); > *(volatile char *)(addr + i) = tmp; > } > } > > int main(int argc, char **argv) > { > struct rusage ru; > > char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > memset(addr, 0x11, DATA_SIZE); > > do { > long old_ru_minflt, old_ru_majflt; > long new_ru_minflt, new_ru_majflt; > > madvise(addr, DATA_SIZE, MADV_PAGEOUT); > > getrusage(RUSAGE_SELF, &ru); > old_ru_minflt = ru.ru_minflt; > old_ru_majflt = ru.ru_majflt; > > read_write_data(addr); > getrusage(RUSAGE_SELF, &ru); > new_ru_minflt = ru.ru_minflt; > new_ru_majflt = ru.ru_majflt; > > printf("minor faults:%ld major faults:%ld\n", > new_ru_minflt - old_ru_minflt, > new_ru_majflt - old_ru_majflt); > } while(0); > > return 0; > } > > w/o patch, > / # ~/a.out > minor faults:512 major faults:512 > > w/ patch, > / # ~/a.out > minor faults:0 major faults:512 > > Minor faults decrease to 0! > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/memory.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index eef4e482c0c2..e1d2e339958e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > */ > if (!folio_test_ksm(folio) && > (exclusive || folio_ref_count(folio) == 1)) { > - if (vmf->flags & FAULT_FLAG_WRITE) { > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > - vmf->flags &= ~FAULT_FLAG_WRITE; > + if (vma->vm_flags & VM_WRITE) { > + pte = pte_mkwrite(pte_mkdirty(pte), vma); > + if (vmf->flags & FAULT_FLAG_WRITE) > + vmf->flags &= ~FAULT_FLAG_WRITE; This implies, that even on a read fault, you would mark the pte dirty and it would have to be written back to swap if still in the swap cache and only read. That is controversial. What is less controversial is doing what mprotect() via change_pte_range()/can_change_pte_writable() would do: mark the PTE writable but not dirty. I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set.
On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: > > On 31.05.24 12:48, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > After swapping out, we perform a swap-in operation. If we first read > > and then write, we encounter a major fault in do_swap_page for reading, > > along with additional minor faults in do_wp_page for writing. However, > > the latter appears to be unnecessary and inefficient. Instead, we can > > directly reuse in do_swap_page and completely eliminate the need for > > do_wp_page. > > > > This patch achieves that optimization specifically for exclusive folios. > > The following microbenchmark demonstrates the significant reduction in > > minor faults. > > > > #define DATA_SIZE (2UL * 1024 * 1024) > > #define PAGE_SIZE (4UL * 1024) > > > > static void *read_write_data(char *addr) > > { > > char tmp; > > > > for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { > > tmp = *(volatile char *)(addr + i); > > *(volatile char *)(addr + i) = tmp; > > } > > } > > > > int main(int argc, char **argv) > > { > > struct rusage ru; > > > > char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > memset(addr, 0x11, DATA_SIZE); > > > > do { > > long old_ru_minflt, old_ru_majflt; > > long new_ru_minflt, new_ru_majflt; > > > > madvise(addr, DATA_SIZE, MADV_PAGEOUT); > > > > getrusage(RUSAGE_SELF, &ru); > > old_ru_minflt = ru.ru_minflt; > > old_ru_majflt = ru.ru_majflt; > > > > read_write_data(addr); > > getrusage(RUSAGE_SELF, &ru); > > new_ru_minflt = ru.ru_minflt; > > new_ru_majflt = ru.ru_majflt; > > > > printf("minor faults:%ld major faults:%ld\n", > > new_ru_minflt - old_ru_minflt, > > new_ru_majflt - old_ru_majflt); > > } while(0); > > > > return 0; > > } > > > > w/o patch, > > / # ~/a.out > > minor faults:512 major faults:512 > > > > w/ patch, > > / # ~/a.out > > minor faults:0 major faults:512 > > > > Minor faults decrease to 0! > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/memory.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index eef4e482c0c2..e1d2e339958e 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > */ > > if (!folio_test_ksm(folio) && > > (exclusive || folio_ref_count(folio) == 1)) { > > - if (vmf->flags & FAULT_FLAG_WRITE) { > > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > - vmf->flags &= ~FAULT_FLAG_WRITE; > > + if (vma->vm_flags & VM_WRITE) { > > + pte = pte_mkwrite(pte_mkdirty(pte), vma); > > + if (vmf->flags & FAULT_FLAG_WRITE) > > + vmf->flags &= ~FAULT_FLAG_WRITE; > > This implies, that even on a read fault, you would mark the pte dirty > and it would have to be written back to swap if still in the swap cache > and only read. > > That is controversial. > > What is less controversial is doing what mprotect() via > change_pte_range()/can_change_pte_writable() would do: mark the PTE > writable but not dirty. > > I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. Thanks! I assume you mean something as below? diff --git a/mm/memory.c b/mm/memory.c index eef4e482c0c2..dbf1ba8ccfd6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); pte = mk_pte(page, vma->vm_page_prot); + if (pte_swp_soft_dirty(vmf->orig_pte)) + pte = pte_mksoft_dirty(pte); + if (pte_swp_uffd_wp(vmf->orig_pte)) + pte = pte_mkuffd_wp(pte); /* * Same logic as in do_wp_page(); however, optimize for pages that are * certainly not shared either because we just allocated them without @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) == 1)) { - if (vmf->flags & FAULT_FLAG_WRITE) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); - vmf->flags &= ~FAULT_FLAG_WRITE; + if (vma->vm_flags & VM_WRITE) { + if (vmf->flags & FAULT_FLAG_WRITE) { + pte = pte_mkwrite(pte_mkdirty(pte), vma); + vmf->flags &= ~FAULT_FLAG_WRITE; + } else if ((!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte)) + && !userfaultfd_pte_wp(vma, pte)) { + pte = pte_mkwrite(pte, vma); + } } rmap_flags |= RMAP_EXCLUSIVE; } folio_ref_add(folio, nr_pages - 1); flush_icache_pages(vma, page, nr_pages); - if (pte_swp_soft_dirty(vmf->orig_pte)) - pte = pte_mksoft_dirty(pte); - if (pte_swp_uffd_wp(vmf->orig_pte)) - pte = pte_mkuffd_wp(pte); vmf->orig_pte = pte_advance_pfn(pte, page_idx); /* ksm created a completely new copy */ > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 31.05.24 13:55, Barry Song wrote: > On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 31.05.24 12:48, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> After swapping out, we perform a swap-in operation. If we first read >>> and then write, we encounter a major fault in do_swap_page for reading, >>> along with additional minor faults in do_wp_page for writing. However, >>> the latter appears to be unnecessary and inefficient. Instead, we can >>> directly reuse in do_swap_page and completely eliminate the need for >>> do_wp_page. >>> >>> This patch achieves that optimization specifically for exclusive folios. >>> The following microbenchmark demonstrates the significant reduction in >>> minor faults. >>> >>> #define DATA_SIZE (2UL * 1024 * 1024) >>> #define PAGE_SIZE (4UL * 1024) >>> >>> static void *read_write_data(char *addr) >>> { >>> char tmp; >>> >>> for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { >>> tmp = *(volatile char *)(addr + i); >>> *(volatile char *)(addr + i) = tmp; >>> } >>> } >>> >>> int main(int argc, char **argv) >>> { >>> struct rusage ru; >>> >>> char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, >>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>> memset(addr, 0x11, DATA_SIZE); >>> >>> do { >>> long old_ru_minflt, old_ru_majflt; >>> long new_ru_minflt, new_ru_majflt; >>> >>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); >>> >>> getrusage(RUSAGE_SELF, &ru); >>> old_ru_minflt = ru.ru_minflt; >>> old_ru_majflt = ru.ru_majflt; >>> >>> read_write_data(addr); >>> getrusage(RUSAGE_SELF, &ru); >>> new_ru_minflt = ru.ru_minflt; >>> new_ru_majflt = ru.ru_majflt; >>> >>> printf("minor faults:%ld major faults:%ld\n", >>> new_ru_minflt - old_ru_minflt, >>> new_ru_majflt - old_ru_majflt); >>> } while(0); >>> >>> return 0; >>> } >>> >>> w/o patch, >>> / # ~/a.out >>> minor faults:512 major faults:512 >>> >>> w/ patch, >>> / # ~/a.out >>> minor faults:0 major faults:512 >>> >>> Minor faults decrease to 0! >>> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> --- >>> mm/memory.c | 7 ++++--- >>> 1 file changed, 4 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index eef4e482c0c2..e1d2e339958e 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>> */ >>> if (!folio_test_ksm(folio) && >>> (exclusive || folio_ref_count(folio) == 1)) { >>> - if (vmf->flags & FAULT_FLAG_WRITE) { >>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); >>> - vmf->flags &= ~FAULT_FLAG_WRITE; >>> + if (vma->vm_flags & VM_WRITE) { >>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); >>> + if (vmf->flags & FAULT_FLAG_WRITE) >>> + vmf->flags &= ~FAULT_FLAG_WRITE; >> >> This implies, that even on a read fault, you would mark the pte dirty >> and it would have to be written back to swap if still in the swap cache >> and only read. >> >> That is controversial. >> >> What is less controversial is doing what mprotect() via >> change_pte_range()/can_change_pte_writable() would do: mark the PTE >> writable but not dirty. >> >> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. > > Thanks! > > I assume you mean something as below? It raises an important point: uffd-wp must be handled accordingly. > > diff --git a/mm/memory.c b/mm/memory.c > index eef4e482c0c2..dbf1ba8ccfd6 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > pte = mk_pte(page, vma->vm_page_prot); > > + if (pte_swp_soft_dirty(vmf->orig_pte)) > + pte = pte_mksoft_dirty(pte); > + if (pte_swp_uffd_wp(vmf->orig_pte)) > + pte = pte_mkuffd_wp(pte); > /* > * Same logic as in do_wp_page(); however, optimize for pages that are > * certainly not shared either because we just allocated them without > @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > */ > if (!folio_test_ksm(folio) && > (exclusive || folio_ref_count(folio) == 1)) { > - if (vmf->flags & FAULT_FLAG_WRITE) { > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > - vmf->flags &= ~FAULT_FLAG_WRITE; > + if (vma->vm_flags & VM_WRITE) { > + if (vmf->flags & FAULT_FLAG_WRITE) { > + pte = pte_mkwrite(pte_mkdirty(pte), vma); > + vmf->flags &= ~FAULT_FLAG_WRITE; > + } else if ((!vma_soft_dirty_enabled(vma) || > pte_soft_dirty(pte)) > + && !userfaultfd_pte_wp(vma, pte)) { > + pte = pte_mkwrite(pte, vma); Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a pte_mkwrite(pte). So we have to catch and handle that earlier (I could have sworn we handle that somehow). Note that the existing pte = pte_mkuffd_wp(pte); Will fix that up because it does an implicit pte_wrprotect(). So maybe what would work is if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && !vma_soft_dirty_enabled(vma)) { pte = pte_mkwrite(pte); /* Only set the PTE dirty on write fault. */ if (vmf->flags & FAULT_FLAG_WRITE) { pte = pte_mkdirty(pte); vmf->flags &= ~FAULT_FLAG_WRITE; } }
On Sat, Jun 1, 2024 at 12:10 AM David Hildenbrand <david@redhat.com> wrote: > > On 31.05.24 13:55, Barry Song wrote: > > On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 31.05.24 12:48, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> After swapping out, we perform a swap-in operation. If we first read > >>> and then write, we encounter a major fault in do_swap_page for reading, > >>> along with additional minor faults in do_wp_page for writing. However, > >>> the latter appears to be unnecessary and inefficient. Instead, we can > >>> directly reuse in do_swap_page and completely eliminate the need for > >>> do_wp_page. > >>> > >>> This patch achieves that optimization specifically for exclusive folios. > >>> The following microbenchmark demonstrates the significant reduction in > >>> minor faults. > >>> > >>> #define DATA_SIZE (2UL * 1024 * 1024) > >>> #define PAGE_SIZE (4UL * 1024) > >>> > >>> static void *read_write_data(char *addr) > >>> { > >>> char tmp; > >>> > >>> for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { > >>> tmp = *(volatile char *)(addr + i); > >>> *(volatile char *)(addr + i) = tmp; > >>> } > >>> } > >>> > >>> int main(int argc, char **argv) > >>> { > >>> struct rusage ru; > >>> > >>> char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > >>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > >>> memset(addr, 0x11, DATA_SIZE); > >>> > >>> do { > >>> long old_ru_minflt, old_ru_majflt; > >>> long new_ru_minflt, new_ru_majflt; > >>> > >>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); > >>> > >>> getrusage(RUSAGE_SELF, &ru); > >>> old_ru_minflt = ru.ru_minflt; > >>> old_ru_majflt = ru.ru_majflt; > >>> > >>> read_write_data(addr); > >>> getrusage(RUSAGE_SELF, &ru); > >>> new_ru_minflt = ru.ru_minflt; > >>> new_ru_majflt = ru.ru_majflt; > >>> > >>> printf("minor faults:%ld major faults:%ld\n", > >>> new_ru_minflt - old_ru_minflt, > >>> new_ru_majflt - old_ru_majflt); > >>> } while(0); > >>> > >>> return 0; > >>> } > >>> > >>> w/o patch, > >>> / # ~/a.out > >>> minor faults:512 major faults:512 > >>> > >>> w/ patch, > >>> / # ~/a.out > >>> minor faults:0 major faults:512 > >>> > >>> Minor faults decrease to 0! > >>> > >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>> --- > >>> mm/memory.c | 7 ++++--- > >>> 1 file changed, 4 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/memory.c b/mm/memory.c > >>> index eef4e482c0c2..e1d2e339958e 100644 > >>> --- a/mm/memory.c > >>> +++ b/mm/memory.c > >>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>> */ > >>> if (!folio_test_ksm(folio) && > >>> (exclusive || folio_ref_count(folio) == 1)) { > >>> - if (vmf->flags & FAULT_FLAG_WRITE) { > >>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > >>> - vmf->flags &= ~FAULT_FLAG_WRITE; > >>> + if (vma->vm_flags & VM_WRITE) { > >>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); > >>> + if (vmf->flags & FAULT_FLAG_WRITE) > >>> + vmf->flags &= ~FAULT_FLAG_WRITE; > >> > >> This implies, that even on a read fault, you would mark the pte dirty > >> and it would have to be written back to swap if still in the swap cache > >> and only read. > >> > >> That is controversial. > >> > >> What is less controversial is doing what mprotect() via > >> change_pte_range()/can_change_pte_writable() would do: mark the PTE > >> writable but not dirty. > >> > >> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. > > > > Thanks! > > > > I assume you mean something as below? > > It raises an important point: uffd-wp must be handled accordingly. > > > > > diff --git a/mm/memory.c b/mm/memory.c > > index eef4e482c0c2..dbf1ba8ccfd6 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > > pte = mk_pte(page, vma->vm_page_prot); > > > > + if (pte_swp_soft_dirty(vmf->orig_pte)) > > + pte = pte_mksoft_dirty(pte); > > + if (pte_swp_uffd_wp(vmf->orig_pte)) > > + pte = pte_mkuffd_wp(pte); > > /* > > * Same logic as in do_wp_page(); however, optimize for pages that are > > * certainly not shared either because we just allocated them without > > @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > */ > > if (!folio_test_ksm(folio) && > > (exclusive || folio_ref_count(folio) == 1)) { > > - if (vmf->flags & FAULT_FLAG_WRITE) { > > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > - vmf->flags &= ~FAULT_FLAG_WRITE; > > + if (vma->vm_flags & VM_WRITE) { > > + if (vmf->flags & FAULT_FLAG_WRITE) { > > + pte = pte_mkwrite(pte_mkdirty(pte), vma); > > + vmf->flags &= ~FAULT_FLAG_WRITE; > > + } else if ((!vma_soft_dirty_enabled(vma) || > > pte_soft_dirty(pte)) > > + && !userfaultfd_pte_wp(vma, pte)) { > > + pte = pte_mkwrite(pte, vma); > > Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a > pte_mkwrite(pte). So we have to catch and handle that earlier (I could > have sworn we handle that somehow). > > Note that the existing > pte = pte_mkuffd_wp(pte); > > Will fix that up because it does an implicit pte_wrprotect(). This is exactly what I have missed as I am struggling with why WRITE_FAULT blindly does mkwrite without checking userfaultfd_pte_wp(). > > > So maybe what would work is > > > if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && > !vma_soft_dirty_enabled(vma)) { > pte = pte_mkwrite(pte); > > /* Only set the PTE dirty on write fault. */ > if (vmf->flags & FAULT_FLAG_WRITE) { > pte = pte_mkdirty(pte); > vmf->flags &= ~FAULT_FLAG_WRITE; > } > } > looks good! > -- > Cheers, > > David / dhildenb > Thanks Barry
On Sat, Jun 1, 2024 at 12:20 AM Barry Song <21cnbao@gmail.com> wrote: > > On Sat, Jun 1, 2024 at 12:10 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 31.05.24 13:55, Barry Song wrote: > > > On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: > > >> > > >> On 31.05.24 12:48, Barry Song wrote: > > >>> From: Barry Song <v-songbaohua@oppo.com> > > >>> > > >>> After swapping out, we perform a swap-in operation. If we first read > > >>> and then write, we encounter a major fault in do_swap_page for reading, > > >>> along with additional minor faults in do_wp_page for writing. However, > > >>> the latter appears to be unnecessary and inefficient. Instead, we can > > >>> directly reuse in do_swap_page and completely eliminate the need for > > >>> do_wp_page. > > >>> > > >>> This patch achieves that optimization specifically for exclusive folios. > > >>> The following microbenchmark demonstrates the significant reduction in > > >>> minor faults. > > >>> > > >>> #define DATA_SIZE (2UL * 1024 * 1024) > > >>> #define PAGE_SIZE (4UL * 1024) > > >>> > > >>> static void *read_write_data(char *addr) > > >>> { > > >>> char tmp; > > >>> > > >>> for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { > > >>> tmp = *(volatile char *)(addr + i); > > >>> *(volatile char *)(addr + i) = tmp; > > >>> } > > >>> } > > >>> > > >>> int main(int argc, char **argv) > > >>> { > > >>> struct rusage ru; > > >>> > > >>> char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > > >>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > > >>> memset(addr, 0x11, DATA_SIZE); > > >>> > > >>> do { > > >>> long old_ru_minflt, old_ru_majflt; > > >>> long new_ru_minflt, new_ru_majflt; > > >>> > > >>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); > > >>> > > >>> getrusage(RUSAGE_SELF, &ru); > > >>> old_ru_minflt = ru.ru_minflt; > > >>> old_ru_majflt = ru.ru_majflt; > > >>> > > >>> read_write_data(addr); > > >>> getrusage(RUSAGE_SELF, &ru); > > >>> new_ru_minflt = ru.ru_minflt; > > >>> new_ru_majflt = ru.ru_majflt; > > >>> > > >>> printf("minor faults:%ld major faults:%ld\n", > > >>> new_ru_minflt - old_ru_minflt, > > >>> new_ru_majflt - old_ru_majflt); > > >>> } while(0); > > >>> > > >>> return 0; > > >>> } > > >>> > > >>> w/o patch, > > >>> / # ~/a.out > > >>> minor faults:512 major faults:512 > > >>> > > >>> w/ patch, > > >>> / # ~/a.out > > >>> minor faults:0 major faults:512 > > >>> > > >>> Minor faults decrease to 0! > > >>> > > >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > >>> --- > > >>> mm/memory.c | 7 ++++--- > > >>> 1 file changed, 4 insertions(+), 3 deletions(-) > > >>> > > >>> diff --git a/mm/memory.c b/mm/memory.c > > >>> index eef4e482c0c2..e1d2e339958e 100644 > > >>> --- a/mm/memory.c > > >>> +++ b/mm/memory.c > > >>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > >>> */ > > >>> if (!folio_test_ksm(folio) && > > >>> (exclusive || folio_ref_count(folio) == 1)) { > > >>> - if (vmf->flags & FAULT_FLAG_WRITE) { > > >>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > >>> - vmf->flags &= ~FAULT_FLAG_WRITE; > > >>> + if (vma->vm_flags & VM_WRITE) { > > >>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); > > >>> + if (vmf->flags & FAULT_FLAG_WRITE) > > >>> + vmf->flags &= ~FAULT_FLAG_WRITE; > > >> > > >> This implies, that even on a read fault, you would mark the pte dirty > > >> and it would have to be written back to swap if still in the swap cache > > >> and only read. > > >> > > >> That is controversial. > > >> > > >> What is less controversial is doing what mprotect() via > > >> change_pte_range()/can_change_pte_writable() would do: mark the PTE > > >> writable but not dirty. > > >> > > >> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. > > > > > > Thanks! > > > > > > I assume you mean something as below? > > > > It raises an important point: uffd-wp must be handled accordingly. > > > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index eef4e482c0c2..dbf1ba8ccfd6 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > > > pte = mk_pte(page, vma->vm_page_prot); > > > > > > + if (pte_swp_soft_dirty(vmf->orig_pte)) > > > + pte = pte_mksoft_dirty(pte); > > > + if (pte_swp_uffd_wp(vmf->orig_pte)) > > > + pte = pte_mkuffd_wp(pte); > > > /* > > > * Same logic as in do_wp_page(); however, optimize for pages that are > > > * certainly not shared either because we just allocated them without > > > @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > */ > > > if (!folio_test_ksm(folio) && > > > (exclusive || folio_ref_count(folio) == 1)) { > > > - if (vmf->flags & FAULT_FLAG_WRITE) { > > > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > > - vmf->flags &= ~FAULT_FLAG_WRITE; > > > + if (vma->vm_flags & VM_WRITE) { > > > + if (vmf->flags & FAULT_FLAG_WRITE) { > > > + pte = pte_mkwrite(pte_mkdirty(pte), vma); > > > + vmf->flags &= ~FAULT_FLAG_WRITE; > > > + } else if ((!vma_soft_dirty_enabled(vma) || > > > pte_soft_dirty(pte)) > > > + && !userfaultfd_pte_wp(vma, pte)) { > > > + pte = pte_mkwrite(pte, vma); > > > > Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a > > pte_mkwrite(pte). So we have to catch and handle that earlier (I could > > have sworn we handle that somehow). > > > > Note that the existing > > pte = pte_mkuffd_wp(pte); > > > > Will fix that up because it does an implicit pte_wrprotect(). > > This is exactly what I have missed as I am struggling with why WRITE_FAULT > blindly does mkwrite without checking userfaultfd_pte_wp(). > > > > > > > So maybe what would work is > > > > > > if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && > > !vma_soft_dirty_enabled(vma)) { > > pte = pte_mkwrite(pte); > > > > /* Only set the PTE dirty on write fault. */ > > if (vmf->flags & FAULT_FLAG_WRITE) { > > pte = pte_mkdirty(pte); > > vmf->flags &= ~FAULT_FLAG_WRITE; > > } WRITE_FAULT has a pte_mkdirty, so it doesn't need to check "!vma_soft_dirty_enabled(vma)"? Maybe I thought too much, just the simple code below should work? if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) == 1)) { if (vma->vm_flags & VM_WRITE) { if (vmf->flags & FAULT_FLAG_WRITE) { pte = pte_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; } else { pte = pte_mkwrite(pte, vma); } } rmap_flags |= RMAP_EXCLUSIVE; } if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(vmf->orig_pte)) pte = pte_mkuffd_wp(pte); This still uses the implicit wrprotect of pte_mkuffd_wp. > > } > > > > looks good! > > > -- > > Cheers, > > > > David / dhildenb > > > > Thanks > Barry
On 31.05.24 14:30, Barry Song wrote: > On Sat, Jun 1, 2024 at 12:20 AM Barry Song <21cnbao@gmail.com> wrote: >> >> On Sat, Jun 1, 2024 at 12:10 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 31.05.24 13:55, Barry Song wrote: >>>> On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 31.05.24 12:48, Barry Song wrote: >>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>> >>>>>> After swapping out, we perform a swap-in operation. If we first read >>>>>> and then write, we encounter a major fault in do_swap_page for reading, >>>>>> along with additional minor faults in do_wp_page for writing. However, >>>>>> the latter appears to be unnecessary and inefficient. Instead, we can >>>>>> directly reuse in do_swap_page and completely eliminate the need for >>>>>> do_wp_page. >>>>>> >>>>>> This patch achieves that optimization specifically for exclusive folios. >>>>>> The following microbenchmark demonstrates the significant reduction in >>>>>> minor faults. >>>>>> >>>>>> #define DATA_SIZE (2UL * 1024 * 1024) >>>>>> #define PAGE_SIZE (4UL * 1024) >>>>>> >>>>>> static void *read_write_data(char *addr) >>>>>> { >>>>>> char tmp; >>>>>> >>>>>> for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { >>>>>> tmp = *(volatile char *)(addr + i); >>>>>> *(volatile char *)(addr + i) = tmp; >>>>>> } >>>>>> } >>>>>> >>>>>> int main(int argc, char **argv) >>>>>> { >>>>>> struct rusage ru; >>>>>> >>>>>> char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, >>>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>>>> memset(addr, 0x11, DATA_SIZE); >>>>>> >>>>>> do { >>>>>> long old_ru_minflt, old_ru_majflt; >>>>>> long new_ru_minflt, new_ru_majflt; >>>>>> >>>>>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); >>>>>> >>>>>> getrusage(RUSAGE_SELF, &ru); >>>>>> old_ru_minflt = ru.ru_minflt; >>>>>> old_ru_majflt = ru.ru_majflt; >>>>>> >>>>>> read_write_data(addr); >>>>>> getrusage(RUSAGE_SELF, &ru); >>>>>> new_ru_minflt = ru.ru_minflt; >>>>>> new_ru_majflt = ru.ru_majflt; >>>>>> >>>>>> printf("minor faults:%ld major faults:%ld\n", >>>>>> new_ru_minflt - old_ru_minflt, >>>>>> new_ru_majflt - old_ru_majflt); >>>>>> } while(0); >>>>>> >>>>>> return 0; >>>>>> } >>>>>> >>>>>> w/o patch, >>>>>> / # ~/a.out >>>>>> minor faults:512 major faults:512 >>>>>> >>>>>> w/ patch, >>>>>> / # ~/a.out >>>>>> minor faults:0 major faults:512 >>>>>> >>>>>> Minor faults decrease to 0! >>>>>> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>>> --- >>>>>> mm/memory.c | 7 ++++--- >>>>>> 1 file changed, 4 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index eef4e482c0c2..e1d2e339958e 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>> */ >>>>>> if (!folio_test_ksm(folio) && >>>>>> (exclusive || folio_ref_count(folio) == 1)) { >>>>>> - if (vmf->flags & FAULT_FLAG_WRITE) { >>>>>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); >>>>>> - vmf->flags &= ~FAULT_FLAG_WRITE; >>>>>> + if (vma->vm_flags & VM_WRITE) { >>>>>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); >>>>>> + if (vmf->flags & FAULT_FLAG_WRITE) >>>>>> + vmf->flags &= ~FAULT_FLAG_WRITE; >>>>> >>>>> This implies, that even on a read fault, you would mark the pte dirty >>>>> and it would have to be written back to swap if still in the swap cache >>>>> and only read. >>>>> >>>>> That is controversial. >>>>> >>>>> What is less controversial is doing what mprotect() via >>>>> change_pte_range()/can_change_pte_writable() would do: mark the PTE >>>>> writable but not dirty. >>>>> >>>>> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. >>>> >>>> Thanks! >>>> >>>> I assume you mean something as below? >>> >>> It raises an important point: uffd-wp must be handled accordingly. >>> >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index eef4e482c0c2..dbf1ba8ccfd6 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); >>>> pte = mk_pte(page, vma->vm_page_prot); >>>> >>>> + if (pte_swp_soft_dirty(vmf->orig_pte)) >>>> + pte = pte_mksoft_dirty(pte); >>>> + if (pte_swp_uffd_wp(vmf->orig_pte)) >>>> + pte = pte_mkuffd_wp(pte); >>>> /* >>>> * Same logic as in do_wp_page(); however, optimize for pages that are >>>> * certainly not shared either because we just allocated them without >>>> @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>> */ >>>> if (!folio_test_ksm(folio) && >>>> (exclusive || folio_ref_count(folio) == 1)) { >>>> - if (vmf->flags & FAULT_FLAG_WRITE) { >>>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); >>>> - vmf->flags &= ~FAULT_FLAG_WRITE; >>>> + if (vma->vm_flags & VM_WRITE) { >>>> + if (vmf->flags & FAULT_FLAG_WRITE) { >>>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); >>>> + vmf->flags &= ~FAULT_FLAG_WRITE; >>>> + } else if ((!vma_soft_dirty_enabled(vma) || >>>> pte_soft_dirty(pte)) >>>> + && !userfaultfd_pte_wp(vma, pte)) { >>>> + pte = pte_mkwrite(pte, vma); >>> >>> Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a >>> pte_mkwrite(pte). So we have to catch and handle that earlier (I could >>> have sworn we handle that somehow). >>> >>> Note that the existing >>> pte = pte_mkuffd_wp(pte); >>> >>> Will fix that up because it does an implicit pte_wrprotect(). >> >> This is exactly what I have missed as I am struggling with why WRITE_FAULT >> blindly does mkwrite without checking userfaultfd_pte_wp(). >> >>> >>> >>> So maybe what would work is >>> >>> >>> if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && >>> !vma_soft_dirty_enabled(vma)) { >>> pte = pte_mkwrite(pte); >>> >>> /* Only set the PTE dirty on write fault. */ >>> if (vmf->flags & FAULT_FLAG_WRITE) { >>> pte = pte_mkdirty(pte); >>> vmf->flags &= ~FAULT_FLAG_WRITE; >>> } > > WRITE_FAULT has a pte_mkdirty, so it doesn't need to check > "!vma_soft_dirty_enabled(vma)"? > Maybe I thought too much, just the simple code below should work? That would likely not handle softdirty correctly in case we end up in pte_mkwrite(pte, vma); note that pte_mksoft_dirty() will not wrprotect ... (note that we shouldn't optimize for softdirty handling) > > if (!folio_test_ksm(folio) && > (exclusive || folio_ref_count(folio) == 1)) { > if (vma->vm_flags & VM_WRITE) { > if (vmf->flags & FAULT_FLAG_WRITE) { > pte = pte_mkwrite(pte_mkdirty(pte), vma); > vmf->flags &= ~FAULT_FLAG_WRITE; > } else { > pte = pte_mkwrite(pte, vma); > } > } > rmap_flags |= RMAP_EXCLUSIVE; > } > > if (pte_swp_soft_dirty(vmf->orig_pte)) > pte = pte_mksoft_dirty(pte); > if (pte_swp_uffd_wp(vmf->orig_pte)) > pte = pte_mkuffd_wp(pte); > > This still uses the implicit wrprotect of pte_mkuffd_wp. But the wrprotected->writable->wrprotected path really is confusing. I'd prefer to set these bits ahead of time instead, so we can properly rely on them -- like we do in other code.
On Sat, Jun 1, 2024 at 12:35 AM David Hildenbrand <david@redhat.com> wrote: > > On 31.05.24 14:30, Barry Song wrote: > > On Sat, Jun 1, 2024 at 12:20 AM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Sat, Jun 1, 2024 at 12:10 AM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 31.05.24 13:55, Barry Song wrote: > >>>> On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: > >>>>> > >>>>> On 31.05.24 12:48, Barry Song wrote: > >>>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>>> > >>>>>> After swapping out, we perform a swap-in operation. If we first read > >>>>>> and then write, we encounter a major fault in do_swap_page for reading, > >>>>>> along with additional minor faults in do_wp_page for writing. However, > >>>>>> the latter appears to be unnecessary and inefficient. Instead, we can > >>>>>> directly reuse in do_swap_page and completely eliminate the need for > >>>>>> do_wp_page. > >>>>>> > >>>>>> This patch achieves that optimization specifically for exclusive folios. > >>>>>> The following microbenchmark demonstrates the significant reduction in > >>>>>> minor faults. > >>>>>> > >>>>>> #define DATA_SIZE (2UL * 1024 * 1024) > >>>>>> #define PAGE_SIZE (4UL * 1024) > >>>>>> > >>>>>> static void *read_write_data(char *addr) > >>>>>> { > >>>>>> char tmp; > >>>>>> > >>>>>> for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { > >>>>>> tmp = *(volatile char *)(addr + i); > >>>>>> *(volatile char *)(addr + i) = tmp; > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> int main(int argc, char **argv) > >>>>>> { > >>>>>> struct rusage ru; > >>>>>> > >>>>>> char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > >>>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > >>>>>> memset(addr, 0x11, DATA_SIZE); > >>>>>> > >>>>>> do { > >>>>>> long old_ru_minflt, old_ru_majflt; > >>>>>> long new_ru_minflt, new_ru_majflt; > >>>>>> > >>>>>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); > >>>>>> > >>>>>> getrusage(RUSAGE_SELF, &ru); > >>>>>> old_ru_minflt = ru.ru_minflt; > >>>>>> old_ru_majflt = ru.ru_majflt; > >>>>>> > >>>>>> read_write_data(addr); > >>>>>> getrusage(RUSAGE_SELF, &ru); > >>>>>> new_ru_minflt = ru.ru_minflt; > >>>>>> new_ru_majflt = ru.ru_majflt; > >>>>>> > >>>>>> printf("minor faults:%ld major faults:%ld\n", > >>>>>> new_ru_minflt - old_ru_minflt, > >>>>>> new_ru_majflt - old_ru_majflt); > >>>>>> } while(0); > >>>>>> > >>>>>> return 0; > >>>>>> } > >>>>>> > >>>>>> w/o patch, > >>>>>> / # ~/a.out > >>>>>> minor faults:512 major faults:512 > >>>>>> > >>>>>> w/ patch, > >>>>>> / # ~/a.out > >>>>>> minor faults:0 major faults:512 > >>>>>> > >>>>>> Minor faults decrease to 0! > >>>>>> > >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>>>>> --- > >>>>>> mm/memory.c | 7 ++++--- > >>>>>> 1 file changed, 4 insertions(+), 3 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/memory.c b/mm/memory.c > >>>>>> index eef4e482c0c2..e1d2e339958e 100644 > >>>>>> --- a/mm/memory.c > >>>>>> +++ b/mm/memory.c > >>>>>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>>>> */ > >>>>>> if (!folio_test_ksm(folio) && > >>>>>> (exclusive || folio_ref_count(folio) == 1)) { > >>>>>> - if (vmf->flags & FAULT_FLAG_WRITE) { > >>>>>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > >>>>>> - vmf->flags &= ~FAULT_FLAG_WRITE; > >>>>>> + if (vma->vm_flags & VM_WRITE) { > >>>>>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); > >>>>>> + if (vmf->flags & FAULT_FLAG_WRITE) > >>>>>> + vmf->flags &= ~FAULT_FLAG_WRITE; > >>>>> > >>>>> This implies, that even on a read fault, you would mark the pte dirty > >>>>> and it would have to be written back to swap if still in the swap cache > >>>>> and only read. > >>>>> > >>>>> That is controversial. > >>>>> > >>>>> What is less controversial is doing what mprotect() via > >>>>> change_pte_range()/can_change_pte_writable() would do: mark the PTE > >>>>> writable but not dirty. > >>>>> > >>>>> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. > >>>> > >>>> Thanks! > >>>> > >>>> I assume you mean something as below? > >>> > >>> It raises an important point: uffd-wp must be handled accordingly. > >>> > >>>> > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index eef4e482c0c2..dbf1ba8ccfd6 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); > >>>> pte = mk_pte(page, vma->vm_page_prot); > >>>> > >>>> + if (pte_swp_soft_dirty(vmf->orig_pte)) > >>>> + pte = pte_mksoft_dirty(pte); > >>>> + if (pte_swp_uffd_wp(vmf->orig_pte)) > >>>> + pte = pte_mkuffd_wp(pte); > >>>> /* > >>>> * Same logic as in do_wp_page(); however, optimize for pages that are > >>>> * certainly not shared either because we just allocated them without > >>>> @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >>>> */ > >>>> if (!folio_test_ksm(folio) && > >>>> (exclusive || folio_ref_count(folio) == 1)) { > >>>> - if (vmf->flags & FAULT_FLAG_WRITE) { > >>>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > >>>> - vmf->flags &= ~FAULT_FLAG_WRITE; > >>>> + if (vma->vm_flags & VM_WRITE) { > >>>> + if (vmf->flags & FAULT_FLAG_WRITE) { > >>>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); > >>>> + vmf->flags &= ~FAULT_FLAG_WRITE; > >>>> + } else if ((!vma_soft_dirty_enabled(vma) || > >>>> pte_soft_dirty(pte)) > >>>> + && !userfaultfd_pte_wp(vma, pte)) { > >>>> + pte = pte_mkwrite(pte, vma); > >>> > >>> Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a > >>> pte_mkwrite(pte). So we have to catch and handle that earlier (I could > >>> have sworn we handle that somehow). > >>> > >>> Note that the existing > >>> pte = pte_mkuffd_wp(pte); > >>> > >>> Will fix that up because it does an implicit pte_wrprotect(). > >> > >> This is exactly what I have missed as I am struggling with why WRITE_FAULT > >> blindly does mkwrite without checking userfaultfd_pte_wp(). > >> > >>> > >>> > >>> So maybe what would work is > >>> > >>> > >>> if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && > >>> !vma_soft_dirty_enabled(vma)) { > >>> pte = pte_mkwrite(pte); > >>> > >>> /* Only set the PTE dirty on write fault. */ > >>> if (vmf->flags & FAULT_FLAG_WRITE) { > >>> pte = pte_mkdirty(pte); > >>> vmf->flags &= ~FAULT_FLAG_WRITE; > >>> } > > > > WRITE_FAULT has a pte_mkdirty, so it doesn't need to check > > "!vma_soft_dirty_enabled(vma)"? > > Maybe I thought too much, just the simple code below should work? > > That would likely not handle softdirty correctly in case we end up in > pte_mkwrite(pte, vma); note that pte_mksoft_dirty() will not wrprotect ... if SOFTDIRTY has been set, we shouldn't do wrprotect? till the dirty is cleared, we don't rely on further write fault to set soft dirty, right? so we should better do pte_mkwrite if pte_soft_dirty(pte) == true? if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && (!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte))) > > (note that we shouldn't optimize for softdirty handling) > > > > > if (!folio_test_ksm(folio) && > > (exclusive || folio_ref_count(folio) == 1)) { > > if (vma->vm_flags & VM_WRITE) { > > if (vmf->flags & FAULT_FLAG_WRITE) { > > pte = pte_mkwrite(pte_mkdirty(pte), vma); > > vmf->flags &= ~FAULT_FLAG_WRITE; > > } else { > > pte = pte_mkwrite(pte, vma); > > } > > } > > rmap_flags |= RMAP_EXCLUSIVE; > > } > > > > if (pte_swp_soft_dirty(vmf->orig_pte)) > > pte = pte_mksoft_dirty(pte); > > if (pte_swp_uffd_wp(vmf->orig_pte)) > > pte = pte_mkuffd_wp(pte); > > > > This still uses the implicit wrprotect of pte_mkuffd_wp. > > But the wrprotected->writable->wrprotected path really is confusing. I'd > prefer to set these bits ahead of time instead, so we can properly rely > on them -- like we do in other code. I agree. we are setting WRITE then reverting the WRITE. It is confusing. So in conclusion, we get the belows? if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); if (pte_swp_uffd_wp(vmf->orig_pte)) pte = pte_mkuffd_wp(pte); /* * Same logic as in do_wp_page(); however, optimize for pages that are * certainly not shared either because we just allocated them without * exposing them to the swapcache or because the swap entry indicates * exclusivity. */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) == 1)) { if (vma->vm_flags & VM_WRITE && !userfaultfd_pte_wp(vma, pte) && (!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte))) { if (vmf->flags & FAULT_FLAG_WRITE) { pte = pte_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; } else { pte = pte_mkwrite(pte, vma); } } rmap_flags |= RMAP_EXCLUSIVE; } > > -- > Cheers, > > David / dhildenb > Thanks Barry
On 01.06.24 02:48, Barry Song wrote: > On Sat, Jun 1, 2024 at 12:35 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 31.05.24 14:30, Barry Song wrote: >>> On Sat, Jun 1, 2024 at 12:20 AM Barry Song <21cnbao@gmail.com> wrote: >>>> >>>> On Sat, Jun 1, 2024 at 12:10 AM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 31.05.24 13:55, Barry Song wrote: >>>>>> On Fri, May 31, 2024 at 11:08 PM David Hildenbrand <david@redhat.com> wrote: >>>>>>> >>>>>>> On 31.05.24 12:48, Barry Song wrote: >>>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>>> >>>>>>>> After swapping out, we perform a swap-in operation. If we first read >>>>>>>> and then write, we encounter a major fault in do_swap_page for reading, >>>>>>>> along with additional minor faults in do_wp_page for writing. However, >>>>>>>> the latter appears to be unnecessary and inefficient. Instead, we can >>>>>>>> directly reuse in do_swap_page and completely eliminate the need for >>>>>>>> do_wp_page. >>>>>>>> >>>>>>>> This patch achieves that optimization specifically for exclusive folios. >>>>>>>> The following microbenchmark demonstrates the significant reduction in >>>>>>>> minor faults. >>>>>>>> >>>>>>>> #define DATA_SIZE (2UL * 1024 * 1024) >>>>>>>> #define PAGE_SIZE (4UL * 1024) >>>>>>>> >>>>>>>> static void *read_write_data(char *addr) >>>>>>>> { >>>>>>>> char tmp; >>>>>>>> >>>>>>>> for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) { >>>>>>>> tmp = *(volatile char *)(addr + i); >>>>>>>> *(volatile char *)(addr + i) = tmp; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> int main(int argc, char **argv) >>>>>>>> { >>>>>>>> struct rusage ru; >>>>>>>> >>>>>>>> char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, >>>>>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>>>>>> memset(addr, 0x11, DATA_SIZE); >>>>>>>> >>>>>>>> do { >>>>>>>> long old_ru_minflt, old_ru_majflt; >>>>>>>> long new_ru_minflt, new_ru_majflt; >>>>>>>> >>>>>>>> madvise(addr, DATA_SIZE, MADV_PAGEOUT); >>>>>>>> >>>>>>>> getrusage(RUSAGE_SELF, &ru); >>>>>>>> old_ru_minflt = ru.ru_minflt; >>>>>>>> old_ru_majflt = ru.ru_majflt; >>>>>>>> >>>>>>>> read_write_data(addr); >>>>>>>> getrusage(RUSAGE_SELF, &ru); >>>>>>>> new_ru_minflt = ru.ru_minflt; >>>>>>>> new_ru_majflt = ru.ru_majflt; >>>>>>>> >>>>>>>> printf("minor faults:%ld major faults:%ld\n", >>>>>>>> new_ru_minflt - old_ru_minflt, >>>>>>>> new_ru_majflt - old_ru_majflt); >>>>>>>> } while(0); >>>>>>>> >>>>>>>> return 0; >>>>>>>> } >>>>>>>> >>>>>>>> w/o patch, >>>>>>>> / # ~/a.out >>>>>>>> minor faults:512 major faults:512 >>>>>>>> >>>>>>>> w/ patch, >>>>>>>> / # ~/a.out >>>>>>>> minor faults:0 major faults:512 >>>>>>>> >>>>>>>> Minor faults decrease to 0! >>>>>>>> >>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>>>>> --- >>>>>>>> mm/memory.c | 7 ++++--- >>>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>>> index eef4e482c0c2..e1d2e339958e 100644 >>>>>>>> --- a/mm/memory.c >>>>>>>> +++ b/mm/memory.c >>>>>>>> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>>>> */ >>>>>>>> if (!folio_test_ksm(folio) && >>>>>>>> (exclusive || folio_ref_count(folio) == 1)) { >>>>>>>> - if (vmf->flags & FAULT_FLAG_WRITE) { >>>>>>>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); >>>>>>>> - vmf->flags &= ~FAULT_FLAG_WRITE; >>>>>>>> + if (vma->vm_flags & VM_WRITE) { >>>>>>>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); >>>>>>>> + if (vmf->flags & FAULT_FLAG_WRITE) >>>>>>>> + vmf->flags &= ~FAULT_FLAG_WRITE; >>>>>>> >>>>>>> This implies, that even on a read fault, you would mark the pte dirty >>>>>>> and it would have to be written back to swap if still in the swap cache >>>>>>> and only read. >>>>>>> >>>>>>> That is controversial. >>>>>>> >>>>>>> What is less controversial is doing what mprotect() via >>>>>>> change_pte_range()/can_change_pte_writable() would do: mark the PTE >>>>>>> writable but not dirty. >>>>>>> >>>>>>> I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> I assume you mean something as below? >>>>> >>>>> It raises an important point: uffd-wp must be handled accordingly. >>>>> >>>>>> >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index eef4e482c0c2..dbf1ba8ccfd6 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4317,6 +4317,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>> add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); >>>>>> pte = mk_pte(page, vma->vm_page_prot); >>>>>> >>>>>> + if (pte_swp_soft_dirty(vmf->orig_pte)) >>>>>> + pte = pte_mksoft_dirty(pte); >>>>>> + if (pte_swp_uffd_wp(vmf->orig_pte)) >>>>>> + pte = pte_mkuffd_wp(pte); >>>>>> /* >>>>>> * Same logic as in do_wp_page(); however, optimize for pages that are >>>>>> * certainly not shared either because we just allocated them without >>>>>> @@ -4325,18 +4329,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >>>>>> */ >>>>>> if (!folio_test_ksm(folio) && >>>>>> (exclusive || folio_ref_count(folio) == 1)) { >>>>>> - if (vmf->flags & FAULT_FLAG_WRITE) { >>>>>> - pte = maybe_mkwrite(pte_mkdirty(pte), vma); >>>>>> - vmf->flags &= ~FAULT_FLAG_WRITE; >>>>>> + if (vma->vm_flags & VM_WRITE) { >>>>>> + if (vmf->flags & FAULT_FLAG_WRITE) { >>>>>> + pte = pte_mkwrite(pte_mkdirty(pte), vma); >>>>>> + vmf->flags &= ~FAULT_FLAG_WRITE; >>>>>> + } else if ((!vma_soft_dirty_enabled(vma) || >>>>>> pte_soft_dirty(pte)) >>>>>> + && !userfaultfd_pte_wp(vma, pte)) { >>>>>> + pte = pte_mkwrite(pte, vma); >>>>> >>>>> Even with FAULT_FLAG_WRITE we must respect uffd-wp and *not* do a >>>>> pte_mkwrite(pte). So we have to catch and handle that earlier (I could >>>>> have sworn we handle that somehow). >>>>> >>>>> Note that the existing >>>>> pte = pte_mkuffd_wp(pte); >>>>> >>>>> Will fix that up because it does an implicit pte_wrprotect(). >>>> >>>> This is exactly what I have missed as I am struggling with why WRITE_FAULT >>>> blindly does mkwrite without checking userfaultfd_pte_wp(). >>>> >>>>> >>>>> >>>>> So maybe what would work is >>>>> >>>>> >>>>> if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && >>>>> !vma_soft_dirty_enabled(vma)) { >>>>> pte = pte_mkwrite(pte); >>>>> >>>>> /* Only set the PTE dirty on write fault. */ >>>>> if (vmf->flags & FAULT_FLAG_WRITE) { >>>>> pte = pte_mkdirty(pte); >>>>> vmf->flags &= ~FAULT_FLAG_WRITE; >>>>> } >>> >>> WRITE_FAULT has a pte_mkdirty, so it doesn't need to check >>> "!vma_soft_dirty_enabled(vma)"? >>> Maybe I thought too much, just the simple code below should work? >> >> That would likely not handle softdirty correctly in case we end up in >> pte_mkwrite(pte, vma); note that pte_mksoft_dirty() will not wrprotect ... > > if SOFTDIRTY has been set, we shouldn't do wrprotect? till the dirty > is cleared, we don't rely on further write fault to set soft dirty, right? If softidrty is enabled for the VMA and the PTE is softdirty, we can (not should) map it writable. My point is that softdirty tracking is so underused that optimizing it here is likely not required. > > so we should better do pte_mkwrite if pte_soft_dirty(pte) == true? > > if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && > (!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte))) > >> >> (note that we shouldn't optimize for softdirty handling) >> >>> >>> if (!folio_test_ksm(folio) && >>> (exclusive || folio_ref_count(folio) == 1)) { >>> if (vma->vm_flags & VM_WRITE) { >>> if (vmf->flags & FAULT_FLAG_WRITE) { >>> pte = pte_mkwrite(pte_mkdirty(pte), vma); >>> vmf->flags &= ~FAULT_FLAG_WRITE; >>> } else { >>> pte = pte_mkwrite(pte, vma); >>> } >>> } >>> rmap_flags |= RMAP_EXCLUSIVE; >>> } >>> >>> if (pte_swp_soft_dirty(vmf->orig_pte)) >>> pte = pte_mksoft_dirty(pte); >>> if (pte_swp_uffd_wp(vmf->orig_pte)) >>> pte = pte_mkuffd_wp(pte); >>> >>> This still uses the implicit wrprotect of pte_mkuffd_wp. >> >> But the wrprotected->writable->wrprotected path really is confusing. I'd >> prefer to set these bits ahead of time instead, so we can properly rely >> on them -- like we do in other code. > > I agree. we are setting WRITE then reverting the WRITE. It is confusing. > > So in conclusion, we get the belows? > > if (pte_swp_soft_dirty(vmf->orig_pte)) > pte = pte_mksoft_dirty(pte); > if (pte_swp_uffd_wp(vmf->orig_pte)) > pte = pte_mkuffd_wp(pte); > > /* > * Same logic as in do_wp_page(); however, optimize for pages that are > * certainly not shared either because we just allocated them without > * exposing them to the swapcache or because the swap entry indicates > * exclusivity. > */ > if (!folio_test_ksm(folio) && > (exclusive || folio_ref_count(folio) == 1)) { > if (vma->vm_flags & VM_WRITE && !userfaultfd_pte_wp(vma, pte) > && (!vma_soft_dirty_enabled(vma) || pte_soft_dirty(pte))) { And here the code gets ugly. Just do if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && !vma_soft_dirty_enabled(vma)) { ... } and don't optimize softdirty. Or if you really want to, have a helper function like userfaultfd_pte_wp() that wraps both checks. > if (vmf->flags & FAULT_FLAG_WRITE) { > pte = pte_mkwrite(pte_mkdirty(pte), vma); > vmf->flags &= ~FAULT_FLAG_WRITE; > } else { > pte = pte_mkwrite(pte, vma); > } > } why not pte = pte_mkwrite(pte, vma); if (vmf->flags & FAULT_FLAG_WRITE) { pte = pte_mkdirty(pte); vmf->flags &= ~FAULT_FLAG_WRITE; } Conceptually, I think this should be fine.
diff --git a/mm/memory.c b/mm/memory.c index eef4e482c0c2..e1d2e339958e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ if (!folio_test_ksm(folio) && (exclusive || folio_ref_count(folio) == 1)) { - if (vmf->flags & FAULT_FLAG_WRITE) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); - vmf->flags &= ~FAULT_FLAG_WRITE; + if (vma->vm_flags & VM_WRITE) { + pte = pte_mkwrite(pte_mkdirty(pte), vma); + if (vmf->flags & FAULT_FLAG_WRITE) + vmf->flags &= ~FAULT_FLAG_WRITE; } rmap_flags |= RMAP_EXCLUSIVE; }