Message ID | 20241212073711.82300-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: map zero-filled pages to zero_pfn while doing swap-in | expand |
On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > While developing the zeromap series, Usama observed that certain > workloads may contain over 10% zero-filled pages. This may present > an opportunity to save memory by mapping zero-filled pages to zero_pfn > in do_swap_page(). If a write occurs later, do_wp_page() can > allocate a new page using the Copy-on-Write mechanism. Shouldn't this be done during, or rather instead of swap out instead? Swapping all zero pages out just to optimize the in-memory representation on seems rather backwards.
On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > While developing the zeromap series, Usama observed that certain > > workloads may contain over 10% zero-filled pages. This may present > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > in do_swap_page(). If a write occurs later, do_wp_page() can > > allocate a new page using the Copy-on-Write mechanism. > > Shouldn't this be done during, or rather instead of swap out instead? > Swapping all zero pages out just to optimize the in-memory > representation on seems rather backwards. I’m having trouble understanding your point—it seems like you might not have fully read the code. :-) The situation is as follows: for a zero-filled page, we are currently allocating a new page unconditionally. By mapping this zero-filled page to zero_pfn, we could save the memory used by this page. We don't need to allocate the memory until the page is written(which may never happen). > Thanks Barry
On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > While developing the zeromap series, Usama observed that certain > > > workloads may contain over 10% zero-filled pages. This may present > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > in do_swap_page(). If a write occurs later, do_wp_page() can > > > allocate a new page using the Copy-on-Write mechanism. > > > > Shouldn't this be done during, or rather instead of swap out instead? > > Swapping all zero pages out just to optimize the in-memory > > representation on seems rather backwards. > > I’m having trouble understanding your point—it seems like you might > not have fully read the code. :-) I've not read the code at all, I've read your commit log. > The situation is as follows: for a zero-filled page, we are currently > allocating a new > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > save the memory used by this page. Yes. But why do that in swap-in and not swap-out?
On 12.12.24 09:46, Barry Song wrote: > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> While developing the zeromap series, Usama observed that certain >>> workloads may contain over 10% zero-filled pages. This may present >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn >>> in do_swap_page(). If a write occurs later, do_wp_page() can >>> allocate a new page using the Copy-on-Write mechanism. >> >> Shouldn't this be done during, or rather instead of swap out instead? >> Swapping all zero pages out just to optimize the in-memory >> representation on seems rather backwards. > > I’m having trouble understanding your point—it seems like you might > not have fully read the code. :-) > > The situation is as follows: for a zero-filled page, we are currently > allocating a new > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > save the memory used by this page. > > We don't need to allocate the memory until the page is written(which may never > happen). I think what Christoph means is that you would determine that at PTE unmap time, and directly place the zero page in there. So there would be no need to have the page fault at all. I suspect at PTE unmap time might be problematic, because we might still have other (i.e., GUP) references modifying that page, and we can only rely on the page content being stable after we flushed the TLB as well. (I recall some deferred flushing optimizations)
On Thu, Dec 12, 2024 at 9:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > While developing the zeromap series, Usama observed that certain > > > > workloads may contain over 10% zero-filled pages. This may present > > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn > > > > in do_swap_page(). If a write occurs later, do_wp_page() can > > > > allocate a new page using the Copy-on-Write mechanism. > > > > > > Shouldn't this be done during, or rather instead of swap out instead? > > > Swapping all zero pages out just to optimize the in-memory > > > representation on seems rather backwards. > > > > I’m having trouble understanding your point—it seems like you might > > not have fully read the code. :-) > > I've not read the code at all, I've read your commit log. > > > The situation is as follows: for a zero-filled page, we are currently > > allocating a new > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > save the memory used by this page. > > Yes. But why do that in swap-in and not swap-out? Usama implemented this in swap-out, where no I/O occurs after his zeromap series. A bit is set in the swap->zeromap bitmap if the swapped-out page is zero-filled. and all swapp-out I/O is skipped. Now, the situation is that when we re-access a swapped-out page, we don’t always need to allocate a new page. Instead, we can map it to zero_pfn and defer the allocation until the page is written. >
On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote: > > On 12.12.24 09:46, Barry Song wrote: > > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> While developing the zeromap series, Usama observed that certain > >>> workloads may contain over 10% zero-filled pages. This may present > >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn > >>> in do_swap_page(). If a write occurs later, do_wp_page() can > >>> allocate a new page using the Copy-on-Write mechanism. > >> > >> Shouldn't this be done during, or rather instead of swap out instead? > >> Swapping all zero pages out just to optimize the in-memory > >> representation on seems rather backwards. > > > > I’m having trouble understanding your point—it seems like you might > > not have fully read the code. :-) > > > > The situation is as follows: for a zero-filled page, we are currently > > allocating a new > > page unconditionally. By mapping this zero-filled page to zero_pfn, we could > > save the memory used by this page. > > > > We don't need to allocate the memory until the page is written(which may never > > happen). > > I think what Christoph means is that you would determine that at PTE > unmap time, and directly place the zero page in there. So there would be > no need to have the page fault at all. > > I suspect at PTE unmap time might be problematic, because we might still > have other (i.e., GUP) references modifying that page, and we can only > rely on the page content being stable after we flushed the TLB as well. > (I recall some deferred flushing optimizations) Yes, we need to follow a strict sequence: 1. try_to_unmap - unmap PTEs in all processes; 2. try_to_unmap_flush_dirty - flush deferred TLB shootdown; 3. pageout - zeromap will set 1 in bitmap if page is zero-filled At the moment of pageout(), we can be confident that the page is zero-filled. mapping to zeropage during unmap seems quite risky. > > -- > Cheers, > > David / dhildenb > Thanks Barry
diff --git a/mm/memory.c b/mm/memory.c index 2bacebbf4cf6..b37f0f61d0bc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4294,6 +4294,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct swap_info_struct *si = NULL; rmap_t rmap_flags = RMAP_NONE; bool need_clear_cache = false; + bool map_zero_pfn = false; bool exclusive = false; swp_entry_t entry; pte_t pte; @@ -4364,6 +4365,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) swapcache = folio; if (!folio) { + /* Use the zero-page for reads */ + if (!(vmf->flags & FAULT_FLAG_WRITE) && + !mm_forbids_zeropage(vma->vm_mm) && + __swap_count(entry) == 1) { + swap_zeromap_batch(entry, 1, &map_zero_pfn); + if (map_zero_pfn) { + if (swapcache_prepare(entry, 1)) { + add_wait_queue(&swapcache_wq, &wait); + schedule_timeout_uninterruptible(1); + remove_wait_queue(&swapcache_wq, &wait); + goto out; + } + nr_pages = 1; + need_clear_cache = true; + pte = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), + vma->vm_page_prot)); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + &vmf->ptl); + if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), + vmf->orig_pte))) + goto unlock; + + page = pfn_to_page(my_zero_pfn(vmf->address)); + arch_swap_restore(entry, page_folio(page)); + swap_free_nr(entry, 1); + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1); + set_ptes(vma->vm_mm, vmf->address, vmf->pte, pte, 1); + arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address, pte, pte, 1); + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + goto unlock; + } + } + if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { /* skip swapcache */