diff mbox series

[RFC] mm: map zero-filled pages to zero_pfn while doing swap-in

Message ID 20241212073711.82300-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: map zero-filled pages to zero_pfn while doing swap-in | expand

Commit Message

Barry Song Dec. 12, 2024, 7:37 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

While developing the zeromap series, Usama observed that certain
workloads may contain over 10% zero-filled pages. This may present
an opportunity to save memory by mapping zero-filled pages to zero_pfn
in do_swap_page(). If a write occurs later, do_wp_page() can
allocate a new page using the Copy-on-Write mechanism.

For workloads with numerous zero-filled pages, this can greatly
reduce the RSS.

For example:
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <sys/mman.h>

 #define SIZE (20 * 1024 * 1024)
 int main()
 {
 	volatile char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
 			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	volatile char data;

 	if (buffer == MAP_FAILED) {
 		perror("mmap failed");
 		exit(EXIT_FAILURE);
 	}

 	memset(buffer, 0, SIZE);

 	if (madvise(buffer, SIZE, MADV_PAGEOUT) != 0)
 		perror("madvise MADV_PAGEOUT failed");

 	for (size_t i = 0; i < SIZE; i++)
 		data = buffer[i];
 	sleep(1000);

 	return 0;
 }

~ # ./a.out &

w/o patch:
~ # ps aux | head -n 1; ps aux | grep '[a]\.out'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       101  2.9 10.6  22540 21268 ttyAMA0  S    06:50   0:00 ./a.out

w/ patch:
~ # ps aux | head -n 1; ps aux | grep '[a]\.out'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       141  0.1  0.3  22540   792 ttyAMA0  S    06:38   0:00 ./a.out

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

Christoph Hellwig Dec. 12, 2024, 8:29 a.m. UTC | #1
On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> While developing the zeromap series, Usama observed that certain
> workloads may contain over 10% zero-filled pages. This may present
> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> in do_swap_page(). If a write occurs later, do_wp_page() can
> allocate a new page using the Copy-on-Write mechanism.

Shouldn't this be done during, or rather instead of swap out instead?
Swapping all zero pages out just to optimize the in-memory
representation on seems rather backwards.
Barry Song Dec. 12, 2024, 8:46 a.m. UTC | #2
On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > While developing the zeromap series, Usama observed that certain
> > workloads may contain over 10% zero-filled pages. This may present
> > an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > in do_swap_page(). If a write occurs later, do_wp_page() can
> > allocate a new page using the Copy-on-Write mechanism.
>
> Shouldn't this be done during, or rather instead of swap out instead?
> Swapping all zero pages out just to optimize the in-memory
> representation on seems rather backwards.

I’m having trouble understanding your point—it seems like you might
not have fully read the code. :-)

The situation is as follows: for a zero-filled page, we are currently
allocating a new
page unconditionally. By mapping this zero-filled page to zero_pfn, we could
save the memory used by this page.

We don't need to allocate the memory until the page is written(which may never
happen).

>

Thanks
Barry
Christoph Hellwig Dec. 12, 2024, 8:50 a.m. UTC | #3
On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote:
> On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > While developing the zeromap series, Usama observed that certain
> > > workloads may contain over 10% zero-filled pages. This may present
> > > an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > in do_swap_page(). If a write occurs later, do_wp_page() can
> > > allocate a new page using the Copy-on-Write mechanism.
> >
> > Shouldn't this be done during, or rather instead of swap out instead?
> > Swapping all zero pages out just to optimize the in-memory
> > representation on seems rather backwards.
> 
> I’m having trouble understanding your point—it seems like you might
> not have fully read the code. :-)

I've not read the code at all, I've read your commit log.

> The situation is as follows: for a zero-filled page, we are currently
> allocating a new
> page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> save the memory used by this page.

Yes.  But why do that in swap-in and not swap-out?
David Hildenbrand Dec. 12, 2024, 8:50 a.m. UTC | #4
On 12.12.24 09:46, Barry Song wrote:
> On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> While developing the zeromap series, Usama observed that certain
>>> workloads may contain over 10% zero-filled pages. This may present
>>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
>>> in do_swap_page(). If a write occurs later, do_wp_page() can
>>> allocate a new page using the Copy-on-Write mechanism.
>>
>> Shouldn't this be done during, or rather instead of swap out instead?
>> Swapping all zero pages out just to optimize the in-memory
>> representation on seems rather backwards.
> 
> I’m having trouble understanding your point—it seems like you might
> not have fully read the code. :-)
> 
> The situation is as follows: for a zero-filled page, we are currently
> allocating a new
> page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> save the memory used by this page.
> 
> We don't need to allocate the memory until the page is written(which may never
> happen).

I think what Christoph means is that you would determine that at PTE 
unmap time, and directly place the zero page in there. So there would be 
no need to have the page fault at all.

I suspect at PTE unmap time might be problematic, because we might still 
have other (i.e., GUP) references modifying that page, and we can only 
rely on the page content being stable after we flushed the TLB as well. 
(I recall some deferred flushing optimizations)
Barry Song Dec. 12, 2024, 8:54 a.m. UTC | #5
On Thu, Dec 12, 2024 at 9:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Dec 12, 2024 at 09:46:03PM +1300, Barry Song wrote:
> > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > While developing the zeromap series, Usama observed that certain
> > > > workloads may contain over 10% zero-filled pages. This may present
> > > > an opportunity to save memory by mapping zero-filled pages to zero_pfn
> > > > in do_swap_page(). If a write occurs later, do_wp_page() can
> > > > allocate a new page using the Copy-on-Write mechanism.
> > >
> > > Shouldn't this be done during, or rather instead of swap out instead?
> > > Swapping all zero pages out just to optimize the in-memory
> > > representation on seems rather backwards.
> >
> > I’m having trouble understanding your point—it seems like you might
> > not have fully read the code. :-)
>
> I've not read the code at all, I've read your commit log.
>
> > The situation is as follows: for a zero-filled page, we are currently
> > allocating a new
> > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > save the memory used by this page.
>
> Yes.  But why do that in swap-in and not swap-out?

Usama implemented this in swap-out, where no I/O occurs after
his zeromap series. A bit is set in the swap->zeromap bitmap if
the swapped-out page is zero-filled. and all swapp-out I/O is
skipped.

Now, the situation is that when we re-access a swapped-out
page, we don’t always need to allocate a new page. Instead,
we can map it to zero_pfn and defer the allocation until the
page is written.

>
Barry Song Dec. 12, 2024, 9:16 a.m. UTC | #6
On Thu, Dec 12, 2024 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.12.24 09:46, Barry Song wrote:
> > On Thu, Dec 12, 2024 at 9:29 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> On Thu, Dec 12, 2024 at 08:37:11PM +1300, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> While developing the zeromap series, Usama observed that certain
> >>> workloads may contain over 10% zero-filled pages. This may present
> >>> an opportunity to save memory by mapping zero-filled pages to zero_pfn
> >>> in do_swap_page(). If a write occurs later, do_wp_page() can
> >>> allocate a new page using the Copy-on-Write mechanism.
> >>
> >> Shouldn't this be done during, or rather instead of swap out instead?
> >> Swapping all zero pages out just to optimize the in-memory
> >> representation on seems rather backwards.
> >
> > I’m having trouble understanding your point—it seems like you might
> > not have fully read the code. :-)
> >
> > The situation is as follows: for a zero-filled page, we are currently
> > allocating a new
> > page unconditionally. By mapping this zero-filled page to zero_pfn, we could
> > save the memory used by this page.
> >
> > We don't need to allocate the memory until the page is written(which may never
> > happen).
>
> I think what Christoph means is that you would determine that at PTE
> unmap time, and directly place the zero page in there. So there would be
> no need to have the page fault at all.
>
> I suspect at PTE unmap time might be problematic, because we might still
> have other (i.e., GUP) references modifying that page, and we can only
> rely on the page content being stable after we flushed the TLB as well.
> (I recall some deferred flushing optimizations)

Yes, we need to follow a strict sequence:

1. try_to_unmap - unmap PTEs in all processes;
2. try_to_unmap_flush_dirty - flush deferred TLB shootdown;
3. pageout - zeromap will set 1 in bitmap if page is zero-filled

At the moment of pageout(), we can be confident that the page is zero-filled.

mapping to zeropage during unmap seems quite risky.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 2bacebbf4cf6..b37f0f61d0bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4294,6 +4294,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
+	bool map_zero_pfn = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4364,6 +4365,39 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
+		/* Use the zero-page for reads */
+		if (!(vmf->flags & FAULT_FLAG_WRITE) &&
+		    !mm_forbids_zeropage(vma->vm_mm) &&
+		    __swap_count(entry) == 1)  {
+			swap_zeromap_batch(entry, 1, &map_zero_pfn);
+			if (map_zero_pfn) {
+				if (swapcache_prepare(entry, 1)) {
+					add_wait_queue(&swapcache_wq, &wait);
+					schedule_timeout_uninterruptible(1);
+					remove_wait_queue(&swapcache_wq, &wait);
+					goto out;
+				}
+				nr_pages = 1;
+				need_clear_cache = true;
+				pte = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
+						vma->vm_page_prot));
+				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+						&vmf->ptl);
+				if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte),
+						vmf->orig_pte)))
+					goto unlock;
+
+				page = pfn_to_page(my_zero_pfn(vmf->address));
+				arch_swap_restore(entry, page_folio(page));
+				swap_free_nr(entry, 1);
+				add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1);
+				set_ptes(vma->vm_mm, vmf->address, vmf->pte, pte, 1);
+				arch_do_swap_page_nr(vma->vm_mm, vma, vmf->address, pte, pte, 1);
+				update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+				goto unlock;
+			}
+		}
+
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */