mbox series

[RFC,0/4] kvm: Report unused guest pages to host

Message ID 20190204181118.12095.38300.stgit@localhost.localdomain (mailing list archive)
Headers show
Series kvm: Report unused guest pages to host | expand

Message

Alexander Duyck Feb. 4, 2019, 6:15 p.m. UTC
This patch set provides a mechanism by which guests can notify the host of
pages that are not currently in use. Using this data a KVM host can more
easily balance memory workloads between guests and improve overall system
performance by avoiding unnecessary writing of unused pages to swap.

In order to support this I have added a new hypercall to provided unused
page hints and made use of mechanisms currently used by PowerPC and s390
architectures to provide those hints. To reduce the overhead of this call
I am only using it per huge page instead of of doing a notification per 4K
page. By doing this we can avoid the expense of fragmenting higher order
pages, and reduce overall cost for the hypercall as it will only be
performed once per huge page.

Because we are limiting this to huge pages it was necessary to add a
secondary location where we make the call as the buddy allocator can merge
smaller pages into a higher order huge page.

This approach is not usable in all cases. Specifically, when KVM direct
device assignment is used, the memory for a guest is permanently assigned
to physical pages in order to support DMA from the assigned device. In
this case we cannot give the pages back, so the hypercall is disabled by
the host.

Another situation that can lead to issues is if the page were accessed
immediately after free. For example, if page poisoning is enabled the
guest will populate the page *after* freeing it. In this case it does not
make sense to provide a hint about the page being freed so we do not
perform the hypercalls from the guest if this functionality is enabled.

My testing up till now has consisted of setting up 4 8GB VMs on a system
with 32GB of memory and 4GB of swap. To stress the memory on the system I
would run "memhog 8G" sequentially on each of the guests and observe how
long it took to complete the run. The observed behavior is that on the
systems with these patches applied in both the guest and on the host I was
able to complete the test with a time of 5 to 7 seconds per guest. On a
system without these patches the time ranged from 7 to 49 seconds per
guest. I am assuming the variability is due to time being spent writing
pages out to disk in order to free up space for the guest.

---

Alexander Duyck (4):
      madvise: Expose ability to set dontneed from kernel
      kvm: Add host side support for free memory hints
      kvm: Add guest side support for free memory hints
      mm: Add merge page notifier


 Documentation/virtual/kvm/cpuid.txt      |    4 ++
 Documentation/virtual/kvm/hypercalls.txt |   14 ++++++++
 arch/x86/include/asm/page.h              |   25 +++++++++++++++
 arch/x86/include/uapi/asm/kvm_para.h     |    3 ++
 arch/x86/kernel/kvm.c                    |   51 ++++++++++++++++++++++++++++++
 arch/x86/kvm/cpuid.c                     |    6 +++-
 arch/x86/kvm/x86.c                       |   35 +++++++++++++++++++++
 include/linux/gfp.h                      |    4 ++
 include/linux/mm.h                       |    2 +
 include/uapi/linux/kvm_para.h            |    1 +
 mm/madvise.c                             |   13 +++++++-
 mm/page_alloc.c                          |    2 +
 12 files changed, 158 insertions(+), 2 deletions(-)

--

Comments

Nitesh Narayan Lal Feb. 5, 2019, 5:25 p.m. UTC | #1
On 2/4/19 1:15 PM, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
>
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.

Hi Alexander,

Can you share the host memory usage before and after your run. (In both
the cases with your patch-set and without your patch-set)

>
> ---
>
> Alexander Duyck (4):
>       madvise: Expose ability to set dontneed from kernel
>       kvm: Add host side support for free memory hints
>       kvm: Add guest side support for free memory hints
>       mm: Add merge page notifier
>
>
>  Documentation/virtual/kvm/cpuid.txt      |    4 ++
>  Documentation/virtual/kvm/hypercalls.txt |   14 ++++++++
>  arch/x86/include/asm/page.h              |   25 +++++++++++++++
>  arch/x86/include/uapi/asm/kvm_para.h     |    3 ++
>  arch/x86/kernel/kvm.c                    |   51 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/cpuid.c                     |    6 +++-
>  arch/x86/kvm/x86.c                       |   35 +++++++++++++++++++++
>  include/linux/gfp.h                      |    4 ++
>  include/linux/mm.h                       |    2 +
>  include/uapi/linux/kvm_para.h            |    1 +
>  mm/madvise.c                             |   13 +++++++-
>  mm/page_alloc.c                          |    2 +
>  12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
Alexander Duyck Feb. 5, 2019, 6:43 p.m. UTC | #2
On Tue, 2019-02-05 at 12:25 -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 1:15 PM, Alexander Duyck wrote:
> > This patch set provides a mechanism by which guests can notify the host of
> > pages that are not currently in use. Using this data a KVM host can more
> > easily balance memory workloads between guests and improve overall system
> > performance by avoiding unnecessary writing of unused pages to swap.
> > 
> > In order to support this I have added a new hypercall to provided unused
> > page hints and made use of mechanisms currently used by PowerPC and s390
> > architectures to provide those hints. To reduce the overhead of this call
> > I am only using it per huge page instead of of doing a notification per 4K
> > page. By doing this we can avoid the expense of fragmenting higher order
> > pages, and reduce overall cost for the hypercall as it will only be
> > performed once per huge page.
> > 
> > Because we are limiting this to huge pages it was necessary to add a
> > secondary location where we make the call as the buddy allocator can merge
> > smaller pages into a higher order huge page.
> > 
> > This approach is not usable in all cases. Specifically, when KVM direct
> > device assignment is used, the memory for a guest is permanently assigned
> > to physical pages in order to support DMA from the assigned device. In
> > this case we cannot give the pages back, so the hypercall is disabled by
> > the host.
> > 
> > Another situation that can lead to issues is if the page were accessed
> > immediately after free. For example, if page poisoning is enabled the
> > guest will populate the page *after* freeing it. In this case it does not
> > make sense to provide a hint about the page being freed so we do not
> > perform the hypercalls from the guest if this functionality is enabled.
> > 
> > My testing up till now has consisted of setting up 4 8GB VMs on a system
> > with 32GB of memory and 4GB of swap. To stress the memory on the system I
> > would run "memhog 8G" sequentially on each of the guests and observe how
> > long it took to complete the run. The observed behavior is that on the
> > systems with these patches applied in both the guest and on the host I was
> > able to complete the test with a time of 5 to 7 seconds per guest. On a
> > system without these patches the time ranged from 7 to 49 seconds per
> > guest. I am assuming the variability is due to time being spent writing
> > pages out to disk in order to free up space for the guest.
> 
> Hi Alexander,
> 
> Can you share the host memory usage before and after your run. (In both
> the cases with your patch-set and without your patch-set)

Here are some snippets from the /proc/meminfo for the system both
before and after the test.

W/O patch
-- Before --
MemTotal:       32881396 kB
MemFree:        21363724 kB
MemAvailable:   25891228 kB
Buffers:            2276 kB
Cached:          4760280 kB
SwapCached:            0 kB
Active:          7166952 kB
Inactive:        1474980 kB
Active(anon):    3893308 kB
Inactive(anon):     8776 kB
Active(file):    3273644 kB
Inactive(file):  1466204 kB
Unevictable:       16756 kB
Mlocked:           16756 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:             29812 kB
Writeback:             0 kB
AnonPages:       3896540 kB
Mapped:            75568 kB
Shmem:             10044 kB

-- After --
MemTotal:       32881396 kB
MemFree:          194668 kB
MemAvailable:      51356 kB
Buffers:              24 kB
Cached:           129036 kB
SwapCached:       224396 kB
Active:         27223304 kB
Inactive:        2589736 kB
Active(anon):   27220360 kB
Inactive(anon):  2481592 kB
Active(file):       2944 kB
Inactive(file):   108144 kB
Unevictable:       16756 kB
Mlocked:           16756 kB
SwapTotal:       4194300 kB
SwapFree:          35616 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:      29476628 kB
Mapped:            22820 kB
Shmem:              5516 kB

W/ patch
-- Before --
MemTotal:       32881396 kB
MemFree:        26618880 kB
MemAvailable:   27056004 kB
Buffers:            2276 kB
Cached:           781496 kB
SwapCached:            0 kB
Active:          3309056 kB
Inactive:         393796 kB
Active(anon):    2932728 kB
Inactive(anon):     8776 kB
Active(file):     376328 kB
Inactive(file):   385020 kB
Unevictable:       16756 kB
Mlocked:           16756 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:                96 kB
Writeback:             0 kB
AnonPages:       2935964 kB
Mapped:            75428 kB
Shmem:             10048 kB

-- After --
MemTotal:       32881396 kB
MemFree:        22677904 kB
MemAvailable:   26543092 kB
Buffers:            2276 kB
Cached:          4205908 kB
SwapCached:            0 kB
Active:          3863016 kB
Inactive:        3768596 kB
Active(anon):    3437368 kB
Inactive(anon):     8772 kB
Active(file):     425648 kB
Inactive(file):  3759824 kB
Unevictable:       16756 kB
Mlocked:           16756 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:           1336180 kB
Writeback:             0 kB
AnonPages:       3440528 kB
Mapped:            74992 kB
Shmem:             10044 kB
Nitesh Narayan Lal Feb. 7, 2019, 2:48 p.m. UTC | #3
On 2/4/19 1:15 PM, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
>
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
Hi Alexander,

Did you get a chance to look at my v8 posting of Guest Free Page Hinting
[1]?
Considering both the solutions are trying to solve the same problem. It
will be great if we can collaborate and come up with a unified solution.

[1] https://lkml.org/lkml/2019/2/4/993
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
>
> ---
>
> Alexander Duyck (4):
>       madvise: Expose ability to set dontneed from kernel
>       kvm: Add host side support for free memory hints
>       kvm: Add guest side support for free memory hints
>       mm: Add merge page notifier
>
>
>  Documentation/virtual/kvm/cpuid.txt      |    4 ++
>  Documentation/virtual/kvm/hypercalls.txt |   14 ++++++++
>  arch/x86/include/asm/page.h              |   25 +++++++++++++++
>  arch/x86/include/uapi/asm/kvm_para.h     |    3 ++
>  arch/x86/kernel/kvm.c                    |   51 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/cpuid.c                     |    6 +++-
>  arch/x86/kvm/x86.c                       |   35 +++++++++++++++++++++
>  include/linux/gfp.h                      |    4 ++
>  include/linux/mm.h                       |    2 +
>  include/uapi/linux/kvm_para.h            |    1 +
>  mm/madvise.c                             |   13 +++++++-
>  mm/page_alloc.c                          |    2 +
>  12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
Alexander Duyck Feb. 7, 2019, 4:56 p.m. UTC | #4
On Thu, 2019-02-07 at 09:48 -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 1:15 PM, Alexander Duyck wrote:
> > This patch set provides a mechanism by which guests can notify the host of
> > pages that are not currently in use. Using this data a KVM host can more
> > easily balance memory workloads between guests and improve overall system
> > performance by avoiding unnecessary writing of unused pages to swap.
> > 
> > In order to support this I have added a new hypercall to provided unused
> > page hints and made use of mechanisms currently used by PowerPC and s390
> > architectures to provide those hints. To reduce the overhead of this call
> > I am only using it per huge page instead of of doing a notification per 4K
> > page. By doing this we can avoid the expense of fragmenting higher order
> > pages, and reduce overall cost for the hypercall as it will only be
> > performed once per huge page.
> > 
> > Because we are limiting this to huge pages it was necessary to add a
> > secondary location where we make the call as the buddy allocator can merge
> > smaller pages into a higher order huge page.
> > 
> > This approach is not usable in all cases. Specifically, when KVM direct
> > device assignment is used, the memory for a guest is permanently assigned
> > to physical pages in order to support DMA from the assigned device. In
> > this case we cannot give the pages back, so the hypercall is disabled by
> > the host.
> > 
> > Another situation that can lead to issues is if the page were accessed
> > immediately after free. For example, if page poisoning is enabled the
> > guest will populate the page *after* freeing it. In this case it does not
> > make sense to provide a hint about the page being freed so we do not
> > perform the hypercalls from the guest if this functionality is enabled.
> 
> Hi Alexander,
> 
> Did you get a chance to look at my v8 posting of Guest Free Page Hinting
> [1]?
> Considering both the solutions are trying to solve the same problem. It
> will be great if we can collaborate and come up with a unified solution.
> 
> [1] https://lkml.org/lkml/2019/2/4/993

I haven't had a chance to review these yet.

I'll try to take a look later today and provide review notes based on
what I find.

Thanks.

- Alex
Michael S. Tsirkin Feb. 10, 2019, 12:51 a.m. UTC | #5
On Mon, Feb 04, 2019 at 10:15:33AM -0800, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.

There's an obvious overlap with Nilal's work and already merged Wei's
work here.  So please Cc people reviewing Nilal's and Wei's
patches.


> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
> 
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
> 
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
> 
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
> 
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
> 
> ---
> 
> Alexander Duyck (4):
>       madvise: Expose ability to set dontneed from kernel
>       kvm: Add host side support for free memory hints
>       kvm: Add guest side support for free memory hints
>       mm: Add merge page notifier
> 
> 
>  Documentation/virtual/kvm/cpuid.txt      |    4 ++
>  Documentation/virtual/kvm/hypercalls.txt |   14 ++++++++
>  arch/x86/include/asm/page.h              |   25 +++++++++++++++
>  arch/x86/include/uapi/asm/kvm_para.h     |    3 ++
>  arch/x86/kernel/kvm.c                    |   51 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/cpuid.c                     |    6 +++-
>  arch/x86/kvm/x86.c                       |   35 +++++++++++++++++++++
>  include/linux/gfp.h                      |    4 ++
>  include/linux/mm.h                       |    2 +
>  include/uapi/linux/kvm_para.h            |    1 +
>  mm/madvise.c                             |   13 +++++++-
>  mm/page_alloc.c                          |    2 +
>  12 files changed, 158 insertions(+), 2 deletions(-)
> 
> --