mbox series

[RFC,v11,0/2] mm: Support for page hinting

Message ID 20190710195158.19640-1-nitesh@redhat.com (mailing list archive)
Headers show
Series mm: Support for page hinting | expand

Message

Nitesh Narayan Lal July 10, 2019, 7:51 p.m. UTC
This patch series proposes an efficient mechanism for reporting free memory
from a guest to its hypervisor. It especially enables guests with no page cache
(e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
rapidly hand back free memory to the hypervisor.
This approach has a minimal impact on the existing core-mm infrastructure.

Measurement results (measurement details appended to this email):
*Number of 5GB guests (each touching 4GB memory) that can be launched
without swap usage on a system with 15GB:
unmodified kernel - 2, 3rd with 2.5GB   
v11 page hinting - 6, 7th with 26MB    
v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
proposed to solve the same problems)

*Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           


This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
A new hook after buddy merging is used to set the bits in the bitmap.
Currently, the bits are only cleared when pages are hinted, not when pages are
re-allocated.

Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
threshold is met, trying to isolate and report pages that are still free.

The isolated pages are reported via virtio-balloon, which is responsible for
sending batched pages to the host synchronously. Once the hypervisor processed
the hinting request, the isolated pages are returned back to the buddy.

Changelog in v11:
* Added logic to take care of multiple NUMA nodes scenarios.
* Simplified the logic for reporting isolated pages to the host. (Eg. replaced
dynamically allocated arrays with static ones, introduced wait event instead of
the loop in order to wait for a response from the host)
* Added a mutex to prevent race condition when page hinting is enabled by
multiple drivers.
* Simplified the logic responsible for decrementing free page counter for each
zone.
* Simplified code structuring/naming.

Known work items for the future:
* Test device assigned guests to ensure that hinting doesn't break it.
* Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
* Decide between MADV_DONTNEED and MADV_FREE.
* Look into memory hotplug, more efficient locking, better naming conventions to
avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
* Come up with proper/traceable error-message/logs and look into other code
simplifications. (If necessary).

Benefit analysis:
1. Number of 5GB guests (each touching 4GB memory) that can be launched without
swap usage on a system with 15GB:
unmodified kernel - 2, 3rd with 2.5GB   
v11 page hinting - 6, 7th with 26MB    
v1 bubble hinting - 6, 7th with 1.8GB   

Conclusion - In this particular testcase on using v11 page hinting and
v1 bubble-hinting 4 more guests could be launched without swapping compared
to an unmodified kernel.
For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
hinting as it touches lesser swap space.

Setup & procedure - 
Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
Guest Memory = 5GB
Number of CPUs in the guest = 1
Host swap = 4GB
Workload = test allocation program that allocates 4GB memory, touches it via
memset and exits.
The first guest is launched and once its console is up, the test allocation
program is executed with 4 GB memory request (Due to this the guest occupies
almost 4-5 GB of memory in the host in a system without page hinting). Once
this program exits at that time another guest is launched in the host and the
same process is followed. It is continued until the swap is not used.

2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           

For this particular test-case in a guest which doesn't require swap access
"memhog 6G" execution time lies within a range of 15-30s.
Conclusion -
In the above test case for an unmodified kernel on executing memhog in the
third guest execution time rises to above 2minutes due to swap access.
Using either page-hinting or bubble hinting brings this execution time to a
a normal range of 15-30s.

Setup & procedure -
Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
Guest Memory = 6GB
Number of CPUs in the guest = 4
Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
one after the other in each of them.
Host swap = 4GB

Performance Analysis:
1. will-it-scale's page_faul1
Setup -
Guest Memory = 6GB
Number of cores = 24

Unmodified kernel -
0,0,100,0,100,0
1,514453,95.84,519502,95.83,519502
2,991485,91.67,932268,91.68,1039004
3,1381237,87.36,1264214,87.64,1558506
4,1789116,83.36,1597767,83.88,2078008
5,2181552,79.20,1889489,80.08,2597510
6,2452416,75.05,2001879,77.10,3117012
7,2671047,70.90,2263866,73.22,3636514
8,2930081,66.75,2333813,70.60,4156016
9,3126431,62.60,2370108,68.28,4675518
10,3211937,58.44,2454093,65.74,5195020
11,3162172,54.32,2450822,63.21,5714522
12,3154261,50.14,2272290,58.98,6234024
13,3115174,46.02,2369679,57.74,6753526
14,3150511,41.86,2470837,54.02,7273028
15,3134158,37.71,2428129,51.98,7792530
16,3143067,33.57,2340469,49.54,8312032
17,3112457,29.43,2263627,44.81,8831534
18,3089724,25.29,2181879,38.69,9351036
19,3076878,21.15,2236505,40.01,9870538
20,3091978,16.95,2266327,35.00,10390040
21,3082927,12.84,2172578,28.12,10909542
22,3055282,8.73,2176269,29.14,11429044
23,3081144,4.56,2138442,24.87,11948546
24,3075509,0.45,2173753,21.62,12468048

page hinting -
0,0,100,0,100,0
1,491683,95.83,494366,95.82,494366
2,988415,91.67,919660,91.68,988732
3,1344829,87.52,1244608,87.69,1483098
4,1797933,83.37,1625797,83.70,1977464
5,2179009,79.21,1881534,80.13,2471830
6,2449858,75.07,2078137,76.82,2966196
7,2732122,70.90,2178105,73.75,3460562
8,2910965,66.75,2340901,70.28,3954928
9,3006665,62.61,2353748,67.91,4449294
10,3164752,58.46,2377936,65.08,4943660
11,3234846,54.32,2510149,63.14,5438026
12,3165477,50.17,2412007,59.91,5932392
13,3141457,46.05,2421548,57.85,6426758
14,3135839,41.90,2378021,53.81,6921124
15,3109113,37.75,2269290,51.76,7415490
16,3093613,33.62,2346185,48.73,7909856
17,3086542,29.49,2352140,46.19,8404222
18,3048991,25.36,2217144,41.52,8898588
19,2965500,21.18,2313614,38.18,9392954
20,2928977,17.05,2175316,35.67,9887320
21,2896667,12.91,2141311,28.90,10381686
22,3047782,8.76,2177664,28.24,10876052
23,2994503,4.58,2160976,22.97,11370418
24,3038762,0.47,2053533,22.39,11864784

bubble-hinting v1 - 
0,0,100,0,100,0
1,515272,95.83,492355,95.81,515272
2,985903,91.66,919653,91.68,1030544
3,1475300,87.51,1353723,87.65,1545816
4,1783938,83.36,1586307,83.78,2061088
5,2093307,79.20,1867395,79.95,2576360
6,2441370,75.05,2055421,76.65,3091632
7,2650471,70.89,2246014,72.93,3606904
8,2926782,66.75,2333601,70.41,4122176
9,3107617,62.60,2383112,68.46,4637448
10,3192332,58.44,2441626,65.84,5152720
11,3268043,54.32,2235964,62.92,5667992
12,3191105,50.18,2449045,60.49,6183264
13,3145317,46.05,2377317,57.80,6698536
14,3161552,41.91,2395814,53.26,7213808
15,3140443,37.77,2333200,51.42,7729080
16,3130866,33.65,2150967,46.11,8244352
17,3112894,29.52,2372068,45.93,8759624
18,3078424,25.39,2336211,39.85,9274896
19,3036457,21.27,2224821,35.25,9790168
20,3046330,17.13,2199755,37.43,10305440
21,2981130,12.98,2214862,28.67,10820712
22,3017481,8.84,2195996,29.69,11335984
23,2979906,4.68,2173395,25.90,11851256
24,2971170,0.52,2134311,21.89,12366528

Conclusion -
For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
in the results wrt the numbers mentioned above. For both bubble-hinting and
page-hinting, there was no noticeable degradation observed other than the
expected variability mentioned earlier. 

Page hinting vs bubble hinting:
From the benefits and performance perspective, both solutions look quite similar
so far. However, unlike bubble-hinting which is more invasive, the overall core
mm changes required for page hinting are minimal.

[1] https://lkml.org/lkml/2019/6/19/926

Nitesh Narayan Lal (2):
  mm: page_hinting: core infrastructure
  virtio-balloon: page_hinting: reporting to the host

 drivers/virtio/Kconfig              |   1 +
 drivers/virtio/virtio_balloon.c     |  91 +++++++++-
 include/linux/page_hinting.h        |  45 +++++
 include/uapi/linux/virtio_balloon.h |  11 ++
 mm/Kconfig                          |   6 +
 mm/Makefile                         |   1 +
 mm/page_alloc.c                     |  18 +-
 mm/page_hinting.c                   | 250 ++++++++++++++++++++++++++++
 8 files changed, 414 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

--

Comments

Dave Hansen July 10, 2019, 8:19 p.m. UTC | #1
On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
> This patch series proposes an efficient mechanism for reporting free memory
> from a guest to its hypervisor. It especially enables guests with no page cache
> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> rapidly hand back free memory to the hypervisor.
> This approach has a minimal impact on the existing core-mm infrastructure.
> 
> Measurement results (measurement details appended to this email):
> *Number of 5GB guests (each touching 4GB memory) that can be launched
> without swap usage on a system with 15GB:

This sounds like a reasonable measurement, but I think you're missing a
sentence or two explaining why this test was used.

> unmodified kernel - 2, 3rd with 2.5GB   

What does "3rd with 2.5GB" mean?  The third gets 2.5GB before failing an
allocation and crashing?

> v11 page hinting - 6, 7th with 26MB    
> v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
> proposed to solve the same problems)

Could you please make an effort to format things so that reviewers can
easily read them?  Aligning columns and using common units would be very
helpful, for instance:

     unmodified kernel - 2, 3rd with 2.50 GB
      v11 page hinting - 6, 7th with 0.03 GB
  v1 bubble hinting[1] - 6, 7th with 1.80 GB

See how you can scan that easily and compare between the rows?

I think you did some analysis below.  But, that seems misplaced.  It's
better to include the conclusion here and the details to back it up
later.  As it stands, the cover letter just throws some data at a
reviewer and hopes they can make sense of it.

> *Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           

Again, I'm finding myself having to reformat your data just so I can
make sense of it.  You also forgot the unit for Guest 1 in row 3.

   unmodified - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB

  v11 hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
  v1 bubble   - Guest1:23s, Guest2:11s, Guest3:26s swap used = 0

So, what is this supposed to show?  What does it mean?  Why do the
numbers vary *so* much?
Alexander Duyck July 10, 2019, 11:40 p.m. UTC | #2
On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:

The results up here were redundant with what is below so I am just
dropping them. I would suggest only including one set of results in
any future cover page as it is confusing to duplicate it like that.

> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> A new hook after buddy merging is used to set the bits in the bitmap.
> Currently, the bits are only cleared when pages are hinted, not when pages are
> re-allocated.
>
> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> threshold is met, trying to isolate and report pages that are still free.
>
> The isolated pages are reported via virtio-balloon, which is responsible for
> sending batched pages to the host synchronously. Once the hypervisor processed
> the hinting request, the isolated pages are returned back to the buddy.
>
> Changelog in v11:
> * Added logic to take care of multiple NUMA nodes scenarios.
> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> dynamically allocated arrays with static ones, introduced wait event instead of
> the loop in order to wait for a response from the host)
> * Added a mutex to prevent race condition when page hinting is enabled by
> multiple drivers.
> * Simplified the logic responsible for decrementing free page counter for each
> zone.
> * Simplified code structuring/naming.
>
> Known work items for the future:
> * Test device assigned guests to ensure that hinting doesn't break it.
> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> * Decide between MADV_DONTNEED and MADV_FREE.
> * Look into memory hotplug, more efficient locking, better naming conventions to
> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> * Come up with proper/traceable error-message/logs and look into other code
> simplifications. (If necessary).
>
> Benefit analysis:
> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> swap usage on a system with 15GB:
> unmodified kernel - 2, 3rd with 2.5GB
> v11 page hinting - 6, 7th with 26MB
> v1 bubble hinting - 6, 7th with 1.8GB
>
> Conclusion - In this particular testcase on using v11 page hinting and
> v1 bubble-hinting 4 more guests could be launched without swapping compared
> to an unmodified kernel.
> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> hinting as it touches lesser swap space.

I'm confused by the comment. From what I can tell bubble hinting came
up with 1.8GB of memory while page hinting only managed to achieve
.026GB (Using the same units makes it easier to visualize the
difference). Also your test says "can be launched without swap usage",
yet you say the bubble hinting is touching swap which makes not sense
to me.

> Setup & procedure -
> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> Guest Memory = 5GB
> Number of CPUs in the guest = 1
> Host swap = 4GB
> Workload = test allocation program that allocates 4GB memory, touches it via
> memset and exits.
> The first guest is launched and once its console is up, the test allocation
> program is executed with 4 GB memory request (Due to this the guest occupies
> almost 4-5 GB of memory in the host in a system without page hinting). Once
> this program exits at that time another guest is launched in the host and the
> same process is followed. It is continued until the swap is not used.
>
> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>
> For this particular test-case in a guest which doesn't require swap access
> "memhog 6G" execution time lies within a range of 15-30s.
> Conclusion -
> In the above test case for an unmodified kernel on executing memhog in the
> third guest execution time rises to above 2minutes due to swap access.
> Using either page-hinting or bubble hinting brings this execution time to a
> a normal range of 15-30s.

So really this test doesn't add much in value. The whole reason why
Guest3 runs so much slower is because it is going to swap. I initially
did this to demonstrate a point, but now running this test doesn't
prove much as it isn't really meant to be a performance test. It is
essentially just a duplicate of the "how many guests can you run" test
that is passing itself off as some sort of performance test.

We could probably just drop this from future version of this as long
as we verify that the memory hinting is freeing most of the memory
back and the guest is reporting a size less than the total guest
memory size.

> Setup & procedure -
> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> Guest Memory = 6GB
> Number of CPUs in the guest = 4
> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> one after the other in each of them.
> Host swap = 4GB
>
> Performance Analysis:
> 1. will-it-scale's page_faul1
> Setup -
> Guest Memory = 6GB
> Number of cores = 24
>
> Unmodified kernel -
> 0,0,100,0,100,0
> 1,514453,95.84,519502,95.83,519502
> 2,991485,91.67,932268,91.68,1039004
> 3,1381237,87.36,1264214,87.64,1558506
> 4,1789116,83.36,1597767,83.88,2078008
> 5,2181552,79.20,1889489,80.08,2597510
> 6,2452416,75.05,2001879,77.10,3117012
> 7,2671047,70.90,2263866,73.22,3636514
> 8,2930081,66.75,2333813,70.60,4156016
> 9,3126431,62.60,2370108,68.28,4675518
> 10,3211937,58.44,2454093,65.74,5195020
> 11,3162172,54.32,2450822,63.21,5714522
> 12,3154261,50.14,2272290,58.98,6234024
> 13,3115174,46.02,2369679,57.74,6753526
> 14,3150511,41.86,2470837,54.02,7273028
> 15,3134158,37.71,2428129,51.98,7792530
> 16,3143067,33.57,2340469,49.54,8312032
> 17,3112457,29.43,2263627,44.81,8831534
> 18,3089724,25.29,2181879,38.69,9351036
> 19,3076878,21.15,2236505,40.01,9870538
> 20,3091978,16.95,2266327,35.00,10390040
> 21,3082927,12.84,2172578,28.12,10909542
> 22,3055282,8.73,2176269,29.14,11429044
> 23,3081144,4.56,2138442,24.87,11948546
> 24,3075509,0.45,2173753,21.62,12468048
>
> page hinting -
> 0,0,100,0,100,0
> 1,491683,95.83,494366,95.82,494366
> 2,988415,91.67,919660,91.68,988732
> 3,1344829,87.52,1244608,87.69,1483098
> 4,1797933,83.37,1625797,83.70,1977464
> 5,2179009,79.21,1881534,80.13,2471830
> 6,2449858,75.07,2078137,76.82,2966196
> 7,2732122,70.90,2178105,73.75,3460562
> 8,2910965,66.75,2340901,70.28,3954928
> 9,3006665,62.61,2353748,67.91,4449294
> 10,3164752,58.46,2377936,65.08,4943660
> 11,3234846,54.32,2510149,63.14,5438026
> 12,3165477,50.17,2412007,59.91,5932392
> 13,3141457,46.05,2421548,57.85,6426758
> 14,3135839,41.90,2378021,53.81,6921124
> 15,3109113,37.75,2269290,51.76,7415490
> 16,3093613,33.62,2346185,48.73,7909856
> 17,3086542,29.49,2352140,46.19,8404222
> 18,3048991,25.36,2217144,41.52,8898588
> 19,2965500,21.18,2313614,38.18,9392954
> 20,2928977,17.05,2175316,35.67,9887320
> 21,2896667,12.91,2141311,28.90,10381686
> 22,3047782,8.76,2177664,28.24,10876052
> 23,2994503,4.58,2160976,22.97,11370418
> 24,3038762,0.47,2053533,22.39,11864784
>
> bubble-hinting v1 -
> 0,0,100,0,100,0
> 1,515272,95.83,492355,95.81,515272
> 2,985903,91.66,919653,91.68,1030544
> 3,1475300,87.51,1353723,87.65,1545816
> 4,1783938,83.36,1586307,83.78,2061088
> 5,2093307,79.20,1867395,79.95,2576360
> 6,2441370,75.05,2055421,76.65,3091632
> 7,2650471,70.89,2246014,72.93,3606904
> 8,2926782,66.75,2333601,70.41,4122176
> 9,3107617,62.60,2383112,68.46,4637448
> 10,3192332,58.44,2441626,65.84,5152720
> 11,3268043,54.32,2235964,62.92,5667992
> 12,3191105,50.18,2449045,60.49,6183264
> 13,3145317,46.05,2377317,57.80,6698536
> 14,3161552,41.91,2395814,53.26,7213808
> 15,3140443,37.77,2333200,51.42,7729080
> 16,3130866,33.65,2150967,46.11,8244352
> 17,3112894,29.52,2372068,45.93,8759624
> 18,3078424,25.39,2336211,39.85,9274896
> 19,3036457,21.27,2224821,35.25,9790168
> 20,3046330,17.13,2199755,37.43,10305440
> 21,2981130,12.98,2214862,28.67,10820712
> 22,3017481,8.84,2195996,29.69,11335984
> 23,2979906,4.68,2173395,25.90,11851256
> 24,2971170,0.52,2134311,21.89,12366528

Okay, so this doesn't match up with the results you gave me last time
(https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
and actually more closely matches what I was expecting to see. The
bubble-hinting patches are performing within a few percent of what the
baseline kernel was doing. I am assuming the results from before had
some additional debugging enabled for the bubble-hinting test that
wasn't enabled for the other ones.

> Conclusion -
> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
> in the results wrt the numbers mentioned above. For both bubble-hinting and
> page-hinting, there was no noticeable degradation observed other than the
> expected variability mentioned earlier.
>
> Page hinting vs bubble hinting:
> From the benefits and performance perspective, both solutions look quite similar
> so far. However, unlike bubble-hinting which is more invasive, the overall core
> mm changes required for page hinting are minimal.
>
> [1] https://lkml.org/lkml/2019/6/19/926

So I think I called it out in the review of the patch but I think we
may want to see what happens if we increase the size of the memory in
the guest to something more like 64G or larger. My main concern is
that as we increase the size of memory the walk through the bitmap is
going to become more and more expensive and I am worried that at some
point it will start impacting the results.
Nitesh Narayan Lal July 11, 2019, 11:30 a.m. UTC | #3
On 7/10/19 7:40 PM, Alexander Duyck wrote:
> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> The results up here were redundant with what is below so I am just
> dropping them. I would suggest only including one set of results in
> any future cover page as it is confusing to duplicate it like that.
>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> Changelog in v11:
>> * Added logic to take care of multiple NUMA nodes scenarios.
>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>> dynamically allocated arrays with static ones, introduced wait event instead of
>> the loop in order to wait for a response from the host)
>> * Added a mutex to prevent race condition when page hinting is enabled by
>> multiple drivers.
>> * Simplified the logic responsible for decrementing free page counter for each
>> zone.
>> * Simplified code structuring/naming.
>>
>> Known work items for the future:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, better naming conventions to
>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
>> * Come up with proper/traceable error-message/logs and look into other code
>> simplifications. (If necessary).
>>
>> Benefit analysis:
>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
>> swap usage on a system with 15GB:
>> unmodified kernel - 2, 3rd with 2.5GB
>> v11 page hinting - 6, 7th with 26MB
>> v1 bubble hinting - 6, 7th with 1.8GB
>>
>> Conclusion - In this particular testcase on using v11 page hinting and
>> v1 bubble-hinting 4 more guests could be launched without swapping compared
>> to an unmodified kernel.
>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
>> hinting as it touches lesser swap space.
> I'm confused by the comment. From what I can tell bubble hinting came
> up with 1.8GB of memory while page hinting only managed to achieve
> .026GB (Using the same units makes it easier to visualize the
> difference). Also your test says "can be launched without swap usage",
> yet you say the bubble hinting is touching swap which makes not sense
> to me.
I will work on the cover to improve this part.
Basically, In each case, the first number indicates the number of the
guest which are launched without touching the swap space. For instance
with bubble hinting, I was able to launch 6 guests without any swap
usage. On launching the 7th guests initially there was no swap usage,
however, as the test app starts allocating 4GB memory the swap came into
the picture. 1.8 GB is the swap usage after the completion of the test
application.
>> Setup & procedure -
>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>> Guest Memory = 5GB
>> Number of CPUs in the guest = 1
>> Host swap = 4GB
>> Workload = test allocation program that allocates 4GB memory, touches it via
>> memset and exits.
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>>
>> For this particular test-case in a guest which doesn't require swap access
>> "memhog 6G" execution time lies within a range of 15-30s.
>> Conclusion -
>> In the above test case for an unmodified kernel on executing memhog in the
>> third guest execution time rises to above 2minutes due to swap access.
>> Using either page-hinting or bubble hinting brings this execution time to a
>> a normal range of 15-30s.
> So really this test doesn't add much in value. The whole reason why
> Guest3 runs so much slower is because it is going to swap. I initially
> did this to demonstrate a point, but now running this test doesn't
> prove much as it isn't really meant to be a performance test. It is
> essentially just a duplicate of the "how many guests can you run" test
> that is passing itself off as some sort of performance test.
>
> We could probably just drop this from future version of this as long
> as we verify that the memory hinting is freeing most of the memory
> back and the guest is reporting a size less than the total guest
> memory size.
>
+1, makes sense to keep just one of the above two.
>> Setup & procedure -
>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>> Guest Memory = 6GB
>> Number of CPUs in the guest = 4
>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Host swap = 4GB
>>
>> Performance Analysis:
>> 1. will-it-scale's page_faul1
>> Setup -
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Unmodified kernel -
>> 0,0,100,0,100,0
>> 1,514453,95.84,519502,95.83,519502
>> 2,991485,91.67,932268,91.68,1039004
>> 3,1381237,87.36,1264214,87.64,1558506
>> 4,1789116,83.36,1597767,83.88,2078008
>> 5,2181552,79.20,1889489,80.08,2597510
>> 6,2452416,75.05,2001879,77.10,3117012
>> 7,2671047,70.90,2263866,73.22,3636514
>> 8,2930081,66.75,2333813,70.60,4156016
>> 9,3126431,62.60,2370108,68.28,4675518
>> 10,3211937,58.44,2454093,65.74,5195020
>> 11,3162172,54.32,2450822,63.21,5714522
>> 12,3154261,50.14,2272290,58.98,6234024
>> 13,3115174,46.02,2369679,57.74,6753526
>> 14,3150511,41.86,2470837,54.02,7273028
>> 15,3134158,37.71,2428129,51.98,7792530
>> 16,3143067,33.57,2340469,49.54,8312032
>> 17,3112457,29.43,2263627,44.81,8831534
>> 18,3089724,25.29,2181879,38.69,9351036
>> 19,3076878,21.15,2236505,40.01,9870538
>> 20,3091978,16.95,2266327,35.00,10390040
>> 21,3082927,12.84,2172578,28.12,10909542
>> 22,3055282,8.73,2176269,29.14,11429044
>> 23,3081144,4.56,2138442,24.87,11948546
>> 24,3075509,0.45,2173753,21.62,12468048
>>
>> page hinting -
>> 0,0,100,0,100,0
>> 1,491683,95.83,494366,95.82,494366
>> 2,988415,91.67,919660,91.68,988732
>> 3,1344829,87.52,1244608,87.69,1483098
>> 4,1797933,83.37,1625797,83.70,1977464
>> 5,2179009,79.21,1881534,80.13,2471830
>> 6,2449858,75.07,2078137,76.82,2966196
>> 7,2732122,70.90,2178105,73.75,3460562
>> 8,2910965,66.75,2340901,70.28,3954928
>> 9,3006665,62.61,2353748,67.91,4449294
>> 10,3164752,58.46,2377936,65.08,4943660
>> 11,3234846,54.32,2510149,63.14,5438026
>> 12,3165477,50.17,2412007,59.91,5932392
>> 13,3141457,46.05,2421548,57.85,6426758
>> 14,3135839,41.90,2378021,53.81,6921124
>> 15,3109113,37.75,2269290,51.76,7415490
>> 16,3093613,33.62,2346185,48.73,7909856
>> 17,3086542,29.49,2352140,46.19,8404222
>> 18,3048991,25.36,2217144,41.52,8898588
>> 19,2965500,21.18,2313614,38.18,9392954
>> 20,2928977,17.05,2175316,35.67,9887320
>> 21,2896667,12.91,2141311,28.90,10381686
>> 22,3047782,8.76,2177664,28.24,10876052
>> 23,2994503,4.58,2160976,22.97,11370418
>> 24,3038762,0.47,2053533,22.39,11864784
>>
>> bubble-hinting v1 -
>> 0,0,100,0,100,0
>> 1,515272,95.83,492355,95.81,515272
>> 2,985903,91.66,919653,91.68,1030544
>> 3,1475300,87.51,1353723,87.65,1545816
>> 4,1783938,83.36,1586307,83.78,2061088
>> 5,2093307,79.20,1867395,79.95,2576360
>> 6,2441370,75.05,2055421,76.65,3091632
>> 7,2650471,70.89,2246014,72.93,3606904
>> 8,2926782,66.75,2333601,70.41,4122176
>> 9,3107617,62.60,2383112,68.46,4637448
>> 10,3192332,58.44,2441626,65.84,5152720
>> 11,3268043,54.32,2235964,62.92,5667992
>> 12,3191105,50.18,2449045,60.49,6183264
>> 13,3145317,46.05,2377317,57.80,6698536
>> 14,3161552,41.91,2395814,53.26,7213808
>> 15,3140443,37.77,2333200,51.42,7729080
>> 16,3130866,33.65,2150967,46.11,8244352
>> 17,3112894,29.52,2372068,45.93,8759624
>> 18,3078424,25.39,2336211,39.85,9274896
>> 19,3036457,21.27,2224821,35.25,9790168
>> 20,3046330,17.13,2199755,37.43,10305440
>> 21,2981130,12.98,2214862,28.67,10820712
>> 22,3017481,8.84,2195996,29.69,11335984
>> 23,2979906,4.68,2173395,25.90,11851256
>> 24,2971170,0.52,2134311,21.89,12366528
> Okay, so this doesn't match up with the results you gave me last time
> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> and actually more closely matches what I was expecting to see. The
> bubble-hinting patches are performing within a few percent of what the
> baseline kernel was doing. 
Interestingly even with an unmodified kernel with every fresh boot, I
observed a certain amount of variability in the results which I stated
below.
> I am assuming the results from before had
> some additional debugging enabled for the bubble-hinting test that
> wasn't enabled for the other ones.

Nope, I had debugging options enabled for all the cases. This time
around I disabled all the debug options.

>
>> Conclusion -
>> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
>> in the results wrt the numbers mentioned above. For both bubble-hinting and
>> page-hinting, there was no noticeable degradation observed other than the
>> expected variability mentioned earlier.
>>
>> Page hinting vs bubble hinting:
>> From the benefits and performance perspective, both solutions look quite similar
>> so far. However, unlike bubble-hinting which is more invasive, the overall core
>> mm changes required for page hinting are minimal.
>>
>> [1] https://lkml.org/lkml/2019/6/19/926
> So I think I called it out in the review of the patch but I think we
> may want to see what happens if we increase the size of the memory in
> the guest to something more like 64G or larger. My main concern is
> that as we increase the size of memory the walk through the bitmap is
> going to become more and more expensive and I am worried that at some
> point it will start impacting the results.
Ok, I can try that scenario.
Nitesh Narayan Lal July 11, 2019, 11:37 a.m. UTC | #4
On 7/10/19 4:19 PM, Dave Hansen wrote:
> On 7/10/19 12:51 PM, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for reporting free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
>>
>> Measurement results (measurement details appended to this email):
>> *Number of 5GB guests (each touching 4GB memory) that can be launched
>> without swap usage on a system with 15GB:
> This sounds like a reasonable measurement, but I think you're missing a
> sentence or two explaining why this test was used.
I will re-work the cover email to better communicate the numbers.
>
>> unmodified kernel - 2, 3rd with 2.5GB   
> What does "3rd with 2.5GB" mean?  The third gets 2.5GB before failing an
> allocation and crashing?
It doesn't crash or fail. To complete the execution of the test
application which allocates 4GB memory in the 3rd guest 2.5GB swap has
been accessed.
>
>> v11 page hinting - 6, 7th with 26MB    
>> v1 bubble hinting[1] - 6, 7th with 1.8GB (bubble hinting is another series
>> proposed to solve the same problems)
> Could you please make an effort to format things so that reviewers can
> easily read them?  Aligning columns and using common units would be very
> helpful, for instance:
>
>      unmodified kernel - 2, 3rd with 2.50 GB
>       v11 page hinting - 6, 7th with 0.03 GB
>   v1 bubble hinting[1] - 6, 7th with 1.80 GB
>
> See how you can scan that easily and compare between the rows?
>
> I think you did some analysis below.  But, that seems misplaced.  It's
> better to include the conclusion here and the details to back it up
> later.  As it stands, the cover letter just throws some data at a
> reviewer and hopes they can make sense of it.
I will improve this. Thanks.
>
>> *Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB       
>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0           
>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0           
> Again, I'm finding myself having to reformat your data just so I can
> make sense of it.  You also forgot the unit for Guest 1 in row 3.
>
>    unmodified - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>
>   v11 hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>   v1 bubble   - Guest1:23s, Guest2:11s, Guest3:26s swap used = 0
>
> So, what is this supposed to show?  What does it mean?  Why do the
> numbers vary *so* much?

Basically, the idea was to communicate that with hinting swap was not
accessed and hence the time of execution is lower.

But as you already mentioned next time around I will format this and add
the conclusion along with these numbers.
I agree with Alexander's comment that there is no point of having the
same thing at two place.
Alexander Duyck July 11, 2019, 2:58 p.m. UTC | #5
On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/10/19 7:40 PM, Alexander Duyck wrote:
> > On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >
> > The results up here were redundant with what is below so I am just
> > dropping them. I would suggest only including one set of results in
> > any future cover page as it is confusing to duplicate it like that.
> >
> >> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> >> A new hook after buddy merging is used to set the bits in the bitmap.
> >> Currently, the bits are only cleared when pages are hinted, not when pages are
> >> re-allocated.
> >>
> >> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> >> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> >> threshold is met, trying to isolate and report pages that are still free.
> >>
> >> The isolated pages are reported via virtio-balloon, which is responsible for
> >> sending batched pages to the host synchronously. Once the hypervisor processed
> >> the hinting request, the isolated pages are returned back to the buddy.
> >>
> >> Changelog in v11:
> >> * Added logic to take care of multiple NUMA nodes scenarios.
> >> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> >> dynamically allocated arrays with static ones, introduced wait event instead of
> >> the loop in order to wait for a response from the host)
> >> * Added a mutex to prevent race condition when page hinting is enabled by
> >> multiple drivers.
> >> * Simplified the logic responsible for decrementing free page counter for each
> >> zone.
> >> * Simplified code structuring/naming.
> >>
> >> Known work items for the future:
> >> * Test device assigned guests to ensure that hinting doesn't break it.
> >> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> >> * Decide between MADV_DONTNEED and MADV_FREE.
> >> * Look into memory hotplug, more efficient locking, better naming conventions to
> >> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> >> * Come up with proper/traceable error-message/logs and look into other code
> >> simplifications. (If necessary).
> >>
> >> Benefit analysis:
> >> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> >> swap usage on a system with 15GB:
> >> unmodified kernel - 2, 3rd with 2.5GB
> >> v11 page hinting - 6, 7th with 26MB
> >> v1 bubble hinting - 6, 7th with 1.8GB
> >>
> >> Conclusion - In this particular testcase on using v11 page hinting and
> >> v1 bubble-hinting 4 more guests could be launched without swapping compared
> >> to an unmodified kernel.
> >> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> >> hinting as it touches lesser swap space.
> > I'm confused by the comment. From what I can tell bubble hinting came
> > up with 1.8GB of memory while page hinting only managed to achieve
> > .026GB (Using the same units makes it easier to visualize the
> > difference). Also your test says "can be launched without swap usage",
> > yet you say the bubble hinting is touching swap which makes not sense
> > to me.
> I will work on the cover to improve this part.
> Basically, In each case, the first number indicates the number of the
> guest which are launched without touching the swap space. For instance
> with bubble hinting, I was able to launch 6 guests without any swap
> usage. On launching the 7th guests initially there was no swap usage,
> however, as the test app starts allocating 4GB memory the swap came into
> the picture. 1.8 GB is the swap usage after the completion of the test
> application.
> >> Setup & procedure -
> >> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >> Guest Memory = 5GB
> >> Number of CPUs in the guest = 1
> >> Host swap = 4GB
> >> Workload = test allocation program that allocates 4GB memory, touches it via
> >> memset and exits.
> >> The first guest is launched and once its console is up, the test allocation
> >> program is executed with 4 GB memory request (Due to this the guest occupies
> >> almost 4-5 GB of memory in the host in a system without page hinting). Once
> >> this program exits at that time another guest is launched in the host and the
> >> same process is followed. It is continued until the swap is not used.
> >>
> >> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> >> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> >> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> >> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
> >>
> >> For this particular test-case in a guest which doesn't require swap access
> >> "memhog 6G" execution time lies within a range of 15-30s.
> >> Conclusion -
> >> In the above test case for an unmodified kernel on executing memhog in the
> >> third guest execution time rises to above 2minutes due to swap access.
> >> Using either page-hinting or bubble hinting brings this execution time to a
> >> a normal range of 15-30s.
> > So really this test doesn't add much in value. The whole reason why
> > Guest3 runs so much slower is because it is going to swap. I initially
> > did this to demonstrate a point, but now running this test doesn't
> > prove much as it isn't really meant to be a performance test. It is
> > essentially just a duplicate of the "how many guests can you run" test
> > that is passing itself off as some sort of performance test.
> >
> > We could probably just drop this from future version of this as long
> > as we verify that the memory hinting is freeing most of the memory
> > back and the guest is reporting a size less than the total guest
> > memory size.
> >
> +1, makes sense to keep just one of the above two.
> >> Setup & procedure -
> >> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >> Guest Memory = 6GB
> >> Number of CPUs in the guest = 4
> >> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> >> one after the other in each of them.
> >> Host swap = 4GB
> >>
> >> Performance Analysis:
> >> 1. will-it-scale's page_faul1
> >> Setup -
> >> Guest Memory = 6GB
> >> Number of cores = 24
> >>
> >> Unmodified kernel -
> >> 0,0,100,0,100,0
> >> 1,514453,95.84,519502,95.83,519502
> >> 2,991485,91.67,932268,91.68,1039004
> >> 3,1381237,87.36,1264214,87.64,1558506
> >> 4,1789116,83.36,1597767,83.88,2078008
> >> 5,2181552,79.20,1889489,80.08,2597510
> >> 6,2452416,75.05,2001879,77.10,3117012
> >> 7,2671047,70.90,2263866,73.22,3636514
> >> 8,2930081,66.75,2333813,70.60,4156016
> >> 9,3126431,62.60,2370108,68.28,4675518
> >> 10,3211937,58.44,2454093,65.74,5195020
> >> 11,3162172,54.32,2450822,63.21,5714522
> >> 12,3154261,50.14,2272290,58.98,6234024
> >> 13,3115174,46.02,2369679,57.74,6753526
> >> 14,3150511,41.86,2470837,54.02,7273028
> >> 15,3134158,37.71,2428129,51.98,7792530
> >> 16,3143067,33.57,2340469,49.54,8312032
> >> 17,3112457,29.43,2263627,44.81,8831534
> >> 18,3089724,25.29,2181879,38.69,9351036
> >> 19,3076878,21.15,2236505,40.01,9870538
> >> 20,3091978,16.95,2266327,35.00,10390040
> >> 21,3082927,12.84,2172578,28.12,10909542
> >> 22,3055282,8.73,2176269,29.14,11429044
> >> 23,3081144,4.56,2138442,24.87,11948546
> >> 24,3075509,0.45,2173753,21.62,12468048
> >>
> >> page hinting -
> >> 0,0,100,0,100,0
> >> 1,491683,95.83,494366,95.82,494366
> >> 2,988415,91.67,919660,91.68,988732
> >> 3,1344829,87.52,1244608,87.69,1483098
> >> 4,1797933,83.37,1625797,83.70,1977464
> >> 5,2179009,79.21,1881534,80.13,2471830
> >> 6,2449858,75.07,2078137,76.82,2966196
> >> 7,2732122,70.90,2178105,73.75,3460562
> >> 8,2910965,66.75,2340901,70.28,3954928
> >> 9,3006665,62.61,2353748,67.91,4449294
> >> 10,3164752,58.46,2377936,65.08,4943660
> >> 11,3234846,54.32,2510149,63.14,5438026
> >> 12,3165477,50.17,2412007,59.91,5932392
> >> 13,3141457,46.05,2421548,57.85,6426758
> >> 14,3135839,41.90,2378021,53.81,6921124
> >> 15,3109113,37.75,2269290,51.76,7415490
> >> 16,3093613,33.62,2346185,48.73,7909856
> >> 17,3086542,29.49,2352140,46.19,8404222
> >> 18,3048991,25.36,2217144,41.52,8898588
> >> 19,2965500,21.18,2313614,38.18,9392954
> >> 20,2928977,17.05,2175316,35.67,9887320
> >> 21,2896667,12.91,2141311,28.90,10381686
> >> 22,3047782,8.76,2177664,28.24,10876052
> >> 23,2994503,4.58,2160976,22.97,11370418
> >> 24,3038762,0.47,2053533,22.39,11864784
> >>
> >> bubble-hinting v1 -
> >> 0,0,100,0,100,0
> >> 1,515272,95.83,492355,95.81,515272
> >> 2,985903,91.66,919653,91.68,1030544
> >> 3,1475300,87.51,1353723,87.65,1545816
> >> 4,1783938,83.36,1586307,83.78,2061088
> >> 5,2093307,79.20,1867395,79.95,2576360
> >> 6,2441370,75.05,2055421,76.65,3091632
> >> 7,2650471,70.89,2246014,72.93,3606904
> >> 8,2926782,66.75,2333601,70.41,4122176
> >> 9,3107617,62.60,2383112,68.46,4637448
> >> 10,3192332,58.44,2441626,65.84,5152720
> >> 11,3268043,54.32,2235964,62.92,5667992
> >> 12,3191105,50.18,2449045,60.49,6183264
> >> 13,3145317,46.05,2377317,57.80,6698536
> >> 14,3161552,41.91,2395814,53.26,7213808
> >> 15,3140443,37.77,2333200,51.42,7729080
> >> 16,3130866,33.65,2150967,46.11,8244352
> >> 17,3112894,29.52,2372068,45.93,8759624
> >> 18,3078424,25.39,2336211,39.85,9274896
> >> 19,3036457,21.27,2224821,35.25,9790168
> >> 20,3046330,17.13,2199755,37.43,10305440
> >> 21,2981130,12.98,2214862,28.67,10820712
> >> 22,3017481,8.84,2195996,29.69,11335984
> >> 23,2979906,4.68,2173395,25.90,11851256
> >> 24,2971170,0.52,2134311,21.89,12366528
> > Okay, so this doesn't match up with the results you gave me last time
> > (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> > and actually more closely matches what I was expecting to see. The
> > bubble-hinting patches are performing within a few percent of what the
> > baseline kernel was doing.
> Interestingly even with an unmodified kernel with every fresh boot, I
> observed a certain amount of variability in the results which I stated
> below.
> > I am assuming the results from before had
> > some additional debugging enabled for the bubble-hinting test that
> > wasn't enabled for the other ones.
>
> Nope, I had debugging options enabled for all the cases. This time
> around I disabled all the debug options.

We can agree to disagree I guess. Those debugging options had reduced
the throughput by over 30% on the guest kernel in my test runs. I was
never able to reproduce the data you reported as enabling the same
debug features on an unmodified kernel had reduced the throughput for
the test just the same as it did for the bubble hinting version. Were
you running the debug options on the host kernel or the guest? I
suppose it is possible that having those debug options enabled on the
host might trigger similar behavior to what you reported since you
were using MADV_FREE versus MADV_DONTNEED so you wouldn't have to
reallocate the pages and could circumvent the page allocation
debugging.

> >
> >> Conclusion -
> >> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
> >> in the results wrt the numbers mentioned above. For both bubble-hinting and
> >> page-hinting, there was no noticeable degradation observed other than the
> >> expected variability mentioned earlier.
> >>
> >> Page hinting vs bubble hinting:
> >> From the benefits and performance perspective, both solutions look quite similar
> >> so far. However, unlike bubble-hinting which is more invasive, the overall core
> >> mm changes required for page hinting are minimal.
> >>
> >> [1] https://lkml.org/lkml/2019/6/19/926
> > So I think I called it out in the review of the patch but I think we
> > may want to see what happens if we increase the size of the memory in
> > the guest to something more like 64G or larger. My main concern is
> > that as we increase the size of memory the walk through the bitmap is
> > going to become more and more expensive and I am worried that at some
> > point it will start impacting the results.
> Ok, I can try that scenario.
>
>
>
> --
> Thanks
> Nitesh
>
Nitesh Narayan Lal July 11, 2019, 3:03 p.m. UTC | #6
On 7/11/19 10:58 AM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/10/19 7:40 PM, Alexander Duyck wrote:
>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>
>>> The results up here were redundant with what is below so I am just
>>> dropping them. I would suggest only including one set of results in
>>> any future cover page as it is confusing to duplicate it like that.
>>>
>>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>>>> A new hook after buddy merging is used to set the bits in the bitmap.
>>>> Currently, the bits are only cleared when pages are hinted, not when pages are
>>>> re-allocated.
>>>>
>>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>>>> threshold is met, trying to isolate and report pages that are still free.
>>>>
>>>> The isolated pages are reported via virtio-balloon, which is responsible for
>>>> sending batched pages to the host synchronously. Once the hypervisor processed
>>>> the hinting request, the isolated pages are returned back to the buddy.
>>>>
>>>> Changelog in v11:
>>>> * Added logic to take care of multiple NUMA nodes scenarios.
>>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>>>> dynamically allocated arrays with static ones, introduced wait event instead of
>>>> the loop in order to wait for a response from the host)
>>>> * Added a mutex to prevent race condition when page hinting is enabled by
>>>> multiple drivers.
>>>> * Simplified the logic responsible for decrementing free page counter for each
>>>> zone.
>>>> * Simplified code structuring/naming.
>>>>
>>>> Known work items for the future:
>>>> * Test device assigned guests to ensure that hinting doesn't break it.
>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
>>>> * Decide between MADV_DONTNEED and MADV_FREE.
>>>> * Look into memory hotplug, more efficient locking, better naming conventions to
>>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
>>>> * Come up with proper/traceable error-message/logs and look into other code
>>>> simplifications. (If necessary).
>>>>
>>>> Benefit analysis:
>>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
>>>> swap usage on a system with 15GB:
>>>> unmodified kernel - 2, 3rd with 2.5GB
>>>> v11 page hinting - 6, 7th with 26MB
>>>> v1 bubble hinting - 6, 7th with 1.8GB
>>>>
>>>> Conclusion - In this particular testcase on using v11 page hinting and
>>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
>>>> to an unmodified kernel.
>>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
>>>> hinting as it touches lesser swap space.
>>> I'm confused by the comment. From what I can tell bubble hinting came
>>> up with 1.8GB of memory while page hinting only managed to achieve
>>> .026GB (Using the same units makes it easier to visualize the
>>> difference). Also your test says "can be launched without swap usage",
>>> yet you say the bubble hinting is touching swap which makes not sense
>>> to me.
>> I will work on the cover to improve this part.
>> Basically, In each case, the first number indicates the number of the
>> guest which are launched without touching the swap space. For instance
>> with bubble hinting, I was able to launch 6 guests without any swap
>> usage. On launching the 7th guests initially there was no swap usage,
>> however, as the test app starts allocating 4GB memory the swap came into
>> the picture. 1.8 GB is the swap usage after the completion of the test
>> application.
>>>> Setup & procedure -
>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>> Guest Memory = 5GB
>>>> Number of CPUs in the guest = 1
>>>> Host swap = 4GB
>>>> Workload = test allocation program that allocates 4GB memory, touches it via
>>>> memset and exits.
>>>> The first guest is launched and once its console is up, the test allocation
>>>> program is executed with 4 GB memory request (Due to this the guest occupies
>>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>>>> this program exits at that time another guest is launched in the host and the
>>>> same process is followed. It is continued until the swap is not used.
>>>>
>>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>>>>
>>>> For this particular test-case in a guest which doesn't require swap access
>>>> "memhog 6G" execution time lies within a range of 15-30s.
>>>> Conclusion -
>>>> In the above test case for an unmodified kernel on executing memhog in the
>>>> third guest execution time rises to above 2minutes due to swap access.
>>>> Using either page-hinting or bubble hinting brings this execution time to a
>>>> a normal range of 15-30s.
>>> So really this test doesn't add much in value. The whole reason why
>>> Guest3 runs so much slower is because it is going to swap. I initially
>>> did this to demonstrate a point, but now running this test doesn't
>>> prove much as it isn't really meant to be a performance test. It is
>>> essentially just a duplicate of the "how many guests can you run" test
>>> that is passing itself off as some sort of performance test.
>>>
>>> We could probably just drop this from future version of this as long
>>> as we verify that the memory hinting is freeing most of the memory
>>> back and the guest is reporting a size less than the total guest
>>> memory size.
>>>
>> +1, makes sense to keep just one of the above two.
>>>> Setup & procedure -
>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>> Guest Memory = 6GB
>>>> Number of CPUs in the guest = 4
>>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>>>> one after the other in each of them.
>>>> Host swap = 4GB
>>>>
>>>> Performance Analysis:
>>>> 1. will-it-scale's page_faul1
>>>> Setup -
>>>> Guest Memory = 6GB
>>>> Number of cores = 24
>>>>
>>>> Unmodified kernel -
>>>> 0,0,100,0,100,0
>>>> 1,514453,95.84,519502,95.83,519502
>>>> 2,991485,91.67,932268,91.68,1039004
>>>> 3,1381237,87.36,1264214,87.64,1558506
>>>> 4,1789116,83.36,1597767,83.88,2078008
>>>> 5,2181552,79.20,1889489,80.08,2597510
>>>> 6,2452416,75.05,2001879,77.10,3117012
>>>> 7,2671047,70.90,2263866,73.22,3636514
>>>> 8,2930081,66.75,2333813,70.60,4156016
>>>> 9,3126431,62.60,2370108,68.28,4675518
>>>> 10,3211937,58.44,2454093,65.74,5195020
>>>> 11,3162172,54.32,2450822,63.21,5714522
>>>> 12,3154261,50.14,2272290,58.98,6234024
>>>> 13,3115174,46.02,2369679,57.74,6753526
>>>> 14,3150511,41.86,2470837,54.02,7273028
>>>> 15,3134158,37.71,2428129,51.98,7792530
>>>> 16,3143067,33.57,2340469,49.54,8312032
>>>> 17,3112457,29.43,2263627,44.81,8831534
>>>> 18,3089724,25.29,2181879,38.69,9351036
>>>> 19,3076878,21.15,2236505,40.01,9870538
>>>> 20,3091978,16.95,2266327,35.00,10390040
>>>> 21,3082927,12.84,2172578,28.12,10909542
>>>> 22,3055282,8.73,2176269,29.14,11429044
>>>> 23,3081144,4.56,2138442,24.87,11948546
>>>> 24,3075509,0.45,2173753,21.62,12468048
>>>>
>>>> page hinting -
>>>> 0,0,100,0,100,0
>>>> 1,491683,95.83,494366,95.82,494366
>>>> 2,988415,91.67,919660,91.68,988732
>>>> 3,1344829,87.52,1244608,87.69,1483098
>>>> 4,1797933,83.37,1625797,83.70,1977464
>>>> 5,2179009,79.21,1881534,80.13,2471830
>>>> 6,2449858,75.07,2078137,76.82,2966196
>>>> 7,2732122,70.90,2178105,73.75,3460562
>>>> 8,2910965,66.75,2340901,70.28,3954928
>>>> 9,3006665,62.61,2353748,67.91,4449294
>>>> 10,3164752,58.46,2377936,65.08,4943660
>>>> 11,3234846,54.32,2510149,63.14,5438026
>>>> 12,3165477,50.17,2412007,59.91,5932392
>>>> 13,3141457,46.05,2421548,57.85,6426758
>>>> 14,3135839,41.90,2378021,53.81,6921124
>>>> 15,3109113,37.75,2269290,51.76,7415490
>>>> 16,3093613,33.62,2346185,48.73,7909856
>>>> 17,3086542,29.49,2352140,46.19,8404222
>>>> 18,3048991,25.36,2217144,41.52,8898588
>>>> 19,2965500,21.18,2313614,38.18,9392954
>>>> 20,2928977,17.05,2175316,35.67,9887320
>>>> 21,2896667,12.91,2141311,28.90,10381686
>>>> 22,3047782,8.76,2177664,28.24,10876052
>>>> 23,2994503,4.58,2160976,22.97,11370418
>>>> 24,3038762,0.47,2053533,22.39,11864784
>>>>
>>>> bubble-hinting v1 -
>>>> 0,0,100,0,100,0
>>>> 1,515272,95.83,492355,95.81,515272
>>>> 2,985903,91.66,919653,91.68,1030544
>>>> 3,1475300,87.51,1353723,87.65,1545816
>>>> 4,1783938,83.36,1586307,83.78,2061088
>>>> 5,2093307,79.20,1867395,79.95,2576360
>>>> 6,2441370,75.05,2055421,76.65,3091632
>>>> 7,2650471,70.89,2246014,72.93,3606904
>>>> 8,2926782,66.75,2333601,70.41,4122176
>>>> 9,3107617,62.60,2383112,68.46,4637448
>>>> 10,3192332,58.44,2441626,65.84,5152720
>>>> 11,3268043,54.32,2235964,62.92,5667992
>>>> 12,3191105,50.18,2449045,60.49,6183264
>>>> 13,3145317,46.05,2377317,57.80,6698536
>>>> 14,3161552,41.91,2395814,53.26,7213808
>>>> 15,3140443,37.77,2333200,51.42,7729080
>>>> 16,3130866,33.65,2150967,46.11,8244352
>>>> 17,3112894,29.52,2372068,45.93,8759624
>>>> 18,3078424,25.39,2336211,39.85,9274896
>>>> 19,3036457,21.27,2224821,35.25,9790168
>>>> 20,3046330,17.13,2199755,37.43,10305440
>>>> 21,2981130,12.98,2214862,28.67,10820712
>>>> 22,3017481,8.84,2195996,29.69,11335984
>>>> 23,2979906,4.68,2173395,25.90,11851256
>>>> 24,2971170,0.52,2134311,21.89,12366528
>>> Okay, so this doesn't match up with the results you gave me last time
>>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
>>> and actually more closely matches what I was expecting to see. The
>>> bubble-hinting patches are performing within a few percent of what the
>>> baseline kernel was doing.
>> Interestingly even with an unmodified kernel with every fresh boot, I
>> observed a certain amount of variability in the results which I stated
>> below.
>>> I am assuming the results from before had
>>> some additional debugging enabled for the bubble-hinting test that
>>> wasn't enabled for the other ones.
>> Nope, I had debugging options enabled for all the cases. This time
>> around I disabled all the debug options.
> We can agree to disagree I guess. Those debugging options had reduced
> the throughput by over 30% on the guest kernel in my test runs. I was
> never able to reproduce the data you reported as enabling the same
> debug features on an unmodified kernel had reduced the throughput for
> the test just the same as it did for the bubble hinting version. Were
> you running the debug options on the host kernel or the guest?
In the guest. Do the results which I shared without debug options, match
with what you have?
I am also curious to know if you see any variability in the results of
page_fault1 for an unmodified kernel with every fresh boot? If so how
often?
>  I
> suppose it is possible that having those debug options enabled on the
> host might trigger similar behavior to what you reported since you
> were using MADV_FREE versus MADV_DONTNEED so you wouldn't have to
> reallocate the pages and could circumvent the page allocation
> debugging.
>
>>>> Conclusion -
>>>> For an unmodified kernel, with every fresh boot, there is 3-4% delta observed
>>>> in the results wrt the numbers mentioned above. For both bubble-hinting and
>>>> page-hinting, there was no noticeable degradation observed other than the
>>>> expected variability mentioned earlier.
>>>>
>>>> Page hinting vs bubble hinting:
>>>> From the benefits and performance perspective, both solutions look quite similar
>>>> so far. However, unlike bubble-hinting which is more invasive, the overall core
>>>> mm changes required for page hinting are minimal.
>>>>
>>>> [1] https://lkml.org/lkml/2019/6/19/926
>>> So I think I called it out in the review of the patch but I think we
>>> may want to see what happens if we increase the size of the memory in
>>> the guest to something more like 64G or larger. My main concern is
>>> that as we increase the size of memory the walk through the bitmap is
>>> going to become more and more expensive and I am worried that at some
>>> point it will start impacting the results.
>> Ok, I can try that scenario.
>>
>>
>>
>> --
>> Thanks
>> Nitesh
>>
Alexander Duyck July 11, 2019, 3:08 p.m. UTC | #7
On Thu, Jul 11, 2019 at 8:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/11/19 10:58 AM, Alexander Duyck wrote:
> > On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 7/10/19 7:40 PM, Alexander Duyck wrote:
> >>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>
> >>> The results up here were redundant with what is below so I am just
> >>> dropping them. I would suggest only including one set of results in
> >>> any future cover page as it is confusing to duplicate it like that.
> >>>
> >>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> >>>> A new hook after buddy merging is used to set the bits in the bitmap.
> >>>> Currently, the bits are only cleared when pages are hinted, not when pages are
> >>>> re-allocated.
> >>>>
> >>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> >>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> >>>> threshold is met, trying to isolate and report pages that are still free.
> >>>>
> >>>> The isolated pages are reported via virtio-balloon, which is responsible for
> >>>> sending batched pages to the host synchronously. Once the hypervisor processed
> >>>> the hinting request, the isolated pages are returned back to the buddy.
> >>>>
> >>>> Changelog in v11:
> >>>> * Added logic to take care of multiple NUMA nodes scenarios.
> >>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> >>>> dynamically allocated arrays with static ones, introduced wait event instead of
> >>>> the loop in order to wait for a response from the host)
> >>>> * Added a mutex to prevent race condition when page hinting is enabled by
> >>>> multiple drivers.
> >>>> * Simplified the logic responsible for decrementing free page counter for each
> >>>> zone.
> >>>> * Simplified code structuring/naming.
> >>>>
> >>>> Known work items for the future:
> >>>> * Test device assigned guests to ensure that hinting doesn't break it.
> >>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> >>>> * Decide between MADV_DONTNEED and MADV_FREE.
> >>>> * Look into memory hotplug, more efficient locking, better naming conventions to
> >>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> >>>> * Come up with proper/traceable error-message/logs and look into other code
> >>>> simplifications. (If necessary).
> >>>>
> >>>> Benefit analysis:
> >>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> >>>> swap usage on a system with 15GB:
> >>>> unmodified kernel - 2, 3rd with 2.5GB
> >>>> v11 page hinting - 6, 7th with 26MB
> >>>> v1 bubble hinting - 6, 7th with 1.8GB
> >>>>
> >>>> Conclusion - In this particular testcase on using v11 page hinting and
> >>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
> >>>> to an unmodified kernel.
> >>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> >>>> hinting as it touches lesser swap space.
> >>> I'm confused by the comment. From what I can tell bubble hinting came
> >>> up with 1.8GB of memory while page hinting only managed to achieve
> >>> .026GB (Using the same units makes it easier to visualize the
> >>> difference). Also your test says "can be launched without swap usage",
> >>> yet you say the bubble hinting is touching swap which makes not sense
> >>> to me.
> >> I will work on the cover to improve this part.
> >> Basically, In each case, the first number indicates the number of the
> >> guest which are launched without touching the swap space. For instance
> >> with bubble hinting, I was able to launch 6 guests without any swap
> >> usage. On launching the 7th guests initially there was no swap usage,
> >> however, as the test app starts allocating 4GB memory the swap came into
> >> the picture. 1.8 GB is the swap usage after the completion of the test
> >> application.
> >>>> Setup & procedure -
> >>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>> Guest Memory = 5GB
> >>>> Number of CPUs in the guest = 1
> >>>> Host swap = 4GB
> >>>> Workload = test allocation program that allocates 4GB memory, touches it via
> >>>> memset and exits.
> >>>> The first guest is launched and once its console is up, the test allocation
> >>>> program is executed with 4 GB memory request (Due to this the guest occupies
> >>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
> >>>> this program exits at that time another guest is launched in the host and the
> >>>> same process is followed. It is continued until the swap is not used.
> >>>>
> >>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> >>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> >>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> >>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
> >>>>
> >>>> For this particular test-case in a guest which doesn't require swap access
> >>>> "memhog 6G" execution time lies within a range of 15-30s.
> >>>> Conclusion -
> >>>> In the above test case for an unmodified kernel on executing memhog in the
> >>>> third guest execution time rises to above 2minutes due to swap access.
> >>>> Using either page-hinting or bubble hinting brings this execution time to a
> >>>> a normal range of 15-30s.
> >>> So really this test doesn't add much in value. The whole reason why
> >>> Guest3 runs so much slower is because it is going to swap. I initially
> >>> did this to demonstrate a point, but now running this test doesn't
> >>> prove much as it isn't really meant to be a performance test. It is
> >>> essentially just a duplicate of the "how many guests can you run" test
> >>> that is passing itself off as some sort of performance test.
> >>>
> >>> We could probably just drop this from future version of this as long
> >>> as we verify that the memory hinting is freeing most of the memory
> >>> back and the guest is reporting a size less than the total guest
> >>> memory size.
> >>>
> >> +1, makes sense to keep just one of the above two.
> >>>> Setup & procedure -
> >>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>> Guest Memory = 6GB
> >>>> Number of CPUs in the guest = 4
> >>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> >>>> one after the other in each of them.
> >>>> Host swap = 4GB
> >>>>
> >>>> Performance Analysis:
> >>>> 1. will-it-scale's page_faul1
> >>>> Setup -
> >>>> Guest Memory = 6GB
> >>>> Number of cores = 24
> >>>>
> >>>> Unmodified kernel -
> >>>> 0,0,100,0,100,0
> >>>> 1,514453,95.84,519502,95.83,519502
> >>>> 2,991485,91.67,932268,91.68,1039004
> >>>> 3,1381237,87.36,1264214,87.64,1558506
> >>>> 4,1789116,83.36,1597767,83.88,2078008
> >>>> 5,2181552,79.20,1889489,80.08,2597510
> >>>> 6,2452416,75.05,2001879,77.10,3117012
> >>>> 7,2671047,70.90,2263866,73.22,3636514
> >>>> 8,2930081,66.75,2333813,70.60,4156016
> >>>> 9,3126431,62.60,2370108,68.28,4675518
> >>>> 10,3211937,58.44,2454093,65.74,5195020
> >>>> 11,3162172,54.32,2450822,63.21,5714522
> >>>> 12,3154261,50.14,2272290,58.98,6234024
> >>>> 13,3115174,46.02,2369679,57.74,6753526
> >>>> 14,3150511,41.86,2470837,54.02,7273028
> >>>> 15,3134158,37.71,2428129,51.98,7792530
> >>>> 16,3143067,33.57,2340469,49.54,8312032
> >>>> 17,3112457,29.43,2263627,44.81,8831534
> >>>> 18,3089724,25.29,2181879,38.69,9351036
> >>>> 19,3076878,21.15,2236505,40.01,9870538
> >>>> 20,3091978,16.95,2266327,35.00,10390040
> >>>> 21,3082927,12.84,2172578,28.12,10909542
> >>>> 22,3055282,8.73,2176269,29.14,11429044
> >>>> 23,3081144,4.56,2138442,24.87,11948546
> >>>> 24,3075509,0.45,2173753,21.62,12468048
> >>>>
> >>>> page hinting -
> >>>> 0,0,100,0,100,0
> >>>> 1,491683,95.83,494366,95.82,494366
> >>>> 2,988415,91.67,919660,91.68,988732
> >>>> 3,1344829,87.52,1244608,87.69,1483098
> >>>> 4,1797933,83.37,1625797,83.70,1977464
> >>>> 5,2179009,79.21,1881534,80.13,2471830
> >>>> 6,2449858,75.07,2078137,76.82,2966196
> >>>> 7,2732122,70.90,2178105,73.75,3460562
> >>>> 8,2910965,66.75,2340901,70.28,3954928
> >>>> 9,3006665,62.61,2353748,67.91,4449294
> >>>> 10,3164752,58.46,2377936,65.08,4943660
> >>>> 11,3234846,54.32,2510149,63.14,5438026
> >>>> 12,3165477,50.17,2412007,59.91,5932392
> >>>> 13,3141457,46.05,2421548,57.85,6426758
> >>>> 14,3135839,41.90,2378021,53.81,6921124
> >>>> 15,3109113,37.75,2269290,51.76,7415490
> >>>> 16,3093613,33.62,2346185,48.73,7909856
> >>>> 17,3086542,29.49,2352140,46.19,8404222
> >>>> 18,3048991,25.36,2217144,41.52,8898588
> >>>> 19,2965500,21.18,2313614,38.18,9392954
> >>>> 20,2928977,17.05,2175316,35.67,9887320
> >>>> 21,2896667,12.91,2141311,28.90,10381686
> >>>> 22,3047782,8.76,2177664,28.24,10876052
> >>>> 23,2994503,4.58,2160976,22.97,11370418
> >>>> 24,3038762,0.47,2053533,22.39,11864784
> >>>>
> >>>> bubble-hinting v1 -
> >>>> 0,0,100,0,100,0
> >>>> 1,515272,95.83,492355,95.81,515272
> >>>> 2,985903,91.66,919653,91.68,1030544
> >>>> 3,1475300,87.51,1353723,87.65,1545816
> >>>> 4,1783938,83.36,1586307,83.78,2061088
> >>>> 5,2093307,79.20,1867395,79.95,2576360
> >>>> 6,2441370,75.05,2055421,76.65,3091632
> >>>> 7,2650471,70.89,2246014,72.93,3606904
> >>>> 8,2926782,66.75,2333601,70.41,4122176
> >>>> 9,3107617,62.60,2383112,68.46,4637448
> >>>> 10,3192332,58.44,2441626,65.84,5152720
> >>>> 11,3268043,54.32,2235964,62.92,5667992
> >>>> 12,3191105,50.18,2449045,60.49,6183264
> >>>> 13,3145317,46.05,2377317,57.80,6698536
> >>>> 14,3161552,41.91,2395814,53.26,7213808
> >>>> 15,3140443,37.77,2333200,51.42,7729080
> >>>> 16,3130866,33.65,2150967,46.11,8244352
> >>>> 17,3112894,29.52,2372068,45.93,8759624
> >>>> 18,3078424,25.39,2336211,39.85,9274896
> >>>> 19,3036457,21.27,2224821,35.25,9790168
> >>>> 20,3046330,17.13,2199755,37.43,10305440
> >>>> 21,2981130,12.98,2214862,28.67,10820712
> >>>> 22,3017481,8.84,2195996,29.69,11335984
> >>>> 23,2979906,4.68,2173395,25.90,11851256
> >>>> 24,2971170,0.52,2134311,21.89,12366528
> >>> Okay, so this doesn't match up with the results you gave me last time
> >>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> >>> and actually more closely matches what I was expecting to see. The
> >>> bubble-hinting patches are performing within a few percent of what the
> >>> baseline kernel was doing.
> >> Interestingly even with an unmodified kernel with every fresh boot, I
> >> observed a certain amount of variability in the results which I stated
> >> below.
> >>> I am assuming the results from before had
> >>> some additional debugging enabled for the bubble-hinting test that
> >>> wasn't enabled for the other ones.
> >> Nope, I had debugging options enabled for all the cases. This time
> >> around I disabled all the debug options.
> > We can agree to disagree I guess. Those debugging options had reduced
> > the throughput by over 30% on the guest kernel in my test runs. I was
> > never able to reproduce the data you reported as enabling the same
> > debug features on an unmodified kernel had reduced the throughput for
> > the test just the same as it did for the bubble hinting version. Were
> > you running the debug options on the host kernel or the guest?
> In the guest. Do the results which I shared without debug options, match
> with what you have?
> I am also curious to know if you see any variability in the results of
> page_fault1 for an unmodified kernel with every fresh boot? If so how
> often?

I see some variability, but not much. Usually it can vary by +/- 5% or
so. What I have been doing is collecting multiple runs, working out
the average, and then comparing that against an average with the
patches applied.

One other thing you can probably do to limit the variability would be
to look at disabling any power management features on the system. One
thing you could be seeing is the effect of the CPU enabling turbo mode
or going into sleep states if idle. That can easily throw the numbers
around quite a bit.
Nitesh Narayan Lal July 11, 2019, 3:19 p.m. UTC | #8
On 7/11/19 11:08 AM, Alexander Duyck wrote:
> On Thu, Jul 11, 2019 at 8:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 7/11/19 10:58 AM, Alexander Duyck wrote:
>>> On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 7/10/19 7:40 PM, Alexander Duyck wrote:
>>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>>
>>>>> The results up here were redundant with what is below so I am just
>>>>> dropping them. I would suggest only including one set of results in
>>>>> any future cover page as it is confusing to duplicate it like that.
>>>>>
>>>>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>>>>>> A new hook after buddy merging is used to set the bits in the bitmap.
>>>>>> Currently, the bits are only cleared when pages are hinted, not when pages are
>>>>>> re-allocated.
>>>>>>
>>>>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>>>>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>>>>>> threshold is met, trying to isolate and report pages that are still free.
>>>>>>
>>>>>> The isolated pages are reported via virtio-balloon, which is responsible for
>>>>>> sending batched pages to the host synchronously. Once the hypervisor processed
>>>>>> the hinting request, the isolated pages are returned back to the buddy.
>>>>>>
>>>>>> Changelog in v11:
>>>>>> * Added logic to take care of multiple NUMA nodes scenarios.
>>>>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
>>>>>> dynamically allocated arrays with static ones, introduced wait event instead of
>>>>>> the loop in order to wait for a response from the host)
>>>>>> * Added a mutex to prevent race condition when page hinting is enabled by
>>>>>> multiple drivers.
>>>>>> * Simplified the logic responsible for decrementing free page counter for each
>>>>>> zone.
>>>>>> * Simplified code structuring/naming.
>>>>>>
>>>>>> Known work items for the future:
>>>>>> * Test device assigned guests to ensure that hinting doesn't break it.
>>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
>>>>>> * Decide between MADV_DONTNEED and MADV_FREE.
>>>>>> * Look into memory hotplug, more efficient locking, better naming conventions to
>>>>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
>>>>>> * Come up with proper/traceable error-message/logs and look into other code
>>>>>> simplifications. (If necessary).
>>>>>>
>>>>>> Benefit analysis:
>>>>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
>>>>>> swap usage on a system with 15GB:
>>>>>> unmodified kernel - 2, 3rd with 2.5GB
>>>>>> v11 page hinting - 6, 7th with 26MB
>>>>>> v1 bubble hinting - 6, 7th with 1.8GB
>>>>>>
>>>>>> Conclusion - In this particular testcase on using v11 page hinting and
>>>>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
>>>>>> to an unmodified kernel.
>>>>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
>>>>>> hinting as it touches lesser swap space.
>>>>> I'm confused by the comment. From what I can tell bubble hinting came
>>>>> up with 1.8GB of memory while page hinting only managed to achieve
>>>>> .026GB (Using the same units makes it easier to visualize the
>>>>> difference). Also your test says "can be launched without swap usage",
>>>>> yet you say the bubble hinting is touching swap which makes not sense
>>>>> to me.
>>>> I will work on the cover to improve this part.
>>>> Basically, In each case, the first number indicates the number of the
>>>> guest which are launched without touching the swap space. For instance
>>>> with bubble hinting, I was able to launch 6 guests without any swap
>>>> usage. On launching the 7th guests initially there was no swap usage,
>>>> however, as the test app starts allocating 4GB memory the swap came into
>>>> the picture. 1.8 GB is the swap usage after the completion of the test
>>>> application.
>>>>>> Setup & procedure -
>>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>>>> Guest Memory = 5GB
>>>>>> Number of CPUs in the guest = 1
>>>>>> Host swap = 4GB
>>>>>> Workload = test allocation program that allocates 4GB memory, touches it via
>>>>>> memset and exits.
>>>>>> The first guest is launched and once its console is up, the test allocation
>>>>>> program is executed with 4 GB memory request (Due to this the guest occupies
>>>>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>>>>>> this program exits at that time another guest is launched in the host and the
>>>>>> same process is followed. It is continued until the swap is not used.
>>>>>>
>>>>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
>>>>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
>>>>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
>>>>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
>>>>>>
>>>>>> For this particular test-case in a guest which doesn't require swap access
>>>>>> "memhog 6G" execution time lies within a range of 15-30s.
>>>>>> Conclusion -
>>>>>> In the above test case for an unmodified kernel on executing memhog in the
>>>>>> third guest execution time rises to above 2minutes due to swap access.
>>>>>> Using either page-hinting or bubble hinting brings this execution time to a
>>>>>> a normal range of 15-30s.
>>>>> So really this test doesn't add much in value. The whole reason why
>>>>> Guest3 runs so much slower is because it is going to swap. I initially
>>>>> did this to demonstrate a point, but now running this test doesn't
>>>>> prove much as it isn't really meant to be a performance test. It is
>>>>> essentially just a duplicate of the "how many guests can you run" test
>>>>> that is passing itself off as some sort of performance test.
>>>>>
>>>>> We could probably just drop this from future version of this as long
>>>>> as we verify that the memory hinting is freeing most of the memory
>>>>> back and the guest is reporting a size less than the total guest
>>>>> memory size.
>>>>>
>>>> +1, makes sense to keep just one of the above two.
>>>>>> Setup & procedure -
>>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
>>>>>> Guest Memory = 6GB
>>>>>> Number of CPUs in the guest = 4
>>>>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>>>>>> one after the other in each of them.
>>>>>> Host swap = 4GB
>>>>>>
>>>>>> Performance Analysis:
>>>>>> 1. will-it-scale's page_faul1
>>>>>> Setup -
>>>>>> Guest Memory = 6GB
>>>>>> Number of cores = 24
>>>>>>
>>>>>> Unmodified kernel -
>>>>>> 0,0,100,0,100,0
>>>>>> 1,514453,95.84,519502,95.83,519502
>>>>>> 2,991485,91.67,932268,91.68,1039004
>>>>>> 3,1381237,87.36,1264214,87.64,1558506
>>>>>> 4,1789116,83.36,1597767,83.88,2078008
>>>>>> 5,2181552,79.20,1889489,80.08,2597510
>>>>>> 6,2452416,75.05,2001879,77.10,3117012
>>>>>> 7,2671047,70.90,2263866,73.22,3636514
>>>>>> 8,2930081,66.75,2333813,70.60,4156016
>>>>>> 9,3126431,62.60,2370108,68.28,4675518
>>>>>> 10,3211937,58.44,2454093,65.74,5195020
>>>>>> 11,3162172,54.32,2450822,63.21,5714522
>>>>>> 12,3154261,50.14,2272290,58.98,6234024
>>>>>> 13,3115174,46.02,2369679,57.74,6753526
>>>>>> 14,3150511,41.86,2470837,54.02,7273028
>>>>>> 15,3134158,37.71,2428129,51.98,7792530
>>>>>> 16,3143067,33.57,2340469,49.54,8312032
>>>>>> 17,3112457,29.43,2263627,44.81,8831534
>>>>>> 18,3089724,25.29,2181879,38.69,9351036
>>>>>> 19,3076878,21.15,2236505,40.01,9870538
>>>>>> 20,3091978,16.95,2266327,35.00,10390040
>>>>>> 21,3082927,12.84,2172578,28.12,10909542
>>>>>> 22,3055282,8.73,2176269,29.14,11429044
>>>>>> 23,3081144,4.56,2138442,24.87,11948546
>>>>>> 24,3075509,0.45,2173753,21.62,12468048
>>>>>>
>>>>>> page hinting -
>>>>>> 0,0,100,0,100,0
>>>>>> 1,491683,95.83,494366,95.82,494366
>>>>>> 2,988415,91.67,919660,91.68,988732
>>>>>> 3,1344829,87.52,1244608,87.69,1483098
>>>>>> 4,1797933,83.37,1625797,83.70,1977464
>>>>>> 5,2179009,79.21,1881534,80.13,2471830
>>>>>> 6,2449858,75.07,2078137,76.82,2966196
>>>>>> 7,2732122,70.90,2178105,73.75,3460562
>>>>>> 8,2910965,66.75,2340901,70.28,3954928
>>>>>> 9,3006665,62.61,2353748,67.91,4449294
>>>>>> 10,3164752,58.46,2377936,65.08,4943660
>>>>>> 11,3234846,54.32,2510149,63.14,5438026
>>>>>> 12,3165477,50.17,2412007,59.91,5932392
>>>>>> 13,3141457,46.05,2421548,57.85,6426758
>>>>>> 14,3135839,41.90,2378021,53.81,6921124
>>>>>> 15,3109113,37.75,2269290,51.76,7415490
>>>>>> 16,3093613,33.62,2346185,48.73,7909856
>>>>>> 17,3086542,29.49,2352140,46.19,8404222
>>>>>> 18,3048991,25.36,2217144,41.52,8898588
>>>>>> 19,2965500,21.18,2313614,38.18,9392954
>>>>>> 20,2928977,17.05,2175316,35.67,9887320
>>>>>> 21,2896667,12.91,2141311,28.90,10381686
>>>>>> 22,3047782,8.76,2177664,28.24,10876052
>>>>>> 23,2994503,4.58,2160976,22.97,11370418
>>>>>> 24,3038762,0.47,2053533,22.39,11864784
>>>>>>
>>>>>> bubble-hinting v1 -
>>>>>> 0,0,100,0,100,0
>>>>>> 1,515272,95.83,492355,95.81,515272
>>>>>> 2,985903,91.66,919653,91.68,1030544
>>>>>> 3,1475300,87.51,1353723,87.65,1545816
>>>>>> 4,1783938,83.36,1586307,83.78,2061088
>>>>>> 5,2093307,79.20,1867395,79.95,2576360
>>>>>> 6,2441370,75.05,2055421,76.65,3091632
>>>>>> 7,2650471,70.89,2246014,72.93,3606904
>>>>>> 8,2926782,66.75,2333601,70.41,4122176
>>>>>> 9,3107617,62.60,2383112,68.46,4637448
>>>>>> 10,3192332,58.44,2441626,65.84,5152720
>>>>>> 11,3268043,54.32,2235964,62.92,5667992
>>>>>> 12,3191105,50.18,2449045,60.49,6183264
>>>>>> 13,3145317,46.05,2377317,57.80,6698536
>>>>>> 14,3161552,41.91,2395814,53.26,7213808
>>>>>> 15,3140443,37.77,2333200,51.42,7729080
>>>>>> 16,3130866,33.65,2150967,46.11,8244352
>>>>>> 17,3112894,29.52,2372068,45.93,8759624
>>>>>> 18,3078424,25.39,2336211,39.85,9274896
>>>>>> 19,3036457,21.27,2224821,35.25,9790168
>>>>>> 20,3046330,17.13,2199755,37.43,10305440
>>>>>> 21,2981130,12.98,2214862,28.67,10820712
>>>>>> 22,3017481,8.84,2195996,29.69,11335984
>>>>>> 23,2979906,4.68,2173395,25.90,11851256
>>>>>> 24,2971170,0.52,2134311,21.89,12366528
>>>>> Okay, so this doesn't match up with the results you gave me last time
>>>>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
>>>>> and actually more closely matches what I was expecting to see. The
>>>>> bubble-hinting patches are performing within a few percent of what the
>>>>> baseline kernel was doing.
>>>> Interestingly even with an unmodified kernel with every fresh boot, I
>>>> observed a certain amount of variability in the results which I stated
>>>> below.
>>>>> I am assuming the results from before had
>>>>> some additional debugging enabled for the bubble-hinting test that
>>>>> wasn't enabled for the other ones.
>>>> Nope, I had debugging options enabled for all the cases. This time
>>>> around I disabled all the debug options.
>>> We can agree to disagree I guess. Those debugging options had reduced
>>> the throughput by over 30% on the guest kernel in my test runs. I was
>>> never able to reproduce the data you reported as enabling the same
>>> debug features on an unmodified kernel had reduced the throughput for
>>> the test just the same as it did for the bubble hinting version. Were
>>> you running the debug options on the host kernel or the guest?
>> In the guest. Do the results which I shared without debug options, match
>> with what you have?
>> I am also curious to know if you see any variability in the results of
>> page_fault1 for an unmodified kernel with every fresh boot? If so how
>> often?
> I see some variability, but not much. Usually it can vary by +/- 5% or
> so.
+1
>  What I have been doing is collecting multiple runs, working out
> the average, and then comparing that against an average with the
> patches applied.
Yeah, I didn't share the average values but I do the same.
I just wanted to mention the variability so that there is no confusion
if later the value comes out to be in the range of +/- 3-4%.
>
> One other thing you can probably do to limit the variability would be
> to look at disabling any power management features on the system. One
> thing you could be seeing is the effect of the CPU enabling turbo mode
> or going into sleep states if idle. That can easily throw the numbers
> around quite a bit.
The variability you mentioned, was it after disabling these options?
Alexander Duyck July 11, 2019, 5:01 p.m. UTC | #9
On Thu, Jul 11, 2019 at 8:19 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 7/11/19 11:08 AM, Alexander Duyck wrote:
> > On Thu, Jul 11, 2019 at 8:04 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 7/11/19 10:58 AM, Alexander Duyck wrote:
> >>> On Thu, Jul 11, 2019 at 4:31 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>> On 7/10/19 7:40 PM, Alexander Duyck wrote:
> >>>>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>>>
> >>>>> The results up here were redundant with what is below so I am just
> >>>>> dropping them. I would suggest only including one set of results in
> >>>>> any future cover page as it is confusing to duplicate it like that.
> >>>>>
> >>>>>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> >>>>>> A new hook after buddy merging is used to set the bits in the bitmap.
> >>>>>> Currently, the bits are only cleared when pages are hinted, not when pages are
> >>>>>> re-allocated.
> >>>>>>
> >>>>>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> >>>>>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> >>>>>> threshold is met, trying to isolate and report pages that are still free.
> >>>>>>
> >>>>>> The isolated pages are reported via virtio-balloon, which is responsible for
> >>>>>> sending batched pages to the host synchronously. Once the hypervisor processed
> >>>>>> the hinting request, the isolated pages are returned back to the buddy.
> >>>>>>
> >>>>>> Changelog in v11:
> >>>>>> * Added logic to take care of multiple NUMA nodes scenarios.
> >>>>>> * Simplified the logic for reporting isolated pages to the host. (Eg. replaced
> >>>>>> dynamically allocated arrays with static ones, introduced wait event instead of
> >>>>>> the loop in order to wait for a response from the host)
> >>>>>> * Added a mutex to prevent race condition when page hinting is enabled by
> >>>>>> multiple drivers.
> >>>>>> * Simplified the logic responsible for decrementing free page counter for each
> >>>>>> zone.
> >>>>>> * Simplified code structuring/naming.
> >>>>>>
> >>>>>> Known work items for the future:
> >>>>>> * Test device assigned guests to ensure that hinting doesn't break it.
> >>>>>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device-side support.
> >>>>>> * Decide between MADV_DONTNEED and MADV_FREE.
> >>>>>> * Look into memory hotplug, more efficient locking, better naming conventions to
> >>>>>> avoid confusion with VIRTIO_BALLOON_F_FREE_PAGE_HINT support.
> >>>>>> * Come up with proper/traceable error-message/logs and look into other code
> >>>>>> simplifications. (If necessary).
> >>>>>>
> >>>>>> Benefit analysis:
> >>>>>> 1. Number of 5GB guests (each touching 4GB memory) that can be launched without
> >>>>>> swap usage on a system with 15GB:
> >>>>>> unmodified kernel - 2, 3rd with 2.5GB
> >>>>>> v11 page hinting - 6, 7th with 26MB
> >>>>>> v1 bubble hinting - 6, 7th with 1.8GB
> >>>>>>
> >>>>>> Conclusion - In this particular testcase on using v11 page hinting and
> >>>>>> v1 bubble-hinting 4 more guests could be launched without swapping compared
> >>>>>> to an unmodified kernel.
> >>>>>> For the 7th guest launch, v11 page hinting is slightly better than v1 Bubble
> >>>>>> hinting as it touches lesser swap space.
> >>>>> I'm confused by the comment. From what I can tell bubble hinting came
> >>>>> up with 1.8GB of memory while page hinting only managed to achieve
> >>>>> .026GB (Using the same units makes it easier to visualize the
> >>>>> difference). Also your test says "can be launched without swap usage",
> >>>>> yet you say the bubble hinting is touching swap which makes not sense
> >>>>> to me.
> >>>> I will work on the cover to improve this part.
> >>>> Basically, In each case, the first number indicates the number of the
> >>>> guest which are launched without touching the swap space. For instance
> >>>> with bubble hinting, I was able to launch 6 guests without any swap
> >>>> usage. On launching the 7th guests initially there was no swap usage,
> >>>> however, as the test app starts allocating 4GB memory the swap came into
> >>>> the picture. 1.8 GB is the swap usage after the completion of the test
> >>>> application.
> >>>>>> Setup & procedure -
> >>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>>>> Guest Memory = 5GB
> >>>>>> Number of CPUs in the guest = 1
> >>>>>> Host swap = 4GB
> >>>>>> Workload = test allocation program that allocates 4GB memory, touches it via
> >>>>>> memset and exits.
> >>>>>> The first guest is launched and once its console is up, the test allocation
> >>>>>> program is executed with 4 GB memory request (Due to this the guest occupies
> >>>>>> almost 4-5 GB of memory in the host in a system without page hinting). Once
> >>>>>> this program exits at that time another guest is launched in the host and the
> >>>>>> same process is followed. It is continued until the swap is not used.
> >>>>>>
> >>>>>> 2. Memhog execution time (For 3 guests each of 6GB on a system with 15GB):
> >>>>>> unmodified kernel - Guest1:21s, Guest2:27s, Guest3:2m37s swap used = 3.7GB
> >>>>>> v11 page hinting - Guest1:23s, Guest2:26s, Guest3:21s swap used = 0
> >>>>>> v1 bubble hinting - Guest1:23, Guest2:11s, Guest3:26s swap used = 0
> >>>>>>
> >>>>>> For this particular test-case in a guest which doesn't require swap access
> >>>>>> "memhog 6G" execution time lies within a range of 15-30s.
> >>>>>> Conclusion -
> >>>>>> In the above test case for an unmodified kernel on executing memhog in the
> >>>>>> third guest execution time rises to above 2minutes due to swap access.
> >>>>>> Using either page-hinting or bubble hinting brings this execution time to a
> >>>>>> a normal range of 15-30s.
> >>>>> So really this test doesn't add much in value. The whole reason why
> >>>>> Guest3 runs so much slower is because it is going to swap. I initially
> >>>>> did this to demonstrate a point, but now running this test doesn't
> >>>>> prove much as it isn't really meant to be a performance test. It is
> >>>>> essentially just a duplicate of the "how many guests can you run" test
> >>>>> that is passing itself off as some sort of performance test.
> >>>>>
> >>>>> We could probably just drop this from future version of this as long
> >>>>> as we verify that the memory hinting is freeing most of the memory
> >>>>> back and the guest is reporting a size less than the total guest
> >>>>> memory size.
> >>>>>
> >>>> +1, makes sense to keep just one of the above two.
> >>>>>> Setup & procedure -
> >>>>>> Total NUMA Node Memory ~ 15 GB (All guests are run on a single NUMA node)
> >>>>>> Guest Memory = 6GB
> >>>>>> Number of CPUs in the guest = 4
> >>>>>> Process = 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> >>>>>> one after the other in each of them.
> >>>>>> Host swap = 4GB
> >>>>>>
> >>>>>> Performance Analysis:
> >>>>>> 1. will-it-scale's page_faul1
> >>>>>> Setup -
> >>>>>> Guest Memory = 6GB
> >>>>>> Number of cores = 24
> >>>>>>
> >>>>>> Unmodified kernel -
> >>>>>> 0,0,100,0,100,0
> >>>>>> 1,514453,95.84,519502,95.83,519502
> >>>>>> 2,991485,91.67,932268,91.68,1039004
> >>>>>> 3,1381237,87.36,1264214,87.64,1558506
> >>>>>> 4,1789116,83.36,1597767,83.88,2078008
> >>>>>> 5,2181552,79.20,1889489,80.08,2597510
> >>>>>> 6,2452416,75.05,2001879,77.10,3117012
> >>>>>> 7,2671047,70.90,2263866,73.22,3636514
> >>>>>> 8,2930081,66.75,2333813,70.60,4156016
> >>>>>> 9,3126431,62.60,2370108,68.28,4675518
> >>>>>> 10,3211937,58.44,2454093,65.74,5195020
> >>>>>> 11,3162172,54.32,2450822,63.21,5714522
> >>>>>> 12,3154261,50.14,2272290,58.98,6234024
> >>>>>> 13,3115174,46.02,2369679,57.74,6753526
> >>>>>> 14,3150511,41.86,2470837,54.02,7273028
> >>>>>> 15,3134158,37.71,2428129,51.98,7792530
> >>>>>> 16,3143067,33.57,2340469,49.54,8312032
> >>>>>> 17,3112457,29.43,2263627,44.81,8831534
> >>>>>> 18,3089724,25.29,2181879,38.69,9351036
> >>>>>> 19,3076878,21.15,2236505,40.01,9870538
> >>>>>> 20,3091978,16.95,2266327,35.00,10390040
> >>>>>> 21,3082927,12.84,2172578,28.12,10909542
> >>>>>> 22,3055282,8.73,2176269,29.14,11429044
> >>>>>> 23,3081144,4.56,2138442,24.87,11948546
> >>>>>> 24,3075509,0.45,2173753,21.62,12468048
> >>>>>>
> >>>>>> page hinting -
> >>>>>> 0,0,100,0,100,0
> >>>>>> 1,491683,95.83,494366,95.82,494366
> >>>>>> 2,988415,91.67,919660,91.68,988732
> >>>>>> 3,1344829,87.52,1244608,87.69,1483098
> >>>>>> 4,1797933,83.37,1625797,83.70,1977464
> >>>>>> 5,2179009,79.21,1881534,80.13,2471830
> >>>>>> 6,2449858,75.07,2078137,76.82,2966196
> >>>>>> 7,2732122,70.90,2178105,73.75,3460562
> >>>>>> 8,2910965,66.75,2340901,70.28,3954928
> >>>>>> 9,3006665,62.61,2353748,67.91,4449294
> >>>>>> 10,3164752,58.46,2377936,65.08,4943660
> >>>>>> 11,3234846,54.32,2510149,63.14,5438026
> >>>>>> 12,3165477,50.17,2412007,59.91,5932392
> >>>>>> 13,3141457,46.05,2421548,57.85,6426758
> >>>>>> 14,3135839,41.90,2378021,53.81,6921124
> >>>>>> 15,3109113,37.75,2269290,51.76,7415490
> >>>>>> 16,3093613,33.62,2346185,48.73,7909856
> >>>>>> 17,3086542,29.49,2352140,46.19,8404222
> >>>>>> 18,3048991,25.36,2217144,41.52,8898588
> >>>>>> 19,2965500,21.18,2313614,38.18,9392954
> >>>>>> 20,2928977,17.05,2175316,35.67,9887320
> >>>>>> 21,2896667,12.91,2141311,28.90,10381686
> >>>>>> 22,3047782,8.76,2177664,28.24,10876052
> >>>>>> 23,2994503,4.58,2160976,22.97,11370418
> >>>>>> 24,3038762,0.47,2053533,22.39,11864784
> >>>>>>
> >>>>>> bubble-hinting v1 -
> >>>>>> 0,0,100,0,100,0
> >>>>>> 1,515272,95.83,492355,95.81,515272
> >>>>>> 2,985903,91.66,919653,91.68,1030544
> >>>>>> 3,1475300,87.51,1353723,87.65,1545816
> >>>>>> 4,1783938,83.36,1586307,83.78,2061088
> >>>>>> 5,2093307,79.20,1867395,79.95,2576360
> >>>>>> 6,2441370,75.05,2055421,76.65,3091632
> >>>>>> 7,2650471,70.89,2246014,72.93,3606904
> >>>>>> 8,2926782,66.75,2333601,70.41,4122176
> >>>>>> 9,3107617,62.60,2383112,68.46,4637448
> >>>>>> 10,3192332,58.44,2441626,65.84,5152720
> >>>>>> 11,3268043,54.32,2235964,62.92,5667992
> >>>>>> 12,3191105,50.18,2449045,60.49,6183264
> >>>>>> 13,3145317,46.05,2377317,57.80,6698536
> >>>>>> 14,3161552,41.91,2395814,53.26,7213808
> >>>>>> 15,3140443,37.77,2333200,51.42,7729080
> >>>>>> 16,3130866,33.65,2150967,46.11,8244352
> >>>>>> 17,3112894,29.52,2372068,45.93,8759624
> >>>>>> 18,3078424,25.39,2336211,39.85,9274896
> >>>>>> 19,3036457,21.27,2224821,35.25,9790168
> >>>>>> 20,3046330,17.13,2199755,37.43,10305440
> >>>>>> 21,2981130,12.98,2214862,28.67,10820712
> >>>>>> 22,3017481,8.84,2195996,29.69,11335984
> >>>>>> 23,2979906,4.68,2173395,25.90,11851256
> >>>>>> 24,2971170,0.52,2134311,21.89,12366528
> >>>>> Okay, so this doesn't match up with the results you gave me last time
> >>>>> (https://lore.kernel.org/lkml/afac6f92-74f5-4580-0303-12b7374e5011@redhat.com/),
> >>>>> and actually more closely matches what I was expecting to see. The
> >>>>> bubble-hinting patches are performing within a few percent of what the
> >>>>> baseline kernel was doing.
> >>>> Interestingly even with an unmodified kernel with every fresh boot, I
> >>>> observed a certain amount of variability in the results which I stated
> >>>> below.
> >>>>> I am assuming the results from before had
> >>>>> some additional debugging enabled for the bubble-hinting test that
> >>>>> wasn't enabled for the other ones.
> >>>> Nope, I had debugging options enabled for all the cases. This time
> >>>> around I disabled all the debug options.
> >>> We can agree to disagree I guess. Those debugging options had reduced
> >>> the throughput by over 30% on the guest kernel in my test runs. I was
> >>> never able to reproduce the data you reported as enabling the same
> >>> debug features on an unmodified kernel had reduced the throughput for
> >>> the test just the same as it did for the bubble hinting version. Were
> >>> you running the debug options on the host kernel or the guest?
> >> In the guest. Do the results which I shared without debug options, match
> >> with what you have?
> >> I am also curious to know if you see any variability in the results of
> >> page_fault1 for an unmodified kernel with every fresh boot? If so how
> >> often?
> > I see some variability, but not much. Usually it can vary by +/- 5% or
> > so.
> +1
> >  What I have been doing is collecting multiple runs, working out
> > the average, and then comparing that against an average with the
> > patches applied.
> Yeah, I didn't share the average values but I do the same.
> I just wanted to mention the variability so that there is no confusion
> if later the value comes out to be in the range of +/- 3-4%.
> >
> > One other thing you can probably do to limit the variability would be
> > to look at disabling any power management features on the system. One
> > thing you could be seeing is the effect of the CPU enabling turbo mode
> > or going into sleep states if idle. That can easily throw the numbers
> > around quite a bit.
> The variability you mentioned, was it after disabling these options?

It wasn't entirely eliminated, however it was reduced. It also reduces
the overall performance for the lower thread counts as well though as
it drops things by like 10% or more for single thread performance.