Message ID | 20200413111810.GA801367@xps-13 (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: swap: use fixed-size readahead during swapoff | expand |
Andrea Righi <andrea.righi@canonical.com> writes: [snip] > diff --git a/mm/swap_state.c b/mm/swap_state.c > index ebed37bbf7a3..c71abc8df304 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -20,6 +20,7 @@ > #include <linux/migrate.h> > #include <linux/vmalloc.h> > #include <linux/swap_slots.h> > +#include <linux/oom.h> > #include <linux/huge_mm.h> > > #include <asm/pgtable.h> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > max_pages = 1 << READ_ONCE(page_cluster); > if (max_pages <= 1) > return 1; > + /* > + * If current task is using too much memory or swapoff is running > + * simply use the max readahead size. Since we likely want to load a > + * lot of pages back into memory, using a fixed-size max readhaead can > + * give better performance in this case. > + */ > + if (oom_task_origin(current)) > + return max_pages; > > hits = atomic_xchg(&swapin_readahead_hits, 0); > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, Thinks this again. If my understanding were correct, the accessing pattern during swapoff is sequential, why swap readahead doesn't work? If so, can you root cause that firstly? Best Regards, Huang, Ying
Andrea Righi <andrea.righi@canonical.com> writes: > The global swap-in readahead policy takes in account the previous access > patterns, using a scaling heuristic to determine the optimal readahead > chunk dynamically. > > This works pretty well in most cases, but like any heuristic there are > specific cases when this approach is not ideal, for example the swapoff > scenario. > > During swapoff we just want to load back into memory all the swapped-out > pages and for this specific use case a fixed-size readahead is more > efficient. > > The specific use case this patch is addressing is to improve swapoff > performance when a VM has been hibernated, resumed and all memory needs > to be forced back to RAM by disabling swap (see the test case below). Why do you need to swapoff after resuming? The swap device isn't used except hibernation? I guess the process is, 1) add swap device to VM 2) hibernate 3) resume 4) swapoff Some pages are swapped out in step 2? If os, can we just set /proc/sys/vm/swappiness to 0 to avoid swapping in step 2? Best Regards, Huang, Ying
On Mon, Apr 13, 2020 at 09:13:33PM +0800, Huang, Ying wrote: > Andrea Righi <andrea.righi@canonical.com> writes: > > > The global swap-in readahead policy takes in account the previous access > > patterns, using a scaling heuristic to determine the optimal readahead > > chunk dynamically. > > > > This works pretty well in most cases, but like any heuristic there are > > specific cases when this approach is not ideal, for example the swapoff > > scenario. > > > > During swapoff we just want to load back into memory all the swapped-out > > pages and for this specific use case a fixed-size readahead is more > > efficient. > > > > The specific use case this patch is addressing is to improve swapoff > > performance when a VM has been hibernated, resumed and all memory needs > > to be forced back to RAM by disabling swap (see the test case below). > > Why do you need to swapoff after resuming? The swap device isn't used > except hibernation? I guess the process is, > > 1) add swap device to VM > 2) hibernate > 3) resume > 4) swapoff Correct, the swap device is used only for hibernation, when the system is resumed the swap is disabled (swapoff). > > Some pages are swapped out in step 2? If os, can we just set > /proc/sys/vm/swappiness to 0 to avoid swapping in step 2? Sorry, can you elaborate more on this? All anonymous pages are swapped out during step 2, it doesn't matter if we set swappiness to 0, they are swapped out anyway, because we need save them somewhere in order to hibernate, shutting down the system. Thanks, -Andrea
On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: > Andrea Righi <andrea.righi@canonical.com> writes: > > [snip] > > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index ebed37bbf7a3..c71abc8df304 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -20,6 +20,7 @@ > > #include <linux/migrate.h> > > #include <linux/vmalloc.h> > > #include <linux/swap_slots.h> > > +#include <linux/oom.h> > > #include <linux/huge_mm.h> > > > > #include <asm/pgtable.h> > > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > > max_pages = 1 << READ_ONCE(page_cluster); > > if (max_pages <= 1) > > return 1; > > + /* > > + * If current task is using too much memory or swapoff is running > > + * simply use the max readahead size. Since we likely want to load a > > + * lot of pages back into memory, using a fixed-size max readhaead can > > + * give better performance in this case. > > + */ > > + if (oom_task_origin(current)) > > + return max_pages; > > > > hits = atomic_xchg(&swapin_readahead_hits, 0); > > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > > Thinks this again. If my understanding were correct, the accessing > pattern during swapoff is sequential, why swap readahead doesn't work? > If so, can you root cause that firstly? Theoretically if the pattern is sequential the current heuristic should already select a big readahead size, but apparently it's not doing that. I'll repeat my tests tracing the readahead size during swapoff to see exactly what's going on here. Thanks, -Andrea
Andrea Righi <andrea.righi@canonical.com> writes: > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: >> Andrea Righi <andrea.righi@canonical.com> writes: >> >> [snip] >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c >> > index ebed37bbf7a3..c71abc8df304 100644 >> > --- a/mm/swap_state.c >> > +++ b/mm/swap_state.c >> > @@ -20,6 +20,7 @@ >> > #include <linux/migrate.h> >> > #include <linux/vmalloc.h> >> > #include <linux/swap_slots.h> >> > +#include <linux/oom.h> >> > #include <linux/huge_mm.h> >> > >> > #include <asm/pgtable.h> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) >> > max_pages = 1 << READ_ONCE(page_cluster); >> > if (max_pages <= 1) >> > return 1; >> > + /* >> > + * If current task is using too much memory or swapoff is running >> > + * simply use the max readahead size. Since we likely want to load a >> > + * lot of pages back into memory, using a fixed-size max readhaead can >> > + * give better performance in this case. >> > + */ >> > + if (oom_task_origin(current)) >> > + return max_pages; >> > >> > hits = atomic_xchg(&swapin_readahead_hits, 0); >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, >> >> Thinks this again. If my understanding were correct, the accessing >> pattern during swapoff is sequential, why swap readahead doesn't work? >> If so, can you root cause that firstly? > > Theoretically if the pattern is sequential the current heuristic should > already select a big readahead size, but apparently it's not doing that. > > I'll repeat my tests tracing the readahead size during swapoff to see > exactly what's going on here. I haven't verify it. It may be helpful to call lookup_swap_cache() before swapin_readahead() in unuse_pte_range(). The theory behind it is to update the swap readahead statistics via lookup_swap_cache(). Best Regards, Huang, Ying
Andrea Righi <andrea.righi@canonical.com> writes: > On Mon, Apr 13, 2020 at 09:13:33PM +0800, Huang, Ying wrote: >> Andrea Righi <andrea.righi@canonical.com> writes: >> >> > The global swap-in readahead policy takes in account the previous access >> > patterns, using a scaling heuristic to determine the optimal readahead >> > chunk dynamically. >> > >> > This works pretty well in most cases, but like any heuristic there are >> > specific cases when this approach is not ideal, for example the swapoff >> > scenario. >> > >> > During swapoff we just want to load back into memory all the swapped-out >> > pages and for this specific use case a fixed-size readahead is more >> > efficient. >> > >> > The specific use case this patch is addressing is to improve swapoff >> > performance when a VM has been hibernated, resumed and all memory needs >> > to be forced back to RAM by disabling swap (see the test case below). >> >> Why do you need to swapoff after resuming? The swap device isn't used >> except hibernation? I guess the process is, >> >> 1) add swap device to VM >> 2) hibernate >> 3) resume >> 4) swapoff > > Correct, the swap device is used only for hibernation, when the system > is resumed the swap is disabled (swapoff). > >> >> Some pages are swapped out in step 2? If os, can we just set >> /proc/sys/vm/swappiness to 0 to avoid swapping in step 2? > > Sorry, can you elaborate more on this? All anonymous pages are swapped > out during step 2, it doesn't matter if we set swappiness to 0, they are > swapped out anyway, because we need save them somewhere in order to > hibernate, shutting down the system. All pages will be written to the swap device in step 2. But the normal swapout path isn't used. So these pages will be restored in step 3 instead of step 4. But at the beginning of step 2, system may try to reclaim some pages, the reclaimed anonymous pages will be restored in step 4. This may be avoided via setting /proc/sys/vm/swappiness to 0 before step 2. Best Regards, Huang, Ying > Thanks, > -Andrea
On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: > Andrea Righi <andrea.righi@canonical.com> writes: > > > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: > >> Andrea Righi <andrea.righi@canonical.com> writes: > >> > >> [snip] > >> > >> > diff --git a/mm/swap_state.c b/mm/swap_state.c > >> > index ebed37bbf7a3..c71abc8df304 100644 > >> > --- a/mm/swap_state.c > >> > +++ b/mm/swap_state.c > >> > @@ -20,6 +20,7 @@ > >> > #include <linux/migrate.h> > >> > #include <linux/vmalloc.h> > >> > #include <linux/swap_slots.h> > >> > +#include <linux/oom.h> > >> > #include <linux/huge_mm.h> > >> > > >> > #include <asm/pgtable.h> > >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > >> > max_pages = 1 << READ_ONCE(page_cluster); > >> > if (max_pages <= 1) > >> > return 1; > >> > + /* > >> > + * If current task is using too much memory or swapoff is running > >> > + * simply use the max readahead size. Since we likely want to load a > >> > + * lot of pages back into memory, using a fixed-size max readhaead can > >> > + * give better performance in this case. > >> > + */ > >> > + if (oom_task_origin(current)) > >> > + return max_pages; > >> > > >> > hits = atomic_xchg(&swapin_readahead_hits, 0); > >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > >> > >> Thinks this again. If my understanding were correct, the accessing > >> pattern during swapoff is sequential, why swap readahead doesn't work? > >> If so, can you root cause that firstly? > > > > Theoretically if the pattern is sequential the current heuristic should > > already select a big readahead size, but apparently it's not doing that. > > > > I'll repeat my tests tracing the readahead size during swapoff to see > > exactly what's going on here. > > I haven't verify it. It may be helpful to call lookup_swap_cache() > before swapin_readahead() in unuse_pte_range(). The theory behind it is > to update the swap readahead statistics via lookup_swap_cache(). I did more tests trying to collect some useful information. In particular I've been focusing at tracing the distribution of the values returned by swapin_nr_pages() in different scenarios. To do so I made swapin_nr_pages() trace-able and I used the following bcc command to measure the distrubution of the returned values: # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' I've collected this metric in the following scenarios: - 5.6 vanilla - 5.6 + lookup_swap_cache() before swapin_readahead() in unuse_pte_range() - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() in unuse_pte_range() - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) before swapin_readahead() in unuse_pte_range() Each kernel has been tested both with swappiness=0 and swappiness=60. Results are pretty much identical changing the swappiness, so I'm just reporting the default case here (w/ swappiness=60). Result ====== = swapoff performance (elapsed time) = vanilla 22.09s lookup_swap_cache() 23.87s hits++ 16.10s hits=last_ra_pages 8.81s = swapin_nr_pages() $retval distribution = 5.6 vanilla: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 36948 $retval = 8 44151 $retval = 4 49290 $retval = 1 527771 $retval = 2 5.6 lookup_swap_cache() before swapin_readahead(): r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 13093 $retval = 1 56703 $retval = 8 123067 $retval = 2 366118 $retval = 4 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 2589 $retval = 1 8016 $retval = 2 40021 $retval = 8 566038 $retval = 4 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 785 $retval = 2 1072 $retval = 1 21844 $retval = 4 644168 $retval = 8 In the vanilla case, the readahead heuristic seems to choose 2 pages most of the time. This is because we are not properly considering the hits (hits are always 0 in the swapoff code path) and, as you correctly pointed out, we can fix this by calling lookup_swap_cache() in unuse_pte_range() before calling swapin_readahead(). With this change the distribution of the readahead size moves more toward 4 pages, but we still have some 2s. That looks good, however it doesn't seem to speed up swapoff very much... maybe because calling lookup_swap_cache() introduces a small overhead? (still need to investigate about this theory). In the next test I've tried to always increment hits by 1 before calling swapin_readahead() in unuse_pte_range(). This is basically cheating, because I'm faking the hit ratio, forcing the heuristic to use a larger readahead size; in fact, the readahead size moves even more toward 4 pages and swapoff performance are a little better now. Pushing even more the "cheating" I can pretend that the previous readahead was all hits (swapin_readahead_hits=last_readahead_pages), so I'm forcing the heuristic to move toward the max size and keep using it. The result here is pretty much identical to my fixed-size patch, because swapin_nr_pages() returns the max readahead size pretty much all the time during swapoff (8 pages or, more in general, vm.page-cluster). Personally I don't like very much forcing the heuristic in this way, it'd be nice if it would just work by accounting the proper hit ratio (so just by adding lookup_swap_cache() as you correctly suggested), but this solution doesn't seem to improve performance in reality. For this reason I still think we should consider the swapoff scenario like a special one and somehow bypass the readahead heuristic and always return the max readahead size. Looking at the hits of the previous step in the swapoff case just doesn't work, because we may have some misses, but they will become hits very soon, since we are reading all the swapped out pages back into memory. This is why using the max readahead size gives better swapoff performance. What do you think? Thanks, -Andrea
Andrea Righi <andrea.righi@canonical.com> writes: > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: >> Andrea Righi <andrea.righi@canonical.com> writes: >> >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: >> >> Andrea Righi <andrea.righi@canonical.com> writes: >> >> >> >> [snip] >> >> >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c >> >> > index ebed37bbf7a3..c71abc8df304 100644 >> >> > --- a/mm/swap_state.c >> >> > +++ b/mm/swap_state.c >> >> > @@ -20,6 +20,7 @@ >> >> > #include <linux/migrate.h> >> >> > #include <linux/vmalloc.h> >> >> > #include <linux/swap_slots.h> >> >> > +#include <linux/oom.h> >> >> > #include <linux/huge_mm.h> >> >> > >> >> > #include <asm/pgtable.h> >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) >> >> > max_pages = 1 << READ_ONCE(page_cluster); >> >> > if (max_pages <= 1) >> >> > return 1; >> >> > + /* >> >> > + * If current task is using too much memory or swapoff is running >> >> > + * simply use the max readahead size. Since we likely want to load a >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can >> >> > + * give better performance in this case. >> >> > + */ >> >> > + if (oom_task_origin(current)) >> >> > + return max_pages; >> >> > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0); >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, >> >> >> >> Thinks this again. If my understanding were correct, the accessing >> >> pattern during swapoff is sequential, why swap readahead doesn't work? >> >> If so, can you root cause that firstly? >> > >> > Theoretically if the pattern is sequential the current heuristic should >> > already select a big readahead size, but apparently it's not doing that. >> > >> > I'll repeat my tests tracing the readahead size during swapoff to see >> > exactly what's going on here. >> >> I haven't verify it. It may be helpful to call lookup_swap_cache() >> before swapin_readahead() in unuse_pte_range(). The theory behind it is >> to update the swap readahead statistics via lookup_swap_cache(). > > I did more tests trying to collect some useful information. > > In particular I've been focusing at tracing the distribution of the > values returned by swapin_nr_pages() in different scenarios. > > To do so I made swapin_nr_pages() trace-able and I used the following > bcc command to measure the distrubution of the returned values: > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' > > I've collected this metric in the following scenarios: > - 5.6 vanilla > - 5.6 + lookup_swap_cache() before swapin_readahead() in > unuse_pte_range() > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() > in unuse_pte_range() > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) > before swapin_readahead() in unuse_pte_range() > > Each kernel has been tested both with swappiness=0 and swappiness=60. > Results are pretty much identical changing the swappiness, so I'm just > reporting the default case here (w/ swappiness=60). > > Result > ====== > > = swapoff performance (elapsed time) = > > vanilla 22.09s > lookup_swap_cache() 23.87s > hits++ 16.10s > hits=last_ra_pages 8.81s > > = swapin_nr_pages() $retval distribution = > > 5.6 vanilla: > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > COUNT EVENT > 36948 $retval = 8 > 44151 $retval = 4 > 49290 $retval = 1 > 527771 $retval = 2 > > 5.6 lookup_swap_cache() before swapin_readahead(): > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > COUNT EVENT > 13093 $retval = 1 > 56703 $retval = 8 > 123067 $retval = 2 > 366118 $retval = 4 > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > COUNT EVENT > 2589 $retval = 1 > 8016 $retval = 2 > 40021 $retval = 8 > 566038 $retval = 4 > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > COUNT EVENT > 785 $retval = 2 > 1072 $retval = 1 > 21844 $retval = 4 > 644168 $retval = 8 > > In the vanilla case, the readahead heuristic seems to choose 2 pages > most of the time. This is because we are not properly considering the > hits (hits are always 0 in the swapoff code path) and, as you correctly > pointed out, we can fix this by calling lookup_swap_cache() in > unuse_pte_range() before calling swapin_readahead(). > > With this change the distribution of the readahead size moves more > toward 4 pages, but we still have some 2s. That looks good, however it > doesn't seem to speed up swapoff very much... maybe because calling > lookup_swap_cache() introduces a small overhead? (still need to > investigate about this theory). > > In the next test I've tried to always increment hits by 1 before calling > swapin_readahead() in unuse_pte_range(). This is basically cheating, > because I'm faking the hit ratio, forcing the heuristic to use a larger > readahead size; in fact, the readahead size moves even more toward 4 > pages and swapoff performance are a little better now. > > Pushing even more the "cheating" I can pretend that the previous > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so > I'm forcing the heuristic to move toward the max size and keep using it. > The result here is pretty much identical to my fixed-size patch, because > swapin_nr_pages() returns the max readahead size pretty much all the > time during swapoff (8 pages or, more in general, vm.page-cluster). > > Personally I don't like very much forcing the heuristic in this way, > it'd be nice if it would just work by accounting the proper hit ratio > (so just by adding lookup_swap_cache() as you correctly suggested), but > this solution doesn't seem to improve performance in reality. For this > reason I still think we should consider the swapoff scenario like a > special one and somehow bypass the readahead heuristic and always return > the max readahead size. > > Looking at the hits of the previous step in the swapoff case just > doesn't work, because we may have some misses, but they will become hits > very soon, since we are reading all the swapped out pages back into > memory. This is why using the max readahead size gives better > swapoff performance. > > What do you think? From your description, it appears that you are using cluster readahead instead of vma readahead. Can you verify this via, # cat /sys/kernel/mm/swap/vma_ra_enabled And if it returns false, you can enable it via, # echo 1 > /sys/kernel/mm/swap/vma_ra_enabled Because now swapoff code swapin pages in the page table order instead of the swap entry order. But this will turn the sequential disk read to random disk read too. Let's see the performance results. And please make sure that in unuse_pte_range(), after lookup_swap_cache() returns non-NULL page, it's unnecessary to call swapin_readahead(). Best Regards, Huang, Ying > Thanks, > -Andrea
On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote: > Andrea Righi <andrea.righi@canonical.com> writes: > > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: > >> Andrea Righi <andrea.righi@canonical.com> writes: > >> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: > >> >> Andrea Righi <andrea.righi@canonical.com> writes: > >> >> > >> >> [snip] > >> >> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c > >> >> > index ebed37bbf7a3..c71abc8df304 100644 > >> >> > --- a/mm/swap_state.c > >> >> > +++ b/mm/swap_state.c > >> >> > @@ -20,6 +20,7 @@ > >> >> > #include <linux/migrate.h> > >> >> > #include <linux/vmalloc.h> > >> >> > #include <linux/swap_slots.h> > >> >> > +#include <linux/oom.h> > >> >> > #include <linux/huge_mm.h> > >> >> > > >> >> > #include <asm/pgtable.h> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > >> >> > max_pages = 1 << READ_ONCE(page_cluster); > >> >> > if (max_pages <= 1) > >> >> > return 1; > >> >> > + /* > >> >> > + * If current task is using too much memory or swapoff is running > >> >> > + * simply use the max readahead size. Since we likely want to load a > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can > >> >> > + * give better performance in this case. > >> >> > + */ > >> >> > + if (oom_task_origin(current)) > >> >> > + return max_pages; > >> >> > > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0); > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > >> >> > >> >> Thinks this again. If my understanding were correct, the accessing > >> >> pattern during swapoff is sequential, why swap readahead doesn't work? > >> >> If so, can you root cause that firstly? > >> > > >> > Theoretically if the pattern is sequential the current heuristic should > >> > already select a big readahead size, but apparently it's not doing that. > >> > > >> > I'll repeat my tests tracing the readahead size during swapoff to see > >> > exactly what's going on here. > >> > >> I haven't verify it. It may be helpful to call lookup_swap_cache() > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is > >> to update the swap readahead statistics via lookup_swap_cache(). > > > > I did more tests trying to collect some useful information. > > > > In particular I've been focusing at tracing the distribution of the > > values returned by swapin_nr_pages() in different scenarios. > > > > To do so I made swapin_nr_pages() trace-able and I used the following > > bcc command to measure the distrubution of the returned values: > > > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' > > > > I've collected this metric in the following scenarios: > > - 5.6 vanilla > > - 5.6 + lookup_swap_cache() before swapin_readahead() in > > unuse_pte_range() > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() > > in unuse_pte_range() > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) > > before swapin_readahead() in unuse_pte_range() > > > > Each kernel has been tested both with swappiness=0 and swappiness=60. > > Results are pretty much identical changing the swappiness, so I'm just > > reporting the default case here (w/ swappiness=60). > > > > Result > > ====== > > > > = swapoff performance (elapsed time) = > > > > vanilla 22.09s > > lookup_swap_cache() 23.87s > > hits++ 16.10s > > hits=last_ra_pages 8.81s > > > > = swapin_nr_pages() $retval distribution = > > > > 5.6 vanilla: > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > COUNT EVENT > > 36948 $retval = 8 > > 44151 $retval = 4 > > 49290 $retval = 1 > > 527771 $retval = 2 > > > > 5.6 lookup_swap_cache() before swapin_readahead(): > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > COUNT EVENT > > 13093 $retval = 1 > > 56703 $retval = 8 > > 123067 $retval = 2 > > 366118 $retval = 4 > > > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > COUNT EVENT > > 2589 $retval = 1 > > 8016 $retval = 2 > > 40021 $retval = 8 > > 566038 $retval = 4 > > > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > COUNT EVENT > > 785 $retval = 2 > > 1072 $retval = 1 > > 21844 $retval = 4 > > 644168 $retval = 8 > > > > In the vanilla case, the readahead heuristic seems to choose 2 pages > > most of the time. This is because we are not properly considering the > > hits (hits are always 0 in the swapoff code path) and, as you correctly > > pointed out, we can fix this by calling lookup_swap_cache() in > > unuse_pte_range() before calling swapin_readahead(). > > > > With this change the distribution of the readahead size moves more > > toward 4 pages, but we still have some 2s. That looks good, however it > > doesn't seem to speed up swapoff very much... maybe because calling > > lookup_swap_cache() introduces a small overhead? (still need to > > investigate about this theory). > > > > In the next test I've tried to always increment hits by 1 before calling > > swapin_readahead() in unuse_pte_range(). This is basically cheating, > > because I'm faking the hit ratio, forcing the heuristic to use a larger > > readahead size; in fact, the readahead size moves even more toward 4 > > pages and swapoff performance are a little better now. > > > > Pushing even more the "cheating" I can pretend that the previous > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so > > I'm forcing the heuristic to move toward the max size and keep using it. > > The result here is pretty much identical to my fixed-size patch, because > > swapin_nr_pages() returns the max readahead size pretty much all the > > time during swapoff (8 pages or, more in general, vm.page-cluster). > > > > Personally I don't like very much forcing the heuristic in this way, > > it'd be nice if it would just work by accounting the proper hit ratio > > (so just by adding lookup_swap_cache() as you correctly suggested), but > > this solution doesn't seem to improve performance in reality. For this > > reason I still think we should consider the swapoff scenario like a > > special one and somehow bypass the readahead heuristic and always return > > the max readahead size. > > > > Looking at the hits of the previous step in the swapoff case just > > doesn't work, because we may have some misses, but they will become hits > > very soon, since we are reading all the swapped out pages back into > > memory. This is why using the max readahead size gives better > > swapoff performance. > > > > What do you think? > > >From your description, it appears that you are using cluster readahead > instead of vma readahead. Can you verify this via, > > # cat /sys/kernel/mm/swap/vma_ra_enabled # cat /sys/kernel/mm/swap/vma_ra_enabled true However, it's still using the cluster readahead because I'm using a swap file and nr_rotate_swap=1, so, according to the following, it's never using the vma readahead: static inline bool swap_use_vma_readahead(void) { return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); } I'll investigate more on this, I think there's no reason to prevent the usage of vma readahead if the underlying device is non-rotational. > > And if it returns false, you can enable it via, > > # echo 1 > /sys/kernel/mm/swap/vma_ra_enabled > > Because now swapoff code swapin pages in the page table order instead of > the swap entry order. But this will turn the sequential disk read to > random disk read too. Let's see the performance results. > > And please make sure that in unuse_pte_range(), after > lookup_swap_cache() returns non-NULL page, it's unnecessary to call > swapin_readahead(). This is what I was missing! Sorry about that, I was still calling swapin_readahead() even if the page was already in case, that was unnecessary overhead and it was messing up the hit ratio! With lookup_swap_cache() applied properly eveything looks way better: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 378 $retval = 1 2138 $retval = 2 25365 $retval = 4 105314 $retval = 8 swapof time: 9.87s So, basically it gives the same performance result of my fixed-size approach, but it's definitely a better and cleaner solution. Just to make it clear, here's the patch that I applied (if it looks good to you I can send another version with a better description): mm/swapfile.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 9fd47e6f7a86..cb9eb517178d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, vmf.pmd = pmd; last_ra = atomic_read(&last_readahead_pages); atomic_set(&swapin_readahead_hits, last_ra); - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); + page = lookup_swap_cache(entry, vma, addr); + if (!page) + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); if (!page) { if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD) goto try_next; Thanks! -Andrea
Andrea Righi <andrea.righi@canonical.com> writes: > mm/swapfile.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 9fd47e6f7a86..cb9eb517178d 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > vmf.pmd = pmd; > last_ra = atomic_read(&last_readahead_pages); > atomic_set(&swapin_readahead_hits, last_ra); You need to remove the above 2 lines firstly. Best Regards, Huang, Ying > - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); > + page = lookup_swap_cache(entry, vma, addr); > + if (!page) > + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); > if (!page) { > if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD) > goto try_next; > > Thanks! > -Andrea
On Wed, Apr 15, 2020 at 03:44:08PM +0800, Huang, Ying wrote: > Andrea Righi <andrea.righi@canonical.com> writes: > > > mm/swapfile.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 9fd47e6f7a86..cb9eb517178d 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > > vmf.pmd = pmd; > > last_ra = atomic_read(&last_readahead_pages); > > atomic_set(&swapin_readahead_hits, last_ra); > > You need to remove the above 2 lines firstly. Meh... too much enthusiasm, and I definitely need more coffee this morning. Here's the right patch applied: diff --git a/mm/swapfile.c b/mm/swapfile.c index 5871a2aa86a5..8b38441b66fa 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1940,7 +1940,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, vmf.vma = vma; vmf.address = addr; vmf.pmd = pmd; - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); + page = lookup_swap_cache(entry, vma, addr); + if (!page) + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); if (!page) { if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD) goto try_next; And following the right results: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 1618 $retval = 1 4960 $retval = 2 41315 $retval = 4 103521 $retval = 8 swapoff time: 12.19s So, not as good as the fixed-size readahead, but it's definitely an improvement, considering that the swapoff time is ~22s without this patch applied. I think this change can be a simple and reasonable compromise. Thanks again and sorry for the noise, -Andrea
On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote: > On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote: > > Andrea Righi <andrea.righi@canonical.com> writes: > > > > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: > > >> Andrea Righi <andrea.righi@canonical.com> writes: > > >> > > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: > > >> >> Andrea Righi <andrea.righi@canonical.com> writes: > > >> >> > > >> >> [snip] > > >> >> > > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c > > >> >> > index ebed37bbf7a3..c71abc8df304 100644 > > >> >> > --- a/mm/swap_state.c > > >> >> > +++ b/mm/swap_state.c > > >> >> > @@ -20,6 +20,7 @@ > > >> >> > #include <linux/migrate.h> > > >> >> > #include <linux/vmalloc.h> > > >> >> > #include <linux/swap_slots.h> > > >> >> > +#include <linux/oom.h> > > >> >> > #include <linux/huge_mm.h> > > >> >> > > > >> >> > #include <asm/pgtable.h> > > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > > >> >> > max_pages = 1 << READ_ONCE(page_cluster); > > >> >> > if (max_pages <= 1) > > >> >> > return 1; > > >> >> > + /* > > >> >> > + * If current task is using too much memory or swapoff is running > > >> >> > + * simply use the max readahead size. Since we likely want to load a > > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can > > >> >> > + * give better performance in this case. > > >> >> > + */ > > >> >> > + if (oom_task_origin(current)) > > >> >> > + return max_pages; > > >> >> > > > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0); > > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > > >> >> > > >> >> Thinks this again. If my understanding were correct, the accessing > > >> >> pattern during swapoff is sequential, why swap readahead doesn't work? > > >> >> If so, can you root cause that firstly? > > >> > > > >> > Theoretically if the pattern is sequential the current heuristic should > > >> > already select a big readahead size, but apparently it's not doing that. > > >> > > > >> > I'll repeat my tests tracing the readahead size during swapoff to see > > >> > exactly what's going on here. > > >> > > >> I haven't verify it. It may be helpful to call lookup_swap_cache() > > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is > > >> to update the swap readahead statistics via lookup_swap_cache(). > > > > > > I did more tests trying to collect some useful information. > > > > > > In particular I've been focusing at tracing the distribution of the > > > values returned by swapin_nr_pages() in different scenarios. > > > > > > To do so I made swapin_nr_pages() trace-able and I used the following > > > bcc command to measure the distrubution of the returned values: > > > > > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' > > > > > > I've collected this metric in the following scenarios: > > > - 5.6 vanilla > > > - 5.6 + lookup_swap_cache() before swapin_readahead() in > > > unuse_pte_range() > > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() > > > in unuse_pte_range() > > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) > > > before swapin_readahead() in unuse_pte_range() > > > > > > Each kernel has been tested both with swappiness=0 and swappiness=60. > > > Results are pretty much identical changing the swappiness, so I'm just > > > reporting the default case here (w/ swappiness=60). > > > > > > Result > > > ====== > > > > > > = swapoff performance (elapsed time) = > > > > > > vanilla 22.09s > > > lookup_swap_cache() 23.87s > > > hits++ 16.10s > > > hits=last_ra_pages 8.81s > > > > > > = swapin_nr_pages() $retval distribution = > > > > > > 5.6 vanilla: > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 36948 $retval = 8 > > > 44151 $retval = 4 > > > 49290 $retval = 1 > > > 527771 $retval = 2 > > > > > > 5.6 lookup_swap_cache() before swapin_readahead(): > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 13093 $retval = 1 > > > 56703 $retval = 8 > > > 123067 $retval = 2 > > > 366118 $retval = 4 > > > > > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 2589 $retval = 1 > > > 8016 $retval = 2 > > > 40021 $retval = 8 > > > 566038 $retval = 4 > > > > > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > > COUNT EVENT > > > 785 $retval = 2 > > > 1072 $retval = 1 > > > 21844 $retval = 4 > > > 644168 $retval = 8 > > > > > > In the vanilla case, the readahead heuristic seems to choose 2 pages > > > most of the time. This is because we are not properly considering the > > > hits (hits are always 0 in the swapoff code path) and, as you correctly > > > pointed out, we can fix this by calling lookup_swap_cache() in > > > unuse_pte_range() before calling swapin_readahead(). > > > > > > With this change the distribution of the readahead size moves more > > > toward 4 pages, but we still have some 2s. That looks good, however it > > > doesn't seem to speed up swapoff very much... maybe because calling > > > lookup_swap_cache() introduces a small overhead? (still need to > > > investigate about this theory). > > > > > > In the next test I've tried to always increment hits by 1 before calling > > > swapin_readahead() in unuse_pte_range(). This is basically cheating, > > > because I'm faking the hit ratio, forcing the heuristic to use a larger > > > readahead size; in fact, the readahead size moves even more toward 4 > > > pages and swapoff performance are a little better now. > > > > > > Pushing even more the "cheating" I can pretend that the previous > > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so > > > I'm forcing the heuristic to move toward the max size and keep using it. > > > The result here is pretty much identical to my fixed-size patch, because > > > swapin_nr_pages() returns the max readahead size pretty much all the > > > time during swapoff (8 pages or, more in general, vm.page-cluster). > > > > > > Personally I don't like very much forcing the heuristic in this way, > > > it'd be nice if it would just work by accounting the proper hit ratio > > > (so just by adding lookup_swap_cache() as you correctly suggested), but > > > this solution doesn't seem to improve performance in reality. For this > > > reason I still think we should consider the swapoff scenario like a > > > special one and somehow bypass the readahead heuristic and always return > > > the max readahead size. > > > > > > Looking at the hits of the previous step in the swapoff case just > > > doesn't work, because we may have some misses, but they will become hits > > > very soon, since we are reading all the swapped out pages back into > > > memory. This is why using the max readahead size gives better > > > swapoff performance. > > > > > > What do you think? > > > > >From your description, it appears that you are using cluster readahead > > instead of vma readahead. Can you verify this via, > > > > # cat /sys/kernel/mm/swap/vma_ra_enabled > > # cat /sys/kernel/mm/swap/vma_ra_enabled > true > > However, it's still using the cluster readahead because I'm using a > swap file and nr_rotate_swap=1, so, according to the following, it's > never using the vma readahead: > > static inline bool swap_use_vma_readahead(void) > { > return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); > } > > I'll investigate more on this, I think there's no reason to prevent the > usage of vma readahead if the underlying device is non-rotational. Few more details about this. Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it looks like my virtio_blk device is considered rotational by default: $ cat /sys/block/vda/queue/rotational 1 Therefore the vma readahead is not used. If I change this to "rotational" then the vma readahead is used; this is confirmed by the fact that swapin_nr_pages isn't called anymore: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 13 $retval = 1 18 $retval = 2 23 $retval = 4 29 $retval = 8 swapoff time: 16.44s In terms of swapoff performance vma readahead works better than the vanilla cluster readahead (~21s), but the "improved" cluster readahead (with lookup_swap_cache() in unuse_pte_range()) still gives slightly better results (~12s). -Andrea
Andrea Righi <andrea.righi@canonical.com> writes: > On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote: >> On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote: >> > Andrea Righi <andrea.righi@canonical.com> writes: >> > >> > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: >> > >> Andrea Righi <andrea.righi@canonical.com> writes: >> > >> >> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: >> > >> >> Andrea Righi <andrea.righi@canonical.com> writes: >> > >> >> >> > >> >> [snip] >> > >> >> >> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c >> > >> >> > index ebed37bbf7a3..c71abc8df304 100644 >> > >> >> > --- a/mm/swap_state.c >> > >> >> > +++ b/mm/swap_state.c >> > >> >> > @@ -20,6 +20,7 @@ >> > >> >> > #include <linux/migrate.h> >> > >> >> > #include <linux/vmalloc.h> >> > >> >> > #include <linux/swap_slots.h> >> > >> >> > +#include <linux/oom.h> >> > >> >> > #include <linux/huge_mm.h> >> > >> >> > >> > >> >> > #include <asm/pgtable.h> >> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) >> > >> >> > max_pages = 1 << READ_ONCE(page_cluster); >> > >> >> > if (max_pages <= 1) >> > >> >> > return 1; >> > >> >> > + /* >> > >> >> > + * If current task is using too much memory or swapoff is running >> > >> >> > + * simply use the max readahead size. Since we likely want to load a >> > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can >> > >> >> > + * give better performance in this case. >> > >> >> > + */ >> > >> >> > + if (oom_task_origin(current)) >> > >> >> > + return max_pages; >> > >> >> > >> > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0); >> > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, >> > >> >> >> > >> >> Thinks this again. If my understanding were correct, the accessing >> > >> >> pattern during swapoff is sequential, why swap readahead doesn't work? >> > >> >> If so, can you root cause that firstly? >> > >> > >> > >> > Theoretically if the pattern is sequential the current heuristic should >> > >> > already select a big readahead size, but apparently it's not doing that. >> > >> > >> > >> > I'll repeat my tests tracing the readahead size during swapoff to see >> > >> > exactly what's going on here. >> > >> >> > >> I haven't verify it. It may be helpful to call lookup_swap_cache() >> > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is >> > >> to update the swap readahead statistics via lookup_swap_cache(). >> > > >> > > I did more tests trying to collect some useful information. >> > > >> > > In particular I've been focusing at tracing the distribution of the >> > > values returned by swapin_nr_pages() in different scenarios. >> > > >> > > To do so I made swapin_nr_pages() trace-able and I used the following >> > > bcc command to measure the distrubution of the returned values: >> > > >> > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' >> > > >> > > I've collected this metric in the following scenarios: >> > > - 5.6 vanilla >> > > - 5.6 + lookup_swap_cache() before swapin_readahead() in >> > > unuse_pte_range() >> > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() >> > > in unuse_pte_range() >> > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) >> > > before swapin_readahead() in unuse_pte_range() >> > > >> > > Each kernel has been tested both with swappiness=0 and swappiness=60. >> > > Results are pretty much identical changing the swappiness, so I'm just >> > > reporting the default case here (w/ swappiness=60). >> > > >> > > Result >> > > ====== >> > > >> > > = swapoff performance (elapsed time) = >> > > >> > > vanilla 22.09s >> > > lookup_swap_cache() 23.87s >> > > hits++ 16.10s >> > > hits=last_ra_pages 8.81s >> > > >> > > = swapin_nr_pages() $retval distribution = >> > > >> > > 5.6 vanilla: >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval >> > > COUNT EVENT >> > > 36948 $retval = 8 >> > > 44151 $retval = 4 >> > > 49290 $retval = 1 >> > > 527771 $retval = 2 >> > > >> > > 5.6 lookup_swap_cache() before swapin_readahead(): >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval >> > > COUNT EVENT >> > > 13093 $retval = 1 >> > > 56703 $retval = 8 >> > > 123067 $retval = 2 >> > > 366118 $retval = 4 >> > > >> > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval >> > > COUNT EVENT >> > > 2589 $retval = 1 >> > > 8016 $retval = 2 >> > > 40021 $retval = 8 >> > > 566038 $retval = 4 >> > > >> > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval >> > > COUNT EVENT >> > > 785 $retval = 2 >> > > 1072 $retval = 1 >> > > 21844 $retval = 4 >> > > 644168 $retval = 8 >> > > >> > > In the vanilla case, the readahead heuristic seems to choose 2 pages >> > > most of the time. This is because we are not properly considering the >> > > hits (hits are always 0 in the swapoff code path) and, as you correctly >> > > pointed out, we can fix this by calling lookup_swap_cache() in >> > > unuse_pte_range() before calling swapin_readahead(). >> > > >> > > With this change the distribution of the readahead size moves more >> > > toward 4 pages, but we still have some 2s. That looks good, however it >> > > doesn't seem to speed up swapoff very much... maybe because calling >> > > lookup_swap_cache() introduces a small overhead? (still need to >> > > investigate about this theory). >> > > >> > > In the next test I've tried to always increment hits by 1 before calling >> > > swapin_readahead() in unuse_pte_range(). This is basically cheating, >> > > because I'm faking the hit ratio, forcing the heuristic to use a larger >> > > readahead size; in fact, the readahead size moves even more toward 4 >> > > pages and swapoff performance are a little better now. >> > > >> > > Pushing even more the "cheating" I can pretend that the previous >> > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so >> > > I'm forcing the heuristic to move toward the max size and keep using it. >> > > The result here is pretty much identical to my fixed-size patch, because >> > > swapin_nr_pages() returns the max readahead size pretty much all the >> > > time during swapoff (8 pages or, more in general, vm.page-cluster). >> > > >> > > Personally I don't like very much forcing the heuristic in this way, >> > > it'd be nice if it would just work by accounting the proper hit ratio >> > > (so just by adding lookup_swap_cache() as you correctly suggested), but >> > > this solution doesn't seem to improve performance in reality. For this >> > > reason I still think we should consider the swapoff scenario like a >> > > special one and somehow bypass the readahead heuristic and always return >> > > the max readahead size. >> > > >> > > Looking at the hits of the previous step in the swapoff case just >> > > doesn't work, because we may have some misses, but they will become hits >> > > very soon, since we are reading all the swapped out pages back into >> > > memory. This is why using the max readahead size gives better >> > > swapoff performance. >> > > >> > > What do you think? >> > >> > >From your description, it appears that you are using cluster readahead >> > instead of vma readahead. Can you verify this via, >> > >> > # cat /sys/kernel/mm/swap/vma_ra_enabled >> >> # cat /sys/kernel/mm/swap/vma_ra_enabled >> true >> >> However, it's still using the cluster readahead because I'm using a >> swap file and nr_rotate_swap=1, so, according to the following, it's >> never using the vma readahead: >> >> static inline bool swap_use_vma_readahead(void) >> { >> return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); >> } >> >> I'll investigate more on this, I think there's no reason to prevent the >> usage of vma readahead if the underlying device is non-rotational. > > Few more details about this. > > Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it > looks like my virtio_blk device is considered rotational by default: > > $ cat /sys/block/vda/queue/rotational > 1 > > Therefore the vma readahead is not used. If I change this to > "rotational" then the vma readahead is used; this is confirmed by the > fact that swapin_nr_pages isn't called anymore: > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > COUNT EVENT > 13 $retval = 1 > 18 $retval = 2 > 23 $retval = 4 > 29 $retval = 8 > > swapoff time: 16.44s > > In terms of swapoff performance vma readahead works better than the > vanilla cluster readahead (~21s), but the "improved" cluster readahead > (with lookup_swap_cache() in unuse_pte_range()) still gives slightly > better results (~12s). Thanks for testing. Can you check the return value of __swapin_nr_pages() to verify whether vma readahead works as expected? Best Regards, Huang, Ying > -Andrea
Andrea Righi <andrea.righi@canonical.com> writes: > On Wed, Apr 15, 2020 at 03:44:08PM +0800, Huang, Ying wrote: >> Andrea Righi <andrea.righi@canonical.com> writes: >> >> > mm/swapfile.c | 4 +++- >> > 1 file changed, 3 insertions(+), 1 deletion(-) >> > >> > diff --git a/mm/swapfile.c b/mm/swapfile.c >> > index 9fd47e6f7a86..cb9eb517178d 100644 >> > --- a/mm/swapfile.c >> > +++ b/mm/swapfile.c >> > @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, >> > vmf.pmd = pmd; >> > last_ra = atomic_read(&last_readahead_pages); >> > atomic_set(&swapin_readahead_hits, last_ra); >> >> You need to remove the above 2 lines firstly. > > Meh... too much enthusiasm, and I definitely need more coffee this > morning. Here's the right patch applied: > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 5871a2aa86a5..8b38441b66fa 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1940,7 +1940,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > vmf.vma = vma; > vmf.address = addr; > vmf.pmd = pmd; > - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); > + page = lookup_swap_cache(entry, vma, addr); > + if (!page) > + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); The vmf assignment can be moved inside "if" block. Otherwise the patch looks good to me. > if (!page) { > if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD) > goto try_next; > > And following the right results: > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > COUNT EVENT > 1618 $retval = 1 > 4960 $retval = 2 > 41315 $retval = 4 > 103521 $retval = 8 > > swapoff time: 12.19s > > So, not as good as the fixed-size readahead, but it's definitely an > improvement, considering that the swapoff time is ~22s without this > patch applied. > > I think this change can be a simple and reasonable compromise. Yes. I think so too. Best Regards, Huang, Ying > Thanks again and sorry for the noise, > -Andrea
On Thu, Apr 16, 2020 at 08:41:39AM +0800, Huang, Ying wrote: > Andrea Righi <andrea.righi@canonical.com> writes: > > > On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote: > >> On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote: > >> > Andrea Righi <andrea.righi@canonical.com> writes: > >> > > >> > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote: > >> > >> Andrea Righi <andrea.righi@canonical.com> writes: > >> > >> > >> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote: > >> > >> >> Andrea Righi <andrea.righi@canonical.com> writes: > >> > >> >> > >> > >> >> [snip] > >> > >> >> > >> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c > >> > >> >> > index ebed37bbf7a3..c71abc8df304 100644 > >> > >> >> > --- a/mm/swap_state.c > >> > >> >> > +++ b/mm/swap_state.c > >> > >> >> > @@ -20,6 +20,7 @@ > >> > >> >> > #include <linux/migrate.h> > >> > >> >> > #include <linux/vmalloc.h> > >> > >> >> > #include <linux/swap_slots.h> > >> > >> >> > +#include <linux/oom.h> > >> > >> >> > #include <linux/huge_mm.h> > >> > >> >> > > >> > >> >> > #include <asm/pgtable.h> > >> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) > >> > >> >> > max_pages = 1 << READ_ONCE(page_cluster); > >> > >> >> > if (max_pages <= 1) > >> > >> >> > return 1; > >> > >> >> > + /* > >> > >> >> > + * If current task is using too much memory or swapoff is running > >> > >> >> > + * simply use the max readahead size. Since we likely want to load a > >> > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can > >> > >> >> > + * give better performance in this case. > >> > >> >> > + */ > >> > >> >> > + if (oom_task_origin(current)) > >> > >> >> > + return max_pages; > >> > >> >> > > >> > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0); > >> > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > >> > >> >> > >> > >> >> Thinks this again. If my understanding were correct, the accessing > >> > >> >> pattern during swapoff is sequential, why swap readahead doesn't work? > >> > >> >> If so, can you root cause that firstly? > >> > >> > > >> > >> > Theoretically if the pattern is sequential the current heuristic should > >> > >> > already select a big readahead size, but apparently it's not doing that. > >> > >> > > >> > >> > I'll repeat my tests tracing the readahead size during swapoff to see > >> > >> > exactly what's going on here. > >> > >> > >> > >> I haven't verify it. It may be helpful to call lookup_swap_cache() > >> > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is > >> > >> to update the swap readahead statistics via lookup_swap_cache(). > >> > > > >> > > I did more tests trying to collect some useful information. > >> > > > >> > > In particular I've been focusing at tracing the distribution of the > >> > > values returned by swapin_nr_pages() in different scenarios. > >> > > > >> > > To do so I made swapin_nr_pages() trace-able and I used the following > >> > > bcc command to measure the distrubution of the returned values: > >> > > > >> > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval' > >> > > > >> > > I've collected this metric in the following scenarios: > >> > > - 5.6 vanilla > >> > > - 5.6 + lookup_swap_cache() before swapin_readahead() in > >> > > unuse_pte_range() > >> > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead() > >> > > in unuse_pte_range() > >> > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way) > >> > > before swapin_readahead() in unuse_pte_range() > >> > > > >> > > Each kernel has been tested both with swappiness=0 and swappiness=60. > >> > > Results are pretty much identical changing the swappiness, so I'm just > >> > > reporting the default case here (w/ swappiness=60). > >> > > > >> > > Result > >> > > ====== > >> > > > >> > > = swapoff performance (elapsed time) = > >> > > > >> > > vanilla 22.09s > >> > > lookup_swap_cache() 23.87s > >> > > hits++ 16.10s > >> > > hits=last_ra_pages 8.81s > >> > > > >> > > = swapin_nr_pages() $retval distribution = > >> > > > >> > > 5.6 vanilla: > >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > >> > > COUNT EVENT > >> > > 36948 $retval = 8 > >> > > 44151 $retval = 4 > >> > > 49290 $retval = 1 > >> > > 527771 $retval = 2 > >> > > > >> > > 5.6 lookup_swap_cache() before swapin_readahead(): > >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > >> > > COUNT EVENT > >> > > 13093 $retval = 1 > >> > > 56703 $retval = 8 > >> > > 123067 $retval = 2 > >> > > 366118 $retval = 4 > >> > > > >> > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead(): > >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > >> > > COUNT EVENT > >> > > 2589 $retval = 1 > >> > > 8016 $retval = 2 > >> > > 40021 $retval = 8 > >> > > 566038 $retval = 4 > >> > > > >> > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead(): > >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > >> > > COUNT EVENT > >> > > 785 $retval = 2 > >> > > 1072 $retval = 1 > >> > > 21844 $retval = 4 > >> > > 644168 $retval = 8 > >> > > > >> > > In the vanilla case, the readahead heuristic seems to choose 2 pages > >> > > most of the time. This is because we are not properly considering the > >> > > hits (hits are always 0 in the swapoff code path) and, as you correctly > >> > > pointed out, we can fix this by calling lookup_swap_cache() in > >> > > unuse_pte_range() before calling swapin_readahead(). > >> > > > >> > > With this change the distribution of the readahead size moves more > >> > > toward 4 pages, but we still have some 2s. That looks good, however it > >> > > doesn't seem to speed up swapoff very much... maybe because calling > >> > > lookup_swap_cache() introduces a small overhead? (still need to > >> > > investigate about this theory). > >> > > > >> > > In the next test I've tried to always increment hits by 1 before calling > >> > > swapin_readahead() in unuse_pte_range(). This is basically cheating, > >> > > because I'm faking the hit ratio, forcing the heuristic to use a larger > >> > > readahead size; in fact, the readahead size moves even more toward 4 > >> > > pages and swapoff performance are a little better now. > >> > > > >> > > Pushing even more the "cheating" I can pretend that the previous > >> > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so > >> > > I'm forcing the heuristic to move toward the max size and keep using it. > >> > > The result here is pretty much identical to my fixed-size patch, because > >> > > swapin_nr_pages() returns the max readahead size pretty much all the > >> > > time during swapoff (8 pages or, more in general, vm.page-cluster). > >> > > > >> > > Personally I don't like very much forcing the heuristic in this way, > >> > > it'd be nice if it would just work by accounting the proper hit ratio > >> > > (so just by adding lookup_swap_cache() as you correctly suggested), but > >> > > this solution doesn't seem to improve performance in reality. For this > >> > > reason I still think we should consider the swapoff scenario like a > >> > > special one and somehow bypass the readahead heuristic and always return > >> > > the max readahead size. > >> > > > >> > > Looking at the hits of the previous step in the swapoff case just > >> > > doesn't work, because we may have some misses, but they will become hits > >> > > very soon, since we are reading all the swapped out pages back into > >> > > memory. This is why using the max readahead size gives better > >> > > swapoff performance. > >> > > > >> > > What do you think? > >> > > >> > >From your description, it appears that you are using cluster readahead > >> > instead of vma readahead. Can you verify this via, > >> > > >> > # cat /sys/kernel/mm/swap/vma_ra_enabled > >> > >> # cat /sys/kernel/mm/swap/vma_ra_enabled > >> true > >> > >> However, it's still using the cluster readahead because I'm using a > >> swap file and nr_rotate_swap=1, so, according to the following, it's > >> never using the vma readahead: > >> > >> static inline bool swap_use_vma_readahead(void) > >> { > >> return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); > >> } > >> > >> I'll investigate more on this, I think there's no reason to prevent the > >> usage of vma readahead if the underlying device is non-rotational. > > > > Few more details about this. > > > > Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it > > looks like my virtio_blk device is considered rotational by default: > > > > $ cat /sys/block/vda/queue/rotational > > 1 > > > > Therefore the vma readahead is not used. If I change this to > > "rotational" then the vma readahead is used; this is confirmed by the > > fact that swapin_nr_pages isn't called anymore: > > > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval > > COUNT EVENT > > 13 $retval = 1 > > 18 $retval = 2 > > 23 $retval = 4 > > 29 $retval = 8 > > > > swapoff time: 16.44s > > > > In terms of swapoff performance vma readahead works better than the > > vanilla cluster readahead (~21s), but the "improved" cluster readahead > > (with lookup_swap_cache() in unuse_pte_range()) still gives slightly > > better results (~12s). > > Thanks for testing. Can you check the return value of > __swapin_nr_pages() to verify whether vma readahead works as expected? This is the vanilla kernel w/ vma readahead enabled: r::__swapin_nr_pages():unsigned int:$retval COUNT EVENT 1286 $retval = 1 6516 $retval = 4 81252 $retval = 8 260815 $retval = 2 swapoff time: real 18.20 And this is the kernel with the lookup_swap_cache() patch and vma readahead enabled: r::__swapin_nr_pages():unsigned int:$retval COUNT EVENT 321 $retval = 2 509 $retval = 1 3017 $retval = 4 124090 $retval = 8 swapoff time: 15.33s So, vma readahead seems to work as expected and swapoff now is faster also with vma readahead enabled. -Andrea
diff --git a/mm/swap_state.c b/mm/swap_state.c index ebed37bbf7a3..c71abc8df304 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -20,6 +20,7 @@ #include <linux/migrate.h> #include <linux/vmalloc.h> #include <linux/swap_slots.h> +#include <linux/oom.h> #include <linux/huge_mm.h> #include <asm/pgtable.h> @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset) max_pages = 1 << READ_ONCE(page_cluster); if (max_pages <= 1) return 1; + /* + * If current task is using too much memory or swapoff is running + * simply use the max readahead size. Since we likely want to load a + * lot of pages back into memory, using a fixed-size max readhaead can + * give better performance in this case. + */ + if (oom_task_origin(current)) + return max_pages; hits = atomic_xchg(&swapin_readahead_hits, 0); pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
The global swap-in readahead policy takes in account the previous access patterns, using a scaling heuristic to determine the optimal readahead chunk dynamically. This works pretty well in most cases, but like any heuristic there are specific cases when this approach is not ideal, for example the swapoff scenario. During swapoff we just want to load back into memory all the swapped-out pages and for this specific use case a fixed-size readahead is more efficient. The specific use case this patch is addressing is to improve swapoff performance when a VM has been hibernated, resumed and all memory needs to be forced back to RAM by disabling swap (see the test case below). But it is not the only case where a fixed-size readahead can show its benefits. More in general, the fixed-size approach can be beneficial in all the cases when a task that is using a large part of swapped out pages needs to load them back to memory as fast as possible. Testing environment =================== - Host: CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache) HDD: PC401 NVMe SK hynix 512GB MEM: 16GB - Guest (kvm): 8GB of RAM virtio block driver 16GB swap file on ext4 (/swapfile) Test case ========= - allocate 85% of memory - `systemctl hibernate` to force all the pages to be swapped-out to the swap file - resume the system - measure the time that swapoff takes to complete: # /usr/bin/time swapoff /swapfile Result ====== 5.6 vanilla 5.6 w/ this patch ----------- ----------------- page-cluster=1 26.77s 21.25s page-cluster=2 28.29s 12.66s page-cluster=3 22.09s 8.77s page-cluster=4 21.50s 7.60s page-cluster=5 25.35s 7.75s page-cluster=6 23.19s 8.32s page-cluster=7 22.25s 9.40s page-cluster=8 22.09s 8.93s Signed-off-by: Andrea Righi <andrea.righi@canonical.com> --- Changes in v2: - avoid introducing a new ABI to select the fixed-size readahead NOTE: after running some tests with this new patch I don't see any difference in terms of performance, so I'm reporting the same test result of the previous version. mm/swap_state.c | 9 +++++++++ 1 file changed, 9 insertions(+)