Message ID | 20241027011959.9226-1-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: count zeromap read and set for swapout and swapin | expand |
On Sat, Oct 26, 2024 at 6:20 PM Barry Song <21cnbao@gmail.com> wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > When the proportion of folios from the zero map is small, missing their > accounting may not significantly impact profiling. However, it’s easy > to construct a scenario where this becomes an issue—for example, > allocating 1 GB of memory, writing zeros from userspace, followed by > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out > and swap-in counts seem to vanish into a black hole, potentially > causing semantic ambiguity. I agree. It also makes developing around this area more challenging. I'm working on the swap abstraction, and sometimes I can't tell if I screwed up somewhere, or if a proportion of these allocated entries go towards this optimization... Thanks for taking a stab at fixing this, Barry! > > We have two ways to address this: > > 1. Add a separate counter specifically for the zero map. > 2. Continue using the current accounting, treating the zero map like > a normal backend. (This aligns with the current behavior of zRAM > when supporting same-page fills at the device level.) Hmm, my understanding of the pswpout/pswpin counters is that they only apply to IO done directly to the backend device, no? That's why we have a separate set of counters for zswap, and do not count them towards pswp(in|out). For users who have swap files on physical disks, the performance difference between reading directly from the swapfile and going through these optimizations could be really large. I think it makes sense to have a separate set of counters for zero-mapped pages (ideally, both at the host level and at the cgroup level?)
On Sun, Oct 27, 2024 at 3:45 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Sat, Oct 26, 2024 at 6:20 PM Barry Song <21cnbao@gmail.com> wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > When the proportion of folios from the zero map is small, missing their > > accounting may not significantly impact profiling. However, it’s easy > > to construct a scenario where this becomes an issue—for example, > > allocating 1 GB of memory, writing zeros from userspace, followed by > > MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out > > and swap-in counts seem to vanish into a black hole, potentially > > causing semantic ambiguity. > > I agree. It also makes developing around this area more challenging. > I'm working on the swap abstraction, and sometimes I can't tell if I > screwed up somewhere, or if a proportion of these allocated entries go > towards this optimization... > > Thanks for taking a stab at fixing this, Barry! > > > > > We have two ways to address this: > > > > 1. Add a separate counter specifically for the zero map. > > 2. Continue using the current accounting, treating the zero map like > > a normal backend. (This aligns with the current behavior of zRAM > > when supporting same-page fills at the device level.) > > Hmm, my understanding of the pswpout/pswpin counters is that they only > apply to IO done directly to the backend device, no? That's why we > have a separate set of counters for zswap, and do not count them > towards pswp(in|out). > > For users who have swap files on physical disks, the performance > difference between reading directly from the swapfile and going > through these optimizations could be really large. I think it makes > sense to have a separate set of counters for zero-mapped pages > (ideally, both at the host level and at the cgroup level?) agree it is better to have a separate counter for zeromap. then it raises a question: what is the proper name for it :-) zeromap_swpin, zeromap_swpout seems too long? and zswpin and zswpout have been used by zswap Thanks barry
On 28/10/2024 02:32, Barry Song wrote: > On Sun, Oct 27, 2024 at 3:45 PM Nhat Pham <nphamcs@gmail.com> wrote: >> >> On Sat, Oct 26, 2024 at 6:20 PM Barry Song <21cnbao@gmail.com> wrote: >>> >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> When the proportion of folios from the zero map is small, missing their >>> accounting may not significantly impact profiling. However, it’s easy >>> to construct a scenario where this becomes an issue—for example, >>> allocating 1 GB of memory, writing zeros from userspace, followed by >>> MADV_PAGEOUT, and then swapping it back in. In this case, the swap-out >>> and swap-in counts seem to vanish into a black hole, potentially >>> causing semantic ambiguity. >> >> I agree. It also makes developing around this area more challenging. >> I'm working on the swap abstraction, and sometimes I can't tell if I >> screwed up somewhere, or if a proportion of these allocated entries go >> towards this optimization... >> >> Thanks for taking a stab at fixing this, Barry! >> >>> >>> We have two ways to address this: >>> >>> 1. Add a separate counter specifically for the zero map. >>> 2. Continue using the current accounting, treating the zero map like >>> a normal backend. (This aligns with the current behavior of zRAM >>> when supporting same-page fills at the device level.) >> >> Hmm, my understanding of the pswpout/pswpin counters is that they only >> apply to IO done directly to the backend device, no? That's why we >> have a separate set of counters for zswap, and do not count them >> towards pswp(in|out). >> >> For users who have swap files on physical disks, the performance >> difference between reading directly from the swapfile and going >> through these optimizations could be really large. I think it makes >> sense to have a separate set of counters for zero-mapped pages >> (ideally, both at the host level and at the cgroup level?) > > agree it is better to have a separate counter for zeromap. > then it raises a question: what is the proper name for it :-) > > zeromap_swpin, zeromap_swpout seems too long? and zswpin > and zswpout have been used by zswap > > Thanks > barry I wonder if instead of having counters, it might be better to keep track of the number of zeropages currently stored in zeromap, similar to how zswap_same_filled_pages did it. It will be more complicated then this patch, but would give more insight of the current state of the system. Joshua (in CC) was going to have a look at that. Thanks, Usama
On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > > I wonder if instead of having counters, it might be better to keep track > of the number of zeropages currently stored in zeromap, similar to how > zswap_same_filled_pages did it. It will be more complicated then this > patch, but would give more insight of the current state of the system. > > Joshua (in CC) was going to have a look at that. I don't think one can substitute for the other. The "current zeromapped page" counter gives you a breakdown of where memory resides, whereas the in/out counters explain past performance based on events that have happened. That's why you have zswap, zswapped, zswpout, zswpin.
On Sun, Oct 27, 2024 at 7:32 PM Barry Song <21cnbao@gmail.com> wrote: > > > agree it is better to have a separate counter for zeromap. > then it raises a question: what is the proper name for it :-) > > zeromap_swpin, zeromap_swpout seems too long? and zswpin > and zswpout have been used by zswap Hmmmmm. How about zeroswpin? zeroswpout? zeromap_swpin and zeromap_swpout is not the end of the world, FWIW :)
On 28/10/2024 16:33, Nhat Pham wrote: > On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> I wonder if instead of having counters, it might be better to keep track >> of the number of zeropages currently stored in zeromap, similar to how >> zswap_same_filled_pages did it. It will be more complicated then this >> patch, but would give more insight of the current state of the system. >> >> Joshua (in CC) was going to have a look at that. > > I don't think one can substitute for the other. Yes agreed, they have separate uses and provide different information, but maybe wasteful to have both types of counters? They are counters so maybe dont consume too much resources but I think we should still think about it.. If you think from a production context, I feel like the number of pages currently in zeromap could be more useful than just accumulation of all zeromap loads/stores, which after days is just going to be a very large number. You can compare the accumulations at different points in time, but they dont take into account freeing swap slots and swap_reclaim. If we are open to having both types then its ok. But might be good to have that discussion here. > > The "current zeromapped page" counter gives you a breakdown of where > memory resides, whereas the in/out counters explain past performance > based on events that have happened. That's why you have zswap, > zswapped, zswpout, zswpin.
On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 16:33, Nhat Pham wrote: > > On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> I wonder if instead of having counters, it might be better to keep track > >> of the number of zeropages currently stored in zeromap, similar to how > >> zswap_same_filled_pages did it. It will be more complicated then this > >> patch, but would give more insight of the current state of the system. > >> > >> Joshua (in CC) was going to have a look at that. > > > > I don't think one can substitute for the other. > > Yes agreed, they have separate uses and provide different information, but > maybe wasteful to have both types of counters? They are counters so maybe > dont consume too much resources but I think we should still think about > it.. Not for or against here, but I would say that statement is debatable at best for memcg stats :) Each new counter consumes 2 longs per-memcg per-CPU (see memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can quickly add up with a large number of CPUs/memcgs/stats. Also, when flushing the stats we iterate all of them to propagate updates from per-CPU counters. This is already a slowpath so adding one stat is not a big deal, but again because we iterate all stats on multiple CPUs (and sometimes on each node as well), the overall flush latency becomes a concern sometimes. All of that is not to say we shouldn't add more memcg stats, but we have to be mindful of the resources.
On 28.10.24 17:34, Nhat Pham wrote: > On Sun, Oct 27, 2024 at 7:32 PM Barry Song <21cnbao@gmail.com> wrote: >> >> >> agree it is better to have a separate counter for zeromap. >> then it raises a question: what is the proper name for it :-) >> >> zeromap_swpin, zeromap_swpout seems too long? and zswpin >> and zswpout have been used by zswap > > Hmmmmm. How about zeroswpin? zeroswpout? Is this kind of a "swpin_skip" / "swpout_skip" ? Because we optimized it out?
On 28/10/2024 17:08, Yosry Ahmed wrote: > On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 28/10/2024 16:33, Nhat Pham wrote: >>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> I wonder if instead of having counters, it might be better to keep track >>>> of the number of zeropages currently stored in zeromap, similar to how >>>> zswap_same_filled_pages did it. It will be more complicated then this >>>> patch, but would give more insight of the current state of the system. >>>> >>>> Joshua (in CC) was going to have a look at that. >>> >>> I don't think one can substitute for the other. >> >> Yes agreed, they have separate uses and provide different information, but >> maybe wasteful to have both types of counters? They are counters so maybe >> dont consume too much resources but I think we should still think about >> it.. > > Not for or against here, but I would say that statement is debatable > at best for memcg stats :) > > Each new counter consumes 2 longs per-memcg per-CPU (see > memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > quickly add up with a large number of CPUs/memcgs/stats. > > Also, when flushing the stats we iterate all of them to propagate > updates from per-CPU counters. This is already a slowpath so adding > one stat is not a big deal, but again because we iterate all stats on > multiple CPUs (and sometimes on each node as well), the overall flush > latency becomes a concern sometimes. > > All of that is not to say we shouldn't add more memcg stats, but we > have to be mindful of the resources. Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is also not much). Not trying to block this patch in anyway. Just think its a good point to discuss here if we are ok with both types of counters. If its too wasteful then which one we should have.
On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 17:08, Yosry Ahmed wrote: > > On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 28/10/2024 16:33, Nhat Pham wrote: > >>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> I wonder if instead of having counters, it might be better to keep track > >>>> of the number of zeropages currently stored in zeromap, similar to how > >>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>> patch, but would give more insight of the current state of the system. > >>>> > >>>> Joshua (in CC) was going to have a look at that. > >>> > >>> I don't think one can substitute for the other. > >> > >> Yes agreed, they have separate uses and provide different information, but > >> maybe wasteful to have both types of counters? They are counters so maybe > >> dont consume too much resources but I think we should still think about > >> it.. > > > > Not for or against here, but I would say that statement is debatable > > at best for memcg stats :) > > > > Each new counter consumes 2 longs per-memcg per-CPU (see > > memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > > quickly add up with a large number of CPUs/memcgs/stats. > > > > Also, when flushing the stats we iterate all of them to propagate > > updates from per-CPU counters. This is already a slowpath so adding > > one stat is not a big deal, but again because we iterate all stats on > > multiple CPUs (and sometimes on each node as well), the overall flush > > latency becomes a concern sometimes. > > > > All of that is not to say we shouldn't add more memcg stats, but we > > have to be mindful of the resources. > > Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > also not much). > > Not trying to block this patch in anyway. Just think its a good point > to discuss here if we are ok with both types of counters. If its too wasteful > then which one we should have. Hi Usama, my point is that with all the below three counters: 1. PSWPIN/PSWPOUT 2. ZSWPIN/ZSWPOUT 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) Shouldn't we have been able to determine the portion of zeromap swap indirectly? Thanks Barry
On Mon, Oct 28, 2024 at 12:54 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > > > > > On 28/10/2024 17:08, Yosry Ahmed wrote: > > > On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >> > > >> > > >> > > >> On 28/10/2024 16:33, Nhat Pham wrote: > > >>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > > >>>> > > >>>> I wonder if instead of having counters, it might be better to keep track > > >>>> of the number of zeropages currently stored in zeromap, similar to how > > >>>> zswap_same_filled_pages did it. It will be more complicated then this > > >>>> patch, but would give more insight of the current state of the system. > > >>>> > > >>>> Joshua (in CC) was going to have a look at that. > > >>> > > >>> I don't think one can substitute for the other. > > >> > > >> Yes agreed, they have separate uses and provide different information, but > > >> maybe wasteful to have both types of counters? They are counters so maybe > > >> dont consume too much resources but I think we should still think about > > >> it.. > > > > > > Not for or against here, but I would say that statement is debatable > > > at best for memcg stats :) > > > > > > Each new counter consumes 2 longs per-memcg per-CPU (see > > > memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > > > quickly add up with a large number of CPUs/memcgs/stats. > > > > > > Also, when flushing the stats we iterate all of them to propagate > > > updates from per-CPU counters. This is already a slowpath so adding > > > one stat is not a big deal, but again because we iterate all stats on > > > multiple CPUs (and sometimes on each node as well), the overall flush > > > latency becomes a concern sometimes. > > > > > > All of that is not to say we shouldn't add more memcg stats, but we > > > have to be mindful of the resources. > > > > Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > > also not much). > > > > Not trying to block this patch in anyway. Just think its a good point > > to discuss here if we are ok with both types of counters. If its too wasteful > > then which one we should have. > > Hi Usama, > my point is that with all the below three counters: > 1. PSWPIN/PSWPOUT > 2. ZSWPIN/ZSWPOUT > 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > > Shouldn't we have been able to determine the portion of zeromap > swap indirectly? What about swap entries that get freed without being swapped in (e.g. swapped out anon memory freed, MADV_FREE, shmem truncate, etc)? > > Thanks > Barry
On 28/10/2024 19:54, Barry Song wrote: > On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 28/10/2024 17:08, Yosry Ahmed wrote: >>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 16:33, Nhat Pham wrote: >>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> I wonder if instead of having counters, it might be better to keep track >>>>>> of the number of zeropages currently stored in zeromap, similar to how >>>>>> zswap_same_filled_pages did it. It will be more complicated then this >>>>>> patch, but would give more insight of the current state of the system. >>>>>> >>>>>> Joshua (in CC) was going to have a look at that. >>>>> >>>>> I don't think one can substitute for the other. >>>> >>>> Yes agreed, they have separate uses and provide different information, but >>>> maybe wasteful to have both types of counters? They are counters so maybe >>>> dont consume too much resources but I think we should still think about >>>> it.. >>> >>> Not for or against here, but I would say that statement is debatable >>> at best for memcg stats :) >>> >>> Each new counter consumes 2 longs per-memcg per-CPU (see >>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can >>> quickly add up with a large number of CPUs/memcgs/stats. >>> >>> Also, when flushing the stats we iterate all of them to propagate >>> updates from per-CPU counters. This is already a slowpath so adding >>> one stat is not a big deal, but again because we iterate all stats on >>> multiple CPUs (and sometimes on each node as well), the overall flush >>> latency becomes a concern sometimes. >>> >>> All of that is not to say we shouldn't add more memcg stats, but we >>> have to be mindful of the resources. >> >> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is >> also not much). >> >> Not trying to block this patch in anyway. Just think its a good point >> to discuss here if we are ok with both types of counters. If its too wasteful >> then which one we should have. > > Hi Usama, > my point is that with all the below three counters: > 1. PSWPIN/PSWPOUT > 2. ZSWPIN/ZSWPOUT > 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > > Shouldn't we have been able to determine the portion of zeromap > swap indirectly? > Hmm, I might be wrong, but I would have thought no? What if you swapout a zero folio, but then discard it? zeromap_swpout would be incremented, but zeromap_swapin would not. > Thanks > Barry
On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 19:54, Barry Song wrote: > > On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>> > >>>>>> I wonder if instead of having counters, it might be better to keep track > >>>>>> of the number of zeropages currently stored in zeromap, similar to how > >>>>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>>>> patch, but would give more insight of the current state of the system. > >>>>>> > >>>>>> Joshua (in CC) was going to have a look at that. > >>>>> > >>>>> I don't think one can substitute for the other. > >>>> > >>>> Yes agreed, they have separate uses and provide different information, but > >>>> maybe wasteful to have both types of counters? They are counters so maybe > >>>> dont consume too much resources but I think we should still think about > >>>> it.. > >>> > >>> Not for or against here, but I would say that statement is debatable > >>> at best for memcg stats :) > >>> > >>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > >>> quickly add up with a large number of CPUs/memcgs/stats. > >>> > >>> Also, when flushing the stats we iterate all of them to propagate > >>> updates from per-CPU counters. This is already a slowpath so adding > >>> one stat is not a big deal, but again because we iterate all stats on > >>> multiple CPUs (and sometimes on each node as well), the overall flush > >>> latency becomes a concern sometimes. > >>> > >>> All of that is not to say we shouldn't add more memcg stats, but we > >>> have to be mindful of the resources. > >> > >> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > >> also not much). > >> > >> Not trying to block this patch in anyway. Just think its a good point > >> to discuss here if we are ok with both types of counters. If its too wasteful > >> then which one we should have. > > > > Hi Usama, > > my point is that with all the below three counters: > > 1. PSWPIN/PSWPOUT > > 2. ZSWPIN/ZSWPOUT > > 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > > > > Shouldn't we have been able to determine the portion of zeromap > > swap indirectly? > > > > Hmm, I might be wrong, but I would have thought no? > > What if you swapout a zero folio, but then discard it? > zeromap_swpout would be incremented, but zeromap_swapin would not. I understand. It looks like we have two issues to tackle: 1. We shouldn't let zeromap swap in or out anything that vanishes into a black hole 2. We want to find out how much I/O/memory has been saved due to zeromap so far From my perspective, issue 1 requires a "fix", while issue 2 is more of an optimization. I consider issue 1 to be more critical because, after observing a phone running for some time, I've been able to roughly estimate the portion zeromap can help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a SWPIN counter. However, I agree that issue 2 still holds significant value as a separate patch. Thanks Barry
On 28/10/2024 20:42, Barry Song wrote: > On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 28/10/2024 19:54, Barry Song wrote: >>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 17:08, Yosry Ahmed wrote: >>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 28/10/2024 16:33, Nhat Pham wrote: >>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>> >>>>>>>> I wonder if instead of having counters, it might be better to keep track >>>>>>>> of the number of zeropages currently stored in zeromap, similar to how >>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this >>>>>>>> patch, but would give more insight of the current state of the system. >>>>>>>> >>>>>>>> Joshua (in CC) was going to have a look at that. >>>>>>> >>>>>>> I don't think one can substitute for the other. >>>>>> >>>>>> Yes agreed, they have separate uses and provide different information, but >>>>>> maybe wasteful to have both types of counters? They are counters so maybe >>>>>> dont consume too much resources but I think we should still think about >>>>>> it.. >>>>> >>>>> Not for or against here, but I would say that statement is debatable >>>>> at best for memcg stats :) >>>>> >>>>> Each new counter consumes 2 longs per-memcg per-CPU (see >>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can >>>>> quickly add up with a large number of CPUs/memcgs/stats. >>>>> >>>>> Also, when flushing the stats we iterate all of them to propagate >>>>> updates from per-CPU counters. This is already a slowpath so adding >>>>> one stat is not a big deal, but again because we iterate all stats on >>>>> multiple CPUs (and sometimes on each node as well), the overall flush >>>>> latency becomes a concern sometimes. >>>>> >>>>> All of that is not to say we shouldn't add more memcg stats, but we >>>>> have to be mindful of the resources. >>>> >>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is >>>> also not much). >>>> >>>> Not trying to block this patch in anyway. Just think its a good point >>>> to discuss here if we are ok with both types of counters. If its too wasteful >>>> then which one we should have. >>> >>> Hi Usama, >>> my point is that with all the below three counters: >>> 1. PSWPIN/PSWPOUT >>> 2. ZSWPIN/ZSWPOUT >>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) >>> >>> Shouldn't we have been able to determine the portion of zeromap >>> swap indirectly? >>> >> >> Hmm, I might be wrong, but I would have thought no? >> >> What if you swapout a zero folio, but then discard it? >> zeromap_swpout would be incremented, but zeromap_swapin would not. > > I understand. It looks like we have two issues to tackle: > 1. We shouldn't let zeromap swap in or out anything that vanishes into > a black hole > 2. We want to find out how much I/O/memory has been saved due to zeromap so far > > From my perspective, issue 1 requires a "fix", while issue 2 is more > of an optimization. Hmm I dont understand why point 1 would be an issue. If its discarded thats fine as far as I can see. As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. Because zswapped would take into account swapped out anon memory freed, MADV_FREE, shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. > > I consider issue 1 to be more critical because, after observing a phone > running for some time, I've been able to roughly estimate the portion > zeromap can > help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a > SWPIN counter. However, I agree that issue 2 still holds significant value > as a separate patch. > > Thanks > Barry
On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 20:42, Barry Song wrote: > > On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 28/10/2024 19:54, Barry Song wrote: > >>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>> > >>>>>>>> I wonder if instead of having counters, it might be better to keep track > >>>>>>>> of the number of zeropages currently stored in zeromap, similar to how > >>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>>>>>> patch, but would give more insight of the current state of the system. > >>>>>>>> > >>>>>>>> Joshua (in CC) was going to have a look at that. > >>>>>>> > >>>>>>> I don't think one can substitute for the other. > >>>>>> > >>>>>> Yes agreed, they have separate uses and provide different information, but > >>>>>> maybe wasteful to have both types of counters? They are counters so maybe > >>>>>> dont consume too much resources but I think we should still think about > >>>>>> it.. > >>>>> > >>>>> Not for or against here, but I would say that statement is debatable > >>>>> at best for memcg stats :) > >>>>> > >>>>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > >>>>> quickly add up with a large number of CPUs/memcgs/stats. > >>>>> > >>>>> Also, when flushing the stats we iterate all of them to propagate > >>>>> updates from per-CPU counters. This is already a slowpath so adding > >>>>> one stat is not a big deal, but again because we iterate all stats on > >>>>> multiple CPUs (and sometimes on each node as well), the overall flush > >>>>> latency becomes a concern sometimes. > >>>>> > >>>>> All of that is not to say we shouldn't add more memcg stats, but we > >>>>> have to be mindful of the resources. > >>>> > >>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > >>>> also not much). > >>>> > >>>> Not trying to block this patch in anyway. Just think its a good point > >>>> to discuss here if we are ok with both types of counters. If its too wasteful > >>>> then which one we should have. > >>> > >>> Hi Usama, > >>> my point is that with all the below three counters: > >>> 1. PSWPIN/PSWPOUT > >>> 2. ZSWPIN/ZSWPOUT > >>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > >>> > >>> Shouldn't we have been able to determine the portion of zeromap > >>> swap indirectly? > >>> > >> > >> Hmm, I might be wrong, but I would have thought no? > >> > >> What if you swapout a zero folio, but then discard it? > >> zeromap_swpout would be incremented, but zeromap_swapin would not. > > > > I understand. It looks like we have two issues to tackle: > > 1. We shouldn't let zeromap swap in or out anything that vanishes into > > a black hole > > 2. We want to find out how much I/O/memory has been saved due to zeromap so far > > > > From my perspective, issue 1 requires a "fix", while issue 2 is more > > of an optimization. > > Hmm I dont understand why point 1 would be an issue. > > If its discarded thats fine as far as I can see. it is fine to you and probably me who knows zeromap as well :-) but any userspace code as below might be entirely confused: p = malloc(1G); write p to 0; or write part of p to 0 madv_pageout(p, 1g) read p to swapin. The entire procedure used to involve 1GB of swap out and 1GB of swap in by any means. Now, it has recorded 0 swaps counted. I don't expect userspace is as smart as you :-) > > As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. > Because zswapped would take into account swapped out anon memory freed, MADV_FREE, > shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. I understand. However, I believe what we really need to focus on is this: if we’ve swapped out, for instance, 100GB in the past hour, how much of that 100GB is zero? This information can help us assess the proportion of zero data in the workload, along with the potential benefits that zeromap can provide for memory, I/O space, or read/write operations. Additionally, having the second count can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, and so on. > > > > > > I consider issue 1 to be more critical because, after observing a phone > > running for some time, I've been able to roughly estimate the portion > > zeromap can > > help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a > > SWPIN counter. However, I agree that issue 2 still holds significant value > > as a separate patch. > > Thanks Barry
On 28/10/2024 21:15, Barry Song wrote: > On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 28/10/2024 20:42, Barry Song wrote: >>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 19:54, Barry Song wrote: >>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: >>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: >>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> I wonder if instead of having counters, it might be better to keep track >>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how >>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this >>>>>>>>>> patch, but would give more insight of the current state of the system. >>>>>>>>>> >>>>>>>>>> Joshua (in CC) was going to have a look at that. >>>>>>>>> >>>>>>>>> I don't think one can substitute for the other. >>>>>>>> >>>>>>>> Yes agreed, they have separate uses and provide different information, but >>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe >>>>>>>> dont consume too much resources but I think we should still think about >>>>>>>> it.. >>>>>>> >>>>>>> Not for or against here, but I would say that statement is debatable >>>>>>> at best for memcg stats :) >>>>>>> >>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see >>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can >>>>>>> quickly add up with a large number of CPUs/memcgs/stats. >>>>>>> >>>>>>> Also, when flushing the stats we iterate all of them to propagate >>>>>>> updates from per-CPU counters. This is already a slowpath so adding >>>>>>> one stat is not a big deal, but again because we iterate all stats on >>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush >>>>>>> latency becomes a concern sometimes. >>>>>>> >>>>>>> All of that is not to say we shouldn't add more memcg stats, but we >>>>>>> have to be mindful of the resources. >>>>>> >>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is >>>>>> also not much). >>>>>> >>>>>> Not trying to block this patch in anyway. Just think its a good point >>>>>> to discuss here if we are ok with both types of counters. If its too wasteful >>>>>> then which one we should have. >>>>> >>>>> Hi Usama, >>>>> my point is that with all the below three counters: >>>>> 1. PSWPIN/PSWPOUT >>>>> 2. ZSWPIN/ZSWPOUT >>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) >>>>> >>>>> Shouldn't we have been able to determine the portion of zeromap >>>>> swap indirectly? >>>>> >>>> >>>> Hmm, I might be wrong, but I would have thought no? >>>> >>>> What if you swapout a zero folio, but then discard it? >>>> zeromap_swpout would be incremented, but zeromap_swapin would not. >>> >>> I understand. It looks like we have two issues to tackle: >>> 1. We shouldn't let zeromap swap in or out anything that vanishes into >>> a black hole >>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far >>> >>> From my perspective, issue 1 requires a "fix", while issue 2 is more >>> of an optimization. >> >> Hmm I dont understand why point 1 would be an issue. >> >> If its discarded thats fine as far as I can see. > > it is fine to you and probably me who knows zeromap as well :-) but > any userspace code > as below might be entirely confused: > > p = malloc(1G); > write p to 0; or write part of p to 0 > madv_pageout(p, 1g) > read p to swapin. > > The entire procedure used to involve 1GB of swap out and 1GB of swap in by any > means. Now, it has recorded 0 swaps counted. > > I don't expect userspace is as smart as you :-) > Ah I completely agree, we need to account for it in some metric. I probably misunderstood when you said "We shouldn't let zeromap swap in or out anything that vanishes into a black hole", by we should not have the zeromap optimization for those cases. What I guess you meant is we need to account for it in some metric. >> >> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. >> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, >> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. > > I understand. However, I believe what we really need to focus on is > this: if we’ve > swapped out, for instance, 100GB in the past hour, how much of that 100GB is > zero? This information can help us assess the proportion of zero data in the > workload, along with the potential benefits that zeromap can provide for memory, > I/O space, or read/write operations. Additionally, having the second count > can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, > and so on. > Yes completely agree! I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, zeromap_swpin (or whatever name works). >> >> >>> >>> I consider issue 1 to be more critical because, after observing a phone >>> running for some time, I've been able to roughly estimate the portion >>> zeromap can >>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a >>> SWPIN counter. However, I agree that issue 2 still holds significant value >>> as a separate patch. >>> > > Thanks > Barry
On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 21:15, Barry Song wrote: > > On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 28/10/2024 20:42, Barry Song wrote: > >>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 19:54, Barry Song wrote: > >>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> I wonder if instead of having counters, it might be better to keep track > >>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how > >>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>>>>>>>> patch, but would give more insight of the current state of the system. > >>>>>>>>>> > >>>>>>>>>> Joshua (in CC) was going to have a look at that. > >>>>>>>>> > >>>>>>>>> I don't think one can substitute for the other. > >>>>>>>> > >>>>>>>> Yes agreed, they have separate uses and provide different information, but > >>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe > >>>>>>>> dont consume too much resources but I think we should still think about > >>>>>>>> it.. > >>>>>>> > >>>>>>> Not for or against here, but I would say that statement is debatable > >>>>>>> at best for memcg stats :) > >>>>>>> > >>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > >>>>>>> quickly add up with a large number of CPUs/memcgs/stats. > >>>>>>> > >>>>>>> Also, when flushing the stats we iterate all of them to propagate > >>>>>>> updates from per-CPU counters. This is already a slowpath so adding > >>>>>>> one stat is not a big deal, but again because we iterate all stats on > >>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush > >>>>>>> latency becomes a concern sometimes. > >>>>>>> > >>>>>>> All of that is not to say we shouldn't add more memcg stats, but we > >>>>>>> have to be mindful of the resources. > >>>>>> > >>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > >>>>>> also not much). > >>>>>> > >>>>>> Not trying to block this patch in anyway. Just think its a good point > >>>>>> to discuss here if we are ok with both types of counters. If its too wasteful > >>>>>> then which one we should have. > >>>>> > >>>>> Hi Usama, > >>>>> my point is that with all the below three counters: > >>>>> 1. PSWPIN/PSWPOUT > >>>>> 2. ZSWPIN/ZSWPOUT > >>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > >>>>> > >>>>> Shouldn't we have been able to determine the portion of zeromap > >>>>> swap indirectly? > >>>>> > >>>> > >>>> Hmm, I might be wrong, but I would have thought no? > >>>> > >>>> What if you swapout a zero folio, but then discard it? > >>>> zeromap_swpout would be incremented, but zeromap_swapin would not. > >>> > >>> I understand. It looks like we have two issues to tackle: > >>> 1. We shouldn't let zeromap swap in or out anything that vanishes into > >>> a black hole > >>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far > >>> > >>> From my perspective, issue 1 requires a "fix", while issue 2 is more > >>> of an optimization. > >> > >> Hmm I dont understand why point 1 would be an issue. > >> > >> If its discarded thats fine as far as I can see. > > > > it is fine to you and probably me who knows zeromap as well :-) but > > any userspace code > > as below might be entirely confused: > > > > p = malloc(1G); > > write p to 0; or write part of p to 0 > > madv_pageout(p, 1g) > > read p to swapin. > > > > The entire procedure used to involve 1GB of swap out and 1GB of swap in by any > > means. Now, it has recorded 0 swaps counted. > > > > I don't expect userspace is as smart as you :-) > > > Ah I completely agree, we need to account for it in some metric. I probably > misunderstood when you said "We shouldn't let zeromap swap in or out anything that > vanishes into a black hole", by we should not have the zeromap optimization for those > cases. What I guess you meant is we need to account for it in some metric. > > >> > >> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. > >> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, > >> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. > > > > I understand. However, I believe what we really need to focus on is > > this: if we’ve > > swapped out, for instance, 100GB in the past hour, how much of that 100GB is > > zero? This information can help us assess the proportion of zero data in the > > workload, along with the potential benefits that zeromap can provide for memory, > > I/O space, or read/write operations. Additionally, having the second count > > can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, > > and so on. > > > Yes completely agree! > > I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, > zeromap_swpin (or whatever name works). It's great to reach an agreement. Let me work on some patches for it. By the way, I recently had an idea: if we can conduct the zeromap check earlier - for example - before allocating swap slots and pageout(), could we completely eliminate swap slot occupation and allocation/release for zeromap data? For example, we could use a special swap entry value in the PTE to indicate zero content and directly fill it with zeros when swapping back. We've observed that swap slot allocation and freeing can consume a lot of CPU and slow down functions like zap_pte_range and swap-in. If we can entirely skip these steps, it could improve performance. However, I'm uncertain about the benefits we would gain if we only have 1-2% zeromap data. I'm just putting this idea out there to see if you're interested in moving forward with it. :-) > > >> > >> > >>> > >>> I consider issue 1 to be more critical because, after observing a phone > >>> running for some time, I've been able to roughly estimate the portion > >>> zeromap can > >>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a > >>> SWPIN counter. However, I agree that issue 2 still holds significant value > >>> as a separate patch. > >>> > > Thanks Barry
On 28/10/2024 21:40, Barry Song wrote: > On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@gmail.com> wrote: >> >> >> >> On 28/10/2024 21:15, Barry Song wrote: >>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>> >>>> >>>> >>>> On 28/10/2024 20:42, Barry Song wrote: >>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 28/10/2024 19:54, Barry Song wrote: >>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: >>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this >>>>>>>>>>>> patch, but would give more insight of the current state of the system. >>>>>>>>>>>> >>>>>>>>>>>> Joshua (in CC) was going to have a look at that. >>>>>>>>>>> >>>>>>>>>>> I don't think one can substitute for the other. >>>>>>>>>> >>>>>>>>>> Yes agreed, they have separate uses and provide different information, but >>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe >>>>>>>>>> dont consume too much resources but I think we should still think about >>>>>>>>>> it.. >>>>>>>>> >>>>>>>>> Not for or against here, but I would say that statement is debatable >>>>>>>>> at best for memcg stats :) >>>>>>>>> >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats. >>>>>>>>> >>>>>>>>> Also, when flushing the stats we iterate all of them to propagate >>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding >>>>>>>>> one stat is not a big deal, but again because we iterate all stats on >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush >>>>>>>>> latency becomes a concern sometimes. >>>>>>>>> >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we >>>>>>>>> have to be mindful of the resources. >>>>>>>> >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is >>>>>>>> also not much). >>>>>>>> >>>>>>>> Not trying to block this patch in anyway. Just think its a good point >>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful >>>>>>>> then which one we should have. >>>>>>> >>>>>>> Hi Usama, >>>>>>> my point is that with all the below three counters: >>>>>>> 1. PSWPIN/PSWPOUT >>>>>>> 2. ZSWPIN/ZSWPOUT >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) >>>>>>> >>>>>>> Shouldn't we have been able to determine the portion of zeromap >>>>>>> swap indirectly? >>>>>>> >>>>>> >>>>>> Hmm, I might be wrong, but I would have thought no? >>>>>> >>>>>> What if you swapout a zero folio, but then discard it? >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not. >>>>> >>>>> I understand. It looks like we have two issues to tackle: >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into >>>>> a black hole >>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far >>>>> >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more >>>>> of an optimization. >>>> >>>> Hmm I dont understand why point 1 would be an issue. >>>> >>>> If its discarded thats fine as far as I can see. >>> >>> it is fine to you and probably me who knows zeromap as well :-) but >>> any userspace code >>> as below might be entirely confused: >>> >>> p = malloc(1G); >>> write p to 0; or write part of p to 0 >>> madv_pageout(p, 1g) >>> read p to swapin. >>> >>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any >>> means. Now, it has recorded 0 swaps counted. >>> >>> I don't expect userspace is as smart as you :-) >>> >> Ah I completely agree, we need to account for it in some metric. I probably >> misunderstood when you said "We shouldn't let zeromap swap in or out anything that >> vanishes into a black hole", by we should not have the zeromap optimization for those >> cases. What I guess you meant is we need to account for it in some metric. >> >>>> >>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. >>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. >>> >>> I understand. However, I believe what we really need to focus on is >>> this: if we’ve >>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is >>> zero? This information can help us assess the proportion of zero data in the >>> workload, along with the potential benefits that zeromap can provide for memory, >>> I/O space, or read/write operations. Additionally, having the second count >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, >>> and so on. >>> >> Yes completely agree! >> >> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, >> zeromap_swpin (or whatever name works). > > It's great to reach an agreement. Let me work on some patches for it. Thanks! > > By the way, I recently had an idea: if we can conduct the zeromap check > earlier - for example - before allocating swap slots and pageout(), could > we completely eliminate swap slot occupation and allocation/release > for zeromap data? For example, we could use a special swap > entry value in the PTE to indicate zero content and directly fill it with > zeros when swapping back. We've observed that swap slot allocation and > freeing can consume a lot of CPU and slow down functions like > zap_pte_range and swap-in. If we can entirely skip these steps, it > could improve performance. However, I'm uncertain about the benefits we > would gain if we only have 1-2% zeromap data. If I remember correctly this was one of the ideas floated around in the initial version of the zeromap series, but it was evaluated as a lot more complicated to do than what the current zeromap code looks like. But I think its definitely worth looking into! > > I'm just putting this idea out there to see if you're interested in moving > forward with it. :-) > >> >>>> >>>> >>>>> >>>>> I consider issue 1 to be more critical because, after observing a phone >>>>> running for some time, I've been able to roughly estimate the portion >>>>> zeromap can >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a >>>>> SWPIN counter. However, I agree that issue 2 still holds significant value >>>>> as a separate patch. >>>>> >>> > > Thanks > Barry
On Tue, Oct 29, 2024 at 5:49 AM Usama Arif <usamaarif642@gmail.com> wrote: > > > > On 28/10/2024 21:40, Barry Song wrote: > > On Tue, Oct 29, 2024 at 5:24 AM Usama Arif <usamaarif642@gmail.com> wrote: > >> > >> > >> > >> On 28/10/2024 21:15, Barry Song wrote: > >>> On Tue, Oct 29, 2024 at 4:51 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>> > >>>> > >>>> > >>>> On 28/10/2024 20:42, Barry Song wrote: > >>>>> On Tue, Oct 29, 2024 at 4:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28/10/2024 19:54, Barry Song wrote: > >>>>>>> On Tue, Oct 29, 2024 at 1:20 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 28/10/2024 17:08, Yosry Ahmed wrote: > >>>>>>>>> On Mon, Oct 28, 2024 at 10:00 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 28/10/2024 16:33, Nhat Pham wrote: > >>>>>>>>>>> On Mon, Oct 28, 2024 at 5:23 AM Usama Arif <usamaarif642@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> I wonder if instead of having counters, it might be better to keep track > >>>>>>>>>>>> of the number of zeropages currently stored in zeromap, similar to how > >>>>>>>>>>>> zswap_same_filled_pages did it. It will be more complicated then this > >>>>>>>>>>>> patch, but would give more insight of the current state of the system. > >>>>>>>>>>>> > >>>>>>>>>>>> Joshua (in CC) was going to have a look at that. > >>>>>>>>>>> > >>>>>>>>>>> I don't think one can substitute for the other. > >>>>>>>>>> > >>>>>>>>>> Yes agreed, they have separate uses and provide different information, but > >>>>>>>>>> maybe wasteful to have both types of counters? They are counters so maybe > >>>>>>>>>> dont consume too much resources but I think we should still think about > >>>>>>>>>> it.. > >>>>>>>>> > >>>>>>>>> Not for or against here, but I would say that statement is debatable > >>>>>>>>> at best for memcg stats :) > >>>>>>>>> > >>>>>>>>> Each new counter consumes 2 longs per-memcg per-CPU (see > >>>>>>>>> memcg_vmstats_percpu), about 16 bytes, which is not a lot but it can > >>>>>>>>> quickly add up with a large number of CPUs/memcgs/stats. > >>>>>>>>> > >>>>>>>>> Also, when flushing the stats we iterate all of them to propagate > >>>>>>>>> updates from per-CPU counters. This is already a slowpath so adding > >>>>>>>>> one stat is not a big deal, but again because we iterate all stats on > >>>>>>>>> multiple CPUs (and sometimes on each node as well), the overall flush > >>>>>>>>> latency becomes a concern sometimes. > >>>>>>>>> > >>>>>>>>> All of that is not to say we shouldn't add more memcg stats, but we > >>>>>>>>> have to be mindful of the resources. > >>>>>>>> > >>>>>>>> Yes agreed! Plus the cost of incrementing similar counters (which ofcourse is > >>>>>>>> also not much). > >>>>>>>> > >>>>>>>> Not trying to block this patch in anyway. Just think its a good point > >>>>>>>> to discuss here if we are ok with both types of counters. If its too wasteful > >>>>>>>> then which one we should have. > >>>>>>> > >>>>>>> Hi Usama, > >>>>>>> my point is that with all the below three counters: > >>>>>>> 1. PSWPIN/PSWPOUT > >>>>>>> 2. ZSWPIN/ZSWPOUT > >>>>>>> 3. SWAPIN_SKIP/SWAPOUT_SKIP or (ZEROSWPIN, ZEROSWPOUT what ever) > >>>>>>> > >>>>>>> Shouldn't we have been able to determine the portion of zeromap > >>>>>>> swap indirectly? > >>>>>>> > >>>>>> > >>>>>> Hmm, I might be wrong, but I would have thought no? > >>>>>> > >>>>>> What if you swapout a zero folio, but then discard it? > >>>>>> zeromap_swpout would be incremented, but zeromap_swapin would not. > >>>>> > >>>>> I understand. It looks like we have two issues to tackle: > >>>>> 1. We shouldn't let zeromap swap in or out anything that vanishes into > >>>>> a black hole > >>>>> 2. We want to find out how much I/O/memory has been saved due to zeromap so far > >>>>> > >>>>> From my perspective, issue 1 requires a "fix", while issue 2 is more > >>>>> of an optimization. > >>>> > >>>> Hmm I dont understand why point 1 would be an issue. > >>>> > >>>> If its discarded thats fine as far as I can see. > >>> > >>> it is fine to you and probably me who knows zeromap as well :-) but > >>> any userspace code > >>> as below might be entirely confused: > >>> > >>> p = malloc(1G); > >>> write p to 0; or write part of p to 0 > >>> madv_pageout(p, 1g) > >>> read p to swapin. > >>> > >>> The entire procedure used to involve 1GB of swap out and 1GB of swap in by any > >>> means. Now, it has recorded 0 swaps counted. > >>> > >>> I don't expect userspace is as smart as you :-) > >>> > >> Ah I completely agree, we need to account for it in some metric. I probably > >> misunderstood when you said "We shouldn't let zeromap swap in or out anything that > >> vanishes into a black hole", by we should not have the zeromap optimization for those > >> cases. What I guess you meant is we need to account for it in some metric. > >> > >>>> > >>>> As a reference, memory.stat.zswapped != memory.stat.zswapout - memory.stat.zswapin. > >>>> Because zswapped would take into account swapped out anon memory freed, MADV_FREE, > >>>> shmem truncate, etc as Yosry said about zeromap, But zswapout and zswapin dont. > >>> > >>> I understand. However, I believe what we really need to focus on is > >>> this: if we’ve > >>> swapped out, for instance, 100GB in the past hour, how much of that 100GB is > >>> zero? This information can help us assess the proportion of zero data in the > >>> workload, along with the potential benefits that zeromap can provide for memory, > >>> I/O space, or read/write operations. Additionally, having the second count > >>> can enhance accuracy when considering MADV_DONTNEED, FREE, TRUNCATE, > >>> and so on. > >>> > >> Yes completely agree! > >> > >> I think we can look into adding all three metrics, zeromap_swapped, zeromap_swpout, > >> zeromap_swpin (or whatever name works). > > > > It's great to reach an agreement. Let me work on some patches for it. > > Thanks! > > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > earlier - for example - before allocating swap slots and pageout(), could > > we completely eliminate swap slot occupation and allocation/release > > for zeromap data? For example, we could use a special swap > > entry value in the PTE to indicate zero content and directly fill it with > > zeros when swapping back. We've observed that swap slot allocation and > > freeing can consume a lot of CPU and slow down functions like > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > could improve performance. However, I'm uncertain about the benefits we > > would gain if we only have 1-2% zeromap data. > > If I remember correctly this was one of the ideas floated around in the > initial version of the zeromap series, but it was evaluated as a lot more > complicated to do than what the current zeromap code looks like. But I > think its definitely worth looking into! Sorry for the noise. I didn't review the initial discussion. But my feeling is that it might be valuable considering the report from Zhiguo: https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/ In fact, our recent benchmark also indicates that swap free could account for a significant portion in do_swap_page(). > > > > > I'm just putting this idea out there to see if you're interested in moving > > forward with it. :-) > > > >> > >>>> > >>>> > >>>>> > >>>>> I consider issue 1 to be more critical because, after observing a phone > >>>>> running for some time, I've been able to roughly estimate the portion > >>>>> zeromap can > >>>>> help save using only PSWPOUT, ZSWPOUT, and SWAPOUT_SKIP, even without a > >>>>> SWPIN counter. However, I agree that issue 2 still holds significant value > >>>>> as a separate patch. > >>>>> > >>> > > Thanks Barry
[..] > > > By the way, I recently had an idea: if we can conduct the zeromap check > > > earlier - for example - before allocating swap slots and pageout(), could > > > we completely eliminate swap slot occupation and allocation/release > > > for zeromap data? For example, we could use a special swap > > > entry value in the PTE to indicate zero content and directly fill it with > > > zeros when swapping back. We've observed that swap slot allocation and > > > freeing can consume a lot of CPU and slow down functions like > > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > > could improve performance. However, I'm uncertain about the benefits we > > > would gain if we only have 1-2% zeromap data. > > > > If I remember correctly this was one of the ideas floated around in the > > initial version of the zeromap series, but it was evaluated as a lot more > > complicated to do than what the current zeromap code looks like. But I > > think its definitely worth looking into! Yup, I did suggest this on the first version: https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/ , and Usama took a stab at implementing it in the second version: https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/ David and Shakeel pointed out a few problems. I think they are fixable, but the complexity/benefit tradeoff was getting unclear at that point. If we can make it work without too much complexity, that would be great of course. > > Sorry for the noise. I didn't review the initial discussion. But my feeling > is that it might be valuable considering the report from Zhiguo: > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/ > > In fact, our recent benchmark also indicates that swap free could account > for a significant portion in do_swap_page(). As Shakeel mentioned in a reply to Usama's patch mentioned above, we would need to check the contents of the page after it's unmapped. So likely we need to allocate a swap slot, walk the rmap and unmap, check contents, walk the rmap again and update the PTEs, free the swap slot. So the swap free will be essentially moved from the fault path to the reclaim path, not eliminated. It may still be worth it, not sure. We also need to make sure we keep the rmap intact after the first walk and unmap in case we need to go back and update the PTEs again. Overall, I think the complexity is unlikely to be low.
On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > > > earlier - for example - before allocating swap slots and pageout(), could > > > > we completely eliminate swap slot occupation and allocation/release > > > > for zeromap data? For example, we could use a special swap > > > > entry value in the PTE to indicate zero content and directly fill it with > > > > zeros when swapping back. We've observed that swap slot allocation and > > > > freeing can consume a lot of CPU and slow down functions like > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > > > could improve performance. However, I'm uncertain about the benefits we > > > > would gain if we only have 1-2% zeromap data. > > > > > > If I remember correctly this was one of the ideas floated around in the > > > initial version of the zeromap series, but it was evaluated as a lot more > > > complicated to do than what the current zeromap code looks like. But I > > > think its definitely worth looking into! > > Yup, I did suggest this on the first version: > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/ > > , and Usama took a stab at implementing it in the second version: > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/ > > David and Shakeel pointed out a few problems. I think they are > fixable, but the complexity/benefit tradeoff was getting unclear at > that point. > > If we can make it work without too much complexity, that would be > great of course. > > > > > Sorry for the noise. I didn't review the initial discussion. But my feeling > > is that it might be valuable considering the report from Zhiguo: > > > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/ > > > > In fact, our recent benchmark also indicates that swap free could account > > for a significant portion in do_swap_page(). > > As Shakeel mentioned in a reply to Usama's patch mentioned above, we > would need to check the contents of the page after it's unmapped. So > likely we need to allocate a swap slot, walk the rmap and unmap, check > contents, walk the rmap again and update the PTEs, free the swap slot. > So the issue is that we can't check the content before allocating slots and unmapping during reclamation? If we find the content is zero, can we skip all slot operations and go directly to rmap/unmap by using a special PTE? > So the swap free will be essentially moved from the fault path to the > reclaim path, not eliminated. It may still be worth it, not sure. We > also need to make sure we keep the rmap intact after the first walk > and unmap in case we need to go back and update the PTEs again. > > Overall, I think the complexity is unlikely to be low.
On Mon, Oct 28, 2024 at 3:52 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [..] > > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > > > > earlier - for example - before allocating swap slots and pageout(), could > > > > > we completely eliminate swap slot occupation and allocation/release > > > > > for zeromap data? For example, we could use a special swap > > > > > entry value in the PTE to indicate zero content and directly fill it with > > > > > zeros when swapping back. We've observed that swap slot allocation and > > > > > freeing can consume a lot of CPU and slow down functions like > > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > > > > could improve performance. However, I'm uncertain about the benefits we > > > > > would gain if we only have 1-2% zeromap data. > > > > > > > > If I remember correctly this was one of the ideas floated around in the > > > > initial version of the zeromap series, but it was evaluated as a lot more > > > > complicated to do than what the current zeromap code looks like. But I > > > > think its definitely worth looking into! > > > > Yup, I did suggest this on the first version: > > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/ > > > > , and Usama took a stab at implementing it in the second version: > > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/ > > > > David and Shakeel pointed out a few problems. I think they are > > fixable, but the complexity/benefit tradeoff was getting unclear at > > that point. > > > > If we can make it work without too much complexity, that would be > > great of course. > > > > > > > > Sorry for the noise. I didn't review the initial discussion. But my feeling > > > is that it might be valuable considering the report from Zhiguo: > > > > > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/ > > > > > > In fact, our recent benchmark also indicates that swap free could account > > > for a significant portion in do_swap_page(). > > > > As Shakeel mentioned in a reply to Usama's patch mentioned above, we > > would need to check the contents of the page after it's unmapped. So > > likely we need to allocate a swap slot, walk the rmap and unmap, check > > contents, walk the rmap again and update the PTEs, free the swap slot. > > > > So the issue is that we can't check the content before allocating slots and > unmapping during reclamation? If we find the content is zero, can we skip > all slot operations and go directly to rmap/unmap by using a special PTE? We need to unmap first before checking the content, otherwise the content can change right after we check it.
On Tue, Oct 29, 2024 at 6:54 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Oct 28, 2024 at 3:52 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > [..] > > > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > > > > > earlier - for example - before allocating swap slots and pageout(), could > > > > > > we completely eliminate swap slot occupation and allocation/release > > > > > > for zeromap data? For example, we could use a special swap > > > > > > entry value in the PTE to indicate zero content and directly fill it with > > > > > > zeros when swapping back. We've observed that swap slot allocation and > > > > > > freeing can consume a lot of CPU and slow down functions like > > > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > > > > > could improve performance. However, I'm uncertain about the benefits we > > > > > > would gain if we only have 1-2% zeromap data. > > > > > > > > > > If I remember correctly this was one of the ideas floated around in the > > > > > initial version of the zeromap series, but it was evaluated as a lot more > > > > > complicated to do than what the current zeromap code looks like. But I > > > > > think its definitely worth looking into! > > > > > > Yup, I did suggest this on the first version: > > > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/ > > > > > > , and Usama took a stab at implementing it in the second version: > > > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/ > > > > > > David and Shakeel pointed out a few problems. I think they are > > > fixable, but the complexity/benefit tradeoff was getting unclear at > > > that point. > > > > > > If we can make it work without too much complexity, that would be > > > great of course. > > > > > > > > > > > Sorry for the noise. I didn't review the initial discussion. But my feeling > > > > is that it might be valuable considering the report from Zhiguo: > > > > > > > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/ > > > > > > > > In fact, our recent benchmark also indicates that swap free could account > > > > for a significant portion in do_swap_page(). > > > > > > As Shakeel mentioned in a reply to Usama's patch mentioned above, we > > > would need to check the contents of the page after it's unmapped. So > > > likely we need to allocate a swap slot, walk the rmap and unmap, check > > > contents, walk the rmap again and update the PTEs, free the swap slot. > > > > > > > So the issue is that we can't check the content before allocating slots and > > unmapping during reclamation? If we find the content is zero, can we skip > > all slot operations and go directly to rmap/unmap by using a special PTE? > > We need to unmap first before checking the content, otherwise the > content can change right after we check it. Well, do we have a way to terminate the unmap if we find pte_dirty and ensure the folio is still mapped after try_to_unmap_one()? Then we could activate it again after try_to_unmap. It might just be noise. Let me take some more time to think about it. :-)
On Mon, Oct 28, 2024 at 4:03 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Oct 29, 2024 at 6:54 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Mon, Oct 28, 2024 at 3:52 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Tue, Oct 29, 2024 at 6:33 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > [..] > > > > > > > By the way, I recently had an idea: if we can conduct the zeromap check > > > > > > > earlier - for example - before allocating swap slots and pageout(), could > > > > > > > we completely eliminate swap slot occupation and allocation/release > > > > > > > for zeromap data? For example, we could use a special swap > > > > > > > entry value in the PTE to indicate zero content and directly fill it with > > > > > > > zeros when swapping back. We've observed that swap slot allocation and > > > > > > > freeing can consume a lot of CPU and slow down functions like > > > > > > > zap_pte_range and swap-in. If we can entirely skip these steps, it > > > > > > > could improve performance. However, I'm uncertain about the benefits we > > > > > > > would gain if we only have 1-2% zeromap data. > > > > > > > > > > > > If I remember correctly this was one of the ideas floated around in the > > > > > > initial version of the zeromap series, but it was evaluated as a lot more > > > > > > complicated to do than what the current zeromap code looks like. But I > > > > > > think its definitely worth looking into! > > > > > > > > Yup, I did suggest this on the first version: > > > > https://lore.kernel.org/linux-mm/CAJD7tkYcTV_GOZV3qR6uxgFEvYXw1rP-h7WQjDnsdwM=g9cpAw@mail.gmail.com/ > > > > > > > > , and Usama took a stab at implementing it in the second version: > > > > https://lore.kernel.org/linux-mm/20240604105950.1134192-1-usamaarif642@gmail.com/ > > > > > > > > David and Shakeel pointed out a few problems. I think they are > > > > fixable, but the complexity/benefit tradeoff was getting unclear at > > > > that point. > > > > > > > > If we can make it work without too much complexity, that would be > > > > great of course. > > > > > > > > > > > > > > Sorry for the noise. I didn't review the initial discussion. But my feeling > > > > > is that it might be valuable considering the report from Zhiguo: > > > > > > > > > > https://lore.kernel.org/linux-mm/20240805153639.1057-1-justinjiang@vivo.com/ > > > > > > > > > > In fact, our recent benchmark also indicates that swap free could account > > > > > for a significant portion in do_swap_page(). > > > > > > > > As Shakeel mentioned in a reply to Usama's patch mentioned above, we > > > > would need to check the contents of the page after it's unmapped. So > > > > likely we need to allocate a swap slot, walk the rmap and unmap, check > > > > contents, walk the rmap again and update the PTEs, free the swap slot. > > > > > > > > > > So the issue is that we can't check the content before allocating slots and > > > unmapping during reclamation? If we find the content is zero, can we skip > > > all slot operations and go directly to rmap/unmap by using a special PTE? > > > > We need to unmap first before checking the content, otherwise the > > content can change right after we check it. > > Well, do we have a way to terminate the unmap if we find pte_dirty and ensure > the folio is still mapped after try_to_unmap_one()? Then we could > activate it again > after try_to_unmap. > > It might just be noise. Let me take some more time to think about it. :-) FWIW, the swap abstraction layer Yosry proposed last year (and I'm working on right now) will allow you to store these zeromapped swap entries without requiring any swap slots allocated on the swapfile. It's basically the same thing as swap/zswap decoupling. Not stopping you guys from optimizing it, since all I have right now is a (most certainly buggy) prototype + there might be benefits if we can get around the swap subsystem altogether for these zero mapped entries. Just letting you know there's a backup plan :)
> FWIW, the swap abstraction layer Yosry proposed last year (and I'm > working on right now) will allow you to store these zeromapped swap > entries without requiring any swap slots allocated on the swapfile. > It's basically the same thing as swap/zswap decoupling. I don't know if I said this before, but thank you for keeping this alive. Much appreciated :) > > Not stopping you guys from optimizing it, since all I have right now > is a (most certainly buggy) prototype + there might be benefits if we > can get around the swap subsystem altogether for these zero mapped > entries. Just letting you know there's a backup plan :)
On Tue, Oct 29, 2024 at 10:55 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > FWIW, the swap abstraction layer Yosry proposed last year (and I'm > > working on right now) will allow you to store these zeromapped swap > > entries without requiring any swap slots allocated on the swapfile. > > It's basically the same thing as swap/zswap decoupling. > > I don't know if I said this before, but thank you for keeping this > alive. Much appreciated :) > Of course, it's my pleasure :) I'll be sure to cc you when I send the prototype out as an RFC - your input is much appreciated, as always :)
diff --git a/mm/page_io.c b/mm/page_io.c index 5d9b6e6cf96c..90c5ea870038 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -226,6 +226,19 @@ static void swap_zeromap_folio_clear(struct folio *folio) } } +static inline void count_swpout_vm_event(struct folio *folio) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (unlikely(folio_test_pmd_mappable(folio))) { + count_memcg_folio_events(folio, THP_SWPOUT, 1); + count_vm_event(THP_SWPOUT); + } +#endif + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT); + count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio)); + count_vm_events(PSWPOUT, folio_nr_pages(folio)); +} + /* * We may have stale swap cache pages in memory: notice * them here and get rid of the unnecessary final write. @@ -258,6 +271,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) */ if (is_folio_zero_filled(folio)) { swap_zeromap_folio_set(folio); + count_swpout_vm_event(folio); folio_unlock(folio); return 0; } else { @@ -282,19 +296,6 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) return 0; } -static inline void count_swpout_vm_event(struct folio *folio) -{ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (unlikely(folio_test_pmd_mappable(folio))) { - count_memcg_folio_events(folio, THP_SWPOUT, 1); - count_vm_event(THP_SWPOUT); - } -#endif - count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT); - count_memcg_folio_events(folio, PSWPOUT, folio_nr_pages(folio)); - count_vm_events(PSWPOUT, folio_nr_pages(folio)); -} - #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio) { @@ -621,6 +622,9 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) delayacct_swapin_start(); if (swap_read_folio_zeromap(folio)) { + count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN); + count_memcg_folio_events(folio, PSWPIN, folio_nr_pages(folio)); + count_vm_events(PSWPIN, folio_nr_pages(folio)); folio_unlock(folio); goto finish; } else if (zswap_load(folio)) {