Message ID | 20230615034830.1361853-1-hezhongkun.hzk@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,1/3] zram: charge the compressed RAM to the page's memcgroup | expand |
On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He <hezhongkun.hzk@bytedance.com> wrote: > > The compressed RAM is currently charged to kernel, not to > any memory cgroup, which is not satisfy our usage scenario. > if the memory of a task is limited by memcgroup, it will > swap out the memory to zram swap device when the memory > is insufficient. In that case, the memory limit will have > no effect. > > So, it should makes sense to charge the compressed RAM to > the page's memory cgroup. We used to do this a long time ago, but we had per-memcg swapfiles [1[ to prevent compressed pages from different memcgs from sharing the same zspage. Does this patchset alone suffer from the same problem, i.e., memcgs sharing zspages? [1] https://lwn.net/Articles/592923/
On Thu, Jun 15, 2023 at 6:59 AM Yu Zhao <yuzhao@google.com> wrote: > On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: > > > > The compressed RAM is currently charged to kernel, not to > > any memory cgroup, which is not satisfy our usage scenario. > > if the memory of a task is limited by memcgroup, it will > > swap out the memory to zram swap device when the memory > > is insufficient. In that case, the memory limit will have > > no effect. > > > > So, it should makes sense to charge the compressed RAM to > > the page's memory cgroup. > While looking at this in the past weeks, I believe that there are two distinct problems: 1. Direct zram usage by process within a cg ie. a process writing to a zram device 2. Indirect zram usage by a process within a cg via swap (described above) Both of them probably require different solutions. In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed. Yu Zhao and Yosry are probably much more familiar with the solution to #2. WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this. Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1? - fabian > We used to do this a long time ago, but we had per-memcg swapfiles [1[ > to prevent compressed pages from different memcgs from sharing the > same zspage. > > Does this patchset alone suffer from the same problem, i.e., memcgs > sharing zspages? > > [1] https://lwn.net/Articles/592923/ > >
On 15.06.23 05:48, Zhongkun He wrote: > The compressed RAM is currently charged to kernel, not to > any memory cgroup, which is not satisfy our usage scenario. > if the memory of a task is limited by memcgroup, it will > swap out the memory to zram swap device when the memory > is insufficient. In that case, the memory limit will have > no effect. > > So, it should makes sense to charge the compressed RAM to > the page's memory cgroup. Interesting. When looking at possible ways to achieve that in a clean way, my understanding was that the page might not always be accounted to a memcg (e.g., simply some buffer?). Besides the swap->zram case I was thinking about other fs->zram case, or just a straight read/write to the zram device. The easiest way to see where that goes wrong I think is zram_bvec_write_partial(), where we simply allocate a temporary page. Either I am missing something important, or this only works in some special cases. I came to the conclusion that the only reasonable way is to assign a complete zram device to a single cgroup and have all memory accounted to that cgroup. Does also not solve all use cases (especially the swap->zram case that might be better offjust using zswap?) but at least the memory gets accounted in a more predictable way. Happy to learn more.
On Thu, Jun 15, 2023 at 6:59 AM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: > > > > The compressed RAM is currently charged to kernel, not to > > any memory cgroup, which is not satisfy our usage scenario. > > if the memory of a task is limited by memcgroup, it will > > swap out the memory to zram swap device when the memory > > is insufficient. In that case, the memory limit will have > > no effect. > > > > So, it should makes sense to charge the compressed RAM to > > the page's memory cgroup. While looking at this in the past weeks, I believe that there are two distinct problems: 1. Direct zram usage by process within a cg ie. a process writing to a zram device 2. Indirect zram usage by a process within a cg via swap (described above) Both of them probably require different solutions. In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed. Yu Zhao and Yosry are probably much more familiar with the solution to #2. WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this. Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1? - fabian > We used to do this a long time ago, but we had per-memcg swapfiles [1[ > to prevent compressed pages from different memcgs from sharing the > same zspage. > > Does this patchset alone suffer from the same problem, i.e., memcgs > sharing zspages? > > [1] https://lwn.net/Articles/592923/ >
On Thu 15-06-23 11:48:30, Zhongkun He wrote: [...] > @@ -1419,6 +1420,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) > struct zcomp_strm *zstrm; > unsigned long element = 0; > enum zram_pageflags flags = 0; > + struct mem_cgroup *memcg, *old_memcg; > + > + memcg = page_memcg(page); > + old_memcg = set_active_memcg(memcg); > > mem = kmap_atomic(page); > if (page_same_filled(mem, &element)) { [...] > @@ -1470,8 +1475,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) > handle = zs_malloc(zram->mem_pool, comp_len, > GFP_NOIO | __GFP_HIGHMEM | > __GFP_MOVABLE); > - if (IS_ERR_VALUE(handle)) > - return PTR_ERR((void *)handle); > + if (IS_ERR_VALUE(handle)) { > + ret = PTR_ERR((void *)handle); > + goto out; > + } > > if (comp_len != PAGE_SIZE) > goto compress_again; I am not really deeply familiar with zram implementation nor usage but how is the above allocation going to be charged without __GFP_ACCOUNT in the gfp mask? Also what exactly is going to happen for the swap backed by the zram device? Your memcg might be hitting the hard limit and therefore swapping out. Wouldn't zs_malloc fail very likely under that condition making the swap effectively unusable?
On Thu, Jun 15, 2023 at 1:00 PM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He > <hezhongkun.hzk@bytedance.com> wrote: > > > > The compressed RAM is currently charged to kernel, not to > > any memory cgroup, which is not satisfy our usage scenario. > > if the memory of a task is limited by memcgroup, it will > > swap out the memory to zram swap device when the memory > > is insufficient. In that case, the memory limit will have > > no effect. > > > > So, it should makes sense to charge the compressed RAM to > > the page's memory cgroup. > > We used to do this a long time ago, but we had per-memcg swapfiles [1[ > to prevent compressed pages from different memcgs from sharing the > same zspage. > > Does this patchset alone suffer from the same problem, i.e., memcgs > sharing zspages? > > [1] https://lwn.net/Articles/592923/ Thanks for your reply. Yes, different memcgs may share the same zspages.
> While looking at this in the past weeks, I believe that there are two distinct problems: > 1. Direct zram usage by process within a cg ie. a process writing to a zram device > 2. Indirect zram usage by a process within a cg via swap (described above) > > Both of them probably require different solutions. > In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed. > > Yu Zhao and Yosry are probably much more familiar with the solution to #2. > WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this. > > Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1? > > - fabian Thanks for your reply, fabian. According to the previous design and test results, this patchset can meet the need of direct and indirect usage scenarios,charge the compressed memory to memory cgroup.
On Thu, Jun 15, 2023 at 5:27 PM David Hildenbrand <david@redhat.com> wrote: > > Interesting. When looking at possible ways to achieve that in a clean > way, my understanding was that the page might not always be accounted to > a memcg (e.g., simply some buffer?). Besides the swap->zram case I was > thinking about other fs->zram case, or just a straight read/write to the > zram device. > > The easiest way to see where that goes wrong I think is > zram_bvec_write_partial(), where we simply allocate a temporary page. > > Either I am missing something important, or this only works in some > special cases. > > I came to the conclusion that the only reasonable way is to assign a > complete zram device to a single cgroup and have all memory accounted to > that cgroup. Does also not solve all use cases (especially the > swap->zram case that might be better offjust using zswap?) but at least > the memory gets accounted in a more predictable way. > > Happy to learn more. > > -- > Cheers, > > David / dhildenb > Hi david, thanks for your reply. This should be my mistake, I didn't describe the problem clearly. The reason for the problem is not the temporary page allocated at the beginning of zram_bvec_write_partial() and released at the end,but the compressed memory allocated by zs_malloc() in zram_write_page(). So, it may not meet the need to assign a complete zram device to a single cgroup. we need to charge the compressed memory to the original page's memory cgroup, which is not charged so far.
On 15.06.23 13:15, 贺中坤 wrote: > On Thu, Jun 15, 2023 at 5:27 PM David Hildenbrand <david@redhat.com> wrote: >> >> Interesting. When looking at possible ways to achieve that in a clean >> way, my understanding was that the page might not always be accounted to >> a memcg (e.g., simply some buffer?). Besides the swap->zram case I was >> thinking about other fs->zram case, or just a straight read/write to the >> zram device. >> >> The easiest way to see where that goes wrong I think is >> zram_bvec_write_partial(), where we simply allocate a temporary page. >> >> Either I am missing something important, or this only works in some >> special cases. >> >> I came to the conclusion that the only reasonable way is to assign a >> complete zram device to a single cgroup and have all memory accounted to >> that cgroup. Does also not solve all use cases (especially the >> swap->zram case that might be better offjust using zswap?) but at least >> the memory gets accounted in a more predictable way. >> >> Happy to learn more. >> >> -- >> Cheers, >> >> David / dhildenb >> > > Hi david, thanks for your reply. This should be my mistake, I didn't > describe the > problem clearly. > > The reason for the problem is not the temporary page allocated at the beginning > of zram_bvec_write_partial() and released at the end,but the compressed memory > allocated by zs_malloc() in zram_write_page(). > > So, it may not meet the need to assign a complete zram device to a > single cgroup. > we need to charge the compressed memory to the original page's memory cgroup, > which is not charged so far. > Yes, but my point is that there are cases where the pages you get are not charged. zram_bvec_write_partial() is just one such example that highlights the issue.
Hi michal, glad to hear from you. > I am not really deeply familiar with zram implementation nor usage but > how is the above allocation going to be charged without __GFP_ACCOUNT in > the gfp mask? Yes,zs_malloc() did not charge compressed memory, even if we add this gfp. so we need to implement this function in this patchset. But this flag should be used to enable this feature. > Also what exactly is going to happen for the swap backed by the zram > device? Your memcg might be hitting the hard limit and therefore > swapping out. Wouldn't zs_malloc fail very likely under that condition > making the swap effectively unusable? This is the key point, as i said above, zs_malloc() did not charge compressed memory, so zs_malloc will not fail under that condition. if the zram swap is large enough, zs_malloc never fails unless system OOM. so memory.max will be invalidated. > -- > Michal Hocko > SUSE Labs
On Thu, Jun 15, 2023 at 12:00 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > > > While looking at this in the past weeks, I believe that there are two distinct problems: > > 1. Direct zram usage by process within a cg ie. a process writing to a zram device > > 2. Indirect zram usage by a process within a cg via swap (described above) > > > > Both of them probably require different solutions. > > In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed. > > > > Yu Zhao and Yosry are probably much more familiar with the solution to #2. > > WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this. > > > > Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1? > > > > - fabian > > Thanks for your reply, fabian. According to the previous design and > test results, this patchset can meet the need of direct and > indirect usage scenarios,charge the compressed memory to memory cgroup. As said, I can not speak about the indirect swap case, but if it is addressing the direct case, and putting the memory accounting on the cgroup - then this would address one of the use-cases I have in mind. Thanks! - fabian
On Thu 15-06-23 19:58:37, 贺中坤 wrote: > Hi michal, glad to hear from you. > > > I am not really deeply familiar with zram implementation nor usage but > > how is the above allocation going to be charged without __GFP_ACCOUNT in > > the gfp mask? > > Yes,zs_malloc() did not charge compressed memory, even if we add this gfp. > so we need to implement this function in this patchset. But this flag should be > used to enable this feature. Let me check I understand. This patch on its own doesn't really do anything. You need the zs_malloc support implemented in patch 3 for this to have any effect. Even with that in place the zs_malloc doesn't follow the __GFP_ACCOUNT scheme we use for allocation tracking. Correct? > > Also what exactly is going to happen for the swap backed by the zram > > device? Your memcg might be hitting the hard limit and therefore > > swapping out. Wouldn't zs_malloc fail very likely under that condition > > making the swap effectively unusable? > > This is the key point, as i said above, zs_malloc() did not charge > compressed memory, > so zs_malloc will not fail under that condition. if the zram swap is > large enough, zs_malloc > never fails unless system OOM. so memory.max will be invalidated. I do not think this is answering my question. Or maybe I just misunderstand. Let me try again. Say you have a memcg under hard limit pressure so any further charge is going to fail. How can you reasonably implement zram back swapout if the memory is charged?
On Thu, Jun 15, 2023 at 7:19 PM David Hildenbrand <david@redhat.com> wrote: > > Yes, but my point is that there are cases where the pages you get are > not charged. zram_bvec_write_partial() is just one such example that > highlights the issue. Sorry ,I got it. > > -- > Cheers, > > David / dhildenb >
On 15.06.23 14:19, 贺中坤 wrote: > On Thu, Jun 15, 2023 at 7:19 PM David Hildenbrand <david@redhat.com> wrote: >> >> Yes, but my point is that there are cases where the pages you get are >> not charged. zram_bvec_write_partial() is just one such example that >> highlights the issue. > > Sorry ,I got it. I suspect for the swap->zram we should always get charged pages, because we're effectively writing out charged anon/shmem pages only -- without any buffer in between. For the fs->zram or direct zram access device case I'm not so sure. It highly depends on what gets mapped into the bio (e.g., a kernel buffer, zeropage, ...). If it's a pagecache page, that should be charged and we're good. No so sure about fs metadata or some other fs cases (e.g., write() to a file that bypass the pagecache).
> Let me check I understand. This patch on its own doesn't really do > anything. You need the zs_malloc support implemented in patch 3 for this > to have any effect. Even with that in place the zs_malloc doesn't follow > the __GFP_ACCOUNT scheme we use for allocation tracking. Correct? > Yes, I will use it on next version. > I do not think this is answering my question. Or maybe I just > misunderstand. Let me try again. Say you have a memcg under hard limit > pressure so any further charge is going to fail. How can you reasonably > implement zram back swapout if the memory is charged? > Sorry, let me try to explain again. I have a memcg under hard limit pressure. Any further charge will try to free memory and swapout to zram back which is compressed and stored data in memory.so any further charge is not going to fail. The charged memory is swapout to compressed memory step by step, but the compressed memory is not charged to the original memcgroup. So, Actual memory usage is already greater than the hard limit in some cases. This pachset will charge the compressed memory to the original memcg, limited by memory.max > -- > Michal Hocko > SUSE Labs
On Thu 15-06-23 21:09:16, 贺中坤 wrote: > > Let me check I understand. This patch on its own doesn't really do > > anything. You need the zs_malloc support implemented in patch 3 for this > > to have any effect. Even with that in place the zs_malloc doesn't follow > > the __GFP_ACCOUNT scheme we use for allocation tracking. Correct? > > > > Yes, I will use it on next version. OK, also make sure that the zsmalloc support is implemented before zram depends on it. > > I do not think this is answering my question. Or maybe I just > > misunderstand. Let me try again. Say you have a memcg under hard limit > > pressure so any further charge is going to fail. How can you reasonably > > implement zram back swapout if the memory is charged? > > > > Sorry, let me try to explain again. I have a memcg under hard limit pressure. > Any further charge will try to free memory and swapout to zram back which > is compressed and stored data in memory.so any further charge is not going > to fail. The charged memory is swapout to compressed memory step by > step, but the compressed memory is not charged to the original memcgroup. > So, Actual memory usage is already greater than the hard limit in some cases. > This pachset will charge the compressed memory to the original memcg, > limited by memory.max This is not really answering my question though. memcg under hard limit is not really my concern. This is a simpler case. I am not saying it doesn't need to get addresses but it is the memcg hard limited case that is much more interested. Because your charges are going to fail very likely and that would mean that swapout would fail AFAIU. If my understanding is wrong then it would really help to describe that case much more in the changelog.
> I suspect for the swap->zram we should always get charged pages, because > we're effectively writing out charged anon/shmem pages only -- without > any buffer in between. Hi David,the charged memory will be released in swap->zram. New pages are allocated by alloc_zspage(), and we did not charge the page directly,but the objects(like slab), because the zspage are shared by any memcg. > > For the fs->zram or direct zram access device case I'm not so sure. It > highly depends on what gets mapped into the bio (e.g., a kernel buffer, > zeropage, ...). If it's a pagecache page, that should be charged and > we're good. No so sure about fs metadata or some other fs cases (e.g., > write() to a file that bypass the pagecache). > Yes, the pagecaches are charged in fs->zram, but will be released if we drop the cache. the compressed objects are not charged. > -- > Cheers, > > David / dhildenb >
> > OK, also make sure that the zsmalloc support is implemented before zram > depends on it. > OK. I got it. > > This is not really answering my question though. memcg under hard limit > is not really my concern. This is a simpler case. I am not saying it > doesn't need to get addresses but it is the memcg hard limited case that > is much more interested. Because your charges are going to fail very > likely and that would mean that swapout would fail AFAIU. If my > understanding is wrong then it would really help to describe that case > much more in the changelog. > OK, Got it. In many cases I have tested, it will not fail because we did not charge the page directly,but the objects(like slab,compressed page), for the zspage may be shared by any memcg. > -- > Michal Hocko > SUSE Labs
On Thu 15-06-23 22:13:09, 贺中坤 wrote: > > This is not really answering my question though. memcg under hard limit > > is not really my concern. This is a simpler case. I am not saying it > > doesn't need to get addresses but it is the memcg hard limited case that > > is much more interested. Because your charges are going to fail very > > likely and that would mean that swapout would fail AFAIU. If my > > understanding is wrong then it would really help to describe that case > > much more in the changelog. > > > > OK, Got it. In many cases I have tested, it will not fail because we did > not charge the page directly,but the objects(like slab,compressed page), > for the zspage may be shared by any memcg. That sounds like a broken design to me.
On 15.06.23 15:40, 贺中坤 wrote: >> I suspect for the swap->zram we should always get charged pages, because >> we're effectively writing out charged anon/shmem pages only -- without >> any buffer in between. > > Hi David,the charged memory will be released in swap->zram. New pages > are allocated by alloc_zspage(), and we did not charge the page directly,but > the objects(like slab), because the zspage are shared by any memcg. > >> >> For the fs->zram or direct zram access device case I'm not so sure. It >> highly depends on what gets mapped into the bio (e.g., a kernel buffer, >> zeropage, ...). If it's a pagecache page, that should be charged and >> we're good. No so sure about fs metadata or some other fs cases (e.g., >> write() to a file that bypass the pagecache). >> > > Yes, the pagecaches are charged in fs->zram, but will be released if > we drop the cache. the compressed objects are not charged. Yes. But just to stress again, one issue I see is that if there is a page in the BIO that is not charged, you cannot charge the compressed page. Assume you have some FS on that zram block device and you want to make sure it gets properly charged to whoever is reading/writing a file on that filesystem. (imagine something like a compress shmem) If a user (or the filesystem?) can trigger a BIO that has an uncharged page in it, it would not get charged accordingly. The "easy" reproducer would have been O_DIRECT write() using the shared zeropage, but zram_write_page() is smart enough to optimize for that case (page_same_filled()). :) Maybe I'm over-thinking this (well, the we do have partial I/O support, so something seems to be able to trigger such cases), and it would be great if someone with more FS->BIO experience could comment. I'll note that this is fundamentally different to zswap, because with zswap you don't get arbitrary BIOs, you get an anon or shmem page (that should be charged).
On Thu, Jun 15, 2023 at 1:57 AM Fabian Deutsch <fdeutsch@redhat.com> wrote: > > > On Thu, Jun 15, 2023 at 6:59 AM Yu Zhao <yuzhao@google.com> wrote: >> >> On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He >> <hezhongkun.hzk@bytedance.com> wrote: >> > >> > The compressed RAM is currently charged to kernel, not to >> > any memory cgroup, which is not satisfy our usage scenario. >> > if the memory of a task is limited by memcgroup, it will >> > swap out the memory to zram swap device when the memory >> > is insufficient. In that case, the memory limit will have >> > no effect. >> > >> > So, it should makes sense to charge the compressed RAM to >> > the page's memory cgroup. > > > While looking at this in the past weeks, I believe that there are two distinct problems: > 1. Direct zram usage by process within a cg ie. a process writing to a zram device > 2. Indirect zram usage by a process within a cg via swap (described above) > > Both of them probably require different solutions. > In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed. > > Yu Zhao and Yosry are probably much more familiar with the solution to #2. > WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this. Thanks Fabian for tagging me. I am not familiar with #1, so I will speak to #2. Zhongkun, There are a few parts that I do not understand -- hopefully you can help me out here: (1) If I understand correctly in this patch we set the active memcg trying to charge any pages allocated in a zspage to the current memcg, yet that zspage will contain multiple compressed object slots, not just the one used by this memcg. Aren't we overcharging the memcg? Basically the first memcg that happens to allocate the zspage will pay for all the objects in this zspage, even after it stops using the zspage completely? (2) Patch 3 seems to be charging the compressed slots to the memcgs, yet this patch is trying to charge the entire zspage. Aren't we double charging the zspage? I am guessing this isn't happening because (as Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so this patch may be NOP, and the actual charging is coming from patch 3 only. (3) Zswap recently implemented per-memcg charging of compressed objects in a much simpler way. If your main interest is #2 (which is what I understand from the commit log), it seems like zswap might be providing this already? Why can't you use zswap? Is it the fact that zswap requires a backing swapfile? Thanks! > > Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1? > > - fabian > >> >> We used to do this a long time ago, but we had per-memcg swapfiles [1[ >> to prevent compressed pages from different memcgs from sharing the >> same zspage. >> >> Does this patchset alone suffer from the same problem, i.e., memcgs >> sharing zspages? >> >> [1] https://lwn.net/Articles/592923/ >>
On Thu, Jun 15, 2023 at 10:21 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 15-06-23 22:13:09, 贺中坤 wrote: > > > This is not really answering my question though. memcg under hard limit > > > is not really my concern. This is a simpler case. I am not saying it > > > doesn't need to get addresses but it is the memcg hard limited case that > > > is much more interested. Because your charges are going to fail very > > > likely and that would mean that swapout would fail AFAIU. If my > > > understanding is wrong then it would really help to describe that case > > > much more in the changelog. > > > > > > > OK, Got it. In many cases I have tested, it will not fail because we did > > not charge the page directly,but the objects(like slab,compressed page), > > for the zspage may be shared by any memcg. > > That sounds like a broken design to me. > -- > Michal Hocko > SUSE Labs I will try more cases in different compression ratios and make sure that swapout will not fail.
> > Yes. But just to stress again, one issue I see is that if there is a > page in the BIO that is not charged, you cannot charge the compressed page. I got it. Thanks. > > Assume you have some FS on that zram block device and you want to make > sure it gets properly charged to whoever is reading/writing a file on > that filesystem. (imagine something like a compress shmem) > > If a user (or the filesystem?) can trigger a BIO that has an uncharged > page in it, it would not get charged accordingly. > > The "easy" reproducer would have been O_DIRECT write() using the shared > zeropage, but zram_write_page() is smart enough to optimize for that > case (page_same_filled()). :) Ok, I will try it. > > Maybe I'm over-thinking this (well, the we do have partial I/O support, > so something seems to be able to trigger such cases), and it would be > great if someone with more FS->BIO experience could comment. > > I'll note that this is fundamentally different to zswap, because with > zswap you don't get arbitrary BIOs, you get an anon or shmem page (that > should be charged). > Hi David, I know your concern and I will try to find the uncharged case. > -- > Cheers, > > David / dhildenb >
> Thanks Fabian for tagging me. > > I am not familiar with #1, so I will speak to #2. Zhongkun, There are > a few parts that I do not understand -- hopefully you can help me out > here: > > (1) If I understand correctly in this patch we set the active memcg > trying to charge any pages allocated in a zspage to the current memcg, > yet that zspage will contain multiple compressed object slots, not > just the one used by this memcg. Aren't we overcharging the memcg? > Basically the first memcg that happens to allocate the zspage will pay > for all the objects in this zspage, even after it stops using the > zspage completely? It will not overcharge. As you said below, we are not using __GFP_ACCOUNT and charging the compressed slots to the memcgs. > > (2) Patch 3 seems to be charging the compressed slots to the memcgs, > yet this patch is trying to charge the entire zspage. Aren't we double > charging the zspage? I am guessing this isn't happening because (as > Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so > this patch may be NOP, and the actual charging is coming from patch 3 > only. YES, the actual charging is coming from patch 3. This patch just delivers the BIO page's memcg to the current task which is not the consumer. > > (3) Zswap recently implemented per-memcg charging of compressed > objects in a much simpler way. If your main interest is #2 (which is > what I understand from the commit log), it seems like zswap might be > providing this already? Why can't you use zswap? Is it the fact that > zswap requires a backing swapfile? Thanks for your reply and review. Yes, the zswap requires a backing swapfile. The I/O path is very complex, sometimes it will throttle the whole system if some resources are short , so we hope to use zram. > > Thanks! >
On Fri 16-06-23 11:31:18, 贺中坤 wrote: > On Thu, Jun 15, 2023 at 10:21 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 15-06-23 22:13:09, 贺中坤 wrote: > > > > This is not really answering my question though. memcg under hard limit > > > > is not really my concern. This is a simpler case. I am not saying it > > > > doesn't need to get addresses but it is the memcg hard limited case that > > > > is much more interested. Because your charges are going to fail very > > > > likely and that would mean that swapout would fail AFAIU. If my > > > > understanding is wrong then it would really help to describe that case > > > > much more in the changelog. > > > > > > > > > > OK, Got it. In many cases I have tested, it will not fail because we did > > > not charge the page directly,but the objects(like slab,compressed page), > > > for the zspage may be shared by any memcg. > > > > That sounds like a broken design to me. > > -- > > Michal Hocko > > SUSE Labs > > I will try more cases in different compression ratios and make sure > that swapout will not fail. I do not think different compression methods will cut it. You essentially need some form of memory reserves - in memcg world pre-charged pool of memory readily available for the swapout. Another way would be to allow the charge to bypass the limit with a guarantee that this will be a temporal breach.
On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > > > Thanks Fabian for tagging me. > > > > I am not familiar with #1, so I will speak to #2. Zhongkun, There are > > a few parts that I do not understand -- hopefully you can help me out > > here: > > > > (1) If I understand correctly in this patch we set the active memcg > > trying to charge any pages allocated in a zspage to the current memcg, > > yet that zspage will contain multiple compressed object slots, not > > just the one used by this memcg. Aren't we overcharging the memcg? > > Basically the first memcg that happens to allocate the zspage will pay > > for all the objects in this zspage, even after it stops using the > > zspage completely? > > It will not overcharge. As you said below, we are not using > __GFP_ACCOUNT and charging the compressed slots to the memcgs. > > > > > (2) Patch 3 seems to be charging the compressed slots to the memcgs, > > yet this patch is trying to charge the entire zspage. Aren't we double > > charging the zspage? I am guessing this isn't happening because (as > > Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so > > this patch may be NOP, and the actual charging is coming from patch 3 > > only. > > YES, the actual charging is coming from patch 3. This patch just > delivers the BIO page's memcg to the current task which is not the > consumer. > > > > > (3) Zswap recently implemented per-memcg charging of compressed > > objects in a much simpler way. If your main interest is #2 (which is > > what I understand from the commit log), it seems like zswap might be > > providing this already? Why can't you use zswap? Is it the fact that > > zswap requires a backing swapfile? > > Thanks for your reply and review. Yes, the zswap requires a backing > swapfile. The I/O path is very complex, sometimes it will throttle the > whole system if some resources are short , so we hope to use zram. Is the only problem with zswap for you the requirement of a backing swapfile? If yes, I am in the early stages of developing a solution to make zswap work without a backing swapfile. This was discussed in LSF/MM [1]. Would this make zswap usable in for your use case? [1]https://lore.kernel.org/linux-mm/CAJD7tkYb_sGN8mfGVjr2JxdB8Pz8Td=yj9_sBCMrmsKQo56vTg@mail.gmail.com/ > > > > > Thanks! > >
On 16.06.23 09:37, Yosry Ahmed wrote: > On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: >> >>> Thanks Fabian for tagging me. >>> >>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are >>> a few parts that I do not understand -- hopefully you can help me out >>> here: >>> >>> (1) If I understand correctly in this patch we set the active memcg >>> trying to charge any pages allocated in a zspage to the current memcg, >>> yet that zspage will contain multiple compressed object slots, not >>> just the one used by this memcg. Aren't we overcharging the memcg? >>> Basically the first memcg that happens to allocate the zspage will pay >>> for all the objects in this zspage, even after it stops using the >>> zspage completely? >> >> It will not overcharge. As you said below, we are not using >> __GFP_ACCOUNT and charging the compressed slots to the memcgs. >> >>> >>> (2) Patch 3 seems to be charging the compressed slots to the memcgs, >>> yet this patch is trying to charge the entire zspage. Aren't we double >>> charging the zspage? I am guessing this isn't happening because (as >>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so >>> this patch may be NOP, and the actual charging is coming from patch 3 >>> only. >> >> YES, the actual charging is coming from patch 3. This patch just >> delivers the BIO page's memcg to the current task which is not the >> consumer. >> >>> >>> (3) Zswap recently implemented per-memcg charging of compressed >>> objects in a much simpler way. If your main interest is #2 (which is >>> what I understand from the commit log), it seems like zswap might be >>> providing this already? Why can't you use zswap? Is it the fact that >>> zswap requires a backing swapfile? >> >> Thanks for your reply and review. Yes, the zswap requires a backing >> swapfile. The I/O path is very complex, sometimes it will throttle the >> whole system if some resources are short , so we hope to use zram. > > Is the only problem with zswap for you the requirement of a backing swapfile? > > If yes, I am in the early stages of developing a solution to make > zswap work without a backing swapfile. This was discussed in LSF/MM > [1]. Would this make zswap usable in for your use case? Out of curiosity, are there any other known pros/cons when using zswap-without-swap instead of zram? I know that zram requires sizing (size of the virtual block device) and consumes metadata, zswap doesn't.
On Fri, Jun 16, 2023 at 12:57 AM David Hildenbrand <david@redhat.com> wrote: > > On 16.06.23 09:37, Yosry Ahmed wrote: > > On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > >> > >>> Thanks Fabian for tagging me. > >>> > >>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are > >>> a few parts that I do not understand -- hopefully you can help me out > >>> here: > >>> > >>> (1) If I understand correctly in this patch we set the active memcg > >>> trying to charge any pages allocated in a zspage to the current memcg, > >>> yet that zspage will contain multiple compressed object slots, not > >>> just the one used by this memcg. Aren't we overcharging the memcg? > >>> Basically the first memcg that happens to allocate the zspage will pay > >>> for all the objects in this zspage, even after it stops using the > >>> zspage completely? > >> > >> It will not overcharge. As you said below, we are not using > >> __GFP_ACCOUNT and charging the compressed slots to the memcgs. > >> > >>> > >>> (2) Patch 3 seems to be charging the compressed slots to the memcgs, > >>> yet this patch is trying to charge the entire zspage. Aren't we double > >>> charging the zspage? I am guessing this isn't happening because (as > >>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so > >>> this patch may be NOP, and the actual charging is coming from patch 3 > >>> only. > >> > >> YES, the actual charging is coming from patch 3. This patch just > >> delivers the BIO page's memcg to the current task which is not the > >> consumer. > >> > >>> > >>> (3) Zswap recently implemented per-memcg charging of compressed > >>> objects in a much simpler way. If your main interest is #2 (which is > >>> what I understand from the commit log), it seems like zswap might be > >>> providing this already? Why can't you use zswap? Is it the fact that > >>> zswap requires a backing swapfile? > >> > >> Thanks for your reply and review. Yes, the zswap requires a backing > >> swapfile. The I/O path is very complex, sometimes it will throttle the > >> whole system if some resources are short , so we hope to use zram. > > > > Is the only problem with zswap for you the requirement of a backing swapfile? > > > > If yes, I am in the early stages of developing a solution to make > > zswap work without a backing swapfile. This was discussed in LSF/MM > > [1]. Would this make zswap usable in for your use case? > > Out of curiosity, are there any other known pros/cons when using > zswap-without-swap instead of zram? > > I know that zram requires sizing (size of the virtual block device) and > consumes metadata, zswap doesn't. We don't use zram in our data centers so I am not an expert about zram, but off the top of my head there are a few more advantages to zswap: (1) Better memcg support (which this series is attempting to address in zram, although in a much more complicated way). (2) We internally have incompressible memory handling on top of zswap, which is something that we would like to upstream when zswap-without-swap is supported. Basically if a page does not compress well enough to save memory we reject it from zswap and make it unevictable (if there is no backing swapfile). The existence of zswap in the MM layer helps with this. Since zram is a block device from the MM perspective, it's more difficult to do something like this. Incompressible pages just sit in zram AFAICT. (3) Writeback support. If you're running out of memory to store compressed pages you can add a swapfile in runtime and zswap will start writing to it freeing up space to compress more pages. This wouldn't be possible in the same way in zram. Zram supports writing to a backing device but in a more manual way (userspace has to write to an interface to tell zram to write some pages). The disadvantage is that zswap doesn't currently support being used without a swapfile, but once this support exists, I am not sure what other disadvantages zswap would have compared to zram. > > -- > Cheers, > > David / dhildenb >
On 16.06.23 10:04, Yosry Ahmed wrote: > On Fri, Jun 16, 2023 at 12:57 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 16.06.23 09:37, Yosry Ahmed wrote: >>> On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: >>>> >>>>> Thanks Fabian for tagging me. >>>>> >>>>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are >>>>> a few parts that I do not understand -- hopefully you can help me out >>>>> here: >>>>> >>>>> (1) If I understand correctly in this patch we set the active memcg >>>>> trying to charge any pages allocated in a zspage to the current memcg, >>>>> yet that zspage will contain multiple compressed object slots, not >>>>> just the one used by this memcg. Aren't we overcharging the memcg? >>>>> Basically the first memcg that happens to allocate the zspage will pay >>>>> for all the objects in this zspage, even after it stops using the >>>>> zspage completely? >>>> >>>> It will not overcharge. As you said below, we are not using >>>> __GFP_ACCOUNT and charging the compressed slots to the memcgs. >>>> >>>>> >>>>> (2) Patch 3 seems to be charging the compressed slots to the memcgs, >>>>> yet this patch is trying to charge the entire zspage. Aren't we double >>>>> charging the zspage? I am guessing this isn't happening because (as >>>>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so >>>>> this patch may be NOP, and the actual charging is coming from patch 3 >>>>> only. >>>> >>>> YES, the actual charging is coming from patch 3. This patch just >>>> delivers the BIO page's memcg to the current task which is not the >>>> consumer. >>>> >>>>> >>>>> (3) Zswap recently implemented per-memcg charging of compressed >>>>> objects in a much simpler way. If your main interest is #2 (which is >>>>> what I understand from the commit log), it seems like zswap might be >>>>> providing this already? Why can't you use zswap? Is it the fact that >>>>> zswap requires a backing swapfile? >>>> >>>> Thanks for your reply and review. Yes, the zswap requires a backing >>>> swapfile. The I/O path is very complex, sometimes it will throttle the >>>> whole system if some resources are short , so we hope to use zram. >>> >>> Is the only problem with zswap for you the requirement of a backing swapfile? >>> >>> If yes, I am in the early stages of developing a solution to make >>> zswap work without a backing swapfile. This was discussed in LSF/MM >>> [1]. Would this make zswap usable in for your use case? >> >> Out of curiosity, are there any other known pros/cons when using >> zswap-without-swap instead of zram? >> >> I know that zram requires sizing (size of the virtual block device) and >> consumes metadata, zswap doesn't. > > We don't use zram in our data centers so I am not an expert about > zram, but off the top of my head there are a few more advantages to > zswap: Thanks! > (1) Better memcg support (which this series is attempting to address > in zram, although in a much more complicated way). Right. I think this patch also misses to update apply the charging in the recompress case. (only triggered by user space IIUC) > > (2) We internally have incompressible memory handling on top of zswap, > which is something that we would like to upstream when > zswap-without-swap is supported. Basically if a page does not compress > well enough to save memory we reject it from zswap and make it > unevictable (if there is no backing swapfile). The existence of zswap > in the MM layer helps with this. Since zram is a block device from the > MM perspective, it's more difficult to do something like this. > Incompressible pages just sit in zram AFAICT. I see. With ZRAM_HUGE we still have to store the uncompressed page (because, it's a block device and has to hold that data). > > (3) Writeback support. If you're running out of memory to store > compressed pages you can add a swapfile in runtime and zswap will > start writing to it freeing up space to compress more pages. This > wouldn't be possible in the same way in zram. Zram supports writing to > a backing device but in a more manual way (userspace has to write to > an interface to tell zram to write some pages). Right, that zram backing device stuff is really sub-optimal and only useful in corner cases (most probably not datacenters). What one can do with zram is to add a second swap device with lower priority. Looking at my Fedora machine: $ cat /proc/swaps Filename Type Size Used Priority /dev/dm-2 partition 16588796 0 -2 /dev/zram0 partition 8388604 0 100 Guess the difference here is that you won't be writing out the compressed data to the disk, but anything the gets swapped out afterwards will end up on the disk. I can see how the zswap behavior might be better in that case (instead of swapping out some additional pages you relocate the already-swapped-out-to-zswap pages to the disk).
On Fri, Jun 16, 2023 at 1:37 AM David Hildenbrand <david@redhat.com> wrote: > > On 16.06.23 10:04, Yosry Ahmed wrote: > > On Fri, Jun 16, 2023 at 12:57 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 16.06.23 09:37, Yosry Ahmed wrote: > >>> On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote: > >>>> > >>>>> Thanks Fabian for tagging me. > >>>>> > >>>>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are > >>>>> a few parts that I do not understand -- hopefully you can help me out > >>>>> here: > >>>>> > >>>>> (1) If I understand correctly in this patch we set the active memcg > >>>>> trying to charge any pages allocated in a zspage to the current memcg, > >>>>> yet that zspage will contain multiple compressed object slots, not > >>>>> just the one used by this memcg. Aren't we overcharging the memcg? > >>>>> Basically the first memcg that happens to allocate the zspage will pay > >>>>> for all the objects in this zspage, even after it stops using the > >>>>> zspage completely? > >>>> > >>>> It will not overcharge. As you said below, we are not using > >>>> __GFP_ACCOUNT and charging the compressed slots to the memcgs. > >>>> > >>>>> > >>>>> (2) Patch 3 seems to be charging the compressed slots to the memcgs, > >>>>> yet this patch is trying to charge the entire zspage. Aren't we double > >>>>> charging the zspage? I am guessing this isn't happening because (as > >>>>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so > >>>>> this patch may be NOP, and the actual charging is coming from patch 3 > >>>>> only. > >>>> > >>>> YES, the actual charging is coming from patch 3. This patch just > >>>> delivers the BIO page's memcg to the current task which is not the > >>>> consumer. > >>>> > >>>>> > >>>>> (3) Zswap recently implemented per-memcg charging of compressed > >>>>> objects in a much simpler way. If your main interest is #2 (which is > >>>>> what I understand from the commit log), it seems like zswap might be > >>>>> providing this already? Why can't you use zswap? Is it the fact that > >>>>> zswap requires a backing swapfile? > >>>> > >>>> Thanks for your reply and review. Yes, the zswap requires a backing > >>>> swapfile. The I/O path is very complex, sometimes it will throttle the > >>>> whole system if some resources are short , so we hope to use zram. > >>> > >>> Is the only problem with zswap for you the requirement of a backing swapfile? > >>> > >>> If yes, I am in the early stages of developing a solution to make > >>> zswap work without a backing swapfile. This was discussed in LSF/MM > >>> [1]. Would this make zswap usable in for your use case? > >> > >> Out of curiosity, are there any other known pros/cons when using > >> zswap-without-swap instead of zram? > >> > >> I know that zram requires sizing (size of the virtual block device) and > >> consumes metadata, zswap doesn't. > > > > We don't use zram in our data centers so I am not an expert about > > zram, but off the top of my head there are a few more advantages to > > zswap: > > Thanks! > > > (1) Better memcg support (which this series is attempting to address > > in zram, although in a much more complicated way). > > Right. I think this patch also misses to update apply the charging in the recompress > case. (only triggered by user space IIUC) > > > > > (2) We internally have incompressible memory handling on top of zswap, > > which is something that we would like to upstream when > > zswap-without-swap is supported. Basically if a page does not compress > > well enough to save memory we reject it from zswap and make it > > unevictable (if there is no backing swapfile). The existence of zswap > > in the MM layer helps with this. Since zram is a block device from the > > MM perspective, it's more difficult to do something like this. > > Incompressible pages just sit in zram AFAICT. > > I see. With ZRAM_HUGE we still have to store the uncompressed page > (because, it's a block device and has to hold that data). Right. > > > > > (3) Writeback support. If you're running out of memory to store > > compressed pages you can add a swapfile in runtime and zswap will > > start writing to it freeing up space to compress more pages. This > > wouldn't be possible in the same way in zram. Zram supports writing to > > a backing device but in a more manual way (userspace has to write to > > an interface to tell zram to write some pages). > > Right, that zram backing device stuff is really sub-optimal and only useful > in corner cases (most probably not datacenters). > > What one can do with zram is to add a second swap device with lower priority. > Looking at my Fedora machine: > > $ cat /proc/swaps > Filename Type Size Used Priority > /dev/dm-2 partition 16588796 0 -2 > /dev/zram0 partition 8388604 0 100 > > > Guess the difference here is that you won't be writing out the compressed > data to the disk, but anything the gets swapped out afterwards will > end up on the disk. I can see how the zswap behavior might be better in that case > (instead of swapping out some additional pages you relocate the > already-swapped-out-to-zswap pages to the disk). Yeah I am hoping we can enable the use of zswap without a backing swapfile, and I keep seeing use cases that would benefit from that. > > -- > Cheers, > > David / dhildenb >
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index f6d90f1ba5cf..03b508447473 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -33,6 +33,7 @@ #include <linux/debugfs.h> #include <linux/cpuhotplug.h> #include <linux/part_stat.h> +#include <linux/memcontrol.h> #include "zram_drv.h" @@ -1419,6 +1420,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) struct zcomp_strm *zstrm; unsigned long element = 0; enum zram_pageflags flags = 0; + struct mem_cgroup *memcg, *old_memcg; + + memcg = page_memcg(page); + old_memcg = set_active_memcg(memcg); mem = kmap_atomic(page); if (page_same_filled(mem, &element)) { @@ -1426,7 +1431,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) /* Free memory associated with this sector now. */ flags = ZRAM_SAME; atomic64_inc(&zram->stats.same_pages); - goto out; + goto out_free; } kunmap_atomic(mem); @@ -1440,7 +1445,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]); pr_err("Compression failed! err=%d\n", ret); zs_free(zram->mem_pool, handle); - return ret; + goto out; } if (comp_len >= huge_class_size) @@ -1470,8 +1475,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) handle = zs_malloc(zram->mem_pool, comp_len, GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE); - if (IS_ERR_VALUE(handle)) - return PTR_ERR((void *)handle); + if (IS_ERR_VALUE(handle)) { + ret = PTR_ERR((void *)handle); + goto out; + } if (comp_len != PAGE_SIZE) goto compress_again; @@ -1491,7 +1498,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) if (zram->limit_pages && alloced_pages > zram->limit_pages) { zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]); zs_free(zram->mem_pool, handle); - return -ENOMEM; + ret = -ENOMEM; + goto out; } dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO); @@ -1506,7 +1514,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]); zs_unmap_object(zram->mem_pool, handle); atomic64_add(comp_len, &zram->stats.compr_data_size); -out: +out_free: /* * Free memory associated with this sector * before overwriting unused sectors. @@ -1531,6 +1539,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index) /* Update stats */ atomic64_inc(&zram->stats.pages_stored); +out: + set_active_memcg(old_memcg); return ret; }
The compressed RAM is currently charged to kernel, not to any memory cgroup, which is not satisfy our usage scenario. if the memory of a task is limited by memcgroup, it will swap out the memory to zram swap device when the memory is insufficient. In that case, the memory limit will have no effect. So, it should makes sense to charge the compressed RAM to the page's memory cgroup. Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> --- drivers/block/zram/zram_drv.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)