Message ID | 20221031054108.541190-1-senozhatsky@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | zsmalloc/zram: configurable zspage size | expand |
On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote: > Hello, > > Some use-cases and/or data patterns may benefit from > larger zspages. Currently the limit on the number of physical > pages that are linked into a zspage is hardcoded to 4. Higher > limit changes key characteristics of a number of the size > classes, improving compactness of the pool and redusing the > amount of memory zsmalloc pool uses. More on this in 0002 > commit message. Hi Sergey, I think the idea that break of fixed subpages in zspage is really good start to optimize further. However, I am worry about introducing per-pool config this stage. How about to introduce just one golden value for the zspage size? order-3 or 4 in Kconfig with keeping default 2? And then we make more efforts to have auto tune based on the wasted memory and the number of size classes on the fly. A good thing to be able to achieve is we have indirect table(handle <-> zpage) so we could move the object anytime so I think we could do better way in the end.
Hi, On (22/11/10 14:44), Minchan Kim wrote: > On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote: > > Hello, > > > > Some use-cases and/or data patterns may benefit from > > larger zspages. Currently the limit on the number of physical > > pages that are linked into a zspage is hardcoded to 4. Higher > > limit changes key characteristics of a number of the size > > classes, improving compactness of the pool and redusing the > > amount of memory zsmalloc pool uses. More on this in 0002 > > commit message. > > Hi Sergey, > > I think the idea that break of fixed subpages in zspage is > really good start to optimize further. However, I am worry > about introducing per-pool config this stage. How about > to introduce just one golden value for the zspage size? > order-3 or 4 in Kconfig with keeping default 2? Sorry, not sure I'm following. So you want a .config value for zspage limit? I really like the sysfs knob, because then one may set values on per-device basis (if they have multiple zram devices in a system with different data patterns): zram0 which is used as a swap device uses, say, 4 zram1 which is vfat block device uses, say, 6 zram2 which is ext4 block device uses, say, 8 The whole point of the series is that one single value does not fit all purposes. There is no silver bullet. > And then we make more efforts to have auto tune based on > the wasted memory and the number of size classes on the > fly. A good thing to be able to achieve is we have indirect > table(handle <-> zpage) so we could move the object anytime > so I think we could do better way in the end. It still needs to be per zram device (per zspool). sysfs knob doesn't stop us from having auto-tuned values in the future.
On Fri, Nov 11, 2022 at 09:56:36AM +0900, Sergey Senozhatsky wrote: > Hi, > > On (22/11/10 14:44), Minchan Kim wrote: > > On Mon, Oct 31, 2022 at 02:40:59PM +0900, Sergey Senozhatsky wrote: > > > Hello, > > > > > > Some use-cases and/or data patterns may benefit from > > > larger zspages. Currently the limit on the number of physical > > > pages that are linked into a zspage is hardcoded to 4. Higher > > > limit changes key characteristics of a number of the size > > > classes, improving compactness of the pool and redusing the > > > amount of memory zsmalloc pool uses. More on this in 0002 > > > commit message. > > > > Hi Sergey, > > > > I think the idea that break of fixed subpages in zspage is > > really good start to optimize further. However, I am worry > > about introducing per-pool config this stage. How about > > to introduce just one golden value for the zspage size? > > order-3 or 4 in Kconfig with keeping default 2? > > Sorry, not sure I'm following. So you want a .config value > for zspage limit? I really like the sysfs knob, because then > one may set values on per-device basis (if they have multiple > zram devices in a system with different data patterns): Yes, I wanted to have just a global policy to drive zsmalloc smarter without needing user's big effort to decide right tune value(I thought the decision process would be quite painful for normal user who don't have enough resources) since zsmalloc's design makes it possible. But for the interim solution until we prove no regression, just provide config and then remove the config later when we add aggressive zpage compaction(if necessary, please see below) since it's easier to deprecate syfs knob. > > zram0 which is used as a swap device uses, say, 4 > zram1 which is vfat block device uses, say, 6 > zram2 which is ext4 block device uses, say, 8 > > The whole point of the series is that one single value does > not fit all purposes. There is no silver bullet. I understand what you want to achieve with per-pool config with exposing the knob to user but my worry is still how user could decide best fit since workload is so dynamic. Some groups have enough resouces to practice under fleet experimental while many others don't so if we really need the per-pool config step, at least, I'd like to provide default guide to user in the documentation along with the tunable knobs for experimental. Maybe, we can suggest 4 for swap case and 8 for fs case. I don't disagree the sysfs knobs for use cases but can't we deal with the issue better way? In general, the bigger pages_per_zspage, the more memory saving. It would be same with slab_order in slab allocator but slab has the limit due to high-order allocation cost and internal fragmentation with bigger order size slab. However, zsmalloc is different in that it doesn't expose memory address directly and it knows when the object is accessed by user. And it doesn't need high-order allocation, either. That's how zsmalloc could support object migration and page migration. With those features, theoretically, zsmalloc doesn't need limitation of the pages_per_zspage so I am looking forward to seeing zsmalloc handles the memory fragmentation problem better way. Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory when zram is used for swap. The use case aims to help memory pressure but the worst case, the bigger pages_per_zspage, more chance to out of memory. However, we could bound the worst case memory consumption up to for class in classes: wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size with *aggressive zpage compaction*. Now, we are relying on shrinker (it might be already enough) to trigger but we could change the policy wasted memory in the class size crossed a threshold we defind for zram fs usecase since it would be used without memory pressure. What do you think about?
Hi Minchan, On (22/11/11 09:03), Minchan Kim wrote: > > Sorry, not sure I'm following. So you want a .config value > > for zspage limit? I really like the sysfs knob, because then > > one may set values on per-device basis (if they have multiple > > zram devices in a system with different data patterns): > > Yes, I wanted to have just a global policy to drive zsmalloc smarter > without needing user's big effort to decide right tune value(I thought > the decision process would be quite painful for normal user who don't > have enough resources) since zsmalloc's design makes it possible. > But for the interim solution until we prove no regression, just > provide config and then remove the config later when we add aggressive > zpage compaction(if necessary, please see below) since it's easier to > deprecate syfs knob. [..] > I understand what you want to achieve with per-pool config with exposing > the knob to user but my worry is still how user could decide best fit > since workload is so dynamic. Some groups have enough resouces to practice > under fleet experimental while many others don't so if we really need the > per-pool config step, at least, I'd like to provide default guide to user > in the documentation along with the tunable knobs for experimental. > Maybe, we can suggest 4 for swap case and 8 for fs case. > > I don't disagree the sysfs knobs for use cases but can't we deal with the > issue better way? [..] > with *aggressive zpage compaction*. Now, we are relying on shrinker > (it might be already enough) to trigger but we could change the policy > wasted memory in the class size crossed a threshold we defind for zram fs > usecase since it would be used without memory pressure. > > What do you think about? This is tricky. I didn't want us to come up with any sort of policies based on assumptions. For instance, we know that SUSE uses zram with fs under severe memory pressure (so severe that they immediately noticed when we removed zsmalloc handle allocation slow path and reported a regression), so assumption that fs zram use-case is not memory sensitive does not always hold. There are too many variables. We have different data patterns, yes, but even same data patterns have different characteristics when compressed with different algorithms; then we also have different host states (memory pressure, etc.) and so on. I think that it'll be safer for us to execute it the other way. We can (that's what I was going to do) reach out to people (Android, SUSE, Meta, ChromeOS, Google cloud, WebOS, Tizen) and ask them to run experiments (try out various numbers). Then (several months later) we can take a look at the data - what numbers work for which workloads, and then we can introduce/change policies, based on evidence and real use cases. Who knows, maybe zspage_chain_size of 6 can be the new default and then we can add .config policy, maybe 7 or 8. Or maybe we won't find a single number that works equally well for everyone (even in similar use cases). This is where sysfs knob is very useful. Unlike .config, which has no flexibility especially when your entire fleet uses same .config for all builds, sysfs knob lets people run numerous A/B tests simultaneously (not to mention that some setups have many zram devices which can have different zspage_chain_size-s). And we don't even need to deprecate it, if we introduce a generic one like allocator_tunables, which will support tuples `key=val`. Then we can just deprecate a specific `key`.
On (22/11/11 09:03), Minchan Kim wrote: [..] > Only concern with bigger pages_per_zspage(e.g., 8 or 16) is exhausting memory > when zram is used for swap. The use case aims to help memory pressure but the > worst case, the bigger pages_per_zspage, more chance to out of memory. It's hard to speak in concrete terms here. What locally may look like a less optimal configuration, can result in a more optimal configuration globally. Yes, some zspage_chains get longer, but in return we have very different clustering and zspool performance/configuration. Example, a synthetic test on my host. zspage_chain_size 4 ------------------- zsmalloc classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... Total 13 51 413836 412973 159955 3 zram mm_stat 1691783168 628083717 655175680 0 655175680 60 0 34048 34049 zspage_chain_size 8 ------------------- zsmalloc classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable ... Total 18 87 414852 412978 156666 0 zram mm_stat 1691803648 627793930 641703936 0 641703936 60 0 33591 33591 Note that we have lower "pages_used" value for the same amount of stored data. Down to 156666 from 159955 pages. So it *could be* that longer zspage_chains can be beneficial even in memory sensitive cases, but we need more data on this, so that we can speak "statistically".
On (22/11/11 09:03), Minchan Kim wrote: [..] > for class in classes: > wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size > > with *aggressive zpage compaction*. Now, we are relying on shrinker > (it might be already enough) to trigger but we could change the policy > wasted memory in the class size crossed a threshold That threshold can be another tunable in zramX/allocator_tunables sysfs knob and struct zs_tunables. But overall it sounds like a bigger project for some time next year. We already have zs_compact() sysfs knob, so user-space can invoke it as often as it wants to (not aware if anyone does btw), maybe new compaction should be something slightly different. I don't have any ideas yet. One way or the other it still can use the same sysfs knob :)
On (22/11/11 09:03), Minchan Kim wrote: [..] > for class in classes: > wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size > > with *aggressive zpage compaction*. Now, we are relying on shrinker > (it might be already enough) to trigger but we could change the policy > wasted memory in the class size crossed a threshold Compaction does something good only when we can release zspage in the end. Otherwise we just hold the global pool->lock (assuming that we land zsmalloc writeback series) and simply move objects around zspages. So ability to limit zspage chain size still can be valuable, on another level, as a measure to reduce dependency on compaction success. We may be can make compaction slightly more successful. For instance, if we would start move objects not only within zspages of the same size class, but, for example, move objects to class size + X (upper size classes). As an example, when all zspages in class are almost full, but class size + 1 has almost empty pages. In other words sort of as is those classes had been merged. (virtual merge). Single pool->look would be handy for it. But this is more of a research project (intern project?), with unclear outcome and ETA. I think in the mean time we can let people start experimenting with various zspage chain sizes so that may be at some point we can arrive to a new "default" value for all zspool, higher than current 4, which has been around for many years. Can't think, at present, of a better way forward.
On (22/11/15 15:01), Sergey Senozhatsky wrote: > On (22/11/11 09:03), Minchan Kim wrote: > [..] > > for class in classes: > > wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size > > > > with *aggressive zpage compaction*. Now, we are relying on shrinker > > (it might be already enough) to trigger but we could change the policy > > wasted memory in the class size crossed a threshold > > Compaction does something good only when we can release zspage in the > end. Otherwise we just hold the global pool->lock (assuming that we > land zsmalloc writeback series) and simply move objects around zspages. > So ability to limit zspage chain size still can be valuable, on another > level, as a measure to reduce dependency on compaction success. > > We may be can make compaction slightly more successful. For instance, > if we would start move objects not only within zspages of the same size > class, but, for example, move objects to class size + X (upper size > classes). As an example, when all zspages in class are almost full, > but class size + 1 has almost empty pages. In other words sort of as > is those classes had been merged. (virtual merge). Single pool->look > would be handy for it. What I'm trying to say here is that "aggressiveness of compaction" probably should be measured not by compaction frequency, but by overall cost of compaction operations. Aggressive frequency of compaction doesn't help us much if the state of the pool doesn't change significantly between compactions. E.g. if we do 10 compaction calls, then only the first one potentially compacts some zspages, the remaining ones don't do anything. Cost of compaction operations is a measure of how hard compaction tries. Does it move object to neighbouring classes and so on? May be we can do something here. But then the question is - how do we control that we don't drain battery too fast? And perhaps some other questions too.
On Tue, Nov 15, 2022 at 04:59:29PM +0900, Sergey Senozhatsky wrote: > On (22/11/15 15:01), Sergey Senozhatsky wrote: > > On (22/11/11 09:03), Minchan Kim wrote: > > [..] > > > for class in classes: > > > wasted_bytes += class->pages_per_zspage * PAGE_SIZE - an object size > > > > > > with *aggressive zpage compaction*. Now, we are relying on shrinker > > > (it might be already enough) to trigger but we could change the policy > > > wasted memory in the class size crossed a threshold > > > > Compaction does something good only when we can release zspage in the > > end. Otherwise we just hold the global pool->lock (assuming that we > > land zsmalloc writeback series) and simply move objects around zspages. > > So ability to limit zspage chain size still can be valuable, on another > > level, as a measure to reduce dependency on compaction success. > > > > We may be can make compaction slightly more successful. For instance, > > if we would start move objects not only within zspages of the same size > > class, but, for example, move objects to class size + X (upper size > > classes). As an example, when all zspages in class are almost full, > > but class size + 1 has almost empty pages. In other words sort of as > > is those classes had been merged. (virtual merge). Single pool->look > > would be handy for it. > > What I'm trying to say here is that "aggressiveness of compaction" > probably should be measured not by compaction frequency, but by overall > cost of compaction operations. > > Aggressive frequency of compaction doesn't help us much if the state of > the pool doesn't change significantly between compactions. E.g. if we do > 10 compaction calls, then only the first one potentially compacts some > zspages, the remaining ones don't do anything. > > Cost of compaction operations is a measure of how hard compaction tries. > Does it move object to neighbouring classes and so on? May be we can do > something here. > > But then the question is - how do we control that we don't drain battery > too fast? And perhaps some other questions too. > Sure, if we start talking about battery, that would have a lot of things we need to consider not only from zram-direct but also other indirect-stuffs caused caused by memory pressure and workload patterns. That's not what we can control and would consume much more battery. I understand your concern but also think sysfs per-konb can solve the issue since workload is too dynamic even in the same swap file/fs, too. I'd like to try finding a sweet spot in general. If it's too hard to have, then, we need to introduce the knob with reasonable guideline how we could find it. Let me try to see the data under Android workload how much just increase the ZS_MAX_PAGES_PER_ZSPAGE blindly will change the data.
On (22/11/15 15:23), Minchan Kim wrote: > Sure, if we start talking about battery, that would have a lot of things > we need to consider not only from zram-direct but also other indirect-stuffs > caused caused by memory pressure and workload patterns. That's not what we > can control and would consume much more battery. I understand your concern > but also think sysfs per-konb can solve the issue since workload is too > dynamic even in the same swap file/fs, too. I'd like to try finding a > sweet spot in general. If it's too hard to have, then, we need to introduce > the knob with reasonable guideline how we could find it. > > Let me try to see the data under Android workload how much just increase > the ZS_MAX_PAGES_PER_ZSPAGE blindly will change the data. I don't want to push for sysfs knob. What I like about sysfs knob vs KConfig is that sysfs is opt-in. We can ask folks to try things out, people will know what to look at and they will keep an eye on metrics, then they come back to us. So we can sit down, look at the numbers and draw some conclusions. KConfig is not opt-in. It'll happen for everyone, as a policy, transparently and then we rely on a) people tracking metrics that they were not asked to track b) people noticing changes (positive or negative) in metrics that they don't keep an eye on c) people figuring out that change in metrics is related to zsmalloc Kconfig (and that's a very non-obvious conclusion) d) people reaching out to us That's way too much to rely on. Chances are we will never hear back. I understand that you don't like sysfs, and it's not the best thing probably, but KConfig is not better. I like the opt-in nature of sysfs - if you change it then you know what you are doing.