Message ID | 20191018002820.307763-1-guro@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | The new slab memory controller | expand |
On 10/17/19 8:28 PM, Roman Gushchin wrote: > The existing slab memory controller is based on the idea of replicating > slab allocator internals for each memory cgroup. This approach promises > a low memory overhead (one pointer per page), and isn't adding too much > code on hot allocation and release paths. But is has a very serious flaw: ^it^ > it leads to a low slab utilization. > > Using a drgn* script I've got an estimation of slab utilization on > a number of machines running different production workloads. In most > cases it was between 45% and 65%, and the best number I've seen was > around 85%. Turning kmem accounting off brings it to high 90s. Also > it brings back 30-50% of slab memory. It means that the real price > of the existing slab memory controller is way bigger than a pointer > per page. > > The real reason why the existing design leads to a low slab utilization > is simple: slab pages are used exclusively by one memory cgroup. > If there are only few allocations of certain size made by a cgroup, > or if some active objects (e.g. dentries) are left after the cgroup is > deleted, or the cgroup contains a single-threaded application which is > barely allocating any kernel objects, but does it every time on a new CPU: > in all these cases the resulting slab utilization is very low. > If kmem accounting is off, the kernel is able to use free space > on slab pages for other allocations. In the case of slub memory allocator, it is not just unused space within a slab. It is also the use of per-cpu slabs that can hold up a lot of memory, especially if the tasks jump around to different cpus. The problem is compounded if a lot of memcgs are being used. Memory utilization can improve quite significantly if per-cpu slabs are disabled. Of course, it comes with a performance cost. Cheers, Longman
On Fri, Oct 18, 2019 at 01:03:54PM -0400, Waiman Long wrote: > On 10/17/19 8:28 PM, Roman Gushchin wrote: > > The existing slab memory controller is based on the idea of replicating > > slab allocator internals for each memory cgroup. This approach promises > > a low memory overhead (one pointer per page), and isn't adding too much > > code on hot allocation and release paths. But is has a very serious flaw: > ^it^ > > it leads to a low slab utilization. > > > > Using a drgn* script I've got an estimation of slab utilization on > > a number of machines running different production workloads. In most > > cases it was between 45% and 65%, and the best number I've seen was > > around 85%. Turning kmem accounting off brings it to high 90s. Also > > it brings back 30-50% of slab memory. It means that the real price > > of the existing slab memory controller is way bigger than a pointer > > per page. > > > > The real reason why the existing design leads to a low slab utilization > > is simple: slab pages are used exclusively by one memory cgroup. > > If there are only few allocations of certain size made by a cgroup, > > or if some active objects (e.g. dentries) are left after the cgroup is > > deleted, or the cgroup contains a single-threaded application which is > > barely allocating any kernel objects, but does it every time on a new CPU: > > in all these cases the resulting slab utilization is very low. > > If kmem accounting is off, the kernel is able to use free space > > on slab pages for other allocations. > > In the case of slub memory allocator, it is not just unused space within > a slab. It is also the use of per-cpu slabs that can hold up a lot of > memory, especially if the tasks jump around to different cpus. The > problem is compounded if a lot of memcgs are being used. Memory > utilization can improve quite significantly if per-cpu slabs are > disabled. Of course, it comes with a performance cost. Right, but it's basically the same problem: if slabs can be used exclusively by a single memory cgroup, slab utilization is low. Per-cpu slabs are just making the problem worse by increasing the number of mostly empty slabs proportionally to the number of CPUs. With the disabled memory cgroup accounting slab utilization is quite high even with per-slabs. So the problem isn't in per-cpu slabs by themselves, they just were not designed to exist in so many copies. Thanks!
On Thu 17-10-19 17:28:04, Roman Gushchin wrote: [...] > Using a drgn* script I've got an estimation of slab utilization on > a number of machines running different production workloads. In most > cases it was between 45% and 65%, and the best number I've seen was > around 85%. Turning kmem accounting off brings it to high 90s. Also > it brings back 30-50% of slab memory. It means that the real price > of the existing slab memory controller is way bigger than a pointer > per page. How much of the memory are we talking about here? Also is there any pattern for specific caches that tend to utilize much worse than others?
On Tue 22-10-19 15:22:06, Michal Hocko wrote: > On Thu 17-10-19 17:28:04, Roman Gushchin wrote: > [...] > > Using a drgn* script I've got an estimation of slab utilization on > > a number of machines running different production workloads. In most > > cases it was between 45% and 65%, and the best number I've seen was > > around 85%. Turning kmem accounting off brings it to high 90s. Also > > it brings back 30-50% of slab memory. It means that the real price > > of the existing slab memory controller is way bigger than a pointer > > per page. > > How much of the memory are we talking about here? Just to be more specific. Your cover mentions several hundreds of MBs but there is no scale to the overal charged memory. How much of that is the actual kmem accounted memory. > Also is there any pattern for specific caches that tend to utilize > much worse than others?
On Thu 17-10-19 17:28:04, Roman Gushchin wrote: > This patchset provides a new implementation of the slab memory controller, > which aims to reach a much better slab utilization by sharing slab pages > between multiple memory cgroups. Below is the short description of the new > design (more details in commit messages). > > Accounting is performed per-object instead of per-page. Slab-related > vmstat counters are converted to bytes. Charging is performed on page-basis, > with rounding up and remembering leftovers. > > Memcg ownership data is stored in a per-slab-page vector: for each slab page > a vector of corresponding size is allocated. To keep slab memory reparenting > working, instead of saving a pointer to the memory cgroup directly an > intermediate object is used. It's simply a pointer to a memcg (which can be > easily changed to the parent) with a built-in reference counter. This scheme > allows to reparent all allocated objects without walking them over and changing > memcg pointer to the parent. > > Instead of creating an individual set of kmem_caches for each memory cgroup, > two global sets are used: the root set for non-accounted and root-cgroup > allocations and the second set for all other allocations. This allows to > simplify the lifetime management of individual kmem_caches: they are destroyed > with root counterparts. It allows to remove a good amount of code and make > things generally simpler. What is the performance impact? Also what is the effect on the memory reclaim side and the isolation. I would expect that mixing objects from different cgroups would have a negative/unpredictable impact on the memcg slab shrinking.
On Tue, Oct 22, 2019 at 03:28:00PM +0200, Michal Hocko wrote: > On Tue 22-10-19 15:22:06, Michal Hocko wrote: > > On Thu 17-10-19 17:28:04, Roman Gushchin wrote: > > [...] > > > Using a drgn* script I've got an estimation of slab utilization on > > > a number of machines running different production workloads. In most > > > cases it was between 45% and 65%, and the best number I've seen was > > > around 85%. Turning kmem accounting off brings it to high 90s. Also > > > it brings back 30-50% of slab memory. It means that the real price > > > of the existing slab memory controller is way bigger than a pointer > > > per page. > > > > How much of the memory are we talking about here? > > Just to be more specific. Your cover mentions several hundreds of MBs > but there is no scale to the overal charged memory. How much of that is > the actual kmem accounted memory. As I wrote, on average it saves 30-45% of slab memory. The smallest number I've seen was about 15%, the largest over 60%. The amount of slab memory isn't a very stable metrics in general: it heavily depends on workload pattern, memory pressure, uptime etc. In absolute numbers I've seen savings from ~60 Mb for an empty vm to more than 2 Gb for some production workloads. Btw, please note that after a recent change from Vlastimil 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting") slab counters are including large allocations which are passed directly to the page allocator. It will makes memory savings smaller in percents, but of course not in absolute numbers. > > > Also is there any pattern for specific caches that tend to utilize > > much worse than others? Caches which usually have many objects (e.g. inodes) initially have a better utilization, but as some of them are getting reclaimed the utilization drops. And if the cgroup is already dead, no one can reuse these mostly slab empty pages, so it's pretty wasteful. So I don't think the problem is specific to any cache, it's pretty general.
On Tue, Oct 22, 2019 at 03:31:48PM +0200, Michal Hocko wrote: > On Thu 17-10-19 17:28:04, Roman Gushchin wrote: > > This patchset provides a new implementation of the slab memory controller, > > which aims to reach a much better slab utilization by sharing slab pages > > between multiple memory cgroups. Below is the short description of the new > > design (more details in commit messages). > > > > Accounting is performed per-object instead of per-page. Slab-related > > vmstat counters are converted to bytes. Charging is performed on page-basis, > > with rounding up and remembering leftovers. > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page > > a vector of corresponding size is allocated. To keep slab memory reparenting > > working, instead of saving a pointer to the memory cgroup directly an > > intermediate object is used. It's simply a pointer to a memcg (which can be > > easily changed to the parent) with a built-in reference counter. This scheme > > allows to reparent all allocated objects without walking them over and changing > > memcg pointer to the parent. > > > > Instead of creating an individual set of kmem_caches for each memory cgroup, > > two global sets are used: the root set for non-accounted and root-cgroup > > allocations and the second set for all other allocations. This allows to > > simplify the lifetime management of individual kmem_caches: they are destroyed > > with root counterparts. It allows to remove a good amount of code and make > > things generally simpler. > > What is the performance impact? As I wrote, so far we haven't found any regression on any real world workload. Of course, it's pretty easy to come up with a synthetic test which will show some performance hit: e.g. allocate and free a large number of objects from a single cache from a single cgroup. The reason is simple: stats and accounting are more precise, so it requires more work. But I don't think it's a real problem. On the other hand I expect to see some positive effects from the significantly reduced number of unmovable pages: memory fragmentation should become lower. And all kernel objects will reside on a smaller number of pages, so we can expect a better cache utilization. > Also what is the effect on the memory > reclaim side and the isolation. I would expect that mixing objects from > different cgroups would have a negative/unpredictable impact on the > memcg slab shrinking. Slab shrinking is already working on per-object basis, so no changes here. Quite opposite: now the freed space can be reused by other cgroups, where previously it was often a useless operation, as nobody can reuse the space unless all objects will be freed and the page can be returned to the page allocator. Thanks!