Message ID | 20220429201131.3397875-2-yosryahmed@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: mm: count KVM mmu usage in memory stats | expand |
On Fri, 29 Apr 2022 21:11:28 +0100, Yosry Ahmed <yosryahmed@google.com> wrote: > > Add NR_SECONDARY_PAGETABLE stat to count secondary page table uses, e.g. > KVM mmu. This provides more insights on the kernel memory used > by a workload. > > This stat will be used by subsequent patches to count KVM mmu > memory usage. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 5 +++++ > Documentation/filesystems/proc.rst | 4 ++++ > drivers/base/node.c | 2 ++ > fs/proc/meminfo.c | 2 ++ > include/linux/mmzone.h | 1 + > mm/memcontrol.c | 1 + > mm/page_alloc.c | 6 +++++- > mm/vmstat.c | 1 + > 8 files changed, 21 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 69d7a6983f78..828cb6b6f918 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1312,6 +1312,11 @@ PAGE_SIZE multiple when read back. > pagetables > Amount of memory allocated for page tables. > > + secondary_pagetables > + Amount of memory allocated for secondary page tables, > + this currently includes KVM mmu allocations on x86 > + and arm64. Can you please explain what the rationale is for this? We already account for the (arm64) S2 PTs as a userspace allocation (see 115bae923ac8bb29ee635). You are saying that this is related to a 'workload', but given that the accounting is global, I fail to see how you can attribute these allocations on a particular VM. What do you plan to do for IOMMU page tables? After all, they serve the exact same purpose, and I'd expect these to be handled the same way (i.e. why is this KVM specific?). Thanks, M.
On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > On Fri, 29 Apr 2022 21:11:28 +0100, > Yosry Ahmed <yosryahmed@google.com> wrote: > > > > Add NR_SECONDARY_PAGETABLE stat to count secondary page table uses, e.g. > > KVM mmu. This provides more insights on the kernel memory used > > by a workload. > > > > This stat will be used by subsequent patches to count KVM mmu > > memory usage. > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 5 +++++ > > Documentation/filesystems/proc.rst | 4 ++++ > > drivers/base/node.c | 2 ++ > > fs/proc/meminfo.c | 2 ++ > > include/linux/mmzone.h | 1 + > > mm/memcontrol.c | 1 + > > mm/page_alloc.c | 6 +++++- > > mm/vmstat.c | 1 + > > 8 files changed, 21 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..828cb6b6f918 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1312,6 +1312,11 @@ PAGE_SIZE multiple when read back. > > pagetables > > Amount of memory allocated for page tables. > > > > + secondary_pagetables > > + Amount of memory allocated for secondary page tables, > > + this currently includes KVM mmu allocations on x86 > > + and arm64. > > Can you please explain what the rationale is for this? We already > account for the (arm64) S2 PTs as a userspace allocation (see This can be considered as continuation for that work. The mentioned commit accounts S2 PTs to the VM process cgroup kernel memory. We have stats for the total kernel memory, and some fine-grained categories of that, like (pagetables, stack, slab, etc.). This patch just adds another category to give further insights into what exactly is using kernel memory. > 115bae923ac8bb29ee635). You are saying that this is related to a > 'workload', but given that the accounting is global, I fail to see how > you can attribute these allocations on a particular VM. The main motivation is having the memcg stats, which give attribution to workloads. If you think it's more appropriate, we can add it as a memcg-only stat, like MEMCG_VMALLOC (see 4e5aa1f4c2b4 ("memcg: add per-memcg vmalloc stat")). The only reason I made this as a global stat too is to be consistent with NR_PAGETABLE. > > What do you plan to do for IOMMU page tables? After all, they serve > the exact same purpose, and I'd expect these to be handled the same > way (i.e. why is this KVM specific?). The reason this was named NR_SECONDARY_PAGTABLE instead of NR_KVM_PAGETABLE is exactly that. To leave room to incrementally account other types of secondary page tables to this stat. It is just that we are currently interested in the KVM MMU usage. > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
On Mon, May 2, 2022 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Fri, 29 Apr 2022 21:11:28 +0100, > > Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Add NR_SECONDARY_PAGETABLE stat to count secondary page table uses, e.g. > > > KVM mmu. This provides more insights on the kernel memory used > > > by a workload. > > > > > > This stat will be used by subsequent patches to count KVM mmu > > > memory usage. > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > --- > > > Documentation/admin-guide/cgroup-v2.rst | 5 +++++ > > > Documentation/filesystems/proc.rst | 4 ++++ > > > drivers/base/node.c | 2 ++ > > > fs/proc/meminfo.c | 2 ++ > > > include/linux/mmzone.h | 1 + > > > mm/memcontrol.c | 1 + > > > mm/page_alloc.c | 6 +++++- > > > mm/vmstat.c | 1 + > > > 8 files changed, 21 insertions(+), 1 deletion(-) > > > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > > index 69d7a6983f78..828cb6b6f918 100644 > > > --- a/Documentation/admin-guide/cgroup-v2.rst > > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > > @@ -1312,6 +1312,11 @@ PAGE_SIZE multiple when read back. > > > pagetables > > > Amount of memory allocated for page tables. > > > > > > + secondary_pagetables > > > + Amount of memory allocated for secondary page tables, > > > + this currently includes KVM mmu allocations on x86 > > > + and arm64. > > > > Can you please explain what the rationale is for this? We already > > account for the (arm64) S2 PTs as a userspace allocation (see > > This can be considered as continuation for that work. The mentioned > commit accounts S2 PTs to the VM process cgroup kernel memory. We have > stats for the total kernel memory, and some fine-grained categories of > that, like (pagetables, stack, slab, etc.). > > This patch just adds another category to give further insights into > what exactly is using kernel memory. > > > 115bae923ac8bb29ee635). You are saying that this is related to a > > 'workload', but given that the accounting is global, I fail to see how > > you can attribute these allocations on a particular VM. > > The main motivation is having the memcg stats, which give attribution > to workloads. If you think it's more appropriate, we can add it as a > memcg-only stat, like MEMCG_VMALLOC (see 4e5aa1f4c2b4 ("memcg: add > per-memcg vmalloc stat")). The only reason I made this as a global > stat too is to be consistent with NR_PAGETABLE. > > > > > What do you plan to do for IOMMU page tables? After all, they serve > > the exact same purpose, and I'd expect these to be handled the same > > way (i.e. why is this KVM specific?). > > The reason this was named NR_SECONDARY_PAGTABLE instead of > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > account other types of secondary page tables to this stat. It is just > that we are currently interested in the KVM MMU usage. > Any thoughts on this? Do you think MEMCG_SECONDARY_PAGETABLE would be more appropriate here?
On Mon, May 9, 2022 at 9:38 AM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > > > > > What do you plan to do for IOMMU page tables? After all, they serve > > > the exact same purpose, and I'd expect these to be handled the same > > > way (i.e. why is this KVM specific?). > > > > The reason this was named NR_SECONDARY_PAGTABLE instead of > > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > > account other types of secondary page tables to this stat. It is just > > that we are currently interested in the KVM MMU usage. > > > > > Any thoughts on this? Do you think MEMCG_SECONDARY_PAGETABLE would be > more appropriate here? I think NR_SECONDARY_PAGTABLE is good. Later it can include pagetables from other subsystems. The only nit (which you can ignore) I have is the very long memcg stat and vmstat names. Other than that the patch looks good to me.
Hey Yosry, On Mon, May 02, 2022 at 11:46:26AM -0700, Yosry Ahmed wrote: > On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > 115bae923ac8bb29ee635). You are saying that this is related to a > > 'workload', but given that the accounting is global, I fail to see how > > you can attribute these allocations on a particular VM. > > The main motivation is having the memcg stats, which give attribution > to workloads. If you think it's more appropriate, we can add it as a > memcg-only stat, like MEMCG_VMALLOC (see 4e5aa1f4c2b4 ("memcg: add > per-memcg vmalloc stat")). The only reason I made this as a global > stat too is to be consistent with NR_PAGETABLE. Please no memcg-specific stats if a regular vmstat item is possible and useful at the system level as well, like in this case. It's extra memcg code, extra callbacks, and it doesn't have NUMA node awareness. > > What do you plan to do for IOMMU page tables? After all, they serve > > the exact same purpose, and I'd expect these to be handled the same > > way (i.e. why is this KVM specific?). > > The reason this was named NR_SECONDARY_PAGTABLE instead of > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > account other types of secondary page tables to this stat. It is just > that we are currently interested in the KVM MMU usage. Do you actually care at the supervisor level that this memory is used for guest page tables? It seems to me you primarily care that it is reported *somewhere* (hence the piggybacking off of NR_PAGETABLE at first). And whether it's page tables or iommu tables or whatever else allocated for the purpose of virtualization, it doesn't make much of a difference to the host/cgroup that is tracking it, right? (The proximity to nr_pagetable could also be confusing. A high page table count can be a hint to userspace to enable THP. It seems actionable in a different way than a high number of kvm page tables or iommu page tables.) How about NR_VIRT? It's shorter, seems descriptive enough, less room for confusion, and is more easily extensible in the future.
On Thu, May 12, 2022, Johannes Weiner wrote: > Hey Yosry, > > On Mon, May 02, 2022 at 11:46:26AM -0700, Yosry Ahmed wrote: > > On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > > 115bae923ac8bb29ee635). You are saying that this is related to a > > > 'workload', but given that the accounting is global, I fail to see how > > > you can attribute these allocations on a particular VM. > > > > The main motivation is having the memcg stats, which give attribution > > to workloads. If you think it's more appropriate, we can add it as a > > memcg-only stat, like MEMCG_VMALLOC (see 4e5aa1f4c2b4 ("memcg: add > > per-memcg vmalloc stat")). The only reason I made this as a global > > stat too is to be consistent with NR_PAGETABLE. > > Please no memcg-specific stats if a regular vmstat item is possible > and useful at the system level as well, like in this case. It's extra > memcg code, extra callbacks, and it doesn't have NUMA node awareness. > > > > What do you plan to do for IOMMU page tables? After all, they serve > > > the exact same purpose, and I'd expect these to be handled the same > > > way (i.e. why is this KVM specific?). > > > > The reason this was named NR_SECONDARY_PAGTABLE instead of > > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > > account other types of secondary page tables to this stat. It is just > > that we are currently interested in the KVM MMU usage. > > Do you actually care at the supervisor level that this memory is used > for guest page tables? Hmm, yes? KVM does have a decent number of large-ish allocations that aren't for page tables, but except for page tables, the number/size of those allocations scales linearly with either the number of vCPUs or the amount of memory assigned to the VM (with no room for improvement barring KVM changes). Off the top of my head, KVM's secondary page tables are the only allocations that don't scale linearly, especially when nested virtualization is in use. > It seems to me you primarily care that it is reported *somewhere* > (hence the piggybacking off of NR_PAGETABLE at first). And whether > it's page tables or iommu tables or whatever else allocated for the > purpose of virtualization, it doesn't make much of a difference to the > host/cgroup that is tracking it, right? > > (The proximity to nr_pagetable could also be confusing. A high page > table count can be a hint to userspace to enable THP. It seems > actionable in a different way than a high number of kvm page tables or > iommu page tables.) I don't know about iommu page tables, but on the KVM side a high count can also be a good signal that enabling THP would be beneficial. It's definitely actionable in a different way though too. > How about NR_VIRT? It's shorter, seems descriptive enough, less room > for confusion, and is more easily extensible in the future. I don't like NR_VIRT because VFIO/iommu can be used for non-virtualization things, and we'd be lying by omission unless KVM (and other users) updates all of its large-ish allocations to account them correctly.
On Thu, May 12, 2022 at 11:29:38PM +0000, Sean Christopherson wrote: > On Thu, May 12, 2022, Johannes Weiner wrote: > > On Mon, May 02, 2022 at 11:46:26AM -0700, Yosry Ahmed wrote: > > > On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > > > What do you plan to do for IOMMU page tables? After all, they serve > > > > the exact same purpose, and I'd expect these to be handled the same > > > > way (i.e. why is this KVM specific?). > > > > > > The reason this was named NR_SECONDARY_PAGTABLE instead of > > > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > > > account other types of secondary page tables to this stat. It is just > > > that we are currently interested in the KVM MMU usage. > > > > Do you actually care at the supervisor level that this memory is used > > for guest page tables? > > Hmm, yes? KVM does have a decent number of large-ish allocations that aren't > for page tables, but except for page tables, the number/size of those allocations > scales linearly with either the number of vCPUs or the amount of memory assigned > to the VM (with no room for improvement barring KVM changes). > > Off the top of my head, KVM's secondary page tables are the only allocations that > don't scale linearly, especially when nested virtualization is in use. Thanks, that's useful information. Are these other allocations accounted somewhere? If not, are they potential containment holes that will need fixing eventually? > > It seems to me you primarily care that it is reported *somewhere* > > (hence the piggybacking off of NR_PAGETABLE at first). And whether > > it's page tables or iommu tables or whatever else allocated for the > > purpose of virtualization, it doesn't make much of a difference to the > > host/cgroup that is tracking it, right? > > > > (The proximity to nr_pagetable could also be confusing. A high page > > table count can be a hint to userspace to enable THP. It seems > > actionable in a different way than a high number of kvm page tables or > > iommu page tables.) > > I don't know about iommu page tables, but on the KVM side a high count can also > be a good signal that enabling THP would be beneficial. Well, maybe. It might help, but ultimately it's the process that's in control in all cases: it's unmovable kernel memory allocated to manage virtual address space inside the task. So I'm still a bit at a loss whether these things should all be lumped in together or kept separately. meminfo and memory.stat are permanent ABI, so we should try to establish in advance whether the new itme is really a first-class consumer or part of something bigger. The patch initially piggybacked on NR_PAGETABLE. I found an email of you asking why it couldn't be a separate item, but it didn't provide a reasoning for that decision. Could you share your thoughts on that? Thanks
On Fri, May 13, 2022, Johannes Weiner wrote: > On Thu, May 12, 2022 at 11:29:38PM +0000, Sean Christopherson wrote: > > On Thu, May 12, 2022, Johannes Weiner wrote: > > > On Mon, May 02, 2022 at 11:46:26AM -0700, Yosry Ahmed wrote: > > > > On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > What do you plan to do for IOMMU page tables? After all, they serve > > > > > the exact same purpose, and I'd expect these to be handled the same > > > > > way (i.e. why is this KVM specific?). > > > > > > > > The reason this was named NR_SECONDARY_PAGTABLE instead of > > > > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > > > > account other types of secondary page tables to this stat. It is just > > > > that we are currently interested in the KVM MMU usage. > > > > > > Do you actually care at the supervisor level that this memory is used > > > for guest page tables? > > > > Hmm, yes? KVM does have a decent number of large-ish allocations that aren't > > for page tables, but except for page tables, the number/size of those allocations > > scales linearly with either the number of vCPUs or the amount of memory assigned > > to the VM (with no room for improvement barring KVM changes). > > > > Off the top of my head, KVM's secondary page tables are the only allocations that > > don't scale linearly, especially when nested virtualization is in use. > > Thanks, that's useful information. > > Are these other allocations accounted somewhere? If not, are they > potential containment holes that will need fixing eventually? All allocations that are tied to specific VM/vCPU are tagged GFP_KERNEL_ACCOUNT, so we should be good on that front. > > > It seems to me you primarily care that it is reported *somewhere* > > > (hence the piggybacking off of NR_PAGETABLE at first). And whether > > > it's page tables or iommu tables or whatever else allocated for the > > > purpose of virtualization, it doesn't make much of a difference to the > > > host/cgroup that is tracking it, right? > > > > > > (The proximity to nr_pagetable could also be confusing. A high page > > > table count can be a hint to userspace to enable THP. It seems > > > actionable in a different way than a high number of kvm page tables or > > > iommu page tables.) > > > > I don't know about iommu page tables, but on the KVM side a high count can also > > be a good signal that enabling THP would be beneficial. > > Well, maybe. > > It might help, but ultimately it's the process that's in control in > all cases: it's unmovable kernel memory allocated to manage virtual > address space inside the task. > > So I'm still a bit at a loss whether these things should all be lumped > in together or kept separately. meminfo and memory.stat are permanent > ABI, so we should try to establish in advance whether the new itme is > really a first-class consumer or part of something bigger. > > The patch initially piggybacked on NR_PAGETABLE. I found an email of > you asking why it couldn't be a separate item, but it didn't provide a > reasoning for that decision. Could you share your thoughts on that? It was mostly an honest question, I too am trying to understand what userspace wants to do with this information. I was/am also trying to understand the benefits of doing the tracking through page_state and not a dedicated KVM stat. E.g. KVM already has specific stats for the number of leaf pages mapped into a VM, why not do the same for non-leaf pages?
Thanks everyone for participating in this discussion and looking into this. On Fri, May 13, 2022 at 9:12 AM Sean Christopherson <seanjc@google.com> wrote: > > On Fri, May 13, 2022, Johannes Weiner wrote: > > On Thu, May 12, 2022 at 11:29:38PM +0000, Sean Christopherson wrote: > > > On Thu, May 12, 2022, Johannes Weiner wrote: > > > > On Mon, May 02, 2022 at 11:46:26AM -0700, Yosry Ahmed wrote: > > > > > On Mon, May 2, 2022 at 3:01 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > What do you plan to do for IOMMU page tables? After all, they serve > > > > > > the exact same purpose, and I'd expect these to be handled the same > > > > > > way (i.e. why is this KVM specific?). > > > > > > > > > > The reason this was named NR_SECONDARY_PAGTABLE instead of > > > > > NR_KVM_PAGETABLE is exactly that. To leave room to incrementally > > > > > account other types of secondary page tables to this stat. It is just > > > > > that we are currently interested in the KVM MMU usage. > > > > > > > > Do you actually care at the supervisor level that this memory is used > > > > for guest page tables? > > > > > > Hmm, yes? KVM does have a decent number of large-ish allocations that aren't > > > for page tables, but except for page tables, the number/size of those allocations > > > scales linearly with either the number of vCPUs or the amount of memory assigned > > > to the VM (with no room for improvement barring KVM changes). > > > > > > Off the top of my head, KVM's secondary page tables are the only allocations that > > > don't scale linearly, especially when nested virtualization is in use. > > > > Thanks, that's useful information. > > > > Are these other allocations accounted somewhere? If not, are they > > potential containment holes that will need fixing eventually? > > All allocations that are tied to specific VM/vCPU are tagged GFP_KERNEL_ACCOUNT, > so we should be good on that front. > > > > > It seems to me you primarily care that it is reported *somewhere* > > > > (hence the piggybacking off of NR_PAGETABLE at first). And whether > > > > it's page tables or iommu tables or whatever else allocated for the > > > > purpose of virtualization, it doesn't make much of a difference to the > > > > host/cgroup that is tracking it, right? > > > > > > > > (The proximity to nr_pagetable could also be confusing. A high page > > > > table count can be a hint to userspace to enable THP. It seems > > > > actionable in a different way than a high number of kvm page tables or > > > > iommu page tables.) > > > > > > I don't know about iommu page tables, but on the KVM side a high count can also > > > be a good signal that enabling THP would be beneficial. > > > > Well, maybe. > > > > It might help, but ultimately it's the process that's in control in > > all cases: it's unmovable kernel memory allocated to manage virtual > > address space inside the task. > > > > So I'm still a bit at a loss whether these things should all be lumped > > in together or kept separately. meminfo and memory.stat are permanent > > ABI, so we should try to establish in advance whether the new itme is > > really a first-class consumer or part of something bigger. > > > > The patch initially piggybacked on NR_PAGETABLE. I found an email of > > you asking why it couldn't be a separate item, but it didn't provide a > > reasoning for that decision. Could you share your thoughts on that? > > It was mostly an honest question, I too am trying to understand what userspace > wants to do with this information. I was/am also trying to understand the benefits > of doing the tracking through page_state and not a dedicated KVM stat. E.g. KVM > already has specific stats for the number of leaf pages mapped into a VM, why not > do the same for non-leaf pages? Let me cast some light on this. The reason this started being piggybacked on NR_PAGETABLE is that we had a remnant of an old internal implementation of NR_PAGETABLE before it was introduced upstream, that accounted KVM secondary page tables as normal page tables. This made me think this behavior was preferable. Personally, I wanted to make it a separate thing since the beginning. When I found opinions here that also suggested a separate stat I went ahead for that. As for where to put this information, it does not have to be NR_SECONDARY_PAGETABLE. Ultimately, people working on KVM are the ones that will interpret and act upon this data, so if you have somewhere else in mind please let me know, Sean.
On Fri, May 13, 2022 at 9:12 AM Sean Christopherson <seanjc@google.com> wrote: > [...] > > It was mostly an honest question, I too am trying to understand what userspace > wants to do with this information. I was/am also trying to understand the benefits > of doing the tracking through page_state and not a dedicated KVM stat. E.g. KVM > already has specific stats for the number of leaf pages mapped into a VM, why not > do the same for non-leaf pages? Let me answer why a more general stat is useful and the potential userspace reaction: For a memory type which is significant enough, it is useful to expose it in the general interfaces, so that the general data/stat collection infra can collect them instead of having workload dependent stat collectors. In addition, not necessarily that stat has to have a userspace reaction in an online fashion. We do collect stats for offline analysis which greatly influence the priority order of optimization workitems. Next the question is do we really need a separate stat item (secondary_pagetable instead of just plain pagetable) exposed in the stable API? To me secondary_pagetable is general (not kvm specific) enough and can be significant, so having a separate dedicated stat should be ok. Though I am ok with lump it with pagetable stat for now but we do want it to be accounted somewhere.
On Fri, May 13, 2022 at 10:14 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Fri, May 13, 2022 at 9:12 AM Sean Christopherson <seanjc@google.com> wrote: > > > [...] > > > > It was mostly an honest question, I too am trying to understand what userspace > > wants to do with this information. I was/am also trying to understand the benefits > > of doing the tracking through page_state and not a dedicated KVM stat. E.g. KVM > > already has specific stats for the number of leaf pages mapped into a VM, why not > > do the same for non-leaf pages? > > Let me answer why a more general stat is useful and the potential > userspace reaction: > > For a memory type which is significant enough, it is useful to expose > it in the general interfaces, so that the general data/stat collection > infra can collect them instead of having workload dependent stat > collectors. In addition, not necessarily that stat has to have a > userspace reaction in an online fashion. We do collect stats for > offline analysis which greatly influence the priority order of > optimization workitems. > > Next the question is do we really need a separate stat item > (secondary_pagetable instead of just plain pagetable) exposed in the > stable API? To me secondary_pagetable is general (not kvm specific) > enough and can be significant, so having a separate dedicated stat > should be ok. Though I am ok with lump it with pagetable stat for now > but we do want it to be accounted somewhere. Any thoughts on this? Johannes?
On Thu, May 19, 2022 at 06:56:54PM -0700, Yosry Ahmed wrote: > On Fri, May 13, 2022 at 10:14 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Fri, May 13, 2022 at 9:12 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > [...] > > > > > > It was mostly an honest question, I too am trying to understand what userspace > > > wants to do with this information. I was/am also trying to understand the benefits > > > of doing the tracking through page_state and not a dedicated KVM stat. E.g. KVM > > > already has specific stats for the number of leaf pages mapped into a VM, why not > > > do the same for non-leaf pages? > > > > Let me answer why a more general stat is useful and the potential > > userspace reaction: > > > > For a memory type which is significant enough, it is useful to expose > > it in the general interfaces, so that the general data/stat collection > > infra can collect them instead of having workload dependent stat > > collectors. In addition, not necessarily that stat has to have a > > userspace reaction in an online fashion. We do collect stats for > > offline analysis which greatly influence the priority order of > > optimization workitems. > > > > Next the question is do we really need a separate stat item > > (secondary_pagetable instead of just plain pagetable) exposed in the > > stable API? To me secondary_pagetable is general (not kvm specific) > > enough and can be significant, so having a separate dedicated stat > > should be ok. Though I am ok with lump it with pagetable stat for now > > but we do want it to be accounted somewhere. > > Any thoughts on this? Johannes? I agree that this memory should show up in vmstat/memory.stat in some form or another. The arguments on whether this should be part of NR_PAGETABLE or a separate entry seem a bit vague to me. I was hoping somebody more familiar with KVM could provide a better picture of memory consumption in that area. Sean had mentioned that these allocations already get tracked through GFP_KERNEL_ACCOUNT. That's good, but if they are significant enough to track, they should be represented in memory.stat in some form. Sean also pointed out though that those allocations tend to scale rather differently than the page tables, so it probably makes sense to keep those two things separate at least. Any thoughts on putting shadow page tables and iommu page tables into the existing NR_PAGETABLE item? If not, what are the cons? And creating (maybe later) a separate NR_VIRT for the other GPF_KERNEL_ACCOUNT allocations in kvm?
On Fri, May 20, 2022 at 7:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, May 19, 2022 at 06:56:54PM -0700, Yosry Ahmed wrote: > > On Fri, May 13, 2022 at 10:14 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Fri, May 13, 2022 at 9:12 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > [...] > > > > > > > > It was mostly an honest question, I too am trying to understand what userspace > > > > wants to do with this information. I was/am also trying to understand the benefits > > > > of doing the tracking through page_state and not a dedicated KVM stat. E.g. KVM > > > > already has specific stats for the number of leaf pages mapped into a VM, why not > > > > do the same for non-leaf pages? > > > > > > Let me answer why a more general stat is useful and the potential > > > userspace reaction: > > > > > > For a memory type which is significant enough, it is useful to expose > > > it in the general interfaces, so that the general data/stat collection > > > infra can collect them instead of having workload dependent stat > > > collectors. In addition, not necessarily that stat has to have a > > > userspace reaction in an online fashion. We do collect stats for > > > offline analysis which greatly influence the priority order of > > > optimization workitems. > > > > > > Next the question is do we really need a separate stat item > > > (secondary_pagetable instead of just plain pagetable) exposed in the > > > stable API? To me secondary_pagetable is general (not kvm specific) > > > enough and can be significant, so having a separate dedicated stat > > > should be ok. Though I am ok with lump it with pagetable stat for now > > > but we do want it to be accounted somewhere. > > > > Any thoughts on this? Johannes? > > I agree that this memory should show up in vmstat/memory.stat in some > form or another. > > The arguments on whether this should be part of NR_PAGETABLE or a > separate entry seem a bit vague to me. I was hoping somebody more > familiar with KVM could provide a better picture of memory consumption > in that area. > > Sean had mentioned that these allocations already get tracked through > GFP_KERNEL_ACCOUNT. That's good, but if they are significant enough to > track, they should be represented in memory.stat in some form. Sean > also pointed out though that those allocations tend to scale rather > differently than the page tables, so it probably makes sense to keep > those two things separate at least. > > Any thoughts on putting shadow page tables and iommu page tables into > the existing NR_PAGETABLE item? If not, what are the cons? > > And creating (maybe later) a separate NR_VIRT for the other > GPF_KERNEL_ACCOUNT allocations in kvm? I agree with Sean that a NR_VIRT stat would be inaccurate by omission, unless we account for all KVM allocations under this stat. This might be an unnecessary burden according to what Sean said, as most other allocations scale linearly with the number of vCPUs or the memory assigned to the VM. I don't have enough context to say whether we should piggyback KVM MMU pages to the existing NR_PAGETABLE item, but from a high level it seems like it would be more helpful if they are a separate stat. Anyway, I am willing to go with whatever Sean thinks is best.
On Tue, May 24, 2022 at 03:31:52PM -0700, Yosry Ahmed wrote: > On Fri, May 20, 2022 at 7:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > I agree that this memory should show up in vmstat/memory.stat in some > > form or another. > > > > The arguments on whether this should be part of NR_PAGETABLE or a > > separate entry seem a bit vague to me. I was hoping somebody more > > familiar with KVM could provide a better picture of memory consumption > > in that area. > > > > Sean had mentioned that these allocations already get tracked through > > GFP_KERNEL_ACCOUNT. That's good, but if they are significant enough to > > track, they should be represented in memory.stat in some form. Sean > > also pointed out though that those allocations tend to scale rather > > differently than the page tables, so it probably makes sense to keep > > those two things separate at least. > > > > Any thoughts on putting shadow page tables and iommu page tables into > > the existing NR_PAGETABLE item? If not, what are the cons? > > > > And creating (maybe later) a separate NR_VIRT for the other > > GPF_KERNEL_ACCOUNT allocations in kvm? > > I agree with Sean that a NR_VIRT stat would be inaccurate by omission, > unless we account for all KVM allocations under this stat. This might > be an unnecessary burden according to what Sean said, as most other > allocations scale linearly with the number of vCPUs or the memory > assigned to the VM. I think it's fine to table the addition of NR_VIRT for now. My conclusion from this discussion was just that if we do want to add more KVM-related allocation sites later on, they likely would be something separate and not share an item with the shadow tables. This simplifies the discussion around how to present the shadow tables. That said, stats can be incremental and still useful. memory.current itself lies by ommission. It's more important to catch what's of significance and allow users to narrow down pathological cases. > I don't have enough context to say whether we should piggyback KVM MMU > pages to the existing NR_PAGETABLE item, but from a high level it > seems like it would be more helpful if they are a separate stat. > Anyway, I am willing to go with whatever Sean thinks is best. Somebody should work this out and put it into a changelog. It's permanent ABI.
On Wed, May 25, 2022, Johannes Weiner wrote: > On Tue, May 24, 2022 at 03:31:52PM -0700, Yosry Ahmed wrote: > > I don't have enough context to say whether we should piggyback KVM MMU > > pages to the existing NR_PAGETABLE item, but from a high level it > > seems like it would be more helpful if they are a separate stat. > > Anyway, I am willing to go with whatever Sean thinks is best. > > Somebody should work this out and put it into a changelog. It's > permanent ABI. After a lot of waffling, my vote is to add a dedicated NR_SECONDARY_PAGETABLE. It's somewhat redundant from a KVM perspective, as NR_SECONDARY_PAGETABLE will scale with KVM's per-VM pages_{4k,2m,1g} stats unless the guest is doing something bizarre, e.g. accessing only 4kb chunks of 2mb pages so that KVM is forced to allocate a large number of page tables even though the guest isn't accessing that much memory. But, someone would need to either understand how KVM works to make that connection, or know (or be told) to go look at KVM's stats if they're running VMs to better decipher the stats. And even in the little bit of time I played with this, I found having nr_page_table_pages side-by-side with nr_secondary_page_table_pages to be very informative. E.g. when backing a VM with THP versus HugeTLB, nr_secondary_page_table_pages is roughly the same, but nr_page_table_pages is an order of a magnitude higher with THP. I'm guessing the THP behavior is due to something triggering DoubleMap, but now I want to find out why that's happening. So while I'm pretty sure a clever user could glean the same info by cross-referencing NR_PAGETABLE stats with KVM stats, I think having NR_SECONDARY_PAGETABLE will at the very least prove to be helpful for understanding tradeoffs between VM backing types, and likely even steer folks towards potential optimizations. Baseline: # grep page_table /proc/vmstat nr_page_table_pages 2830 nr_secondary_page_table_pages 0 THP: # grep page_table /proc/vmstat nr_page_table_pages 7584 nr_secondary_page_table_pages 140 HugeTLB: # grep page_table /proc/vmstat nr_page_table_pages 3153 nr_secondary_page_table_pages 153
On Wed, May 25, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, May 25, 2022, Johannes Weiner wrote: > > On Tue, May 24, 2022 at 03:31:52PM -0700, Yosry Ahmed wrote: > > > I don't have enough context to say whether we should piggyback KVM MMU > > > pages to the existing NR_PAGETABLE item, but from a high level it > > > seems like it would be more helpful if they are a separate stat. > > > Anyway, I am willing to go with whatever Sean thinks is best. > > > > Somebody should work this out and put it into a changelog. It's > > permanent ABI. > > After a lot of waffling, my vote is to add a dedicated NR_SECONDARY_PAGETABLE. > > It's somewhat redundant from a KVM perspective, as NR_SECONDARY_PAGETABLE will > scale with KVM's per-VM pages_{4k,2m,1g} stats unless the guest is doing something > bizarre, e.g. accessing only 4kb chunks of 2mb pages so that KVM is forced to > allocate a large number of page tables even though the guest isn't accessing that > much memory. > > But, someone would need to either understand how KVM works to make that connection, > or know (or be told) to go look at KVM's stats if they're running VMs to better > decipher the stats. > > And even in the little bit of time I played with this, I found having > nr_page_table_pages side-by-side with nr_secondary_page_table_pages to be very > informative. E.g. when backing a VM with THP versus HugeTLB, > nr_secondary_page_table_pages is roughly the same, but nr_page_table_pages is an > order of a magnitude higher with THP. I'm guessing the THP behavior is due to > something triggering DoubleMap, but now I want to find out why that's happening. > > So while I'm pretty sure a clever user could glean the same info by cross-referencing > NR_PAGETABLE stats with KVM stats, I think having NR_SECONDARY_PAGETABLE will at the > very least prove to be helpful for understanding tradeoffs between VM backing types, > and likely even steer folks towards potential optimizations. > > Baseline: > # grep page_table /proc/vmstat > nr_page_table_pages 2830 > nr_secondary_page_table_pages 0 > > THP: > # grep page_table /proc/vmstat > nr_page_table_pages 7584 > nr_secondary_page_table_pages 140 > > HugeTLB: > # grep page_table /proc/vmstat > nr_page_table_pages 3153 > nr_secondary_page_table_pages 153 > Interesting findings! Thanks for taking the time to look into this, Sean! I will refresh this patchset and summarize the discussion in the commit message, and also fix some nits on the KVM side. Does this sound good to everyone?
On Fri, May 27, 2022 at 11:33:27AM -0700, Yosry Ahmed wrote: > On Wed, May 25, 2022 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, May 25, 2022, Johannes Weiner wrote: > > > On Tue, May 24, 2022 at 03:31:52PM -0700, Yosry Ahmed wrote: > > > > I don't have enough context to say whether we should piggyback KVM MMU > > > > pages to the existing NR_PAGETABLE item, but from a high level it > > > > seems like it would be more helpful if they are a separate stat. > > > > Anyway, I am willing to go with whatever Sean thinks is best. > > > > > > Somebody should work this out and put it into a changelog. It's > > > permanent ABI. > > > > After a lot of waffling, my vote is to add a dedicated NR_SECONDARY_PAGETABLE. > > > > It's somewhat redundant from a KVM perspective, as NR_SECONDARY_PAGETABLE will > > scale with KVM's per-VM pages_{4k,2m,1g} stats unless the guest is doing something > > bizarre, e.g. accessing only 4kb chunks of 2mb pages so that KVM is forced to > > allocate a large number of page tables even though the guest isn't accessing that > > much memory. > > > > But, someone would need to either understand how KVM works to make that connection, > > or know (or be told) to go look at KVM's stats if they're running VMs to better > > decipher the stats. > > > > And even in the little bit of time I played with this, I found having > > nr_page_table_pages side-by-side with nr_secondary_page_table_pages to be very > > informative. E.g. when backing a VM with THP versus HugeTLB, > > nr_secondary_page_table_pages is roughly the same, but nr_page_table_pages is an > > order of a magnitude higher with THP. I'm guessing the THP behavior is due to > > something triggering DoubleMap, but now I want to find out why that's happening. > > > > So while I'm pretty sure a clever user could glean the same info by cross-referencing > > NR_PAGETABLE stats with KVM stats, I think having NR_SECONDARY_PAGETABLE will at the > > very least prove to be helpful for understanding tradeoffs between VM backing types, > > and likely even steer folks towards potential optimizations. > > > > Baseline: > > # grep page_table /proc/vmstat > > nr_page_table_pages 2830 > > nr_secondary_page_table_pages 0 > > > > THP: > > # grep page_table /proc/vmstat > > nr_page_table_pages 7584 > > nr_secondary_page_table_pages 140 > > > > HugeTLB: > > # grep page_table /proc/vmstat > > nr_page_table_pages 3153 > > nr_secondary_page_table_pages 153 > > > > Interesting findings! Thanks for taking the time to look into this, Sean! > I will refresh this patchset and summarize the discussion in the > commit message, and also fix some nits on the KVM side. Does this > sound good to everyone? Yes, thanks for summarizing this. Sounds good to me!
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 69d7a6983f78..828cb6b6f918 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1312,6 +1312,11 @@ PAGE_SIZE multiple when read back. pagetables Amount of memory allocated for page tables. + secondary_pagetables + Amount of memory allocated for secondary page tables, + this currently includes KVM mmu allocations on x86 + and arm64. + percpu (npn) Amount of memory used for storing per-cpu kernel data structures. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 061744c436d9..894d6317f3bd 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -973,6 +973,7 @@ You may not have all of these fields. SReclaimable: 159856 kB SUnreclaim: 124508 kB PageTables: 24448 kB + SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB @@ -1067,6 +1068,9 @@ SUnreclaim PageTables amount of memory dedicated to the lowest level of page tables. +SecPageTables + amount of memory dedicated to secondary page tables, this + currently includes KVM mmu allocations on x86 and arm64. NFS_Unstable Always zero. Previous counted pages which had been written to the server, but has not been committed to stable storage. diff --git a/drivers/base/node.c b/drivers/base/node.c index ec8bb24a5a22..9fe716832546 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -433,6 +433,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d ShadowCallStack:%8lu kB\n" #endif "Node %d PageTables: %8lu kB\n" + "Node %d SecPageTables: %8lu kB\n" "Node %d NFS_Unstable: %8lu kB\n" "Node %d Bounce: %8lu kB\n" "Node %d WritebackTmp: %8lu kB\n" @@ -459,6 +460,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif nid, K(node_page_state(pgdat, NR_PAGETABLE)), + nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), nid, 0UL, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 6fa761c9cc78..fad29024eb2e 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -108,6 +108,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #endif show_val_kb(m, "PageTables: ", global_node_page_state(NR_PAGETABLE)); + show_val_kb(m, "SecPageTables: ", + global_node_page_state(NR_SECONDARY_PAGETABLE)); show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 962b14d403e8..35f57f2578c0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -219,6 +219,7 @@ enum node_stat_item { NR_KERNEL_SCS_KB, /* measured in KiB */ #endif NR_PAGETABLE, /* used for pagetables */ + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 725f76723220..89fbd1793960 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1388,6 +1388,7 @@ static const struct memory_stat memory_stats[] = { { "kernel", MEMCG_KMEM }, { "kernel_stack", NR_KERNEL_STACK_KB }, { "pagetables", NR_PAGETABLE }, + { "secondary_pagetables", NR_SECONDARY_PAGETABLE }, { "percpu", MEMCG_PERCPU_B }, { "sock", MEMCG_SOCK }, { "vmalloc", MEMCG_VMALLOC }, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2db95780e003..96d00ae9d5c1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5932,7 +5932,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " active_file:%lu inactive_file:%lu isolated_file:%lu\n" " unevictable:%lu dirty:%lu writeback:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" - " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" + " mapped:%lu shmem:%lu pagetables:%lu\n" + " secondary_pagetables:%lu bounce:%lu\n" " kernel_misc_reclaimable:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", global_node_page_state(NR_ACTIVE_ANON), @@ -5949,6 +5950,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_FILE_MAPPED), global_node_page_state(NR_SHMEM), global_node_page_state(NR_PAGETABLE), + global_node_page_state(NR_SECONDARY_PAGETABLE), global_zone_page_state(NR_BOUNCE), global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE), global_zone_page_state(NR_FREE_PAGES), @@ -5982,6 +5984,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " shadow_call_stack:%lukB" #endif " pagetables:%lukB" + " secondary_pagetables:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -6007,6 +6010,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif K(node_page_state(pgdat, NR_PAGETABLE)), + K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } diff --git a/mm/vmstat.c b/mm/vmstat.c index b75b1a64b54c..50bbec73809b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1240,6 +1240,7 @@ const char * const vmstat_text[] = { "nr_shadow_call_stack", #endif "nr_page_table_pages", + "nr_secondary_page_table_pages", #ifdef CONFIG_SWAP "nr_swapcached", #endif
Add NR_SECONDARY_PAGETABLE stat to count secondary page table uses, e.g. KVM mmu. This provides more insights on the kernel memory used by a workload. This stat will be used by subsequent patches to count KVM mmu memory usage. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- Documentation/admin-guide/cgroup-v2.rst | 5 +++++ Documentation/filesystems/proc.rst | 4 ++++ drivers/base/node.c | 2 ++ fs/proc/meminfo.c | 2 ++ include/linux/mmzone.h | 1 + mm/memcontrol.c | 1 + mm/page_alloc.c | 6 +++++- mm/vmstat.c | 1 + 8 files changed, 21 insertions(+), 1 deletion(-)