Message ID | 20240418030703.38628-1-lirongqing@baidu.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: SVM: Consider NUMA affinity when allocating per-CPU save_area | expand |
+Tom, Mike, and Peter On Thu, Apr 18, 2024, Li RongQing wrote: > save_area of per-CPU svm_data are dominantly accessed from their > own local CPUs, so allocate them node-local for performance reason > > Signed-off-by: Li RongQing <lirongqing@baidu.com> > --- > arch/x86/kvm/svm/sev.c | 6 +++--- > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/svm/svm.h | 6 +++++- > 3 files changed, 9 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > index 61a7531..cce8ec7 100644 > --- a/arch/x86/kvm/svm/sev.c > +++ b/arch/x86/kvm/svm/sev.c > @@ -3179,13 +3179,13 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1); > } > > -struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) > +struct page *snp_safe_alloc_page_node(struct kvm_vcpu *vcpu, int node) > { > unsigned long pfn; > struct page *p; > > if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) > - return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); > + return alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0); > > /* > * Allocate an SNP-safe page to workaround the SNP erratum where > @@ -3196,7 +3196,7 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) > * Allocate one extra page, choose a page which is not > * 2MB-aligned, and free the other. > */ > - p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); > + p = alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); This made me realize the existing code is buggy. The allocation for the per-CPU save area shouldn't be accounted. Also, what's the point of taking @vcpu? It's a nice enough flag to say "this should be accounted", but it's decidely odd. How about we fix both in a single series, and end up with this over 3-4 patches? I.e. pass -1 where vcpu is non-NULL, and the current CPU for the save area. struct page *snp_safe_alloc_page(int cpu) { unsigned long pfn; struct page *p; gfp_t gpf; int node; if (cpu >= 0) { node = cpu_to_node(cpu); gfp = GFP_KERNEL; } else { node = NUMA_NO_NODE; gfp = GFP_KERNEL_ACCOUNT } gfp |= __GFP_ZERO; if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) return alloc_pages_node(node, gfp, 0); /* * Allocate an SNP-safe page to workaround the SNP erratum where * the CPU will incorrectly signal an RMP violation #PF if a * hugepage (2MB or 1GB) collides with the RMP entry of a * 2MB-aligned VMCB, VMSA, or AVIC backing page. * * Allocate one extra page, choose a page which is not * 2MB-aligned, and free the other. */ p = alloc_pages_node(node, gfp, 1); if (!p) return NULL; split_page(p, 1); pfn = page_to_pfn(p); if (IS_ALIGNED(pfn, PTRS_PER_PMD)) __free_page(p++); else __free_page(p + 1); return p; }
On Mon, Apr 22, 2024 at 6:29 PM Sean Christopherson <seanjc@google.com> wrote: > > +Tom, Mike, and Peter > > On Thu, Apr 18, 2024, Li RongQing wrote: > > save_area of per-CPU svm_data are dominantly accessed from their > > own local CPUs, so allocate them node-local for performance reason > > > > Signed-off-by: Li RongQing <lirongqing@baidu.com> > > --- > > arch/x86/kvm/svm/sev.c | 6 +++--- > > arch/x86/kvm/svm/svm.c | 2 +- > > arch/x86/kvm/svm/svm.h | 6 +++++- > > 3 files changed, 9 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > index 61a7531..cce8ec7 100644 > > --- a/arch/x86/kvm/svm/sev.c > > +++ b/arch/x86/kvm/svm/sev.c > > @@ -3179,13 +3179,13 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) > > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1); > > } > > > > -struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) > > +struct page *snp_safe_alloc_page_node(struct kvm_vcpu *vcpu, int node) > > { > > unsigned long pfn; > > struct page *p; > > > > if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) > > - return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); > > + return alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0); > > > > /* > > * Allocate an SNP-safe page to workaround the SNP erratum where > > @@ -3196,7 +3196,7 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) > > * Allocate one extra page, choose a page which is not > > * 2MB-aligned, and free the other. > > */ > > - p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); > > + p = alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); > > This made me realize the existing code is buggy. The allocation for the per-CPU > save area shouldn't be accounted. > > Also, what's the point of taking @vcpu? It's a nice enough flag to say "this > should be accounted", but it's decidely odd. > > How about we fix both in a single series, and end up with this over 3-4 patches? > I.e. pass -1 where vcpu is non-NULL, and the current CPU for the save area. Looks good to me. Internally we already use GFP_KERNEL for these allocations. But we had an issue with split_page() and memcg accounting internally. Yosry submitted the following: + if (memcg_kmem_charge(p, GFP_KERNEL_ACCOUNT, 0)) { + __free_page(p); + return NULL; + } Not sure if this is an issue with our kernel or if we should use split_page_memcg() here? It was suggested internally but we didn't want to backport it. > > struct page *snp_safe_alloc_page(int cpu) > { > unsigned long pfn; > struct page *p; > gfp_t gpf; > int node; > > if (cpu >= 0) { > node = cpu_to_node(cpu); > gfp = GFP_KERNEL; > } else { > node = NUMA_NO_NODE; > gfp = GFP_KERNEL_ACCOUNT > } > gfp |= __GFP_ZERO; > > if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) > return alloc_pages_node(node, gfp, 0); > > /* > * Allocate an SNP-safe page to workaround the SNP erratum where > * the CPU will incorrectly signal an RMP violation #PF if a > * hugepage (2MB or 1GB) collides with the RMP entry of a > * 2MB-aligned VMCB, VMSA, or AVIC backing page. > * > * Allocate one extra page, choose a page which is not > * 2MB-aligned, and free the other. > */ > p = alloc_pages_node(node, gfp, 1); > if (!p) > return NULL; > > split_page(p, 1); > > pfn = page_to_pfn(p); > if (IS_ALIGNED(pfn, PTRS_PER_PMD)) > __free_page(p++); > else > __free_page(p + 1); > > return p; > } >
On Tue, Apr 23, 2024 at 6:30 AM Peter Gonda <pgonda@google.com> wrote: > > On Mon, Apr 22, 2024 at 6:29 PM Sean Christopherson <seanjc@google.com> wrote: > > > > +Tom, Mike, and Peter > > > > On Thu, Apr 18, 2024, Li RongQing wrote: > > > save_area of per-CPU svm_data are dominantly accessed from their > > > own local CPUs, so allocate them node-local for performance reason > > > > > > Signed-off-by: Li RongQing <lirongqing@baidu.com> > > > --- > > > arch/x86/kvm/svm/sev.c | 6 +++--- > > > arch/x86/kvm/svm/svm.c | 2 +- > > > arch/x86/kvm/svm/svm.h | 6 +++++- > > > 3 files changed, 9 insertions(+), 5 deletions(-) > > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c > > > index 61a7531..cce8ec7 100644 > > > --- a/arch/x86/kvm/svm/sev.c > > > +++ b/arch/x86/kvm/svm/sev.c > > > @@ -3179,13 +3179,13 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) > > > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1); > > > } > > > > > > -struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) > > > +struct page *snp_safe_alloc_page_node(struct kvm_vcpu *vcpu, int node) > > > { > > > unsigned long pfn; > > > struct page *p; > > > > > > if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) > > > - return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); > > > + return alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0); > > > > > > /* > > > * Allocate an SNP-safe page to workaround the SNP erratum where > > > @@ -3196,7 +3196,7 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) > > > * Allocate one extra page, choose a page which is not > > > * 2MB-aligned, and free the other. > > > */ > > > - p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); > > > + p = alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); > > > > This made me realize the existing code is buggy. The allocation for the per-CPU > > save area shouldn't be accounted. > > > > Also, what's the point of taking @vcpu? It's a nice enough flag to say "this > > should be accounted", but it's decidely odd. > > > > How about we fix both in a single series, and end up with this over 3-4 patches? > > I.e. pass -1 where vcpu is non-NULL, and the current CPU for the save area. > > Looks good to me. Internally we already use GFP_KERNEL for these > allocations. But we had an issue with split_page() and memcg > accounting internally. Yosry submitted the following: > > + if (memcg_kmem_charge(p, GFP_KERNEL_ACCOUNT, 0)) { > + __free_page(p); > + return NULL; > + } > > Not sure if this is an issue with our kernel or if we should use > split_page_memcg() here? It was suggested internally but we didn't > want to backport it. The referenced internal code is in an older kernel where split_page() was buggy and did not handle memcg charging correctly (did not call split_page_memcg()). So we needed to make the allocation with GFP_KERNEL, then manually account the used page after splitting and freeing the unneeded page. AFAICT, this is not needed upstream and the current code is fine. > > > > > struct page *snp_safe_alloc_page(int cpu) > > { > > unsigned long pfn; > > struct page *p; > > gfp_t gpf; > > int node; > > > > if (cpu >= 0) { > > node = cpu_to_node(cpu); > > gfp = GFP_KERNEL; > > } else { > > node = NUMA_NO_NODE; > > gfp = GFP_KERNEL_ACCOUNT > > } FWIW, from the pov of someone who has zero familiarity with this code, passing @cpu only to make inferences about the GFP flags and numa nodes is confusing tbh. Would it be clearer to pass in the gfp flags and node id directly? I think it would make it clearer why we choose to account the allocation and/or specify a node in some cases but not others. > > gfp |= __GFP_ZERO; > > > > if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) > > return alloc_pages_node(node, gfp, 0); > > > > /* > > * Allocate an SNP-safe page to workaround the SNP erratum where > > * the CPU will incorrectly signal an RMP violation #PF if a > > * hugepage (2MB or 1GB) collides with the RMP entry of a > > * 2MB-aligned VMCB, VMSA, or AVIC backing page. > > * > > * Allocate one extra page, choose a page which is not > > * 2MB-aligned, and free the other. > > */ > > p = alloc_pages_node(node, gfp, 1); > > if (!p) > > return NULL; > > > > split_page(p, 1); > > > > pfn = page_to_pfn(p); > > if (IS_ALIGNED(pfn, PTRS_PER_PMD)) > > __free_page(p++); > > else > > __free_page(p + 1); > > > > return p; > > } > >
On Tue, Apr 23, 2024, Yosry Ahmed wrote: > On Tue, Apr 23, 2024 at 6:30 AM Peter Gonda <pgonda@google.com> wrote: > > > struct page *snp_safe_alloc_page(int cpu) > > > { > > > unsigned long pfn; > > > struct page *p; > > > gfp_t gpf; > > > int node; > > > > > > if (cpu >= 0) { > > > node = cpu_to_node(cpu); > > > gfp = GFP_KERNEL; > > > } else { > > > node = NUMA_NO_NODE; > > > gfp = GFP_KERNEL_ACCOUNT > > > } > > FWIW, from the pov of someone who has zero familiarity with this code, > passing @cpu only to make inferences about the GFP flags and numa > nodes is confusing tbh. > > Would it be clearer to pass in the gfp flags and node id directly? I > think it would make it clearer why we choose to account the allocation > and/or specify a node in some cases but not others. Hmm, yeah, passing GFP directly makes sense, if only to align with alloc_page() and not reinvent the wheel. But forcing all callers to explicitly provide a node ID is a net negative, i.e. having snp_safe_alloc_page() and snp_safe_alloc_page_node() as originally suggested makes sense. But snp_safe_alloc_page() should again flow alloc_pages() and pass numa_node_id(), not NUMA_NO_NODE.
On Tue, Apr 23, 2024 at 11:43 AM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Apr 23, 2024, Yosry Ahmed wrote: > > On Tue, Apr 23, 2024 at 6:30 AM Peter Gonda <pgonda@google.com> wrote: > > > > struct page *snp_safe_alloc_page(int cpu) > > > > { > > > > unsigned long pfn; > > > > struct page *p; > > > > gfp_t gpf; > > > > int node; > > > > > > > > if (cpu >= 0) { > > > > node = cpu_to_node(cpu); > > > > gfp = GFP_KERNEL; > > > > } else { > > > > node = NUMA_NO_NODE; > > > > gfp = GFP_KERNEL_ACCOUNT > > > > } > > > > FWIW, from the pov of someone who has zero familiarity with this code, > > passing @cpu only to make inferences about the GFP flags and numa > > nodes is confusing tbh. > > > > Would it be clearer to pass in the gfp flags and node id directly? I > > think it would make it clearer why we choose to account the allocation > > and/or specify a node in some cases but not others. > > Hmm, yeah, passing GFP directly makes sense, if only to align with alloc_page() > and not reinvent the wheel. But forcing all callers to explicitly provide a node > ID is a net negative, i.e. having snp_safe_alloc_page() and snp_safe_alloc_page_node() > as originally suggested makes sense. Yeah if most callers do not need to provide a node then it makes sense to have a specialized snp_safe_alloc_page_node() variant. > > But snp_safe_alloc_page() should again flow alloc_pages() and pass numa_node_id(), > not NUMA_NO_NODE. alloc_pages_node() will use numa_node_id() if you pass in NUMA_NO_NODE. That's the documented behavior and it seems to be widely used. I don't see anyone using numa_node_id() directly with alloc_pages_node().
On Tue, Apr 23, 2024, Yosry Ahmed wrote: > On Tue, Apr 23, 2024 at 11:43 AM Sean Christopherson <seanjc@google.com> wrote: > > But snp_safe_alloc_page() should again flow alloc_pages() and pass numa_node_id(), > > not NUMA_NO_NODE. > > alloc_pages_node() will use numa_node_id() if you pass in NUMA_NO_NODE. The code uses numa_mem_id() if (nid == NUMA_NO_NODE) nid = numa_mem_id(); which is presumably the exact same thing as numa_node_id() on x86. But I don't want to have to think that hard :-) In other words, all I'm saying is that if we want to mirror alloc_pages() and alloc_pages_node(), then we should mirror them exactly. > That's the documented behavior and it seems to be widely > used. I don't see anyone using numa_node_id() directly with > alloc_pages_node().
On Tue, Apr 23, 2024 at 12:00 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Apr 23, 2024, Yosry Ahmed wrote: > > On Tue, Apr 23, 2024 at 11:43 AM Sean Christopherson <seanjc@google.com> wrote: > > > But snp_safe_alloc_page() should again flow alloc_pages() and pass numa_node_id(), > > > not NUMA_NO_NODE. > > > > alloc_pages_node() will use numa_node_id() if you pass in NUMA_NO_NODE. > > The code uses numa_mem_id() > > if (nid == NUMA_NO_NODE) > nid = numa_mem_id(); > > which is presumably the exact same thing as numa_node_id() on x86. But I don't > want to have to think that hard :-) Uh yes, I missed numa_node_id() vs numa_mem_id(). Anyway, using NUMA_NO_NODE with alloc_pages_node() is intended as an abstraction such that you don't need to think about it :P > > In other words, all I'm saying is that if we want to mirror alloc_pages() and > alloc_pages_node(), then we should mirror them exactly. It's different though, these functions are core MM APIs used by the entire kernel. snp_safe_alloc_page() is just a helper here wrapping those core MM APIs rather than mirroring them. In this case snp_safe_alloc_page() would wrap snp_safe_alloc_page_node() which would wrap alloc_pages_node(). So it should use alloc_pages_node() as intended: pass in a node id or NUMA_NO_NODE if you just want the closest node. Just my 2c, not that it will make a difference in practice.
On 4/23/24 14:08, Yosry Ahmed wrote: > On Tue, Apr 23, 2024 at 12:00 PM Sean Christopherson <seanjc@google.com> wrote: >> On Tue, Apr 23, 2024, Yosry Ahmed wrote: >>> On Tue, Apr 23, 2024 at 11:43 AM Sean Christopherson <seanjc@google.com> wrote: >>>> But snp_safe_alloc_page() should again flow alloc_pages() and pass numa_node_id(), >>>> not NUMA_NO_NODE. >>> >>> alloc_pages_node() will use numa_node_id() if you pass in NUMA_NO_NODE. >> >> The code uses numa_mem_id() >> >> if (nid == NUMA_NO_NODE) >> nid = numa_mem_id(); >> >> which is presumably the exact same thing as numa_node_id() on x86. But I don't >> want to have to think that hard :-) > > Uh yes, I missed numa_node_id() vs numa_mem_id(). Anyway, using > NUMA_NO_NODE with alloc_pages_node() is intended as an abstraction > such that you don't need to think about it :P > >> >> In other words, all I'm saying is that if we want to mirror alloc_pages() and >> alloc_pages_node(), then we should mirror them exactly. > > It's different though, these functions are core MM APIs used by the > entire kernel. snp_safe_alloc_page() is just a helper here wrapping > those core MM APIs rather than mirroring them. > > In this case snp_safe_alloc_page() would wrap > snp_safe_alloc_page_node() which would wrap alloc_pages_node(). So it > should use alloc_pages_node() as intended: pass in a node id or > NUMA_NO_NODE if you just want the closest node. > > Just my 2c, not that it will make a difference in practice. Pretty much agree with everything everyone has said. The per-CPU shouldn't be GFP_KERNEL_ACCOUNT and having a snp_safe_alloc_page() that wraps snp_safe_alloc_page_node() which then calls alloc_pages_node() seems the best way to me. Thanks, Tom
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 61a7531..cce8ec7 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -3179,13 +3179,13 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1); } -struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) +struct page *snp_safe_alloc_page_node(struct kvm_vcpu *vcpu, int node) { unsigned long pfn; struct page *p; if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP)) - return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); + return alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0); /* * Allocate an SNP-safe page to workaround the SNP erratum where @@ -3196,7 +3196,7 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) * Allocate one extra page, choose a page which is not * 2MB-aligned, and free the other. */ - p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); + p = alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1); if (!p) return NULL; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d1a9f995..69fc809 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -703,7 +703,7 @@ static int svm_cpu_init(int cpu) int ret = -ENOMEM; memset(sd, 0, sizeof(struct svm_cpu_data)); - sd->save_area = snp_safe_alloc_page(NULL); + sd->save_area = snp_safe_alloc_page_node(NULL, cpu_to_node(cpu)); if (!sd->save_area) return ret; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 7f1fbd8..3bbf638 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -694,8 +694,12 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm); void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector); void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa); void sev_es_unmap_ghcb(struct vcpu_svm *svm); -struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu); +struct page *snp_safe_alloc_page_node(struct kvm_vcpu *vcpu, int node); +static inline struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu) +{ + return snp_safe_alloc_page_node(vcpu, NUMA_NO_NODE); +} /* vmenter.S */ void __svm_sev_es_vcpu_run(struct vcpu_svm *svm, bool spec_ctrl_intercepted);
save_area of per-CPU svm_data are dominantly accessed from their own local CPUs, so allocate them node-local for performance reason Signed-off-by: Li RongQing <lirongqing@baidu.com> --- arch/x86/kvm/svm/sev.c | 6 +++--- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/svm/svm.h | 6 +++++- 3 files changed, 9 insertions(+), 5 deletions(-)