KVM: arm: memcg awareness

Message ID	1615959984-7122-1-git-send-email-wanpengli@tencent.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wanpeng Li <kernellwp@gmail.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org> Subject: [PATCH] KVM: arm: memcg awareness Date: Wed, 17 Mar 2021 13:46:24 +0800 Message-Id: <1615959984-7122-1-git-send-email-wanpengli@tencent.com> Precedence: bulk
Series	KVM: arm: memcg awareness \| expand KVM: arm: memcg awareness

Wanpeng Li March 17, 2021, 5:46 a.m. UTC

From: Wanpeng Li <wanpengli@tencent.com>

KVM allocations in the arm kvm code which are tied to the life 
of the VM process should be charged to the VM process's cgroup.
This will help the memcg controler to do the right decisions.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/arm64/kvm/arm.c               |  5 +++--
 arch/arm64/kvm/hyp/pgtable.c       |  4 ++--
 arch/arm64/kvm/mmu.c               |  4 ++--
 arch/arm64/kvm/pmu-emul.c          |  2 +-
 arch/arm64/kvm/reset.c             |  2 +-
 arch/arm64/kvm/vgic/vgic-debug.c   |  2 +-
 arch/arm64/kvm/vgic/vgic-init.c    |  2 +-
 arch/arm64/kvm/vgic/vgic-irqfd.c   |  2 +-
 arch/arm64/kvm/vgic/vgic-its.c     | 14 +++++++-------
 arch/arm64/kvm/vgic/vgic-mmio-v3.c |  2 +-
 arch/arm64/kvm/vgic/vgic-v4.c      |  2 +-
 11 files changed, 21 insertions(+), 20 deletions(-)

Michal Hocko March 17, 2021, 7:57 a.m. UTC | #1

On Wed 17-03-21 13:46:24, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> KVM allocations in the arm kvm code which are tied to the life 
> of the VM process should be charged to the VM process's cgroup.

How much memory are we talking about?

> This will help the memcg controler to do the right decisions.

This is a bit vague. What is the right decision? AFAICS none of that
memory is considered during oom victim selection. The only thing memcg
controler can help with is to contain and account this additional
memory. This might help to better isolate multiple workloads on the same
system. Maybe this is what you wanted to say? Or maybe this is a way to
prevent untrusted users from consuming a lot of memory?

> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  arch/arm64/kvm/arm.c               |  5 +++--
>  arch/arm64/kvm/hyp/pgtable.c       |  4 ++--
>  arch/arm64/kvm/mmu.c               |  4 ++--
>  arch/arm64/kvm/pmu-emul.c          |  2 +-
>  arch/arm64/kvm/reset.c             |  2 +-
>  arch/arm64/kvm/vgic/vgic-debug.c   |  2 +-
>  arch/arm64/kvm/vgic/vgic-init.c    |  2 +-
>  arch/arm64/kvm/vgic/vgic-irqfd.c   |  2 +-
>  arch/arm64/kvm/vgic/vgic-its.c     | 14 +++++++-------
>  arch/arm64/kvm/vgic/vgic-mmio-v3.c |  2 +-
>  arch/arm64/kvm/vgic/vgic-v4.c      |  2 +-
>  11 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 7f06ba7..8040874 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -278,9 +278,10 @@ long kvm_arch_dev_ioctl(struct file *filp,
>  struct kvm *kvm_arch_alloc_vm(void)
>  {
>  	if (!has_vhe())
> -		return kzalloc(sizeof(struct kvm), GFP_KERNEL);
> +		return kzalloc(sizeof(struct kvm), GFP_KERNEL_ACCOUNT);
>  
> -	return vzalloc(sizeof(struct kvm));
> +	return __vmalloc(sizeof(struct kvm),
> +			GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  }
>  
>  void kvm_arch_free_vm(struct kvm *kvm)
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 926fc07..a0845d3 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -366,7 +366,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>  		return -EINVAL;
>  
> -	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
> +	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>  	if (!childp)
>  		return -ENOMEM;
>  
> @@ -401,7 +401,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
>  {
>  	u64 levels = ARM64_HW_PGTABLE_LEVELS(va_bits);
>  
> -	pgt->pgd = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
> +	pgt->pgd = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>  	if (!pgt->pgd)
>  		return -ENOMEM;
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8711894..8c9dc49 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -370,7 +370,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu)
>  		return -EINVAL;
>  	}
>  
> -	pgt = kzalloc(sizeof(*pgt), GFP_KERNEL);
> +	pgt = kzalloc(sizeof(*pgt), GFP_KERNEL_ACCOUNT);
>  	if (!pgt)
>  		return -ENOMEM;
>  
> @@ -1244,7 +1244,7 @@ int kvm_mmu_init(void)
>  		goto out;
>  	}
>  
> -	hyp_pgtable = kzalloc(sizeof(*hyp_pgtable), GFP_KERNEL);
> +	hyp_pgtable = kzalloc(sizeof(*hyp_pgtable), GFP_KERNEL_ACCOUNT);
>  	if (!hyp_pgtable) {
>  		kvm_err("Hyp mode page-table not allocated\n");
>  		err = -ENOMEM;
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index e32c6e1..00cf750 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -967,7 +967,7 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>  		mutex_lock(&vcpu->kvm->lock);
>  
>  		if (!vcpu->kvm->arch.pmu_filter) {
> -			vcpu->kvm->arch.pmu_filter = bitmap_alloc(nr_events, GFP_KERNEL);
> +			vcpu->kvm->arch.pmu_filter = bitmap_alloc(nr_events, GFP_KERNEL_ACCOUNT);
>  			if (!vcpu->kvm->arch.pmu_filter) {
>  				mutex_unlock(&vcpu->kvm->lock);
>  				return -ENOMEM;
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index bd354cd..3cbcf6b 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -110,7 +110,7 @@ static int kvm_vcpu_finalize_sve(struct kvm_vcpu *vcpu)
>  		    vl > SVE_VL_ARCH_MAX))
>  		return -EIO;
>  
> -	buf = kzalloc(SVE_SIG_REGS_SIZE(sve_vq_from_vl(vl)), GFP_KERNEL);
> +	buf = kzalloc(SVE_SIG_REGS_SIZE(sve_vq_from_vl(vl)), GFP_KERNEL_ACCOUNT);
>  	if (!buf)
>  		return -ENOMEM;
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-debug.c b/arch/arm64/kvm/vgic/vgic-debug.c
> index f38c40a..e6a01f2 100644
> --- a/arch/arm64/kvm/vgic/vgic-debug.c
> +++ b/arch/arm64/kvm/vgic/vgic-debug.c
> @@ -92,7 +92,7 @@ static void *vgic_debug_start(struct seq_file *s, loff_t *pos)
>  		goto out;
>  	}
>  
> -	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
> +	iter = kmalloc(sizeof(*iter), GFP_KERNEL_ACCOUNT);
>  	if (!iter) {
>  		iter = ERR_PTR(-ENOMEM);
>  		goto out;
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index 052917d..27d1513 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -134,7 +134,7 @@ static int kvm_vgic_dist_init(struct kvm *kvm, unsigned int nr_spis)
>  	struct kvm_vcpu *vcpu0 = kvm_get_vcpu(kvm, 0);
>  	int i;
>  
> -	dist->spis = kcalloc(nr_spis, sizeof(struct vgic_irq), GFP_KERNEL);
> +	dist->spis = kcalloc(nr_spis, sizeof(struct vgic_irq), GFP_KERNEL_ACCOUNT);
>  	if (!dist->spis)
>  		return  -ENOMEM;
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c
> index 79f8899..475059b 100644
> --- a/arch/arm64/kvm/vgic/vgic-irqfd.c
> +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
> @@ -139,7 +139,7 @@ int kvm_vgic_setup_default_irq_routing(struct kvm *kvm)
>  	u32 nr = dist->nr_spis;
>  	int i, ret;
>  
> -	entries = kcalloc(nr, sizeof(*entries), GFP_KERNEL);
> +	entries = kcalloc(nr, sizeof(*entries), GFP_KERNEL_ACCOUNT);
>  	if (!entries)
>  		return -ENOMEM;
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
> index 40cbaca..bd90730 100644
> --- a/arch/arm64/kvm/vgic/vgic-its.c
> +++ b/arch/arm64/kvm/vgic/vgic-its.c
> @@ -48,7 +48,7 @@ static struct vgic_irq *vgic_add_lpi(struct kvm *kvm, u32 intid,
>  	if (irq)
>  		return irq;
>  
> -	irq = kzalloc(sizeof(struct vgic_irq), GFP_KERNEL);
> +	irq = kzalloc(sizeof(struct vgic_irq), GFP_KERNEL_ACCOUNT);
>  	if (!irq)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -332,7 +332,7 @@ int vgic_copy_lpi_list(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 **intid_ptr)
>  	 * we must be careful not to overrun the array.
>  	 */
>  	irq_count = READ_ONCE(dist->lpi_list_count);
> -	intids = kmalloc_array(irq_count, sizeof(intids[0]), GFP_KERNEL);
> +	intids = kmalloc_array(irq_count, sizeof(intids[0]), GFP_KERNEL_ACCOUNT);
>  	if (!intids)
>  		return -ENOMEM;
>  
> @@ -985,7 +985,7 @@ static int vgic_its_alloc_collection(struct vgic_its *its,
>  	if (!vgic_its_check_id(its, its->baser_coll_table, coll_id, NULL))
>  		return E_ITS_MAPC_COLLECTION_OOR;
>  
> -	collection = kzalloc(sizeof(*collection), GFP_KERNEL);
> +	collection = kzalloc(sizeof(*collection), GFP_KERNEL_ACCOUNT);
>  	if (!collection)
>  		return -ENOMEM;
>  
> @@ -1029,7 +1029,7 @@ static struct its_ite *vgic_its_alloc_ite(struct its_device *device,
>  {
>  	struct its_ite *ite;
>  
> -	ite = kzalloc(sizeof(*ite), GFP_KERNEL);
> +	ite = kzalloc(sizeof(*ite), GFP_KERNEL_ACCOUNT);
>  	if (!ite)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -1150,7 +1150,7 @@ static struct its_device *vgic_its_alloc_device(struct vgic_its *its,
>  {
>  	struct its_device *device;
>  
> -	device = kzalloc(sizeof(*device), GFP_KERNEL);
> +	device = kzalloc(sizeof(*device), GFP_KERNEL_ACCOUNT);
>  	if (!device)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -1847,7 +1847,7 @@ void vgic_lpi_translation_cache_init(struct kvm *kvm)
>  		struct vgic_translation_cache_entry *cte;
>  
>  		/* An allocation failure is not fatal */
> -		cte = kzalloc(sizeof(*cte), GFP_KERNEL);
> +		cte = kzalloc(sizeof(*cte), GFP_KERNEL_ACCOUNT);
>  		if (WARN_ON(!cte))
>  			break;
>  
> @@ -1888,7 +1888,7 @@ static int vgic_its_create(struct kvm_device *dev, u32 type)
>  	if (type != KVM_DEV_TYPE_ARM_VGIC_ITS)
>  		return -ENODEV;
>  
> -	its = kzalloc(sizeof(struct vgic_its), GFP_KERNEL);
> +	its = kzalloc(sizeof(struct vgic_its), GFP_KERNEL_ACCOUNT);
>  	if (!its)
>  		return -ENOMEM;
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
> index 15a6c98..22ab4ba 100644
> --- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
> +++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
> @@ -826,7 +826,7 @@ static int vgic_v3_insert_redist_region(struct kvm *kvm, uint32_t index,
>  	if (vgic_v3_rdist_overlap(kvm, base, size))
>  		return -EINVAL;
>  
> -	rdreg = kzalloc(sizeof(*rdreg), GFP_KERNEL);
> +	rdreg = kzalloc(sizeof(*rdreg), GFP_KERNEL_ACCOUNT);
>  	if (!rdreg)
>  		return -ENOMEM;
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
> index 66508b0..a80cc37 100644
> --- a/arch/arm64/kvm/vgic/vgic-v4.c
> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
> @@ -227,7 +227,7 @@ int vgic_v4_init(struct kvm *kvm)
>  	nr_vcpus = atomic_read(&kvm->online_vcpus);
>  
>  	dist->its_vm.vpes = kcalloc(nr_vcpus, sizeof(*dist->its_vm.vpes),
> -				    GFP_KERNEL);
> +				    GFP_KERNEL_ACCOUNT);
>  	if (!dist->its_vm.vpes)
>  		return -ENOMEM;
>  
> -- 
> 2.7.4

Wanpeng Li March 17, 2021, 8:04 a.m. UTC | #2

On Wed, 17 Mar 2021 at 15:57, Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 17-03-21 13:46:24, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > KVM allocations in the arm kvm code which are tied to the life
> > of the VM process should be charged to the VM process's cgroup.
>
> How much memory are we talking about?
>
> > This will help the memcg controler to do the right decisions.
>
> This is a bit vague. What is the right decision? AFAICS none of that
> memory is considered during oom victim selection. The only thing memcg
> controler can help with is to contain and account this additional
> memory. This might help to better isolate multiple workloads on the same
> system. Maybe this is what you wanted to say? Or maybe this is a way to
> prevent untrusted users from consuming a lot of memory?

It is explained in this patchset for x86 kvm which is upstream, I
think I don't need to copy and paste. :)

    Wanpeng

Wanpeng Li March 17, 2021, 8:04 a.m. UTC | #3

On Wed, 17 Mar 2021 at 16:04, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> On Wed, 17 Mar 2021 at 15:57, Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 17-03-21 13:46:24, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@tencent.com>
> > >
> > > KVM allocations in the arm kvm code which are tied to the life
> > > of the VM process should be charged to the VM process's cgroup.
> >
> > How much memory are we talking about?
> >
> > > This will help the memcg controler to do the right decisions.
> >
> > This is a bit vague. What is the right decision? AFAICS none of that
> > memory is considered during oom victim selection. The only thing memcg
> > controler can help with is to contain and account this additional
> > memory. This might help to better isolate multiple workloads on the same
> > system. Maybe this is what you wanted to say? Or maybe this is a way to
> > prevent untrusted users from consuming a lot of memory?
>

https://patchwork.kernel.org/project/kvm/patch/20190211190252.198101-1-bgardon@google.com/

> It is explained in this patchset for x86 kvm which is upstream, I
> think I don't need to copy and paste. :)
>
>     Wanpeng

Marc Zyngier March 17, 2021, 9:10 a.m. UTC | #4

On Wed, 17 Mar 2021 05:46:24 +0000,
Wanpeng Li <kernellwp@gmail.com> wrote:
> 
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> KVM allocations in the arm kvm code which are tied to the life 
> of the VM process should be charged to the VM process's cgroup.
> This will help the memcg controler to do the right decisions.
> 
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  arch/arm64/kvm/arm.c               |  5 +++--
>  arch/arm64/kvm/hyp/pgtable.c       |  4 ++--
>  arch/arm64/kvm/mmu.c               |  4 ++--
>  arch/arm64/kvm/pmu-emul.c          |  2 +-
>  arch/arm64/kvm/reset.c             |  2 +-
>  arch/arm64/kvm/vgic/vgic-debug.c   |  2 +-
>  arch/arm64/kvm/vgic/vgic-init.c    |  2 +-
>  arch/arm64/kvm/vgic/vgic-irqfd.c   |  2 +-
>  arch/arm64/kvm/vgic/vgic-its.c     | 14 +++++++-------
>  arch/arm64/kvm/vgic/vgic-mmio-v3.c |  2 +-
>  arch/arm64/kvm/vgic/vgic-v4.c      |  2 +-
>  11 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 7f06ba7..8040874 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -278,9 +278,10 @@ long kvm_arch_dev_ioctl(struct file *filp,
>  struct kvm *kvm_arch_alloc_vm(void)
>  {
>  	if (!has_vhe())
> -		return kzalloc(sizeof(struct kvm), GFP_KERNEL);
> +		return kzalloc(sizeof(struct kvm), GFP_KERNEL_ACCOUNT);
>  
> -	return vzalloc(sizeof(struct kvm));
> +	return __vmalloc(sizeof(struct kvm),
> +			GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  }
>  
>  void kvm_arch_free_vm(struct kvm *kvm)
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 926fc07..a0845d3 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -366,7 +366,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>  	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>  		return -EINVAL;
>  
> -	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
> +	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);

No, this is wrong.

You cannot account the hypervisor page tables to the guest because we
don't ever unmap them, and that we can't distinguish two data
structures from two different VMs occupying the same page.

>  	if (!childp)
>  		return -ENOMEM;
>  
> @@ -401,7 +401,7 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits)
>  {
>  	u64 levels = ARM64_HW_PGTABLE_LEVELS(va_bits);
>  
> -	pgt->pgd = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
> +	pgt->pgd = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);

There is no VM in this context. There isn't even any userspace
whatsoever in the system when this is called.

[...]

> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
> index 66508b0..a80cc37 100644
> --- a/arch/arm64/kvm/vgic/vgic-v4.c
> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
> @@ -227,7 +227,7 @@ int vgic_v4_init(struct kvm *kvm)
>  	nr_vcpus = atomic_read(&kvm->online_vcpus);
>  
>  	dist->its_vm.vpes = kcalloc(nr_vcpus, sizeof(*dist->its_vm.vpes),
> -				    GFP_KERNEL);
> +				    GFP_KERNEL_ACCOUNT);

And now for the elephant in the room: what you do for the GICv4 VPTs
that are allocated for each vPE?

	M.

Marc Zyngier March 17, 2021, 9:15 a.m. UTC | #5

On Wed, 17 Mar 2021 08:04:20 +0000,
Wanpeng Li <kernellwp@gmail.com> wrote:
> 
> On Wed, 17 Mar 2021 at 15:57, Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 17-03-21 13:46:24, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@tencent.com>
> > >
> > > KVM allocations in the arm kvm code which are tied to the life
> > > of the VM process should be charged to the VM process's cgroup.
> >
> > How much memory are we talking about?
> >
> > > This will help the memcg controler to do the right decisions.
> >
> > This is a bit vague. What is the right decision? AFAICS none of that
> > memory is considered during oom victim selection. The only thing memcg
> > controler can help with is to contain and account this additional
> > memory. This might help to better isolate multiple workloads on the same
> > system. Maybe this is what you wanted to say? Or maybe this is a way to
> > prevent untrusted users from consuming a lot of memory?
> 
> It is explained in this patchset for x86 kvm which is upstream, I
> think I don't need to copy and paste. :)

You expect us to review your patches, don't you?

Surely you can afford to explain why I should consider it at all. And
if you cannot be bothered, maybe I shouldn't either.

	M.

Michal Hocko March 17, 2021, 9:44 a.m. UTC | #6

On Wed 17-03-21 16:04:51, Wanpeng Li wrote:
> On Wed, 17 Mar 2021 at 16:04, Wanpeng Li <kernellwp@gmail.com> wrote:
> >
> > On Wed, 17 Mar 2021 at 15:57, Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 17-03-21 13:46:24, Wanpeng Li wrote:
> > > > From: Wanpeng Li <wanpengli@tencent.com>
> > > >
> > > > KVM allocations in the arm kvm code which are tied to the life
> > > > of the VM process should be charged to the VM process's cgroup.
> > >
> > > How much memory are we talking about?
> > >
> > > > This will help the memcg controler to do the right decisions.
> > >
> > > This is a bit vague. What is the right decision? AFAICS none of that
> > > memory is considered during oom victim selection. The only thing memcg
> > > controler can help with is to contain and account this additional
> > > memory. This might help to better isolate multiple workloads on the same
> > > system. Maybe this is what you wanted to say? Or maybe this is a way to
> > > prevent untrusted users from consuming a lot of memory?
> >
> 
> https://patchwork.kernel.org/project/kvm/patch/20190211190252.198101-1-bgardon@google.com/
> 
> > It is explained in this patchset for x86 kvm which is upstream, I
> > think I don't need to copy and paste. :)

How is one supposed to know that? If you want to spare some typing then
you could have referenced 4183683918ef ("kvm: vmx: Add memcg accounting
to KVM allocations").

Btw. that explanation is rather vague as well. It doesn't explain any of
my above questions. It is not my take to judge whether these are
important for the respective maintainers I just want to point out that
once somebody revisits this code and try to find out why the accounting
has been added then this will be far from clear because "memcg doing the
right thing" doesn't tell much in itself.

Paolo Bonzini March 17, 2021, 10:40 a.m. UTC | #7

On 17/03/21 10:10, Marc Zyngier wrote:
>> @@ -366,7 +366,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>   	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>>   		return -EINVAL;
>>   
>> -	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
>> +	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> No, this is wrong.
> 
> You cannot account the hypervisor page tables to the guest because we
> don't ever unmap them, and that we can't distinguish two data
> structures from two different VMs occupying the same page.

If you never unmap them, there should at least be a shrinker to get rid 
of unused pages in the event of memory pressure.

Paolo

Marc Zyngier March 17, 2021, 10:53 a.m. UTC | #8

On Wed, 17 Mar 2021 10:40:23 +0000,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 17/03/21 10:10, Marc Zyngier wrote:
> >> @@ -366,7 +366,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >>   	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
> >>   		return -EINVAL;
> >>   -	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
> >> +	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> > No, this is wrong.
> > 
> > You cannot account the hypervisor page tables to the guest because we
> > don't ever unmap them, and that we can't distinguish two data
> > structures from two different VMs occupying the same page.
> 
> If you never unmap them, there should at least be a shrinker to get
> rid of unused pages in the event of memory pressure.

We don't track where these pages are coming from or whether they can
safely be unmapped. Until we can track such ownership and deal with
page sharing, these mappings have to stay,

At most, this represent the amount of memory required to map the whole
of the linear mapping.

	M.

Paolo Bonzini March 17, 2021, 10:55 a.m. UTC | #9

On 17/03/21 11:53, Marc Zyngier wrote:
> On Wed, 17 Mar 2021 10:40:23 +0000,
> Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 17/03/21 10:10, Marc Zyngier wrote:
>>>> @@ -366,7 +366,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>    	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
>>>>    		return -EINVAL;
>>>>    -	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
>>>> +	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>>> No, this is wrong.
>>>
>>> You cannot account the hypervisor page tables to the guest because we
>>> don't ever unmap them, and that we can't distinguish two data
>>> structures from two different VMs occupying the same page.
>>
>> If you never unmap them, there should at least be a shrinker to get
>> rid of unused pages in the event of memory pressure.
> 
> We don't track where these pages are coming from or whether they can
> safely be unmapped. Until we can track such ownership and deal with
> page sharing, these mappings have to stay,
> 
> At most, this represent the amount of memory required to map the whole
> of the linear mapping.

Ah, these are the EL2 pages, not the stage2 page tables, right?  If so, 
sorry for the noise.

Paolo

Marc Zyngier March 17, 2021, 11:03 a.m. UTC | #10

On Wed, 17 Mar 2021 10:55:00 +0000,
Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 17/03/21 11:53, Marc Zyngier wrote:
> > On Wed, 17 Mar 2021 10:40:23 +0000,
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> 
> >> On 17/03/21 10:10, Marc Zyngier wrote:
> >>>> @@ -366,7 +366,7 @@ static int hyp_map_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> >>>>    	if (WARN_ON(level == KVM_PGTABLE_MAX_LEVELS - 1))
> >>>>    		return -EINVAL;
> >>>>    -	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL);
> >>>> +	childp = (kvm_pte_t *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
> >>> No, this is wrong.
> >>> 
> >>> You cannot account the hypervisor page tables to the guest because we
> >>> don't ever unmap them, and that we can't distinguish two data
> >>> structures from two different VMs occupying the same page.
> >> 
> >> If you never unmap them, there should at least be a shrinker to get
> >> rid of unused pages in the event of memory pressure.
> > 
> > We don't track where these pages are coming from or whether they can
> > safely be unmapped. Until we can track such ownership and deal with
> > page sharing, these mappings have to stay,
> > 
> > At most, this represent the amount of memory required to map the whole
> > of the linear mapping.
> 
> Ah, these are the EL2 pages, not the stage2 page tables, right?  If
> so, sorry for the noise.

Yes, EL2 page tables when running non-VHE. VHE doesn't have that
problem for obvious reasons. Stage-2 page tables can be completely
discarded at any point, and the MMU notifiers already deal with that.

	M.

KVM: arm: memcg awareness

Commit Message

Comments

Patch