Message ID | 20181026075900.111462-1-marcorr@google.com (mailing list archive) |
---|---|
Headers | show |
Series | use vmalloc to allocate vmx vcpus | expand |
On Fri, Oct 26, 2018 at 12:58:58AM -0700, Marc Orr wrote: > A couple of patches to allocate vmx vcpus with vmalloc instead of > kalloc, which enables vcpu allocation to succeeed when contiguous > physical memory is sparse. A question that may have been asked already, but if so I didn't see it ... does kvm_vcpu need to be so damn big? It's 22kB with the random .config I happen to have (which gets rounded to 32kB, an order-3 allocation). If we can knock 6kB off it (either by allocating pieces separately), it becomes an order-2 allocation. So, biggest chunks: struct kvm_vcpu_arch arch; /* 576 21568 */ struct kvm_mmu mmu; /* 336 400 */ struct kvm_mmu nested_mmu; /* 736 400 */ struct fpu user_fpu; /* 2176 4160 */ struct fpu guest_fpu; /* 6336 4160 */ struct kvm_cpuid_entry2 cpuid_entries[80]; /* 10580 3200 */ struct x86_emulate_ctxt emulate_ctxt; /* 13792 2656 */ struct kvm_pmu pmu; /* 17344 1520 */ struct kvm_vcpu_hv hyperv; /* 18872 1872 */ gfn_t gfns[64]; /* 20832 512 */ that's about 19kB of the 22kB right there. Can any of them be shrunk in size or allocated separately?
On Fri, Oct 26, 2018 at 05:29:48AM -0700, Matthew Wilcox wrote: > A question that may have been asked already, but if so I didn't see it ... > does kvm_vcpu need to be so damn big? It's 22kB with the random .config > I happen to have (which gets rounded to 32kB, an order-3 allocation). If > we can knock 6kB off it (either by allocating pieces separately), it becomes > an order-2 allocation. So, biggest chunks: > > struct kvm_vcpu_arch arch; /* 576 21568 */ > > struct kvm_mmu mmu; /* 336 400 */ > struct kvm_mmu nested_mmu; /* 736 400 */ > struct fpu user_fpu; /* 2176 4160 */ > struct fpu guest_fpu; /* 6336 4160 */ > struct kvm_cpuid_entry2 cpuid_entries[80]; /* 10580 3200 */ > struct x86_emulate_ctxt emulate_ctxt; /* 13792 2656 */ > struct kvm_pmu pmu; /* 17344 1520 */ > struct kvm_vcpu_hv hyperv; /* 18872 1872 */ > gfn_t gfns[64]; /* 20832 512 */ > > that's about 19kB of the 22kB right there. Can any of them be shrunk > in size or allocated separately? I just had a fun conversation with Dave Hansen (cc'd) in which he indicated that the fpu should definitely be dynamically allocated. See also his commits from 2015 around XSAVE. 0c8c0f03e3a292e031596484275c14cf39c0ab7a 4109ca066b6b899ac7549bf3aac94b178ac95891
On 10/26/18 7:45 AM, Matthew Wilcox wrote: > struct fpu user_fpu; /* 2176 4160 */ > struct fpu guest_fpu; /* 6336 4160 */ Those are *not* supposed to be embedded in any other structures. My bad for not documenting this better. It also seems really goofy that we need an xsave buffer in the task_struct for user fpu state, then another in the vcpu. Isn't one for user state enough? In any case, I'd suggest getting rid of 'user_fpu', then either moving 'guest_fpu' to the bottom of the structure, or just make it a 'struct fpu *' and dynamically allocating it separately. To do this, I'd take fpu__init_task_struct_size(), and break it apart a bit to tell you the size of the 'struct fpu' separately from the size of the 'task struct'.
On Fri, 26 Oct 2018 at 15:59, Marc Orr <marcorr@google.com> wrote: > > A couple of patches to allocate vmx vcpus with vmalloc instead of > kalloc, which enables vcpu allocation to succeeed when contiguous > physical memory is sparse. We have not yet encounter memory is too fragmented to allocate kvm related metadata in our overcommit pools, is this true requirement from the product environments? Regards, Wanpeng Li > > Compared to the last version of these patches, this version: > 1. Splits out the refactoring of the vmx_msrs struct into it's own > patch, as suggested by Sean Christopherson <sean.j.christopherson@intel.com>. > 2. Leverages the __vmalloc() API rather than introducing a new vzalloc() > API, as suggested by Michal Hocko <mhocko@kernel.org>. > > Marc Orr (2): > kvm: vmx: refactor vmx_msrs struct for vmalloc > kvm: vmx: use vmalloc() to allocate vcpus > > arch/x86/kvm/vmx.c | 92 +++++++++++++++++++++++++++++++++++++++++---- > virt/kvm/kvm_main.c | 28 ++++++++------ > 2 files changed, 100 insertions(+), 20 deletions(-) > > -- > 2.19.1.568.g152ad8e336-goog >
On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote: > We have not yet encounter memory is too fragmented to allocate kvm > related metadata in our overcommit pools, is this true requirement > from the product environments? Yes.
On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote: > On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote: > > We have not yet encounter memory is too fragmented to allocate kvm > > related metadata in our overcommit pools, is this true requirement > > from the product environments? > > Yes. Are your logs granular enough to determine if turning this into an order-2 allocation (by splitting out "struct fpu" allocations) will be sufficient to resolve your problem, or do we need to turn it into an order-1 or vmalloc allocation to achieve your production goals?
On Mon, Oct 29, 2018 at 9:48 AM, Matthew Wilcox <willy@infradead.org> wrote: > On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote: >> On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote: >> > We have not yet encounter memory is too fragmented to allocate kvm >> > related metadata in our overcommit pools, is this true requirement >> > from the product environments? >> >> Yes. > > Are your logs granular enough to determine if turning this into an > order-2 allocation (by splitting out "struct fpu" allocations) will be > sufficient to resolve your problem, or do we need to turn it into an > order-1 or vmalloc allocation to achieve your production goals? Turning this into an order-2 allocation should suffice.
Thanks for all the discussion on this. Give me a bit to investigate Dave's suggestions around refactoring the fpu state, and I'll report back with what I find. Thanks, Marc On Mon, Oct 29, 2018 at 11:12 AM Jim Mattson <jmattson@google.com> wrote: > > On Mon, Oct 29, 2018 at 9:48 AM, Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote: > >> On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote: > >> > We have not yet encounter memory is too fragmented to allocate kvm > >> > related metadata in our overcommit pools, is this true requirement > >> > from the product environments? > >> > >> Yes. > > > > Are your logs granular enough to determine if turning this into an > > order-2 allocation (by splitting out "struct fpu" allocations) will be > > sufficient to resolve your problem, or do we need to turn it into an > > order-1 or vmalloc allocation to achieve your production goals? > > Turning this into an order-2 allocation should suffice.
On Mon, Oct 29, 2018 at 12:16 PM Marc Orr <marcorr@google.com> wrote: > > Thanks for all the discussion on this. Give me a bit to investigate > Dave's suggestions around refactoring the fpu state, and I'll report > back with what I find. > Thanks, > Marc Also, sorry for top-posting on my last email!
On Fri, Oct 26, 2018 at 7:49 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/26/18 7:45 AM, Matthew Wilcox wrote: > > struct fpu user_fpu; /* 2176 4160 */ > > struct fpu guest_fpu; /* 6336 4160 */ > > Those are *not* supposed to be embedded in any other structures. My bad > for not documenting this better. > > It also seems really goofy that we need an xsave buffer in the > task_struct for user fpu state, then another in the vcpu. Isn't one for > user state enough? > > In any case, I'd suggest getting rid of 'user_fpu', then either moving > 'guest_fpu' to the bottom of the structure, or just make it a 'struct > fpu *' and dynamically allocating it separately. I've written a patch to get rid of user_fpu, as suggested here and will be sending that out shortly. > > To do this, I'd take fpu__init_task_struct_size(), and break it apart a > bit to tell you the size of the 'struct fpu' separately from the size of > the 'task struct'. I've written a 2nd patch to make guest_cpu a 'struct fpu *' and dynamically allocate it separately. The reason I went with this suggestion, rather than moving 'struct fpu' to the bottom of kvm_vcpu_arch is because I believe that solution would still expand the kvm_vcpu_arch by the size of the fpu, according to which fpregs_state was in use.
On Mon, Oct 29, 2018 at 9:48 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 29, 2018 at 09:25:05AM -0700, Jim Mattson wrote: > > On Sun, Oct 28, 2018 at 6:58 PM, Wanpeng Li <kernellwp@gmail.com> wrote: > > > We have not yet encounter memory is too fragmented to allocate kvm > > > related metadata in our overcommit pools, is this true requirement > > > from the product environments? > > > > Yes. > > Are your logs granular enough to determine if turning this into an > order-2 allocation (by splitting out "struct fpu" allocations) will be > sufficient to resolve your problem, or do we need to turn it into an > order-1 or vmalloc allocation to achieve your production goals? As noted in my response to Dave Hansen, I've got his suggestions done and they were successful in drastically reducing the size of the vcpu_vmx struct, which is great. Specifically, on an upstream kernel, I've reduced the size of the struct from 23680 down to 15168, which is order 2. All that being said, I don't really understand why we wouldn't convert this memory allocation from a kmalloc() into a vmalloc(). From my point of view, we are still close to bloating vcpu_vmx into an order 3 allocation, and it's common for vendors to append to both vcpu_vmx directly, or more likely to its embedded structs. Though, arguably, vendors should not be doing that. Most importantly, it just isn't obvious to me why kmalloc() is preferred over vmalloc(). From my point of view, vmalloc() does the exact same thing as kmalloc(), except that it works when contiguous memory is sparse, which seems better to me.
On Wed, Oct 31, 2018 at 01:17:40PM +0000, Marc Orr wrote: > All that being said, I don't really understand why we wouldn't convert > this memory allocation from a kmalloc() into a vmalloc(). From my > point of view, we are still close to bloating vcpu_vmx into an order 3 > allocation, and it's common for vendors to append to both vcpu_vmx > directly, or more likely to its embedded structs. Though, arguably, > vendors should not be doing that. > > Most importantly, it just isn't obvious to me why kmalloc() is > preferred over vmalloc(). From my point of view, vmalloc() does the > exact same thing as kmalloc(), except that it works when contiguous > memory is sparse, which seems better to me. It's ever-so-slightly faster to access kmalloc memory than vmalloc memory; kmalloc memory comes from the main kernel mapping, generally mapped with 1GB pages while vmalloc memory is necessarily accessed using 4kB pages, taking an extra two steps in the page table hierarchy. There's more software overhead involved too; in addition to allocating the physical pages (which both kmalloc and vmalloc have to do), vmalloc has to allocate a vmap_area and a vm_struct to describe the virtual mapping. The virtual address space can also get fragmented, just like the physical address space does, potentially leading to the amusing scenario where you can allocate physically contiguous memory, but not find a contiguous chunk of vmalloc space to put it in. For larger allocations, we tend to prefer kvmalloc() which gives us the best of both worlds, allocating from kmalloc first and vmalloc as a fallback, but you've got some fun gyrations to go through to do physical mapping, so that's not entirely appropriate for your case.
On Wed, Oct 31, 2018 at 6:27 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Oct 31, 2018 at 01:17:40PM +0000, Marc Orr wrote: > > All that being said, I don't really understand why we wouldn't convert > > this memory allocation from a kmalloc() into a vmalloc(). From my > > point of view, we are still close to bloating vcpu_vmx into an order 3 > > allocation, and it's common for vendors to append to both vcpu_vmx > > directly, or more likely to its embedded structs. Though, arguably, > > vendors should not be doing that. > > > > Most importantly, it just isn't obvious to me why kmalloc() is > > preferred over vmalloc(). From my point of view, vmalloc() does the > > exact same thing as kmalloc(), except that it works when contiguous > > memory is sparse, which seems better to me. > > It's ever-so-slightly faster to access kmalloc memory than vmalloc memory; > kmalloc memory comes from the main kernel mapping, generally mapped with > 1GB pages while vmalloc memory is necessarily accessed using 4kB pages, > taking an extra two steps in the page table hierarchy. There's more > software overhead involved too; in addition to allocating the physical > pages (which both kmalloc and vmalloc have to do), vmalloc has to allocate > a vmap_area and a vm_struct to describe the virtual mapping. > > The virtual address space can also get fragmented, just like the physical > address space does, potentially leading to the amusing scenario where > you can allocate physically contiguous memory, but not find a contiguous > chunk of vmalloc space to put it in. > > For larger allocations, we tend to prefer kvmalloc() which gives us > the best of both worlds, allocating from kmalloc first and vmalloc as a > fallback, but you've got some fun gyrations to go through to do physical > mapping, so that's not entirely appropriate for your case. Thanks for the explanation. Is there a way to dynamically detect the memory allocation done by kvmalloc() (i.e., kmalloc() or vmalloc())? If so, we could use kvmalloc(), and add two code paths to do the physical mapping, according to whether the underlying memory was allocated with kmalloc() or vmalloc().
On Wed, Oct 31, 2018 at 01:48:44PM +0000, Marc Orr wrote: > Thanks for the explanation. Is there a way to dynamically detect the > memory allocation done by kvmalloc() (i.e., kmalloc() or vmalloc())? > If so, we could use kvmalloc(), and add two code paths to do the > physical mapping, according to whether the underlying memory was > allocated with kmalloc() or vmalloc(). Yes -- it's used in the implementation of kvfree(): if (is_vmalloc_addr(addr)) vfree(addr); else kfree(addr);
On Wed, Oct 31, 2018 at 7:21 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Oct 31, 2018 at 01:48:44PM +0000, Marc Orr wrote: > > Thanks for the explanation. Is there a way to dynamically detect the > > memory allocation done by kvmalloc() (i.e., kmalloc() or vmalloc())? > > If so, we could use kvmalloc(), and add two code paths to do the > > physical mapping, according to whether the underlying memory was > > allocated with kmalloc() or vmalloc(). > > Yes -- it's used in the implementation of kvfree(): > > if (is_vmalloc_addr(addr)) > vfree(addr); > else > kfree(addr); > I can drop the vmalloc() patches (unless someone else thinks we should proceed with a kvmalloc() version). I discussed them with my colleagues, and the consensus on our end is we shouldn't let these structs bloat so big. Thanks for everyone's help to reduce them by 2x the fpu struct! I'll be sending out another version of the patch series, with the two fpu patches and minus the vmalloc() patches, after I hear back from Dave on a question I just sent.