Message ID | 1399224322-22028-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/04/2014 07:25 PM, Aneesh Kumar K.V wrote: > We reserve 5% of total ram for CMA allocation and not using that can > result in us running out of numa node memory with specific > configuration. One caveat is we may not have node local hpt with pinned > vcpu configuration. But currently libvirt also pins the vcpu to cpuset > after creating hash page table. I don't understand the problem. Can you please elaborate? Alex > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > --- > arch/powerpc/kvm/book3s_64_mmu_hv.c | 23 ++++++----------------- > 1 file changed, 6 insertions(+), 17 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c > index fb25ebc0af0c..f32896ffd784 100644 > --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c > +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c > @@ -52,7 +52,7 @@ static void kvmppc_rmap_reset(struct kvm *kvm); > > long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) > { > - unsigned long hpt; > + unsigned long hpt = 0; > struct revmap_entry *rev; > struct page *page = NULL; > long order = KVM_DEFAULT_HPT_ORDER; > @@ -64,22 +64,11 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) > } > > kvm->arch.hpt_cma_alloc = 0; > - /* > - * try first to allocate it from the kernel page allocator. > - * We keep the CMA reserved for failed allocation. > - */ > - hpt = __get_free_pages(GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT | > - __GFP_NOWARN, order - PAGE_SHIFT); > - > - /* Next try to allocate from the preallocated pool */ > - if (!hpt) { > - VM_BUG_ON(order < KVM_CMA_CHUNK_ORDER); > - page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT)); > - if (page) { > - hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); > - kvm->arch.hpt_cma_alloc = 1; > - } else > - --order; > + VM_BUG_ON(order < KVM_CMA_CHUNK_ORDER); > + page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT)); > + if (page) { > + hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); > + kvm->arch.hpt_cma_alloc = 1; > } > > /* Lastly try successively smaller sizes from the page allocator */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Graf <agraf@suse.de> writes: > On 05/04/2014 07:25 PM, Aneesh Kumar K.V wrote: >> We reserve 5% of total ram for CMA allocation and not using that can >> result in us running out of numa node memory with specific >> configuration. One caveat is we may not have node local hpt with pinned >> vcpu configuration. But currently libvirt also pins the vcpu to cpuset >> after creating hash page table. > > I don't understand the problem. Can you please elaborate? > > Lets take a system with 100GB RAM. We reserve around 5GB for htab allocation. Now if we use rest of available memory for hugetlbfs (because we want all the guest to be backed by huge pages), we would end up in a situation where we have a few GB of free RAM and 5GB of CMA reserve area. Now if we allow hash page table allocation to consume the free space, we would end up hitting page allocation failure for other non movable kernel allocation even though we still have 5GB CMA reserve space free. -aneesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Am 05.05.2014 um 16:35 schrieb "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>: > > Alexander Graf <agraf@suse.de> writes: > >>> On 05/04/2014 07:25 PM, Aneesh Kumar K.V wrote: >>> We reserve 5% of total ram for CMA allocation and not using that can >>> result in us running out of numa node memory with specific >>> configuration. One caveat is we may not have node local hpt with pinned >>> vcpu configuration. But currently libvirt also pins the vcpu to cpuset >>> after creating hash page table. >> >> I don't understand the problem. Can you please elaborate? > > Lets take a system with 100GB RAM. We reserve around 5GB for htab > allocation. Now if we use rest of available memory for hugetlbfs > (because we want all the guest to be backed by huge pages), we would > end up in a situation where we have a few GB of free RAM and 5GB of CMA > reserve area. Now if we allow hash page table allocation to consume the > free space, we would end up hitting page allocation failure for other > non movable kernel allocation even though we still have 5GB CMA reserve > space free. Isn't this a greater problem? We should start swapping before we hit the point where non movable kernel allocation fails, no? The fact that KVM uses a good number of normal kernel pages is maybe suboptimal, but shouldn't be a critical problem. Alex > > -aneesh > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Graf <agraf@suse.de> writes: >> Am 05.05.2014 um 16:35 schrieb "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>: >> >> Alexander Graf <agraf@suse.de> writes: >> >>>> On 05/04/2014 07:25 PM, Aneesh Kumar K.V wrote: >>>> We reserve 5% of total ram for CMA allocation and not using that can >>>> result in us running out of numa node memory with specific >>>> configuration. One caveat is we may not have node local hpt with pinned >>>> vcpu configuration. But currently libvirt also pins the vcpu to cpuset >>>> after creating hash page table. >>> >>> I don't understand the problem. Can you please elaborate? >> >> Lets take a system with 100GB RAM. We reserve around 5GB for htab >> allocation. Now if we use rest of available memory for hugetlbfs >> (because we want all the guest to be backed by huge pages), we would >> end up in a situation where we have a few GB of free RAM and 5GB of CMA >> reserve area. Now if we allow hash page table allocation to consume the >> free space, we would end up hitting page allocation failure for other >> non movable kernel allocation even though we still have 5GB CMA reserve >> space free. > > Isn't this a greater problem? We should start swapping before we hit > the point where non movable kernel allocation fails, no? But there is nothing much to swap. Because most of the memory is reserved for guest RAM via hugetlbfs. > > The fact that KVM uses a good number of normal kernel pages is maybe > suboptimal, but shouldn't be a critical problem. Yes. But then in this case we could do better isn't it ? We already have a large part of guest RAM kept aside for htab allocation which cannot be used for non movable allocation. And we ignore that reserve space and use other areas for hash page table allocation with the current code. We actually hit this case in one of the test box. KVM guest htab at c000001e50000000 (order 30), LPID 1 libvirtd invoked oom-killer: gfp_mask=0x2000d0, order=0,oom_score_adj=0 libvirtd cpuset=/ mems_allowed=0,16 CPU: 72 PID: 20044 Comm: libvirtd Not tainted 3.10.23-1401.pkvm2_1.4.ppc64 #1 Call Trace: [c000001e3b63f150] [c000000000017330] .show_stack+0x130/0x200(unreliable) [c000001e3b63f220] [c00000000087a888] .dump_stack+0x28/0x3c [c000001e3b63f290] [c000000000876a4c] .dump_header+0xbc/0x228 [c000001e3b63f360] [c0000000001dd838].oom_kill_process+0x318/0x4c0 [c000001e3b63f440] [c0000000001de258] .out_of_memory+0x518/0x550 [c000001e3b63f520] [c0000000001e5aac].__alloc_pages_nodemask+0xb3c/0xbf0 [c000001e3b63f700] [c000000000243580] .new_slab+0x440/0x490 [c000001e3b63f7a0] [c0000000008781fc] .__slab_alloc+0x17c/0x618 [c000001e3b63f8d0] [c0000000002467fc].kmem_cache_alloc_node_trace+0xcc/0x300 [c000001e3b63f990] [c00000000010f62c].alloc_fair_sched_group+0xfc/0x200 [c000001e3b63fa60] [c000000000104f00].sched_create_group+0x50/0xe0 [c000001e3b63fae0] [c000000000104fc0].cpu_cgroup_css_alloc+0x30/0x80 [c000001e3b63fb60] [c0000000001513ec] .cgroup_mkdir+0x2bc/0x6e0 [c000001e3b63fc50] [c000000000275aec] .vfs_mkdir+0x14c/0x220 [c000001e3b63fcf0] [c00000000027a734] .SyS_mkdirat+0x94/0x110 [c000001e3b63fdb0] [c00000000027a7e4] .SyS_mkdir+0x34/0x50 [c000001e3b63fe30] [c000000000009f54] syscall_exit+0x0/0x98 Node 0 DMA free:23424kB min:23424kB low:29248kB high:35136kB active_anon:0kB inactive_anon:128kB active_file:256kB inactive_file:384kB unevictable:9536kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:65931776kB mlocked:9536kB dirty:64kB writeback:0kB mapped:5376kB shmem:0kB slab_reclaimable:23616kB slab_unreclaimable:1237056kB kernel_stack:18256kB pagetables:1088kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:78 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 Node 16 DMA free:5787008kB min:21376kB low:26688kB high:32064kB active_anon:1984kB inactive_anon:2112kB active_file:896kB inactive_file:64kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:60060032kB mlocked:0kB dirty:128kB writeback:3712kB mapped:0kB shmem:0kB slab_reclaimable:23424kB slab_unreclaimable:826048kB kernel_stack:576kB pagetables:1408kB unstable:0kB bounce:0kB free_cma:5767040kB writeback_tmp:0kB pages_scanned:756 all_unreclaimable? yes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: > Isn't this a greater problem? We should start swapping before we hit > the point where non movable kernel allocation fails, no? Possibly but the fact remains, this can be avoided by making sure that if we create a CMA reserve for KVM, then it uses it rather than using the rest of main memory for hash tables. > The fact that KVM uses a good number of normal kernel pages is maybe > suboptimal, but shouldn't be a critical problem. The point is that we explicitly reserve those pages in CMA for use by KVM for that specific purpose, but the current code tries first to get them out of the normal pool. This is not an optimal behaviour and is what Aneesh patches are trying to fix. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06.05.14 02:06, Benjamin Herrenschmidt wrote: > On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: >> Isn't this a greater problem? We should start swapping before we hit >> the point where non movable kernel allocation fails, no? > Possibly but the fact remains, this can be avoided by making sure that > if we create a CMA reserve for KVM, then it uses it rather than using > the rest of main memory for hash tables. So why were we preferring non-CMA memory before? Considering that Aneesh introduced that logic in fa61a4e3 I suppose this was just a mistake? >> The fact that KVM uses a good number of normal kernel pages is maybe >> suboptimal, but shouldn't be a critical problem. > The point is that we explicitly reserve those pages in CMA for use > by KVM for that specific purpose, but the current code tries first > to get them out of the normal pool. > > This is not an optimal behaviour and is what Aneesh patches are > trying to fix. I agree, and I agree that it's worth it to make better use of our resources. But we still shouldn't crash. However, reading through this thread I think I've slowly grasped what the problem is. The hugetlbfs size calculation. I guess something in your stack overreserves huge pages because it doesn't account for the fact that some part of system memory is already reserved for CMA. So the underlying problem is something completely orthogonal. The patch body as is is fine, but the patch description should simply say that we should prefer the CMA region because it's already reserved for us for this purpose and we make better use of our available resources that way. All the bits about pinning, numa, libvirt and whatnot don't really matter and are just details that led Aneesh to find this non-optimal allocation. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote: > On 06.05.14 02:06, Benjamin Herrenschmidt wrote: > > On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: > >> Isn't this a greater problem? We should start swapping before we hit > >> the point where non movable kernel allocation fails, no? > > Possibly but the fact remains, this can be avoided by making sure that > > if we create a CMA reserve for KVM, then it uses it rather than using > > the rest of main memory for hash tables. > > So why were we preferring non-CMA memory before? Considering that Aneesh > introduced that logic in fa61a4e3 I suppose this was just a mistake? I assume so. > >> The fact that KVM uses a good number of normal kernel pages is maybe > >> suboptimal, but shouldn't be a critical problem. > > The point is that we explicitly reserve those pages in CMA for use > > by KVM for that specific purpose, but the current code tries first > > to get them out of the normal pool. > > > > This is not an optimal behaviour and is what Aneesh patches are > > trying to fix. > > I agree, and I agree that it's worth it to make better use of our > resources. But we still shouldn't crash. Well, Linux hitting out of memory conditions has never been a happy story :-) > However, reading through this thread I think I've slowly grasped what > the problem is. The hugetlbfs size calculation. Not really. > I guess something in your stack overreserves huge pages because it > doesn't account for the fact that some part of system memory is already > reserved for CMA. Either that or simply Linux runs out because we dirty too fast... really, Linux has never been good at dealing with OO situations, especially when things like network drivers and filesystems try to do ATOMIC or NOIO allocs... > So the underlying problem is something completely orthogonal. The patch > body as is is fine, but the patch description should simply say that we > should prefer the CMA region because it's already reserved for us for > this purpose and we make better use of our available resources that way. No. We give a chunk of memory to hugetlbfs, it's all good and fine. Whatever remains is split between CMA and the normal page allocator. Without Aneesh latest patch, when creating guests, KVM starts allocating it's hash tables from the latter instead of CMA (we never allocate from hugetlb pool afaik, only guest pages do that, not hash tables). So we exhaust the page allocator and get linux into OOM conditions while there's plenty of space in CMA. But the kernel cannot use CMA for it's own allocations, only to back user pages, which we don't care about because our guest pages are covered by our hugetlb reserve :-) > All the bits about pinning, numa, libvirt and whatnot don't really > matter and are just details that led Aneesh to find this non-optimal > allocation. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06.05.14 09:19, Benjamin Herrenschmidt wrote: > On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote: >> On 06.05.14 02:06, Benjamin Herrenschmidt wrote: >>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: >>>> Isn't this a greater problem? We should start swapping before we hit >>>> the point where non movable kernel allocation fails, no? >>> Possibly but the fact remains, this can be avoided by making sure that >>> if we create a CMA reserve for KVM, then it uses it rather than using >>> the rest of main memory for hash tables. >> So why were we preferring non-CMA memory before? Considering that Aneesh >> introduced that logic in fa61a4e3 I suppose this was just a mistake? > I assume so. > >>>> The fact that KVM uses a good number of normal kernel pages is maybe >>>> suboptimal, but shouldn't be a critical problem. >>> The point is that we explicitly reserve those pages in CMA for use >>> by KVM for that specific purpose, but the current code tries first >>> to get them out of the normal pool. >>> >>> This is not an optimal behaviour and is what Aneesh patches are >>> trying to fix. >> I agree, and I agree that it's worth it to make better use of our >> resources. But we still shouldn't crash. > Well, Linux hitting out of memory conditions has never been a happy > story :-) > >> However, reading through this thread I think I've slowly grasped what >> the problem is. The hugetlbfs size calculation. > Not really. > >> I guess something in your stack overreserves huge pages because it >> doesn't account for the fact that some part of system memory is already >> reserved for CMA. > Either that or simply Linux runs out because we dirty too fast... > really, Linux has never been good at dealing with OO situations, > especially when things like network drivers and filesystems try to do > ATOMIC or NOIO allocs... > >> So the underlying problem is something completely orthogonal. The patch >> body as is is fine, but the patch description should simply say that we >> should prefer the CMA region because it's already reserved for us for >> this purpose and we make better use of our available resources that way. > No. > > We give a chunk of memory to hugetlbfs, it's all good and fine. > > Whatever remains is split between CMA and the normal page allocator. > > Without Aneesh latest patch, when creating guests, KVM starts allocating > it's hash tables from the latter instead of CMA (we never allocate from > hugetlb pool afaik, only guest pages do that, not hash tables). > > So we exhaust the page allocator and get linux into OOM conditions > while there's plenty of space in CMA. But the kernel cannot use CMA for > it's own allocations, only to back user pages, which we don't care about > because our guest pages are covered by our hugetlb reserve :-) Yes. Write that in the patch description and I'm happy ;). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Graf <agraf@suse.de> writes: > On 06.05.14 09:19, Benjamin Herrenschmidt wrote: >> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote: >>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote: >>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: >>>>> Isn't this a greater problem? We should start swapping before we hit >>>>> the point where non movable kernel allocation fails, no? >>>> Possibly but the fact remains, this can be avoided by making sure that >>>> if we create a CMA reserve for KVM, then it uses it rather than using >>>> the rest of main memory for hash tables. >>> So why were we preferring non-CMA memory before? Considering that Aneesh >>> introduced that logic in fa61a4e3 I suppose this was just a mistake? >> I assume so. .... ... >> >> Whatever remains is split between CMA and the normal page allocator. >> >> Without Aneesh latest patch, when creating guests, KVM starts allocating >> it's hash tables from the latter instead of CMA (we never allocate from >> hugetlb pool afaik, only guest pages do that, not hash tables). >> >> So we exhaust the page allocator and get linux into OOM conditions >> while there's plenty of space in CMA. But the kernel cannot use CMA for >> it's own allocations, only to back user pages, which we don't care about >> because our guest pages are covered by our hugetlb reserve :-) > > Yes. Write that in the patch description and I'm happy ;). > How about the below: Current KVM code first try to allocate hash page table from the normal page allocator before falling back to the CMA reserve region. One of the side effects of that is, we could exhaust the page allocator and get linux into OOM conditions while we still have plenty of space in CMA. Fix this by trying the CMA reserve region first and then falling back to normal page allocator if we fail to get enough memory from CMA reserve area. -aneesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/06/2014 04:20 PM, Aneesh Kumar K.V wrote: > Alexander Graf <agraf@suse.de> writes: > >> On 06.05.14 09:19, Benjamin Herrenschmidt wrote: >>> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote: >>>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote: >>>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote: >>>>>> Isn't this a greater problem? We should start swapping before we hit >>>>>> the point where non movable kernel allocation fails, no? >>>>> Possibly but the fact remains, this can be avoided by making sure that >>>>> if we create a CMA reserve for KVM, then it uses it rather than using >>>>> the rest of main memory for hash tables. >>>> So why were we preferring non-CMA memory before? Considering that Aneesh >>>> introduced that logic in fa61a4e3 I suppose this was just a mistake? >>> I assume so. > .... > ... > >>> Whatever remains is split between CMA and the normal page allocator. >>> >>> Without Aneesh latest patch, when creating guests, KVM starts allocating >>> it's hash tables from the latter instead of CMA (we never allocate from >>> hugetlb pool afaik, only guest pages do that, not hash tables). >>> >>> So we exhaust the page allocator and get linux into OOM conditions >>> while there's plenty of space in CMA. But the kernel cannot use CMA for >>> it's own allocations, only to back user pages, which we don't care about >>> because our guest pages are covered by our hugetlb reserve :-) >> Yes. Write that in the patch description and I'm happy ;). >> > How about the below: > > Current KVM code first try to allocate hash page table from the normal > page allocator before falling back to the CMA reserve region. One of the > side effects of that is, we could exhaust the page allocator and get > linux into OOM conditions while we still have plenty of space in CMA. > > Fix this by trying the CMA reserve region first and then falling back > to normal page allocator if we fail to get enough memory from CMA > reserve area. Fix the grammar (I've spotted a good number of mistakes), then this should do. Please also improve the headline. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index fb25ebc0af0c..f32896ffd784 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -52,7 +52,7 @@ static void kvmppc_rmap_reset(struct kvm *kvm); long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) { - unsigned long hpt; + unsigned long hpt = 0; struct revmap_entry *rev; struct page *page = NULL; long order = KVM_DEFAULT_HPT_ORDER; @@ -64,22 +64,11 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp) } kvm->arch.hpt_cma_alloc = 0; - /* - * try first to allocate it from the kernel page allocator. - * We keep the CMA reserved for failed allocation. - */ - hpt = __get_free_pages(GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT | - __GFP_NOWARN, order - PAGE_SHIFT); - - /* Next try to allocate from the preallocated pool */ - if (!hpt) { - VM_BUG_ON(order < KVM_CMA_CHUNK_ORDER); - page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT)); - if (page) { - hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); - kvm->arch.hpt_cma_alloc = 1; - } else - --order; + VM_BUG_ON(order < KVM_CMA_CHUNK_ORDER); + page = kvm_alloc_hpt(1 << (order - PAGE_SHIFT)); + if (page) { + hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page)); + kvm->arch.hpt_cma_alloc = 1; } /* Lastly try successively smaller sizes from the page allocator */
We reserve 5% of total ram for CMA allocation and not using that can result in us running out of numa node memory with specific configuration. One caveat is we may not have node local hpt with pinned vcpu configuration. But currently libvirt also pins the vcpu to cpuset after creating hash page table. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 23 ++++++----------------- 1 file changed, 6 insertions(+), 17 deletions(-)