Message ID | 20220330225642.1163897-1-song@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack | expand |
On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote: > [1] > https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/ The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed in this series. And I think the solution I proposed is kind of wonky with respect to hibernate. So I think maybe hibernate should be fixed to not impose restrictions on the direct map, so the wonkiness is not needed. But then this "fixes" series becomes quite extensive. I wonder, why not just push the patch 1 here, then re-enable this thing when it is all properly fixed up. It looked like your code could handle the allocation not actually getting large pages. Another solution that would keep large pages but still need fixing up later: Just don't use VM_FLUSH_RESET_PERMS for now. Call set_memory_nx() and then set_memory_rw() on the module space address before vfree(). This will clean up everything that's needed with respect to direct map permissions. Have vmalloc warn if is sees VM_FLUSH_RESET_PERMS and huge pages together.
> On Mar 30, 2022, at 5:04 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote: >> [1] >> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/ > > The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed in > this series. And I think the solution I proposed is kind of wonky with > respect to hibernate. So I think maybe hibernate should be fixed to not > impose restrictions on the direct map, so the wonkiness is not needed. > But then this "fixes" series becomes quite extensive. > > I wonder, why not just push the patch 1 here, then re-enable this thing > when it is all properly fixed up. It looked like your code could handle > the allocation not actually getting large pages. Only shipping patch 1 should eliminate the issues. But that will also reduce the benefit in iTLB efficiency (I don't know by how much yet.) > > Another solution that would keep large pages but still need fixing up > later: Just don't use VM_FLUSH_RESET_PERMS for now. Call > set_memory_nx() and then set_memory_rw() on the module space address > before vfree(). This will clean up everything that's needed with > respect to direct map permissions. Have vmalloc warn if is sees > VM_FLUSH_RESET_PERMS and huge pages together. Do you mean we should remove set_vm_flush_reset_perms() from alloc_new_pack() and do set_memory_nx() and set_memory_rw() before we call vfree() in bpf_prog_pack_free()? If this works, I would prefer we go with this way. Thanks, Song
On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote: > We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause > issues [1], [2]. > Please fix the underlying issues instead of papering over them and creating a huge maintainance burden for others.
On Thu, 2022-03-31 at 00:46 +0000, Song Liu wrote: > > On Mar 30, 2022, at 5:04 PM, Edgecombe, Rick P < > > rick.p.edgecombe@intel.com> wrote: > > > > On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote: > > > [1] > > > https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/ > > > > The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed > > in > > this series. And I think the solution I proposed is kind of wonky > > with > > respect to hibernate. So I think maybe hibernate should be fixed to > > not > > impose restrictions on the direct map, so the wonkiness is not > > needed. > > But then this "fixes" series becomes quite extensive. > > > > I wonder, why not just push the patch 1 here, then re-enable this > > thing > > when it is all properly fixed up. It looked like your code could > > handle > > the allocation not actually getting large pages. > > Only shipping patch 1 should eliminate the issues. But that will also > reduce the benefit in iTLB efficiency (I don't know by how much yet.) Yea, it's just a matter of what order/timeline things get done in. This change didn't get enough mm attention ahead of time. Now there are two issues. One where the root cause is not fully clear and one that properly needs a wider fix. Just thinking it could be nice to take some time on it, rather than rush to finish what was already too rushed. > > > > > Another solution that would keep large pages but still need fixing > > up > > later: Just don't use VM_FLUSH_RESET_PERMS for now. Call > > set_memory_nx() and then set_memory_rw() on the module space > > address > > before vfree(). This will clean up everything that's needed with > > respect to direct map permissions. Have vmalloc warn if is sees > > VM_FLUSH_RESET_PERMS and huge pages together. > > Do you mean we should remove set_vm_flush_reset_perms() from > alloc_new_pack() and do set_memory_nx() and set_memory_rw() before > we call vfree() in bpf_prog_pack_free()? If this works, I would > prefer > we go with this way. I believe this would work functionally.
Hi Christoph, > On Mar 30, 2022, at 10:37 PM, Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote: >> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause >> issues [1], [2]. >> > > Please fix the underlying issues instead of papering over them and > creating a huge maintainance burden for others. I agree that this set is papering over the issue. And I would like your recommendations here. The biggest problem to me is that we (or at least myself) don't know all the issues HAVE_ARCH_HUGE_VMALLOC will trigger on x86_64. Right now we have a bug report from Paul, and the warning from Rick, but I am afraid there might be some other issues. How about we approach it like this: Since it is still early in the release cycle (pre rc1), we can keep HAVE_ARCH_HUGE_VMALLOC on for x86_64 for now and try to fix all the reported issues and warnings. If things don't go very well, we can turn HAVE_ARCH_HUGE_VMALLOC off after rc4 or rc5. Does this make sense? Thanks, Song
+ Nicholas and Claudio, > On Mar 31, 2022, at 4:59 PM, Song Liu <songliubraving@fb.com> wrote: > > Hi Christoph, > >> On Mar 30, 2022, at 10:37 PM, Christoph Hellwig <hch@infradead.org> wrote: >> >> On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote: >>> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause >>> issues [1], [2]. >>> >> >> Please fix the underlying issues instead of papering over them and >> creating a huge maintainance burden for others. After reading the code a little more, I wonder what would be best strategy. IIUC, most of the kernel is not ready for huge page backed vmalloc memory. For example, all the module_alloc cannot work with huge pages at the moment. And the error Paul Menzel reported in drm_fb_helper.c will probably hit powerpc with 5.17 kernel as-is? (trace attached below) Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. However, given there are so many users of vmalloc, vzalloc, etc., we probably do need a flag for the user to opt-in? Does this make sense? Any recommendations are really appreciated. Thanks, Song [ 1.687983] BUG: Bad page state in process systemd-udevd pfn:102e03 [ 1.687992] fbcon: Taking over console [ 1.688007] page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x3 pfn:0x102e03 [ 1.688011] head:(____ptrval____) order:9 compound_mapcount:0 compound_pincount:0 [ 1.688013] flags: 0x2fffc000010000(head|node=0|zone=2|lastcpupid=0x3fff) [ 1.688018] raw: 002fffc000000000 ffffe815040b8001 ffffe815040b80c8 0000000000000000 [ 1.688020] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 1.688022] head: 002fffc000010000 0000000000000000 dead000000000122 0000000000000000 [ 1.688023] head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 1.688024] page dumped because: corrupted mapping in tail page [ 1.688025] Modules linked in: r8169(+) k10temp snd_pcm(+) xhci_hcd snd_timer realtek ohci_hcd ehci_pci(+) i2c_piix4 ehci_hcd radeon(+) snd sg drm_ttm_helper ttm soundcore coreboot_table acpi_cpufreq fuse ipv6 autofs4 [ 1.688045] CPU: 1 PID: 151 Comm: systemd-udevd Not tainted 5.16.0-11615-gfac54e2bfb5b #319 [ 1.688048] Hardware name: ASUS F2A85-M_PRO/F2A85-M_PRO, BIOS 4.16-337-gb87986e67b 03/25/2022 [ 1.688050] Call Trace: [ 1.688051] <TASK> [ 1.688053] dump_stack_lvl+0x34/0x44 [ 1.688059] bad_page.cold+0x63/0x94 [ 1.688063] free_tail_pages_check+0xd1/0x110 [ 1.688067] ? _raw_spin_lock+0x13/0x30 [ 1.688071] free_pcp_prepare+0x251/0x2e0 [ 1.688075] free_unref_page+0x1d/0x110 [ 1.688078] __vunmap+0x28a/0x380 [ 1.688082] drm_fbdev_cleanup+0x5f/0xb0 [ 1.688085] drm_fbdev_fb_destroy+0x15/0x30 [ 1.688087] unregister_framebuffer+0x1d/0x30 [ 1.688091] drm_client_dev_unregister+0x69/0xe0 [ 1.688095] drm_dev_unregister+0x2e/0x80 [ 1.688098] drm_dev_unplug+0x21/0x40 [ 1.688100] simpledrm_remove+0x11/0x20 [ 1.688103] platform_remove+0x1f/0x40 [ 1.688106] __device_release_driver+0x17a/0x240 [ 1.688109] device_release_driver+0x24/0x30 [ 1.688112] bus_remove_device+0xd8/0x140 [ 1.688115] device_del+0x18b/0x3f0 [ 1.688118] ? _raw_spin_unlock_irqrestore+0x1b/0x30 [ 1.688121] ? try_to_wake_up+0x94/0x5b0 [ 1.688124] platform_device_del.part.0+0x13/0x70 [ 1.688127] platform_device_unregister+0x1c/0x30 [ 1.688130] drm_aperture_detach_drivers+0xa1/0xd0 [ 1.688134] drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x60 [ 1.688137] radeon_pci_probe+0x54/0xf0 [radeon] [ 1.688212] local_pci_probe+0x45/0x80 [ 1.688216] ? pci_match_device+0xd7/0x130 [ 1.688219] pci_device_probe+0xc2/0x1d0 [ 1.688223] really_probe+0x1f5/0x3d0 [ 1.688226] __driver_probe_device+0xfe/0x180 [ 1.688229] driver_probe_device+0x1e/0x90 [ 1.688232] __driver_attach+0xc0/0x1c0 [ 1.688235] ? __device_attach_driver+0xe0/0xe0 [ 1.688237] ? __device_attach_driver+0xe0/0xe0 [ 1.688239] bus_for_each_dev+0x78/0xc0 [ 1.688242] bus_add_driver+0x149/0x1e0 [ 1.688245] driver_register+0x8f/0xe0 [ 1.688248] ? 0xffffffffc051d000 [ 1.688250] do_one_initcall+0x44/0x200 [ 1.688254] ? kmem_cache_alloc_trace+0x170/0x2c0 [ 1.688257] do_init_module+0x5c/0x260 [ 1.688262] __do_sys_finit_module+0xb4/0x120 [ 1.688266] __do_fast_syscall_32+0x6b/0xe0 [ 1.688270] do_fast_syscall_32+0x2f/0x70 [ 1.688272] entry_SYSCALL_compat_after_hwframe+0x45/0x4d [ 1.688275] RIP: 0023:0xf7e51549 [ 1.688278] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 cd 0f 05 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 [ 1.688281] RSP: 002b:00000000ffa1666c EFLAGS: 00200292 ORIG_RAX: 000000000000015e [ 1.688285] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00000000f7e30e09 [ 1.688287] RDX: 0000000000000000 RSI: 00000000f9a705d0 RDI: 00000000f9a6f6a0 [ 1.688288] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 1.688290] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1.688291] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1.688294] </TASK> [ 1.688355] Disabling lock debugging due to kernel taint
On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote: > >> Please fix the underlying issues instead of papering over them and > >> creating a huge maintainance burden for others. > > After reading the code a little more, I wonder what would be best strategy. > IIUC, most of the kernel is not ready for huge page backed vmalloc memory. > For example, all the module_alloc cannot work with huge pages at the moment. > And the error Paul Menzel reported in drm_fb_helper.c will probably hit > powerpc with 5.17 kernel as-is? (trace attached below) > > Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. > However, given there are so many users of vmalloc, vzalloc, etc., we > probably do need a flag for the user to opt-in? > > Does this make sense? Any recommendations are really appreciated. I think there is multiple aspects here: - if we think that the kernel is not ready for hugepage backed vmalloc in general we need to disable it in powerpc for now. - if we think even in the longer run only some users can cope with hugepage backed vmalloc we need to turn it into an opt-in in general and not just for x86 - there still to appear various unresolved underlying x86 specific issues that need to be fixed either way
> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote: >>>> Please fix the underlying issues instead of papering over them and >>>> creating a huge maintainance burden for others. >> >> After reading the code a little more, I wonder what would be best strategy. >> IIUC, most of the kernel is not ready for huge page backed vmalloc memory. >> For example, all the module_alloc cannot work with huge pages at the moment. >> And the error Paul Menzel reported in drm_fb_helper.c will probably hit >> powerpc with 5.17 kernel as-is? (trace attached below) >> >> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. >> However, given there are so many users of vmalloc, vzalloc, etc., we >> probably do need a flag for the user to opt-in? >> >> Does this make sense? Any recommendations are really appreciated. > > I think there is multiple aspects here: > > - if we think that the kernel is not ready for hugepage backed vmalloc > in general we need to disable it in powerpc for now. Nicholas and Claudio, What do you think about the status of hugepage backed vmalloc on powerpc? I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages. But I am not aware of users that benefit from huge pages (except vfs hash, which was mentioned in 8abddd968a30). Does an opt-in flag (instead of current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? Thanks, Song > - if we think even in the longer run only some users can cope with > hugepage backed vmalloc we need to turn it into an opt-in in > general and not just for x86 > - there still to appear various unresolved underlying x86 specific > issues that need to be fixed either way
Hi Nicholas and Claudio, > On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote: > >> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote: >> >> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote: >>>>> Please fix the underlying issues instead of papering over them and >>>>> creating a huge maintainance burden for others. >>> >>> After reading the code a little more, I wonder what would be best strategy. >>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory. >>> For example, all the module_alloc cannot work with huge pages at the moment. >>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit >>> powerpc with 5.17 kernel as-is? (trace attached below) >>> >>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. >>> However, given there are so many users of vmalloc, vzalloc, etc., we >>> probably do need a flag for the user to opt-in? >>> >>> Does this make sense? Any recommendations are really appreciated. >> >> I think there is multiple aspects here: >> >> - if we think that the kernel is not ready for hugepage backed vmalloc >> in general we need to disable it in powerpc for now. > > Nicholas and Claudio, > > What do you think about the status of hugepage backed vmalloc on powerpc? > I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages. > But I am not aware of users that benefit from huge pages (except vfs hash, > which was mentioned in 8abddd968a30). Does an opt-in flag (instead of > current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? Could you please share your comments on this? Specifically, does it make sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current opt-out flag is better approach, what would be the best practice to find all the cases to opt-out? Thanks, Song > Thanks, > Song > >> - if we think even in the longer run only some users can cope with >> hugepage backed vmalloc we need to turn it into an opt-in in >> general and not just for x86 >> - there still to appear various unresolved underlying x86 specific >> issues that need to be fixed either way >
On Thu, 7 Apr 2022 19:57:25 +0000 Song Liu <songliubraving@fb.com> wrote: > Hi Nicholas and Claudio, > > > On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote: > > > >> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote: > >> > >> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote: > >>>>> Please fix the underlying issues instead of papering over them and > >>>>> creating a huge maintainance burden for others. > >>> > >>> After reading the code a little more, I wonder what would be best strategy. > >>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory. > >>> For example, all the module_alloc cannot work with huge pages at the moment. > >>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit > >>> powerpc with 5.17 kernel as-is? (trace attached below) > >>> > >>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. > >>> However, given there are so many users of vmalloc, vzalloc, etc., we > >>> probably do need a flag for the user to opt-in? > >>> > >>> Does this make sense? Any recommendations are really appreciated. > >> > >> I think there is multiple aspects here: > >> > >> - if we think that the kernel is not ready for hugepage backed vmalloc > >> in general we need to disable it in powerpc for now. > > > > Nicholas and Claudio, > > > > What do you think about the status of hugepage backed vmalloc on powerpc? > > I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages. > > But I am not aware of users that benefit from huge pages (except vfs hash, > > which was mentioned in 8abddd968a30). Does an opt-in flag (instead of > > current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? > > Could you please share your comments on this? Specifically, does it make > sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current > opt-out flag is better approach, what would be the best practice to find > all the cases to opt-out? An opt in flag would surely make sense, and it would be more backwards compatible with existing code. That way each user can decide whether to fix the code to allow for hugepages, if possible at all. For example, the case you mentioned for s390 (kvm_s390_pv_alloc_vm) would not be fixable, because of a hardware limitation (the whole area _must_ be mapped with 4k pages) If the consensus were to keep the current opt-put, then I guess each user would have to check each usage of vmalloc and similar, and see if anything breaks. To be honest, I think an opt-out would only be possible after having the opt-in for a (long) while, when most users would have fixed their code. In short, I fully support opt-in. > > Thanks, > Song > > > > Thanks, > > Song > > > >> - if we think even in the longer run only some users can cope with > >> hugepage backed vmalloc we need to turn it into an opt-in in > >> general and not just for x86 > >> - there still to appear various unresolved underlying x86 specific > >> issues that need to be fixed either way > > >
> On Apr 8, 2022, at 3:08 AM, Claudio Imbrenda <imbrenda@linux.ibm.com> wrote: > > On Thu, 7 Apr 2022 19:57:25 +0000 > Song Liu <songliubraving@fb.com> wrote: > >> Hi Nicholas and Claudio, >> >>> On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote: >>> >>>> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote: >>>> >>>> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote: >>>>>>> Please fix the underlying issues instead of papering over them and >>>>>>> creating a huge maintainance burden for others. >>>>> >>>>> After reading the code a little more, I wonder what would be best strategy. >>>>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory. >>>>> For example, all the module_alloc cannot work with huge pages at the moment. >>>>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit >>>>> powerpc with 5.17 kernel as-is? (trace attached below) >>>>> >>>>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. >>>>> However, given there are so many users of vmalloc, vzalloc, etc., we >>>>> probably do need a flag for the user to opt-in? >>>>> >>>>> Does this make sense? Any recommendations are really appreciated. >>>> >>>> I think there is multiple aspects here: >>>> >>>> - if we think that the kernel is not ready for hugepage backed vmalloc >>>> in general we need to disable it in powerpc for now. >>> >>> Nicholas and Claudio, >>> >>> What do you think about the status of hugepage backed vmalloc on powerpc? >>> I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages. >>> But I am not aware of users that benefit from huge pages (except vfs hash, >>> which was mentioned in 8abddd968a30). Does an opt-in flag (instead of >>> current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? >> >> Could you please share your comments on this? Specifically, does it make >> sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current >> opt-out flag is better approach, what would be the best practice to find >> all the cases to opt-out? > > An opt in flag would surely make sense, and it would be more backwards > compatible with existing code. That way each user can decide whether to > fix the code to allow for hugepages, if possible at all. For example, > the case you mentioned for s390 (kvm_s390_pv_alloc_vm) would not be > fixable, because of a hardware limitation (the whole area _must_ be > mapped with 4k pages) > > If the consensus were to keep the current opt-put, then I guess each > user would have to check each usage of vmalloc and similar, and see if > anything breaks. To be honest, I think an opt-out would only be > possible after having the opt-in for a (long) while, when most users > would have fixed their code. > > In short, I fully support opt-in. Thanks Claudio! I will prepare patches to replace VM_NO_HUGE_VMAP with an opt-in flag, and use the new flag in BPF. Please let me know any comments/suggestions ont this direction. Song