Message ID | 20220714042420.1847125-3-naoya.horiguchi@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, hwpoison: enable 1GB hugepage support (v7) | expand |
On Thu, Jul 14, 2022 at 01:24:14PM +0900, Naoya Horiguchi wrote: > +/* > + * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal > + * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. > + * Otherwise, returns 0. > + */ > int pud_huge(pud_t pud) > { > - return !!(pud_val(pud) & _PAGE_PSE); > + return !pud_none(pud) && > + (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; > } Hi, This causes i915 to trip a BUG_ON() on x86-32 when I start X. [ 225.777375] kernel BUG at mm/memory.c:2664! [ 225.777391] invalid opcode: 0000 [#1] PREEMPT SMP [ 225.777405] CPU: 0 PID: 2402 Comm: Xorg Not tainted 6.1.0-rc3-bdg+ #86 [ 225.777415] Hardware name: /8I865G775-G, BIOS F1 08/29/2006 [ 225.777421] EIP: __apply_to_page_range+0x24d/0x31c [ 225.777437] Code: ff ff 8b 55 e8 8b 45 cc e8 0a 11 ec ff 89 d8 83 c4 28 5b 5e 5f 5d c3 81 7d e0 a0 ef 96 c1 74 ad 8b 45 d0 e8 2d 83 49 00 eb a3 <0f> 0b 25 00 f0 ff ff 81 eb 00 00 00 40 01 c3 8b 45 ec 8b 00 e8 76 [ 225.777446] EAX: 00000001 EBX: c53a3b58 ECX: b5c00000 EDX: c258aa00 [ 225.777454] ESI: b5c00000 EDI: b5900000 EBP: c4b0fdb4 ESP: c4b0fd80 [ 225.777462] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010202 [ 225.777470] CR0: 80050033 CR2: b5900000 CR3: 053a3000 CR4: 000006d0 [ 225.777479] Call Trace: [ 225.777486] ? i915_memcpy_init_early+0x63/0x63 [i915] [ 225.777684] apply_to_page_range+0x21/0x27 [ 225.777694] ? i915_memcpy_init_early+0x63/0x63 [i915] [ 225.777870] remap_io_mapping+0x49/0x75 [i915] [ 225.778046] ? i915_memcpy_init_early+0x63/0x63 [i915] [ 225.778220] ? mutex_unlock+0xb/0xd [ 225.778231] ? i915_vma_pin_fence+0x6d/0xf7 [i915] [ 225.778420] vm_fault_gtt+0x2a9/0x8f1 [i915] [ 225.778644] ? lock_is_held_type+0x56/0xe7 [ 225.778655] ? lock_is_held_type+0x7a/0xe7 [ 225.778663] ? 0xc1000000 [ 225.778670] __do_fault+0x21/0x6a [ 225.778679] handle_mm_fault+0x708/0xb21 [ 225.778686] ? mt_find+0x21e/0x5ae [ 225.778696] exc_page_fault+0x185/0x705 [ 225.778704] ? doublefault_shim+0x127/0x127 [ 225.778715] handle_exception+0x130/0x130 [ 225.778723] EIP: 0xb700468a [ 225.778730] Code: 44 24 40 8b 7c 24 1c 89 47 54 8b 44 24 5c 65 2b 05 14 00 00 00 0f 85 8a 01 00 00 83 c4 6c 5b 5e 5f 5d c3 8b 44 24 1c 8b 40 28 <c7> 00 00 00 00 00 8b 44 24 20 8d 90 20 1b 00 00 8b 02 83 e8 01 89 [ 225.778738] EAX: b5900000 EBX: b7148000 ECX: 00000000 EDX: 00000000 [ 225.778745] ESI: 0103eb60 EDI: b7148000 EBP: b6cf7000 ESP: bfd76650 [ 225.778752] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00010246 [ 225.778761] ? doublefault_shim+0x127/0x127 [ 225.778769] Modules linked in: i915 prime_numbers i2c_algo_bit iosf_mbi drm_buddy video wmi drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm drm_panel_orientation_quirks backlight cfg80211 rfkill sch_fq_codel xt_tcpudp xt_multiport xt_state iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 ip_tables x_tables binfmt_misc i2c_dev iTCO_wdt snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer psmouse i2c_i801 snd i2c_smbus uhci_hcd i2c_core pcspkr soundcore lpc_ich mfd_core ehci_pci ehci_hcd skge intel_agp intel_gtt usbcore agpgart usb_common rng_core parport_pc parport evdev [ 225.778899] ---[ end trace 0000000000000000 ]--- [ 225.778906] EIP: __apply_to_page_range+0x24d/0x31c [ 225.778916] Code: ff ff 8b 55 e8 8b 45 cc e8 0a 11 ec ff 89 d8 83 c4 28 5b 5e 5f 5d c3 81 7d e0 a0 ef 96 c1 74 ad 8b 45 d0 e8 2d 83 49 00 eb a3 <0f> 0b 25 00 f0 ff ff 81 eb 00 00 00 40 01 c3 8b 45 ec 8b 00 e8 76 [ 225.778924] EAX: 00000001 EBX: c53a3b58 ECX: b5c00000 EDX: c258aa00 [ 225.778931] ESI: b5c00000 EDI: b5900000 EBP: c4b0fdb4 ESP: c4b0fd80 [ 225.778938] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010202 [ 225.778946] CR0: 80050033 CR2: b5900000 CR3: 053a3000 CR4: 000006d0
On Wed, Nov 02, 2022 at 10:51:40PM +0200, Ville Syrjälä wrote: > On Thu, Jul 14, 2022 at 01:24:14PM +0900, Naoya Horiguchi wrote: > > +/* > > + * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal > > + * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. > > + * Otherwise, returns 0. > > + */ > > int pud_huge(pud_t pud) > > { > > - return !!(pud_val(pud) & _PAGE_PSE); > > + return !pud_none(pud) && > > + (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; > > } > > Hi, > > This causes i915 to trip a BUG_ON() on x86-32 when I start X. Hello, Thank you for finding and reporting the issue. x86-32 does not enable CONFIG_ARCH_HAS_GIGANTIC_PAGE, so pud_huge() is supposed to be false on x86-32. Doing like below looks to me a fix (reverting to the original behavior for x86-32): diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 6b3033845c6d..bf73f25aaa32 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -37,8 +37,12 @@ int pmd_huge(pmd_t pmd) */ int pud_huge(pud_t pud) { +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE return !pud_none(pud) && (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; +#else + return !!(pud_val(pud) & _PAGE_PSE); // or "return 0;" ? +#endif } #ifdef CONFIG_HUGETLB_PAGE Let me guess what the PUD entry was there when triggering the issue. Assuming that the original code (before 3a194f3f8ad0) was correct, the PSE bit in pud_val(pud) should be always cleared. So, when pud_huge() returns true since 3a194f3f8ad0, the PRESENT bit should be clear and some other bits (rather than PRESENT and PSE) are set so that pud_none() is false. I'm not sure what such a non-present PUD entry does mean. Thanks, Naoya Horiguchi > > [ 225.777375] kernel BUG at mm/memory.c:2664! > [ 225.777391] invalid opcode: 0000 [#1] PREEMPT SMP > [ 225.777405] CPU: 0 PID: 2402 Comm: Xorg Not tainted 6.1.0-rc3-bdg+ #86 > [ 225.777415] Hardware name: /8I865G775-G, BIOS F1 08/29/2006 > [ 225.777421] EIP: __apply_to_page_range+0x24d/0x31c > [ 225.777437] Code: ff ff 8b 55 e8 8b 45 cc e8 0a 11 ec ff 89 d8 83 c4 28 5b 5e 5f 5d c3 81 7d e0 a0 ef 96 c1 74 ad 8b 45 d0 e8 2d 83 49 00 eb a3 <0f> 0b 25 00 f0 ff ff 81 eb 00 00 00 40 01 c3 8b 45 ec 8b 00 e8 76 > [ 225.777446] EAX: 00000001 EBX: c53a3b58 ECX: b5c00000 EDX: c258aa00 > [ 225.777454] ESI: b5c00000 EDI: b5900000 EBP: c4b0fdb4 ESP: c4b0fd80 > [ 225.777462] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010202 > [ 225.777470] CR0: 80050033 CR2: b5900000 CR3: 053a3000 CR4: 000006d0 > [ 225.777479] Call Trace: > [ 225.777486] ? i915_memcpy_init_early+0x63/0x63 [i915] > [ 225.777684] apply_to_page_range+0x21/0x27 > [ 225.777694] ? i915_memcpy_init_early+0x63/0x63 [i915] > [ 225.777870] remap_io_mapping+0x49/0x75 [i915] > [ 225.778046] ? i915_memcpy_init_early+0x63/0x63 [i915] > [ 225.778220] ? mutex_unlock+0xb/0xd > [ 225.778231] ? i915_vma_pin_fence+0x6d/0xf7 [i915] > [ 225.778420] vm_fault_gtt+0x2a9/0x8f1 [i915] > [ 225.778644] ? lock_is_held_type+0x56/0xe7 > [ 225.778655] ? lock_is_held_type+0x7a/0xe7 > [ 225.778663] ? 0xc1000000 > [ 225.778670] __do_fault+0x21/0x6a > [ 225.778679] handle_mm_fault+0x708/0xb21 > [ 225.778686] ? mt_find+0x21e/0x5ae > [ 225.778696] exc_page_fault+0x185/0x705 > [ 225.778704] ? doublefault_shim+0x127/0x127 > [ 225.778715] handle_exception+0x130/0x130 > [ 225.778723] EIP: 0xb700468a > [ 225.778730] Code: 44 24 40 8b 7c 24 1c 89 47 54 8b 44 24 5c 65 2b 05 14 00 00 00 0f 85 8a 01 00 00 83 c4 6c 5b 5e 5f 5d c3 8b 44 24 1c 8b 40 28 <c7> 00 00 00 00 00 8b 44 24 20 8d 90 20 1b 00 00 8b 02 83 e8 01 89 > [ 225.778738] EAX: b5900000 EBX: b7148000 ECX: 00000000 EDX: 00000000 > [ 225.778745] ESI: 0103eb60 EDI: b7148000 EBP: b6cf7000 ESP: bfd76650 > [ 225.778752] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00010246 > [ 225.778761] ? doublefault_shim+0x127/0x127 > [ 225.778769] Modules linked in: i915 prime_numbers i2c_algo_bit iosf_mbi drm_buddy video wmi drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm drm_panel_orientation_quirks backlight cfg80211 rfkill sch_fq_codel xt_tcpudp xt_multiport xt_state iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 ip_tables x_tables binfmt_misc i2c_dev iTCO_wdt snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer psmouse i2c_i801 snd i2c_smbus uhci_hcd i2c_core pcspkr soundcore lpc_ich mfd_core ehci_pci ehci_hcd skge intel_agp intel_gtt usbcore agpgart usb_common rng_core parport_pc parport evdev > [ 225.778899] ---[ end trace 0000000000000000 ]--- > [ 225.778906] EIP: __apply_to_page_range+0x24d/0x31c > [ 225.778916] Code: ff ff 8b 55 e8 8b 45 cc e8 0a 11 ec ff 89 d8 83 c4 28 5b 5e 5f 5d c3 81 7d e0 a0 ef 96 c1 74 ad 8b 45 d0 e8 2d 83 49 00 eb a3 <0f> 0b 25 00 f0 ff ff 81 eb 00 00 00 40 01 c3 8b 45 ec 8b 00 e8 76 > [ 225.778924] EAX: 00000001 EBX: c53a3b58 ECX: b5c00000 EDX: c258aa00 > [ 225.778931] ESI: b5c00000 EDI: b5900000 EBP: c4b0fdb4 ESP: c4b0fd80 > [ 225.778938] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010202 > [ 225.778946] CR0: 80050033 CR2: b5900000 CR3: 053a3000 CR4: 000006d0 > > -- > Ville Syrjälä > Intel
On Sat, Nov 05, 2022 at 12:59:30AM +0900, Naoya Horiguchi wrote: > On Wed, Nov 02, 2022 at 10:51:40PM +0200, Ville Syrjälä wrote: > > On Thu, Jul 14, 2022 at 01:24:14PM +0900, Naoya Horiguchi wrote: > > > +/* > > > + * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal > > > + * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. > > > + * Otherwise, returns 0. > > > + */ > > > int pud_huge(pud_t pud) > > > { > > > - return !!(pud_val(pud) & _PAGE_PSE); > > > + return !pud_none(pud) && > > > + (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; > > > } > > > > Hi, > > > > This causes i915 to trip a BUG_ON() on x86-32 when I start X. > > Hello, > > Thank you for finding and reporting the issue. > > x86-32 does not enable CONFIG_ARCH_HAS_GIGANTIC_PAGE, so pud_huge() is > supposed to be false on x86-32. Doing like below looks to me a fix > (reverting to the original behavior for x86-32): > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > index 6b3033845c6d..bf73f25aaa32 100644 > --- a/arch/x86/mm/hugetlbpage.c > +++ b/arch/x86/mm/hugetlbpage.c > @@ -37,8 +37,12 @@ int pmd_huge(pmd_t pmd) > */ > int pud_huge(pud_t pud) > { > +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > return !pud_none(pud) && > (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; > +#else > + return !!(pud_val(pud) & _PAGE_PSE); // or "return 0;" ? > +#endif > } > > #ifdef CONFIG_HUGETLB_PAGE > > > Let me guess what the PUD entry was there when triggering the issue. > Assuming that the original code (before 3a194f3f8ad0) was correct, the PSE > bit in pud_val(pud) should be always cleared. So, when pud_huge() returns > true since 3a194f3f8ad0, the PRESENT bit should be clear and some other > bits (rather than PRESENT and PSE) are set so that pud_none() is false. > I'm not sure what such a non-present PUD entry does mean. pud_val()==0 when it blows up, and pud_none() is false because pgtable-nopmd.h says so with 2 level paging. And given that I just tested with PAE / 3 level paging, and sure enough it no longer blows up. So looks to me like maybe this new code just doesn't understand how the levels get folded. I might also be missing something obvious, but why is it even necessary to treat PRESENT==0+PSE==0 as a huge entry?
On Sat, Nov 05, 2022 at 12:23:40AM +0200, Ville Syrjälä wrote: > On Sat, Nov 05, 2022 at 12:59:30AM +0900, Naoya Horiguchi wrote: > > On Wed, Nov 02, 2022 at 10:51:40PM +0200, Ville Syrjälä wrote: > > > On Thu, Jul 14, 2022 at 01:24:14PM +0900, Naoya Horiguchi wrote: > > > > +/* > > > > + * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal > > > > + * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. > > > > + * Otherwise, returns 0. > > > > + */ > > > > int pud_huge(pud_t pud) > > > > { > > > > - return !!(pud_val(pud) & _PAGE_PSE); > > > > + return !pud_none(pud) && > > > > + (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; > > > > } > > > > > > Hi, > > > > > > This causes i915 to trip a BUG_ON() on x86-32 when I start X. > > > > Hello, > > > > Thank you for finding and reporting the issue. > > > > x86-32 does not enable CONFIG_ARCH_HAS_GIGANTIC_PAGE, so pud_huge() is > > supposed to be false on x86-32. Doing like below looks to me a fix > > (reverting to the original behavior for x86-32): > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > > index 6b3033845c6d..bf73f25aaa32 100644 > > --- a/arch/x86/mm/hugetlbpage.c > > +++ b/arch/x86/mm/hugetlbpage.c > > @@ -37,8 +37,12 @@ int pmd_huge(pmd_t pmd) > > */ > > int pud_huge(pud_t pud) > > { > > +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > > return !pud_none(pud) && > > (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; > > +#else > > + return !!(pud_val(pud) & _PAGE_PSE); // or "return 0;" ? > > +#endif > > } > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > > > Let me guess what the PUD entry was there when triggering the issue. > > Assuming that the original code (before 3a194f3f8ad0) was correct, the PSE > > bit in pud_val(pud) should be always cleared. So, when pud_huge() returns > > true since 3a194f3f8ad0, the PRESENT bit should be clear and some other > > bits (rather than PRESENT and PSE) are set so that pud_none() is false. > > I'm not sure what such a non-present PUD entry does mean. > > pud_val()==0 when it blows up, and pud_none() is false because > pgtable-nopmd.h says so with 2 level paging. > > And given that I just tested with PAE / 3 level paging, > and sure enough it no longer blows up. > > So looks to me like maybe this new code just doesn't understand > how the levels get folded. OK, so branching based on "#if CONFIG_PGTABLE_LEVELS > 2" seems better. Thank you for additional testing. > > I might also be missing something obvious, but why is it even > necessary to treat PRESENT==0+PSE==0 as a huge entry? The format of pud entry differs based on PRESENT bit, and PSE bit is checked before PRESENT bit. So in order to distinguish from a normal huge entry, we had to define that a non-present huge entry should have its PSE bit cleared (although this sounds counter-intuitive). Thanks, Naoya Horiguchi
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 509408da0da1..6b3033845c6d 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -30,9 +30,15 @@ int pmd_huge(pmd_t pmd) (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; } +/* + * pud_huge() returns 1 if @pud is hugetlb related entry, that is normal + * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. + * Otherwise, returns 0. + */ int pud_huge(pud_t pud) { - return !!(pud_val(pud) & _PAGE_PSE); + return !pud_none(pud) && + (pud_val(pud) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT; } #ifdef CONFIG_HUGETLB_PAGE diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cf8ccee7654c..77119d93a0f9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6978,10 +6978,38 @@ struct page * __weak follow_huge_pud(struct mm_struct *mm, unsigned long address, pud_t *pud, int flags) { - if (flags & (FOLL_GET | FOLL_PIN)) + struct page *page = NULL; + spinlock_t *ptl; + pte_t pte; + + if (WARN_ON_ONCE(flags & FOLL_PIN)) return NULL; - return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); +retry: + ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud); + if (!pud_huge(*pud)) + goto out; + pte = huge_ptep_get((pte_t *)pud); + if (pte_present(pte)) { + page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { + page = NULL; + goto out; + } + } else { + if (is_hugetlb_entry_migration(pte)) { + spin_unlock(ptl); + __migration_entry_wait(mm, (pte_t *)pud, ptl); + goto retry; + } + /* + * hwpoisoned entry is treated as no_page_table in + * follow_page_mask(). + */ + } +out: + spin_unlock(ptl); + return page; } struct page * __weak