Message ID | 20250114080927.2616684-2-vivek.kasireddy@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memfd: reserve hugetlb folios before allocation | expand |
On Tue, 14 Jan 2025 00:08:00 -0800 Vivek Kasireddy <vivek.kasireddy@intel.com> wrote: > There are cases when we try to pin a folio but discover that it has > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > but there is a chance that we might encounter a crash/failure > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > at that instant. This issue was reported by syzbot: > > kernel BUG at mm/hugetlb.c:2403! > > ... > > Therefore, to avoid this situation and fix this issue, we just need > to make a reservation (by calling hugetlb_reserve_pages()) before > we try to allocate the folio. This will ensure that we are properly > doing region/subpool accounting associated with our allocation. > > While at it, move subpool_inode() into hugetlb header and also > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > crash the system in this scenario and instead we could just warn > and fail the allocation. > > ... > > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, > struct folio *folio; > > spin_lock_irq(&hugetlb_lock); > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > + spin_unlock_irq(&hugetlb_lock); > + return NULL; > + } > + What is is that we're warning of here? Is there any action which either kernel developers or the user can take to prevent this warning from being issued? IOW, maybe the WARN shouldn't be present?
Hi Andrew, > Subject: Re: [PATCH v2 1/2] mm/memfd: reserve hugetlb folios before > allocation > > > > There are cases when we try to pin a folio but discover that it has > > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > > but there is a chance that we might encounter a crash/failure > > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > > at that instant. This issue was reported by syzbot: > > > > kernel BUG at mm/hugetlb.c:2403! > > > > ... > > > > Therefore, to avoid this situation and fix this issue, we just need > > to make a reservation (by calling hugetlb_reserve_pages()) before > > we try to allocate the folio. This will ensure that we are properly > > doing region/subpool accounting associated with our allocation. > > > > While at it, move subpool_inode() into hugetlb header and also > > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > > crash the system in this scenario and instead we could just warn > > and fail the allocation. > > > > ... > > > > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct > hstate *h, int preferred_nid, > > struct folio *folio; > > > > spin_lock_irq(&hugetlb_lock); > > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > > + spin_unlock_irq(&hugetlb_lock); > > + return NULL; > > + } > > + > > What is is that we're warning of here? The warning serves two purposes: 1) To flag a situation that is unexpected at that instant 2) To alert the callers (mostly future) that they need to somehow reserve their hugetlb folios before calling this function > Is there any action which > either kernel developers or the user can take to prevent this warning > from being issued? Yeah, the callers of this function need to make a reservation and ensure that hugetlb_reserve_pages() gets called (probably via hugetlbfs_file_mmap() or other possible means) before they get their folios allocated via this function. > > IOW, maybe the WARN shouldn't be present? Instead of silently failing, warning the caller about the failure mode seems like the right thing to do in this situation. However, I am OK if this warning is not present (given that this not a common use-case as of now) but do you see any concern if it stays? Thanks, Vivek
On 14.01.25 09:08, Vivek Kasireddy wrote: > There are cases when we try to pin a folio but discover that it has > not been faulted-in. So, we try to allocate it in memfd_alloc_folio() > but there is a chance that we might encounter a crash/failure > (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations > at that instant. This issue was reported by syzbot: > > kernel BUG at mm/hugetlb.c:2403! > Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI > CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted > 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 > Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 > f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f > RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 > RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 > RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed > RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 > R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 > R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 > FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 > memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 > udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] > udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 > udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] > udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:906 [inline] > __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Therefore, to avoid this situation and fix this issue, we just need > to make a reservation (by calling hugetlb_reserve_pages()) before > we try to allocate the folio. This will ensure that we are properly > doing region/subpool accounting associated with our allocation. > > While at it, move subpool_inode() into hugetlb header and also > replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to > crash the system in this scenario and instead we could just warn > and fail the allocation. > > Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") > Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > Cc: Steve Sistare <steven.sistare@oracle.com> > Cc: Muchun Song <muchun.song@linux.dev> > Cc: David Hildenbrand <david@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/hugetlb.h | 5 +++++ > mm/hugetlb.c | 14 ++++++-------- > mm/memfd.c | 14 +++++++++++--- > 3 files changed, 22 insertions(+), 11 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index ae4fe8615bb6..38c580548564 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx; > > #define default_hstate (hstates[default_hstate_idx]) > > +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) > +{ > + return HUGETLBFS_SB(inode->i_sb)->spool; > +} > + > static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio) > { > return folio->_hugetlb_subpool; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c498874a7170..ef948f56b864 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, > return ret; > } > > -static inline struct hugepage_subpool *subpool_inode(struct inode *inode) > -{ > - return HUGETLBFS_SB(inode->i_sb)->spool; > -} > - > static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) > { > return subpool_inode(file_inode(vma->vm_file)); > @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, > struct folio *folio; > > spin_lock_irq(&hugetlb_lock); > + if (WARN_ON_ONCE(!h->resv_huge_pages)) { > + spin_unlock_irq(&hugetlb_lock); > + return NULL; > + } > + > folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid, > nmask); > - if (folio) { > - VM_BUG_ON(!h->resv_huge_pages); > + if (folio) > h->resv_huge_pages--; > - } > > spin_unlock_irq(&hugetlb_lock); > return folio; > diff --git a/mm/memfd.c b/mm/memfd.c > index 35a370d75c9a..0d128c44fb78 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > #ifdef CONFIG_HUGETLB_PAGE > struct folio *folio; > gfp_t gfp_mask; > - int err; > + int err = -ENOMEM; > > if (is_file_hugepages(memfd)) { > /* > @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > * alloc from. Also, the folio will be pinned for an indefinite > * amount of time, so it is not expected to be migrated away. > */ > + struct inode *inode = file_inode(memfd); > struct hstate *h = hstate_file(memfd); > > gfp_mask = htlb_alloc_mask(h); > gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); > idx >>= huge_page_order(h); > > + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0)) > + return ERR_PTR(err); > + > folio = alloc_hugetlb_folio_reserve(h, > numa_node_id(), > NULL, > @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > idx); > if (err) { > folio_put(folio); > - return ERR_PTR(err); > + goto err; > } > + > + hugetlb_set_folio_subpool(folio, subpool_inode(inode)); > folio_unlock(folio); > return folio; > } > - return ERR_PTR(-ENOMEM); > +err: > + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); Hmmm, shouldn't we maybe only un-reserve if we were responsible for the reservation above? If it's already reserved before this call, we should probably leave it as is? Or maybe we never want to un-reserve at all here?
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index ae4fe8615bb6..38c580548564 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -712,6 +712,11 @@ extern unsigned int default_hstate_idx; #define default_hstate (hstates[default_hstate_idx]) +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) +{ + return HUGETLBFS_SB(inode->i_sb)->spool; +} + static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio) { return folio->_hugetlb_subpool; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c498874a7170..ef948f56b864 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -251,11 +251,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, return ret; } -static inline struct hugepage_subpool *subpool_inode(struct inode *inode) -{ - return HUGETLBFS_SB(inode->i_sb)->spool; -} - static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) { return subpool_inode(file_inode(vma->vm_file)); @@ -2397,12 +2392,15 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid, struct folio *folio; spin_lock_irq(&hugetlb_lock); + if (WARN_ON_ONCE(!h->resv_huge_pages)) { + spin_unlock_irq(&hugetlb_lock); + return NULL; + } + folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, preferred_nid, nmask); - if (folio) { - VM_BUG_ON(!h->resv_huge_pages); + if (folio) h->resv_huge_pages--; - } spin_unlock_irq(&hugetlb_lock); return folio; diff --git a/mm/memfd.c b/mm/memfd.c index 35a370d75c9a..0d128c44fb78 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -70,7 +70,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) #ifdef CONFIG_HUGETLB_PAGE struct folio *folio; gfp_t gfp_mask; - int err; + int err = -ENOMEM; if (is_file_hugepages(memfd)) { /* @@ -79,12 +79,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) * alloc from. Also, the folio will be pinned for an indefinite * amount of time, so it is not expected to be migrated away. */ + struct inode *inode = file_inode(memfd); struct hstate *h = hstate_file(memfd); gfp_mask = htlb_alloc_mask(h); gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE); idx >>= huge_page_order(h); + if (!hugetlb_reserve_pages(inode, idx, idx + 1, NULL, 0)) + return ERR_PTR(err); + folio = alloc_hugetlb_folio_reserve(h, numa_node_id(), NULL, @@ -95,12 +99,16 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) idx); if (err) { folio_put(folio); - return ERR_PTR(err); + goto err; } + + hugetlb_set_folio_subpool(folio, subpool_inode(inode)); folio_unlock(folio); return folio; } - return ERR_PTR(-ENOMEM); +err: + hugetlb_unreserve_pages(inode, idx, idx + 1, 0); + return ERR_PTR(err); } #endif return shmem_read_folio(memfd->f_mapping, idx);
There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but there is a chance that we might encounter a crash/failure (VM_BUG_ON(!h->resv_huge_pages)) if there are no active reservations at that instant. This issue was reported by syzbot: kernel BUG at mm/hugetlb.c:2403! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Therefore, to avoid this situation and fix this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will ensure that we are properly doing region/subpool accounting associated with our allocation. While at it, move subpool_inode() into hugetlb header and also replace the VM_BUG_ON() with WARN_ON_ONCE() as there is no need to crash the system in this scenario and instead we could just warn and fail the allocation. Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- include/linux/hugetlb.h | 5 +++++ mm/hugetlb.c | 14 ++++++-------- mm/memfd.c | 14 +++++++++++--- 3 files changed, 22 insertions(+), 11 deletions(-)