diff mbox series

[v2] secretmem: fix unhandled fault in truncate

Message ID 20220707165650.248088-1-rppt@kernel.org (mailing list archive)
State New
Headers show
Series [v2] secretmem: fix unhandled fault in truncate | expand

Commit Message

Mike Rapoport July 7, 2022, 4:56 p.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

syzkaller reports the following issue:

BUG: unable to handle page fault for address: ffff888021f7e005
PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
Oops: 0002 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
FS:  00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 zero_user_segments include/linux/highmem.h:272 [inline]
 folio_zero_range include/linux/highmem.h:428 [inline]
 truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
 truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
 truncate_inode_pages mm/truncate.c:452 [inline]
 truncate_pagecache+0x63/0x90 mm/truncate.c:753
 simple_setattr+0xed/0x110 fs/libfs.c:535
 secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
 notify_change+0xb8c/0x12b0 fs/attr.c:424
 do_truncate+0x13c/0x200 fs/open.c:65
 do_sys_ftruncate+0x536/0x730 fs/open.c:193
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fb29d900899
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
 </TASK>
Modules linked in:
CR2: ffff888021f7e005
---[ end trace 0000000000000000 ]---

Eric Biggers suggested that this happens when
secretmem_setattr()->simple_setattr() races with secretmem_fault() so
that a page that is faulted in by secretmem_fault() (and thus removed
from the direct map) is zeroed by inode truncation right afterwards.

Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
secretmem_fault() prevents the race.

Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---

v2: use inode_lock_shared() rather than add a new rw_sem to secretmem

Axel, I didn't add your Reviewed-by because v2 is quite different.

 mm/secretmem.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)


base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a

Comments

Yang Shi July 7, 2022, 5:48 p.m. UTC | #1
On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> syzkaller reports the following issue:
>
> BUG: unable to handle page fault for address: ffff888021f7e005
> PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> Oops: 0002 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> FS:  00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  zero_user_segments include/linux/highmem.h:272 [inline]
>  folio_zero_range include/linux/highmem.h:428 [inline]
>  truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
>  truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
>  truncate_inode_pages mm/truncate.c:452 [inline]
>  truncate_pagecache+0x63/0x90 mm/truncate.c:753
>  simple_setattr+0xed/0x110 fs/libfs.c:535
>  secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
>  notify_change+0xb8c/0x12b0 fs/attr.c:424
>  do_truncate+0x13c/0x200 fs/open.c:65
>  do_sys_ftruncate+0x536/0x730 fs/open.c:193
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fb29d900899
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
>  </TASK>
> Modules linked in:
> CR2: ffff888021f7e005
> ---[ end trace 0000000000000000 ]---
>
> Eric Biggers suggested that this happens when
> secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> that a page that is faulted in by secretmem_fault() (and thus removed
> from the direct map) is zeroed by inode truncation right afterwards.
>
> Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> secretmem_fault() prevents the race.

Should invalidate_lock be used to serialize between page fault and truncate?


>
> Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> Suggested-by: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>
> v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
>
> Axel, I didn't add your Reviewed-by because v2 is quite different.
>
>  mm/secretmem.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 206ed6b40c1d..a4fabf705e4f 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>         gfp_t gfp = vmf->gfp_mask;
>         unsigned long addr;
>         struct page *page;
> +       vm_fault_t ret;
>         int err;
>
>         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>                 return vmf_error(-EINVAL);
>
> +       inode_lock_shared(inode);
> +
>  retry:
>         page = find_lock_page(mapping, offset);
>         if (!page) {
>                 page = alloc_page(gfp | __GFP_ZERO);
> -               if (!page)
> -                       return VM_FAULT_OOM;
> +               if (!page) {
> +                       ret = VM_FAULT_OOM;
> +                       goto out;
> +               }
>
>                 err = set_direct_map_invalid_noflush(page);
>                 if (err) {
>                         put_page(page);
> -                       return vmf_error(err);
> +                       ret = vmf_error(err);
> +                       goto out;
>                 }
>
>                 __SetPageUptodate(page);
> @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>                         if (err == -EEXIST)
>                                 goto retry;
>
> -                       return vmf_error(err);
> +                       ret = vmf_error(err);
> +                       goto out;
>                 }
>
>                 addr = (unsigned long)page_address(page);
> @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>         }
>
>         vmf->page = page;
> -       return VM_FAULT_LOCKED;
> +       ret = VM_FAULT_LOCKED;
> +
> +out:
> +       inode_unlock_shared(inode);
> +       return ret;
>  }
>
>  static const struct vm_operations_struct secretmem_vm_ops = {
>
> base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> --
> 2.34.1
>
>
Axel Rasmussen July 7, 2022, 8:27 p.m. UTC | #2
Looks correct to me - at least, I'm confident it will prevent the
race. I'm not really familiar enough with filesystem APIs to know
whether Yang's suggestion is better or not, though.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>


On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> syzkaller reports the following issue:
>
> BUG: unable to handle page fault for address: ffff888021f7e005
> PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> Oops: 0002 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> FS:  00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  zero_user_segments include/linux/highmem.h:272 [inline]
>  folio_zero_range include/linux/highmem.h:428 [inline]
>  truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
>  truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
>  truncate_inode_pages mm/truncate.c:452 [inline]
>  truncate_pagecache+0x63/0x90 mm/truncate.c:753
>  simple_setattr+0xed/0x110 fs/libfs.c:535
>  secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
>  notify_change+0xb8c/0x12b0 fs/attr.c:424
>  do_truncate+0x13c/0x200 fs/open.c:65
>  do_sys_ftruncate+0x536/0x730 fs/open.c:193
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fb29d900899
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
>  </TASK>
> Modules linked in:
> CR2: ffff888021f7e005
> ---[ end trace 0000000000000000 ]---
>
> Eric Biggers suggested that this happens when
> secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> that a page that is faulted in by secretmem_fault() (and thus removed
> from the direct map) is zeroed by inode truncation right afterwards.
>
> Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> secretmem_fault() prevents the race.
>
> Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> Suggested-by: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>
> v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
>
> Axel, I didn't add your Reviewed-by because v2 is quite different.
>
>  mm/secretmem.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 206ed6b40c1d..a4fabf705e4f 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>         gfp_t gfp = vmf->gfp_mask;
>         unsigned long addr;
>         struct page *page;
> +       vm_fault_t ret;
>         int err;
>
>         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>                 return vmf_error(-EINVAL);
>
> +       inode_lock_shared(inode);
> +
>  retry:
>         page = find_lock_page(mapping, offset);
>         if (!page) {
>                 page = alloc_page(gfp | __GFP_ZERO);
> -               if (!page)
> -                       return VM_FAULT_OOM;
> +               if (!page) {
> +                       ret = VM_FAULT_OOM;
> +                       goto out;
> +               }
>
>                 err = set_direct_map_invalid_noflush(page);
>                 if (err) {
>                         put_page(page);
> -                       return vmf_error(err);
> +                       ret = vmf_error(err);
> +                       goto out;
>                 }
>
>                 __SetPageUptodate(page);
> @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>                         if (err == -EEXIST)
>                                 goto retry;
>
> -                       return vmf_error(err);
> +                       ret = vmf_error(err);
> +                       goto out;
>                 }
>
>                 addr = (unsigned long)page_address(page);
> @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>         }
>
>         vmf->page = page;
> -       return VM_FAULT_LOCKED;
> +       ret = VM_FAULT_LOCKED;
> +
> +out:
> +       inode_unlock_shared(inode);
> +       return ret;
>  }
>
>  static const struct vm_operations_struct secretmem_vm_ops = {
>
> base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> --
> 2.34.1
>
Darrick J. Wong July 7, 2022, 8:55 p.m. UTC | #3
On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: Mike Rapoport <rppt@linux.ibm.com>
> >
> > syzkaller reports the following issue:
> >
> > BUG: unable to handle page fault for address: ffff888021f7e005
> > PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> > Oops: 0002 [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> > Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> > RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> > RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> > RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> > RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> > R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> > R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> > FS:  00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  zero_user_segments include/linux/highmem.h:272 [inline]
> >  folio_zero_range include/linux/highmem.h:428 [inline]
> >  truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
> >  truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
> >  truncate_inode_pages mm/truncate.c:452 [inline]
> >  truncate_pagecache+0x63/0x90 mm/truncate.c:753
> >  simple_setattr+0xed/0x110 fs/libfs.c:535
> >  secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
> >  notify_change+0xb8c/0x12b0 fs/attr.c:424
> >  do_truncate+0x13c/0x200 fs/open.c:65
> >  do_sys_ftruncate+0x536/0x730 fs/open.c:193
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > RIP: 0033:0x7fb29d900899
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> > RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> > RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> > RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> > R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
> >  </TASK>
> > Modules linked in:
> > CR2: ffff888021f7e005
> > ---[ end trace 0000000000000000 ]---
> >
> > Eric Biggers suggested that this happens when
> > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > that a page that is faulted in by secretmem_fault() (and thus removed
> > from the direct map) is zeroed by inode truncation right afterwards.
> >
> > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > secretmem_fault() prevents the race.
> 
> Should invalidate_lock be used to serialize between page fault and truncate?

I would have thought so, given Documentation/filesystems/locking.rst:

"->fault() is called when a previously not present pte is about to be
faulted in. The filesystem must find and return the page associated with
the passed in "pgoff" in the vm_fault structure. If it is possible that
the page may be truncated and/or invalidated, then the filesystem must
lock invalidate_lock, then ensure the page is not already truncated
(invalidate_lock will block subsequent truncate), and then return with
VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."

IIRC page faults aren't supposed to take i_rwsem because the fault could
be in response to someone mmaping a file into memory and then write()ing
to the same file using the mmapped region.  The write() takes
inode_lock and faults on the buffer, so the fault cannot take inode_lock
again.

That said... I don't think memfd_secret files /can/ be written to?  Hard
to say, since I can't find a manpage describing what that syscall does.

--D

> 
> >
> > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >
> > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> >
> > Axel, I didn't add your Reviewed-by because v2 is quite different.
> >
> >  mm/secretmem.c | 21 ++++++++++++++++-----
> >  1 file changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > index 206ed6b40c1d..a4fabf705e4f 100644
> > --- a/mm/secretmem.c
> > +++ b/mm/secretmem.c
> > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> >         gfp_t gfp = vmf->gfp_mask;
> >         unsigned long addr;
> >         struct page *page;
> > +       vm_fault_t ret;
> >         int err;
> >
> >         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> >                 return vmf_error(-EINVAL);
> >
> > +       inode_lock_shared(inode);
> > +
> >  retry:
> >         page = find_lock_page(mapping, offset);
> >         if (!page) {
> >                 page = alloc_page(gfp | __GFP_ZERO);
> > -               if (!page)
> > -                       return VM_FAULT_OOM;
> > +               if (!page) {
> > +                       ret = VM_FAULT_OOM;
> > +                       goto out;
> > +               }
> >
> >                 err = set_direct_map_invalid_noflush(page);
> >                 if (err) {
> >                         put_page(page);
> > -                       return vmf_error(err);
> > +                       ret = vmf_error(err);
> > +                       goto out;
> >                 }
> >
> >                 __SetPageUptodate(page);
> > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> >                         if (err == -EEXIST)
> >                                 goto retry;
> >
> > -                       return vmf_error(err);
> > +                       ret = vmf_error(err);
> > +                       goto out;
> >                 }
> >
> >                 addr = (unsigned long)page_address(page);
> > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> >         }
> >
> >         vmf->page = page;
> > -       return VM_FAULT_LOCKED;
> > +       ret = VM_FAULT_LOCKED;
> > +
> > +out:
> > +       inode_unlock_shared(inode);
> > +       return ret;
> >  }
> >
> >  static const struct vm_operations_struct secretmem_vm_ops = {
> >
> > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > --
> > 2.34.1
> >
> >
Yang Shi July 7, 2022, 10:09 p.m. UTC | #4
On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > syzkaller reports the following issue:
> > >
> > > BUG: unable to handle page fault for address: ffff888021f7e005
> > > PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> > > Oops: 0002 [#1] PREEMPT SMP KASAN
> > > CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> > > Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> > > RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> > > RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> > > RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> > > RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> > > R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> > > R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> > > FS:  00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  zero_user_segments include/linux/highmem.h:272 [inline]
> > >  folio_zero_range include/linux/highmem.h:428 [inline]
> > >  truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
> > >  truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
> > >  truncate_inode_pages mm/truncate.c:452 [inline]
> > >  truncate_pagecache+0x63/0x90 mm/truncate.c:753
> > >  simple_setattr+0xed/0x110 fs/libfs.c:535
> > >  secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
> > >  notify_change+0xb8c/0x12b0 fs/attr.c:424
> > >  do_truncate+0x13c/0x200 fs/open.c:65
> > >  do_sys_ftruncate+0x536/0x730 fs/open.c:193
> > >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > >  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > RIP: 0033:0x7fb29d900899
> > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> > > RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> > > RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> > > RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> > > R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
> > >  </TASK>
> > > Modules linked in:
> > > CR2: ffff888021f7e005
> > > ---[ end trace 0000000000000000 ]---
> > >
> > > Eric Biggers suggested that this happens when
> > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > from the direct map) is zeroed by inode truncation right afterwards.
> > >
> > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > secretmem_fault() prevents the race.
> >
> > Should invalidate_lock be used to serialize between page fault and truncate?
>
> I would have thought so, given Documentation/filesystems/locking.rst:
>
> "->fault() is called when a previously not present pte is about to be
> faulted in. The filesystem must find and return the page associated with
> the passed in "pgoff" in the vm_fault structure. If it is possible that
> the page may be truncated and/or invalidated, then the filesystem must
> lock invalidate_lock, then ensure the page is not already truncated
> (invalidate_lock will block subsequent truncate), and then return with
> VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
>
> IIRC page faults aren't supposed to take i_rwsem because the fault could
> be in response to someone mmaping a file into memory and then write()ing
> to the same file using the mmapped region.  The write() takes
> inode_lock and faults on the buffer, so the fault cannot take inode_lock
> again.

Do you mean writing from one part of the file to the other part of the
file so the "from" buffer used by copy_from_user() is part of the
mmaped region?

Another possible deadlock issue by using inode_lock in page faults is
mmap_lock is acquired before inode_lock, but write may acquire
inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
not cause real deadlock since mmap_lock is not exclusive for page
faults. But such pattern should be avoided IMHO.

>
> That said... I don't think memfd_secret files /can/ be written to?  Hard
> to say, since I can't find a manpage describing what that syscall does.
>
> --D
>
> >
> > >
> > > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > ---
> > >
> > > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> > >
> > > Axel, I didn't add your Reviewed-by because v2 is quite different.
> > >
> > >  mm/secretmem.c | 21 ++++++++++++++++-----
> > >  1 file changed, 16 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > > index 206ed6b40c1d..a4fabf705e4f 100644
> > > --- a/mm/secretmem.c
> > > +++ b/mm/secretmem.c
> > > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > >         gfp_t gfp = vmf->gfp_mask;
> > >         unsigned long addr;
> > >         struct page *page;
> > > +       vm_fault_t ret;
> > >         int err;
> > >
> > >         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > >                 return vmf_error(-EINVAL);
> > >
> > > +       inode_lock_shared(inode);
> > > +
> > >  retry:
> > >         page = find_lock_page(mapping, offset);
> > >         if (!page) {
> > >                 page = alloc_page(gfp | __GFP_ZERO);
> > > -               if (!page)
> > > -                       return VM_FAULT_OOM;
> > > +               if (!page) {
> > > +                       ret = VM_FAULT_OOM;
> > > +                       goto out;
> > > +               }
> > >
> > >                 err = set_direct_map_invalid_noflush(page);
> > >                 if (err) {
> > >                         put_page(page);
> > > -                       return vmf_error(err);
> > > +                       ret = vmf_error(err);
> > > +                       goto out;
> > >                 }
> > >
> > >                 __SetPageUptodate(page);
> > > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > >                         if (err == -EEXIST)
> > >                                 goto retry;
> > >
> > > -                       return vmf_error(err);
> > > +                       ret = vmf_error(err);
> > > +                       goto out;
> > >                 }
> > >
> > >                 addr = (unsigned long)page_address(page);
> > > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > >         }
> > >
> > >         vmf->page = page;
> > > -       return VM_FAULT_LOCKED;
> > > +       ret = VM_FAULT_LOCKED;
> > > +
> > > +out:
> > > +       inode_unlock_shared(inode);
> > > +       return ret;
> > >  }
> > >
> > >  static const struct vm_operations_struct secretmem_vm_ops = {
> > >
> > > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > > --
> > > 2.34.1
> > >
> > >
Mike Rapoport July 8, 2022, 8:28 a.m. UTC | #5
On Thu, Jul 07, 2022 at 03:09:32PM -0700, Yang Shi wrote:
> On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > Eric Biggers suggested that this happens when
> > > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > > from the direct map) is zeroed by inode truncation right afterwards.
> > > >
> > > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > > secretmem_fault() prevents the race.
> > >
> > > Should invalidate_lock be used to serialize between page fault and truncate?
> >
> > I would have thought so, given Documentation/filesystems/locking.rst:
> >
> > "->fault() is called when a previously not present pte is about to be
> > faulted in. The filesystem must find and return the page associated with
> > the passed in "pgoff" in the vm_fault structure. If it is possible that
> > the page may be truncated and/or invalidated, then the filesystem must
> > lock invalidate_lock, then ensure the page is not already truncated
> > (invalidate_lock will block subsequent truncate), and then return with
> > VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
> >
> > IIRC page faults aren't supposed to take i_rwsem because the fault could
> > be in response to someone mmaping a file into memory and then write()ing
> > to the same file using the mmapped region.  The write() takes
> > inode_lock and faults on the buffer, so the fault cannot take inode_lock
> > again.
> 
> Do you mean writing from one part of the file to the other part of the
> file so the "from" buffer used by copy_from_user() is part of the
> mmaped region?
> 
> Another possible deadlock issue by using inode_lock in page faults is
> mmap_lock is acquired before inode_lock, but write may acquire
> inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
> not cause real deadlock since mmap_lock is not exclusive for page
> faults. But such pattern should be avoided IMHO.
>
> > That said... I don't think memfd_secret files /can/ be written to?

memfd_secret files cannot be written to, they can only be mmap()ed.
Synchronization is only required between
do_truncate()->...->simple_setatt() and secretmem->fault() and I don't see
how that can deadlock.

I'm not an fs expert though, so if you think that invalidate_lock() is
safer, I don't mind s/inode_lock/invalidate_lock/ in the patch.

> > Hard to say, since I can't find a manpage describing what that syscall
> > does.

Right, I don't see it's published :-/

There is a groff version:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/memfd_secret.2

> > --D
> >
> > >
> > > >
> > > > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > > > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > > ---
> > > >
> > > > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> > > >
> > > > Axel, I didn't add your Reviewed-by because v2 is quite different.
> > > >
> > > >  mm/secretmem.c | 21 ++++++++++++++++-----
> > > >  1 file changed, 16 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > > > index 206ed6b40c1d..a4fabf705e4f 100644
> > > > --- a/mm/secretmem.c
> > > > +++ b/mm/secretmem.c
> > > > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > >         gfp_t gfp = vmf->gfp_mask;
> > > >         unsigned long addr;
> > > >         struct page *page;
> > > > +       vm_fault_t ret;
> > > >         int err;
> > > >
> > > >         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > > >                 return vmf_error(-EINVAL);
> > > >
> > > > +       inode_lock_shared(inode);
> > > > +
> > > >  retry:
> > > >         page = find_lock_page(mapping, offset);
> > > >         if (!page) {
> > > >                 page = alloc_page(gfp | __GFP_ZERO);
> > > > -               if (!page)
> > > > -                       return VM_FAULT_OOM;
> > > > +               if (!page) {
> > > > +                       ret = VM_FAULT_OOM;
> > > > +                       goto out;
> > > > +               }
> > > >
> > > >                 err = set_direct_map_invalid_noflush(page);
> > > >                 if (err) {
> > > >                         put_page(page);
> > > > -                       return vmf_error(err);
> > > > +                       ret = vmf_error(err);
> > > > +                       goto out;
> > > >                 }
> > > >
> > > >                 __SetPageUptodate(page);
> > > > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > >                         if (err == -EEXIST)
> > > >                                 goto retry;
> > > >
> > > > -                       return vmf_error(err);
> > > > +                       ret = vmf_error(err);
> > > > +                       goto out;
> > > >                 }
> > > >
> > > >                 addr = (unsigned long)page_address(page);
> > > > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > >         }
> > > >
> > > >         vmf->page = page;
> > > > -       return VM_FAULT_LOCKED;
> > > > +       ret = VM_FAULT_LOCKED;
> > > > +
> > > > +out:
> > > > +       inode_unlock_shared(inode);
> > > > +       return ret;
> > > >  }
> > > >
> > > >  static const struct vm_operations_struct secretmem_vm_ops = {
> > > >
> > > > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > > > --
> > > > 2.34.1
> > > >
> > > >
Yang Shi July 12, 2022, 5:40 p.m. UTC | #6
On Fri, Jul 8, 2022 at 1:29 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, Jul 07, 2022 at 03:09:32PM -0700, Yang Shi wrote:
> > On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > > > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > >
> > > > > Eric Biggers suggested that this happens when
> > > > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > > > from the direct map) is zeroed by inode truncation right afterwards.
> > > > >
> > > > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > > > secretmem_fault() prevents the race.
> > > >
> > > > Should invalidate_lock be used to serialize between page fault and truncate?
> > >
> > > I would have thought so, given Documentation/filesystems/locking.rst:
> > >
> > > "->fault() is called when a previously not present pte is about to be
> > > faulted in. The filesystem must find and return the page associated with
> > > the passed in "pgoff" in the vm_fault structure. If it is possible that
> > > the page may be truncated and/or invalidated, then the filesystem must
> > > lock invalidate_lock, then ensure the page is not already truncated
> > > (invalidate_lock will block subsequent truncate), and then return with
> > > VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
> > >
> > > IIRC page faults aren't supposed to take i_rwsem because the fault could
> > > be in response to someone mmaping a file into memory and then write()ing
> > > to the same file using the mmapped region.  The write() takes
> > > inode_lock and faults on the buffer, so the fault cannot take inode_lock
> > > again.
> >
> > Do you mean writing from one part of the file to the other part of the
> > file so the "from" buffer used by copy_from_user() is part of the
> > mmaped region?
> >
> > Another possible deadlock issue by using inode_lock in page faults is
> > mmap_lock is acquired before inode_lock, but write may acquire
> > inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
> > not cause real deadlock since mmap_lock is not exclusive for page
> > faults. But such pattern should be avoided IMHO.
> >
> > > That said... I don't think memfd_secret files /can/ be written to?
>
> memfd_secret files cannot be written to, they can only be mmap()ed.
> Synchronization is only required between
> do_truncate()->...->simple_setatt() and secretmem->fault() and I don't see
> how that can deadlock.

Sure, there is no deadlock.

>
> I'm not an fs expert though, so if you think that invalidate_lock() is
> safer, I don't mind s/inode_lock/invalidate_lock/ in the patch.

IIUC invalidate_lock should be preferred per the filesystem's locking
document. And I found Jan Kara's email of the invalidate_lock
patchset, please refer to
https://lore.kernel.org/linux-mm/20210715133202.5975-1-jack@suse.cz/.

>
> > > Hard to say, since I can't find a manpage describing what that syscall
> > > does.
>
> Right, I don't see it's published :-/
>
> There is a groff version:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/memfd_secret.2
>
> > > --D
> > >
> > > >
> > > > >
> > > > > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > > > > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > > > ---
> > > > >
> > > > > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> > > > >
> > > > > Axel, I didn't add your Reviewed-by because v2 is quite different.
> > > > >
> > > > >  mm/secretmem.c | 21 ++++++++++++++++-----
> > > > >  1 file changed, 16 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > > > > index 206ed6b40c1d..a4fabf705e4f 100644
> > > > > --- a/mm/secretmem.c
> > > > > +++ b/mm/secretmem.c
> > > > > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > >         gfp_t gfp = vmf->gfp_mask;
> > > > >         unsigned long addr;
> > > > >         struct page *page;
> > > > > +       vm_fault_t ret;
> > > > >         int err;
> > > > >
> > > > >         if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > > > >                 return vmf_error(-EINVAL);
> > > > >
> > > > > +       inode_lock_shared(inode);
> > > > > +
> > > > >  retry:
> > > > >         page = find_lock_page(mapping, offset);
> > > > >         if (!page) {
> > > > >                 page = alloc_page(gfp | __GFP_ZERO);
> > > > > -               if (!page)
> > > > > -                       return VM_FAULT_OOM;
> > > > > +               if (!page) {
> > > > > +                       ret = VM_FAULT_OOM;
> > > > > +                       goto out;
> > > > > +               }
> > > > >
> > > > >                 err = set_direct_map_invalid_noflush(page);
> > > > >                 if (err) {
> > > > >                         put_page(page);
> > > > > -                       return vmf_error(err);
> > > > > +                       ret = vmf_error(err);
> > > > > +                       goto out;
> > > > >                 }
> > > > >
> > > > >                 __SetPageUptodate(page);
> > > > > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > >                         if (err == -EEXIST)
> > > > >                                 goto retry;
> > > > >
> > > > > -                       return vmf_error(err);
> > > > > +                       ret = vmf_error(err);
> > > > > +                       goto out;
> > > > >                 }
> > > > >
> > > > >                 addr = (unsigned long)page_address(page);
> > > > > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > >         }
> > > > >
> > > > >         vmf->page = page;
> > > > > -       return VM_FAULT_LOCKED;
> > > > > +       ret = VM_FAULT_LOCKED;
> > > > > +
> > > > > +out:
> > > > > +       inode_unlock_shared(inode);
> > > > > +       return ret;
> > > > >  }
> > > > >
> > > > >  static const struct vm_operations_struct secretmem_vm_ops = {
> > > > >
> > > > > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > > > > --
> > > > > 2.34.1
> > > > >
> > > > >
>
> --
> Sincerely yours,
> Mike.
Jan Kara July 13, 2022, 1:27 p.m. UTC | #7
On Tue 12-07-22 10:40:11, Yang Shi wrote:
> On Fri, Jul 8, 2022 at 1:29 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Thu, Jul 07, 2022 at 03:09:32PM -0700, Yang Shi wrote:
> > > On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > > > > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > > >
> > > > > > Eric Biggers suggested that this happens when
> > > > > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > > > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > > > > from the direct map) is zeroed by inode truncation right afterwards.
> > > > > >
> > > > > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > > > > secretmem_fault() prevents the race.
> > > > >
> > > > > Should invalidate_lock be used to serialize between page fault and truncate?
> > > >
> > > > I would have thought so, given Documentation/filesystems/locking.rst:
> > > >
> > > > "->fault() is called when a previously not present pte is about to be
> > > > faulted in. The filesystem must find and return the page associated with
> > > > the passed in "pgoff" in the vm_fault structure. If it is possible that
> > > > the page may be truncated and/or invalidated, then the filesystem must
> > > > lock invalidate_lock, then ensure the page is not already truncated
> > > > (invalidate_lock will block subsequent truncate), and then return with
> > > > VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
> > > >
> > > > IIRC page faults aren't supposed to take i_rwsem because the fault could
> > > > be in response to someone mmaping a file into memory and then write()ing
> > > > to the same file using the mmapped region.  The write() takes
> > > > inode_lock and faults on the buffer, so the fault cannot take inode_lock
> > > > again.
> > >
> > > Do you mean writing from one part of the file to the other part of the
> > > file so the "from" buffer used by copy_from_user() is part of the
> > > mmaped region?
> > >
> > > Another possible deadlock issue by using inode_lock in page faults is
> > > mmap_lock is acquired before inode_lock, but write may acquire
> > > inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
> > > not cause real deadlock since mmap_lock is not exclusive for page
> > > faults. But such pattern should be avoided IMHO.
> > >
> > > > That said... I don't think memfd_secret files /can/ be written to?
> >
> > memfd_secret files cannot be written to, they can only be mmap()ed.
> > Synchronization is only required between
> > do_truncate()->...->simple_setatt() and secretmem->fault() and I don't see
> > how that can deadlock.
> 
> Sure, there is no deadlock.
> 
> >
> > I'm not an fs expert though, so if you think that invalidate_lock() is
> > safer, I don't mind s/inode_lock/invalidate_lock/ in the patch.
> 
> IIUC invalidate_lock should be preferred per the filesystem's locking
> document. And I found Jan Kara's email of the invalidate_lock
> patchset, please refer to
> https://lore.kernel.org/linux-mm/20210715133202.5975-1-jack@suse.cz/.

Yeah, so using invalidate_lock for such synchronization would be certainly
more standard than using inode_lock. Although I agree that for filesystems
that do not support read(2) and write(2) there does not seem to be an
immediate risk of a deadlock when inode_lock is used inside a page fault.

								Honza
diff mbox series

Patch

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 206ed6b40c1d..a4fabf705e4f 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -55,22 +55,28 @@  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 	gfp_t gfp = vmf->gfp_mask;
 	unsigned long addr;
 	struct page *page;
+	vm_fault_t ret;
 	int err;
 
 	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
 		return vmf_error(-EINVAL);
 
+	inode_lock_shared(inode);
+
 retry:
 	page = find_lock_page(mapping, offset);
 	if (!page) {
 		page = alloc_page(gfp | __GFP_ZERO);
-		if (!page)
-			return VM_FAULT_OOM;
+		if (!page) {
+			ret = VM_FAULT_OOM;
+			goto out;
+		}
 
 		err = set_direct_map_invalid_noflush(page);
 		if (err) {
 			put_page(page);
-			return vmf_error(err);
+			ret = vmf_error(err);
+			goto out;
 		}
 
 		__SetPageUptodate(page);
@@ -86,7 +92,8 @@  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 			if (err == -EEXIST)
 				goto retry;
 
-			return vmf_error(err);
+			ret = vmf_error(err);
+			goto out;
 		}
 
 		addr = (unsigned long)page_address(page);
@@ -94,7 +101,11 @@  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 	}
 
 	vmf->page = page;
-	return VM_FAULT_LOCKED;
+	ret = VM_FAULT_LOCKED;
+
+out:
+	inode_unlock_shared(inode);
+	return ret;
 }
 
 static const struct vm_operations_struct secretmem_vm_ops = {