diff mbox series

[v4] mm: Fix possible NULL pointer dereference in __swap_duplicate

Message ID e223b0e6ba2f4924984b1917cc717bd5@honor.com (mailing list archive)
State New
Headers show
Series [v4] mm: Fix possible NULL pointer dereference in __swap_duplicate | expand

Commit Message

gaoxu Feb. 19, 2025, 1:56 a.m. UTC
Add a NULL check on the return value of swp_swap_info in __swap_duplicate
to prevent crashes caused by NULL pointer dereference.

The reason why swp_swap_info() returns NULL is unclear; it may be due to
CPU cache issues or DDR bit flips. The probability of this issue is very
small, and the stack info we encountered is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000058
[RB/E]rb_sreason_str_set: sreason_str set null_pointer
Mem abort info:
  ESR = 0x0000000096000005
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x05: level 1 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 39-bit VAs, pgdp=00000008a80e5000
[0000000000000058] pgd=0000000000000000, p4d=0000000000000000,
pud=0000000000000000
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Skip md ftrace buffer dump for: 0x1609e0
...
pc : swap_duplicate+0x44/0x164
lr : copy_page_range+0x508/0x1e78
sp : ffffffc0f2a699e0
x29: ffffffc0f2a699e0 x28: ffffff8a5b28d388 x27: ffffff8b06603388
x26: ffffffdf7291fe70 x25: 0000000000000006 x24: 0000000000100073
x23: 00000000002d2d2f x22: 0000000000000008 x21: 0000000000000000
x20: 00000000002d2d2f x19: 18000000002d2d2f x18: ffffffdf726faec0
x17: 0000000000000000 x16: 0010000000000001 x15: 0040000000000001
x14: 0400000000000001 x13: ff7ffffffffffb7f x12: ffeffffffffffbff
x11: ffffff8a5c7e1898 x10: 0000000000000018 x9 : 0000000000000006
x8 : 1800000000000000 x7 : 0000000000000000 x6 : ffffff8057c01f10
x5 : 000000000000a318 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000006daf200000 x1 : 0000000000000001 x0 : 18000000002d2d2f
Call trace:
 swap_duplicate+0x44/0x164
 copy_page_range+0x508/0x1e78
 copy_process+0x1278/0x21cc
 kernel_clone+0x90/0x438
 __arm64_sys_clone+0x5c/0x8c
 invoke_syscall+0x58/0x110
 do_el0_svc+0x8c/0xe0
 el0_svc+0x38/0x9c
 el0t_64_sync_handler+0x44/0xec
 el0t_64_sync+0x1a8/0x1ac
Code: 9139c35a 71006f3f 54000568 f8797b55 (f9402ea8)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs

The patch seems to only provide a workaround, but there are no more
effective software solutions to handle the bit flips problem. This path
will change the issue from a system crash to a process exception, thereby
reducing the impact on the entire machine.

Signed-off-by: gao xu <gaoxu2@honor.com>
---
v1 -> v2: 
- Add WARN_ON_ONCE.
- update the commit info.
v2 -> v3: Delete the review tags (This is my issue, and I apologize).
V3 -> v4: Add swap entry logging per Barry Song's suggestion.
---
 mm/swapfile.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Barry Song Feb. 19, 2025, 2:28 a.m. UTC | #1
On Wed, Feb 19, 2025 at 2:56 PM gaoxu <gaoxu2@honor.com> wrote:
>
> Add a NULL check on the return value of swp_swap_info in __swap_duplicate
> to prevent crashes caused by NULL pointer dereference.
>
> The reason why swp_swap_info() returns NULL is unclear; it may be due to
> CPU cache issues or DDR bit flips. The probability of this issue is very
> small, and the stack info we encountered is as follows:
> Unable to handle kernel NULL pointer dereference at virtual address
> 0000000000000058
> [RB/E]rb_sreason_str_set: sreason_str set null_pointer
> Mem abort info:
>   ESR = 0x0000000096000005
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
>   FSC = 0x05: level 1 translation fault
> Data abort info:
>   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> user pgtable: 4k pages, 39-bit VAs, pgdp=00000008a80e5000
> [0000000000000058] pgd=0000000000000000, p4d=0000000000000000,
> pud=0000000000000000
> Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
> Skip md ftrace buffer dump for: 0x1609e0
> ...
> pc : swap_duplicate+0x44/0x164
> lr : copy_page_range+0x508/0x1e78
> sp : ffffffc0f2a699e0
> x29: ffffffc0f2a699e0 x28: ffffff8a5b28d388 x27: ffffff8b06603388
> x26: ffffffdf7291fe70 x25: 0000000000000006 x24: 0000000000100073
> x23: 00000000002d2d2f x22: 0000000000000008 x21: 0000000000000000
> x20: 00000000002d2d2f x19: 18000000002d2d2f x18: ffffffdf726faec0
> x17: 0000000000000000 x16: 0010000000000001 x15: 0040000000000001
> x14: 0400000000000001 x13: ff7ffffffffffb7f x12: ffeffffffffffbff
> x11: ffffff8a5c7e1898 x10: 0000000000000018 x9 : 0000000000000006
> x8 : 1800000000000000 x7 : 0000000000000000 x6 : ffffff8057c01f10
> x5 : 000000000000a318 x4 : 0000000000000000 x3 : 0000000000000000
> x2 : 0000006daf200000 x1 : 0000000000000001 x0 : 18000000002d2d2f
> Call trace:
>  swap_duplicate+0x44/0x164
>  copy_page_range+0x508/0x1e78
>  copy_process+0x1278/0x21cc
>  kernel_clone+0x90/0x438
>  __arm64_sys_clone+0x5c/0x8c
>  invoke_syscall+0x58/0x110
>  do_el0_svc+0x8c/0xe0
>  el0_svc+0x38/0x9c
>  el0t_64_sync_handler+0x44/0xec
>  el0t_64_sync+0x1a8/0x1ac
> Code: 9139c35a 71006f3f 54000568 f8797b55 (f9402ea8)
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Oops: Fatal exception
> SMP: stopping secondary CPUs
>
> The patch seems to only provide a workaround, but there are no more
> effective software solutions to handle the bit flips problem. This path
> will change the issue from a system crash to a process exception, thereby
> reducing the impact on the entire machine.
>
> Signed-off-by: gao xu <gaoxu2@honor.com>

Regardless of whether the above statement is 100% accurate or whether
a bit-flip actually exists, providing this check still seems useful,
at least for
defensive programming.

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
> v1 -> v2:
> - Add WARN_ON_ONCE.
> - update the commit info.
> v2 -> v3: Delete the review tags (This is my issue, and I apologize).
> V3 -> v4: Add swap entry logging per Barry Song's suggestion.
> ---
>  mm/swapfile.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7448a3876..403df1817 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3521,6 +3521,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>         int err, i;
>
>         si = swp_swap_info(entry);
> +       if (WARN_ON_ONCE(!si)) {
> +               pr_err("%s%08lx\n", Bad_file, entry.val);
> +               return -EINVAL;
> +       }
>
>         offset = swp_offset(entry);
>         VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
> --
> 2.17.1
Andrew Morton Feb. 19, 2025, 4:03 a.m. UTC | #2
On Wed, 19 Feb 2025 15:28:26 +1300 Barry Song <21cnbao@gmail.com> wrote:

> > The patch seems to only provide a workaround, but there are no more
> > effective software solutions to handle the bit flips problem. This path
> > will change the issue from a system crash to a process exception, thereby
> > reducing the impact on the entire machine.
> >
> > Signed-off-by: gao xu <gaoxu2@honor.com>
> 
> Regardless of whether the above statement is 100% accurate or whether
> a bit-flip actually exists, providing this check still seems useful,
> at least for
> defensive programming.

I'm doubtful as well.

How often has this crash been observed?
gaoxu Feb. 19, 2025, 6:23 a.m. UTC | #3
> 
> On Wed, 19 Feb 2025 15:28:26 +1300 Barry Song <21cnbao@gmail.com>
> wrote:
> 
> > > The patch seems to only provide a workaround, but there are no more
> > > effective software solutions to handle the bit flips problem. This
> > > path will change the issue from a system crash to a process
> > > exception, thereby reducing the impact on the entire machine.
> > >
> > > Signed-off-by: gao xu <gaoxu2@honor.com>
> >
> > Regardless of whether the above statement is 100% accurate or whether
> > a bit-flip actually exists, providing this check still seems useful,
> > at least for defensive programming.
> 
> I'm doubtful as well.
> 
> How often has this crash been observed?
The probability of this issue occurring is approximately 1 in 500,000 per week.
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7448a3876..403df1817 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3521,6 +3521,10 @@  static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	int err, i;
 
 	si = swp_swap_info(entry);
+	if (WARN_ON_ONCE(!si)) {
+		pr_err("%s%08lx\n", Bad_file, entry.val);
+		return -EINVAL;
+	}
 
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);