md: fix a crash in mempool_free

Message ID	alpine.LRH.2.21.2211031121070.18305@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <linux-raid-owner@kernel.org> Date: Thu, 3 Nov 2022 11:23:22 -0400 (EDT) From: Mikulas Patocka <mpatocka@redhat.com> To: Song Liu <song@kernel.org>, Guoqing Jiang <guoqing.jiang@linux.dev>, Zdenek Kabelac <zkabelac@redhat.com> cc: linux-raid@vger.kernel.org, dm-devel@redhat.com Subject: [PATCH] md: fix a crash in mempool_free Message-ID: <alpine.LRH.2.21.2211031121070.18305@file01.intranet.prod.int.rdu2.redhat.com> User-Agent: Alpine 2.21 (LRH 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	md: fix a crash in mempool_free \| expand md: fix a crash in mempool_free

Message ID

alpine.LRH.2.21.2211031121070.18305@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)

State

Superseded, archived

Headers

Date: Thu, 3 Nov 2022 11:23:22 -0400 (EDT)
From: Mikulas Patocka <mpatocka@redhat.com>
To: Song Liu <song@kernel.org>,
        Guoqing Jiang <guoqing.jiang@linux.dev>,
        Zdenek Kabelac <zkabelac@redhat.com>
cc: linux-raid@vger.kernel.org, dm-devel@redhat.com
Subject: [PATCH] md: fix a crash in mempool_free
Message-ID: 
 <alpine.LRH.2.21.2211031121070.18305@file01.intranet.prod.int.rdu2.redhat.com>
User-Agent: Alpine 2.21 (LRH 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk

Series

md: fix a crash in mempool_free | expand

Commit Message

Mikulas Patocka Nov. 3, 2022, 3:23 p.m. UTC

There's a crash in mempool_free when running the lvm test
shell/lvchange-rebuild-raid.sh.

The reason for the crash is this:
* super_written calls atomic_dec_and_test(&mddev->pending_writes) and
  wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
  and bio_put(bio).
* so, the process that waited on sb_wait and that is woken up is racing
  with bio_put(bio).
* if the process wins the race, it calls bioset_exit before bio_put(bio)
  is executed.
* bio_put(bio) attempts to free a bio into a destroyed bio set - causing
  a crash in mempool_free.

We fix this bug by moving bio_put before atomic_dec_and_test.

The function md_end_flush has a similar bug - we must call bio_put before
we decrement the number of in-progress bios.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 11557f0067 P4D 11557f0067 PUD 0
 Oops: 0002 [#1] PREEMPT SMP
 CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 Workqueue: kdelayd flush_expired_bios [dm_delay]
 RIP: 0010:mempool_free+0x47/0x80
 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
 FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
 Call Trace:
  <TASK>
  clone_endio+0xf4/0x1c0 [dm_mod]
  clone_endio+0xf4/0x1c0 [dm_mod]
  __submit_bio+0x76/0x120
  submit_bio_noacct_nocheck+0xb6/0x2a0
  flush_expired_bios+0x28/0x2f [dm_delay]
  process_one_work+0x1b4/0x300
  worker_thread+0x45/0x3e0
  ? rescuer_thread+0x380/0x380
  kthread+0xc2/0x100
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>
 Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 drivers/md/md.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

NeilBrown Nov. 4, 2022, 4:40 a.m. UTC | #1

On Fri, 04 Nov 2022, Mikulas Patocka wrote:
> There's a crash in mempool_free when running the lvm test
> shell/lvchange-rebuild-raid.sh.
> 
> The reason for the crash is this:
> * super_written calls atomic_dec_and_test(&mddev->pending_writes) and
>   wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
>   and bio_put(bio).
> * so, the process that waited on sb_wait and that is woken up is racing
>   with bio_put(bio).
> * if the process wins the race, it calls bioset_exit before bio_put(bio)
>   is executed.
> * bio_put(bio) attempts to free a bio into a destroyed bio set - causing
>   a crash in mempool_free.
> 
> We fix this bug by moving bio_put before atomic_dec_and_test.
> 
> The function md_end_flush has a similar bug - we must call bio_put before
> we decrement the number of in-progress bios.
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000000
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 11557f0067 P4D 11557f0067 PUD 0
>  Oops: 0002 [#1] PREEMPT SMP
>  CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>  Workqueue: kdelayd flush_expired_bios [dm_delay]
>  RIP: 0010:mempool_free+0x47/0x80
>  Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
>  RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
>  RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
>  RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
>  RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
>  R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
>  R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
>  FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
>  Call Trace:
>   <TASK>
>   clone_endio+0xf4/0x1c0 [dm_mod]
>   clone_endio+0xf4/0x1c0 [dm_mod]
>   __submit_bio+0x76/0x120
>   submit_bio_noacct_nocheck+0xb6/0x2a0
>   flush_expired_bios+0x28/0x2f [dm_delay]
>   process_one_work+0x1b4/0x300
>   worker_thread+0x45/0x3e0
>   ? rescuer_thread+0x380/0x380
>   kthread+0xc2/0x100
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x1f/0x30
>   </TASK>
>  Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
>  CR2: 0000000000000000
>  ---[ end trace 0000000000000000 ]---
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> 
> ---
>  drivers/md/md.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/drivers/md/md.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/md.c	2022-11-03 15:29:02.000000000 +0100
> +++ linux-2.6/drivers/md/md.c	2022-11-03 15:33:17.000000000 +0100
> @@ -509,13 +509,14 @@ static void md_end_flush(struct bio *bio
>  	struct md_rdev *rdev = bio->bi_private;
>  	struct mddev *mddev = rdev->mddev;
>  
> +	bio_put(bio);
> +
>  	rdev_dec_pending(rdev, mddev);
>  
>  	if (atomic_dec_and_test(&mddev->flush_pending)) {
>  		/* The pre-request flush has finished */
>  		queue_work(md_wq, &mddev->flush_work);
>  	}
> -	bio_put(bio);
>  }
>  
>  static void md_submit_flush_data(struct work_struct *ws);
> @@ -913,10 +914,11 @@ static void super_written(struct bio *bi
>  	} else
>  		clear_bit(LastDev, &rdev->flags);
>  
> +	bio_put(bio);
> +
>  	if (atomic_dec_and_test(&mddev->pending_writes))
>  		wake_up(&mddev->sb_wait);
>  	rdev_dec_pending(rdev, mddev);
> -	bio_put(bio);
>  }

Thanks. I think this is a clear improvement.
I think it would be a little better if the rdev_dec_pending were also
move up.
Then both code fragments would be:
  bio_put ; rdev_dec_pending ; atomic_dec_and_test

Thanks,
NeilBrown

Paul Menzel Nov. 4, 2022, 5:52 a.m. UTC | #2

Dear Neil,


Just a heads-up, Mozilla Thunderbird 102.4.1 does not thread your message.

Your reply contains:

     In-reply-to: 
=?utf-8?q?=3Calpine=2ELRH=2E2=2E21=2E2211031121070=2E18305=40fi?=
  =?utf-8?q?le01=2Eintranet=2Eprod=2Eint=2Erdu2=2Eredhat=2Ecom=3E?=
     References: 
=?utf-8?q?=3Calpine=2ELRH=2E2=2E21=2E2211031121070=2E18305=40fil?=
  =?utf-8?q?e01=2Eintranet=2Eprod=2Eint=2Erdu2=2Eredhat=2Ecom=3E?=

Mikulas’ message has:

     Message-ID: 
<alpine.LRH.2.21.2211031121070.18305@file01.intranet.prod.int.rdu2.redhat.com>

It looks strange, that the message id, that does not seem to contain any 
non-ASCII characters is „reencoded“. No idea, if it violates any 
standards though.


Kind regards,

Paul

Mikulas Patocka Nov. 4, 2022, 1:52 p.m. UTC | #3

On Fri, 4 Nov 2022, NeilBrown wrote:

> > ---
> >  drivers/md/md.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6/drivers/md/md.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/md/md.c	2022-11-03 15:29:02.000000000 +0100
> > +++ linux-2.6/drivers/md/md.c	2022-11-03 15:33:17.000000000 +0100
> > @@ -509,13 +509,14 @@ static void md_end_flush(struct bio *bio
> >  	struct md_rdev *rdev = bio->bi_private;
> >  	struct mddev *mddev = rdev->mddev;
> >  
> > +	bio_put(bio);
> > +
> >  	rdev_dec_pending(rdev, mddev);
> >  
> >  	if (atomic_dec_and_test(&mddev->flush_pending)) {
> >  		/* The pre-request flush has finished */
> >  		queue_work(md_wq, &mddev->flush_work);
> >  	}
> > -	bio_put(bio);
> >  }
> >  
> >  static void md_submit_flush_data(struct work_struct *ws);
> > @@ -913,10 +914,11 @@ static void super_written(struct bio *bi
> >  	} else
> >  		clear_bit(LastDev, &rdev->flags);
> >  
> > +	bio_put(bio);
> > +
> >  	if (atomic_dec_and_test(&mddev->pending_writes))
> >  		wake_up(&mddev->sb_wait);
> >  	rdev_dec_pending(rdev, mddev);
> > -	bio_put(bio);
> >  }
> 
> Thanks. I think this is a clear improvement.
> I think it would be a little better if the rdev_dec_pending were also
> move up.
> Then both code fragments would be:
>   bio_put ; rdev_dec_pending ; atomic_dec_and_test
> 
> Thanks,
> NeilBrown

Yes, I'll send a second patch that moves rdev_dec_pending up too.

BTW. even this is theoretically incorrect:

> >     if (atomic_dec_and_test(&mddev->pending_writes))
> >             wake_up(&mddev->sb_wait);

Suppose that you execute atomic_dec_and_test and then there's a context 
switch to a process that destroys the md device and then there's a context 
switch back and you call "wake_up(&mddev->sb_wait)" on freed memory.

I think that we should use wait_var_event/wake_up_var instead of sb_wait. 
That will use preallocated hashed wait queues.

Mikulas

NeilBrown Nov. 5, 2022, 10:34 p.m. UTC | #4

On Sat, 05 Nov 2022, Mikulas Patocka wrote:
> 
> On Fri, 4 Nov 2022, NeilBrown wrote:
> 
> > > ---
> > >  drivers/md/md.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > Index: linux-2.6/drivers/md/md.c
> > > ===================================================================
> > > --- linux-2.6.orig/drivers/md/md.c	2022-11-03 15:29:02.000000000 +0100
> > > +++ linux-2.6/drivers/md/md.c	2022-11-03 15:33:17.000000000 +0100
> > > @@ -509,13 +509,14 @@ static void md_end_flush(struct bio *bio
> > >  	struct md_rdev *rdev = bio->bi_private;
> > >  	struct mddev *mddev = rdev->mddev;
> > >  
> > > +	bio_put(bio);
> > > +
> > >  	rdev_dec_pending(rdev, mddev);
> > >  
> > >  	if (atomic_dec_and_test(&mddev->flush_pending)) {
> > >  		/* The pre-request flush has finished */
> > >  		queue_work(md_wq, &mddev->flush_work);
> > >  	}
> > > -	bio_put(bio);
> > >  }
> > >  
> > >  static void md_submit_flush_data(struct work_struct *ws);
> > > @@ -913,10 +914,11 @@ static void super_written(struct bio *bi
> > >  	} else
> > >  		clear_bit(LastDev, &rdev->flags);
> > >  
> > > +	bio_put(bio);
> > > +
> > >  	if (atomic_dec_and_test(&mddev->pending_writes))
> > >  		wake_up(&mddev->sb_wait);
> > >  	rdev_dec_pending(rdev, mddev);
> > > -	bio_put(bio);
> > >  }
> > 
> > Thanks. I think this is a clear improvement.
> > I think it would be a little better if the rdev_dec_pending were also
> > move up.
> > Then both code fragments would be:
> >   bio_put ; rdev_dec_pending ; atomic_dec_and_test
> > 
> > Thanks,
> > NeilBrown
> 
> Yes, I'll send a second patch that moves rdev_dec_pending up too.
> 
> BTW. even this is theoretically incorrect:
> 
> > >     if (atomic_dec_and_test(&mddev->pending_writes))
> > >             wake_up(&mddev->sb_wait);
> 
> Suppose that you execute atomic_dec_and_test and then there's a context 
> switch to a process that destroys the md device and then there's a context 
> switch back and you call "wake_up(&mddev->sb_wait)" on freed memory.
> 
> I think that we should use wait_var_event/wake_up_var instead of sb_wait. 
> That will use preallocated hashed wait queues.
> 

I agree there is a potential problem.  Using wait_var_event is an
approach that could work.
An alternate would be to change that code to

  if (atomic_dec_and_lock(&mddev->pending_writes, &mddev->lock)) {
           wake_up(&mddev->sb_wait);
           spin_unlock(&mddev->lock);
  }

As __md_stop() takes mddev->lock, it would not be able to get to the
'free' until after the lock was dropped.

Thanks,
NeilBrown

Index: linux-2.6/drivers/md/md.c
===================================================================
--- linux-2.6.orig/drivers/md/md.c	2022-11-03 15:29:02.000000000 +0100
+++ linux-2.6/drivers/md/md.c	2022-11-03 15:33:17.000000000 +0100
@@ -509,13 +509,14 @@  static void md_end_flush(struct bio *bio
 	struct md_rdev *rdev = bio->bi_private;
 	struct mddev *mddev = rdev->mddev;
 
+	bio_put(bio);
+
 	rdev_dec_pending(rdev, mddev);
 
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
 		/* The pre-request flush has finished */
 		queue_work(md_wq, &mddev->flush_work);
 	}
-	bio_put(bio);
 }
 
 static void md_submit_flush_data(struct work_struct *ws);
@@ -913,10 +914,11 @@  static void super_written(struct bio *bi
 	} else
 		clear_bit(LastDev, &rdev->flags);
 
+	bio_put(bio);
+
 	if (atomic_dec_and_test(&mddev->pending_writes))
 		wake_up(&mddev->sb_wait);
 	rdev_dec_pending(rdev, mddev);
-	bio_put(bio);
 }
 
 void md_super_write(struct mddev *mddev, struct md_rdev *rdev,

md: fix a crash in mempool_free

Commit Message

Comments

Patch