diff mbox

[v2] blk-mq: Fix race between resetting the timer and completion handling

Message ID 20180207200724.GD695913@devbig577.frc2.facebook.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tejun Heo Feb. 7, 2018, 8:07 p.m. UTC
Hello, Bart.

On Wed, Feb 07, 2018 at 06:14:13PM +0000, Bart Van Assche wrote:
> When I wrote my comment I was not sure whether or not non-reentrancy is
> guaranteed for work queue items. However, according to what I found in the
> workqueue implementation I think that is guaranteed. So it shouldn't be
> possible that the timer activated by blk_add_timer() gets handled before
> aborted_gstate is reset. But since the timeout handler and completion

Yeah, we're basically single threaded in the timeout path.

> handlers can be executed by different CPUs, shouldn't a memory barrier be
> inserted between the blk_add_timer() call and resetting aborted_gsync to
> guarantee that a completion cannot occur before blk_add_timer() has reset
> RQF_MQ_TIMEOUT_EXPIRED?

Ah, you're right.  u64_stat_sync doesn't imply barriers, so we want
something like the following.

Comments

Bart Van Assche Feb. 7, 2018, 11:48 p.m. UTC | #1
On Wed, 2018-02-07 at 12:07 -0800, tj@kernel.org wrote:
> Ah, you're right.  u64_stat_sync doesn't imply barriers, so we want

> something like the following.

> 

> diff --git a/block/blk-mq.c b/block/blk-mq.c

> index df93102..d6edf3b 100644

> --- a/block/blk-mq.c

> +++ b/block/blk-mq.c

> @@ -593,7 +593,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)

>  	 */

>  	local_irq_save(flags);

>  	u64_stats_update_begin(&rq->aborted_gstate_sync);

> -	rq->aborted_gstate = gstate;

> +	smp_store_release(&rq->aborted_gstate, gstate);

>  	u64_stats_update_end(&rq->aborted_gstate_sync);

>  	local_irq_restore(flags);

>  }

> @@ -605,7 +605,7 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)

>  

>  	do {

>  		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);

> -		aborted_gstate = rq->aborted_gstate;

> +		aborted_gstate = smp_load_acquire(&rq->aborted_gstate);

>  	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));

>  

>  	return aborted_gstate;

> @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)

>  		 * ->aborted_gstate is set, this may lead to ignored

>  		 * completions and further spurious timeouts.

>  		 */

> -		blk_mq_rq_update_aborted_gstate(req, 0);

>  		blk_add_timer(req);

> +		blk_mq_rq_update_aborted_gstate(req, 0);

>  		break;

>  	case BLK_EH_NOT_HANDLED:

>  		break;


Hello Tejun,

With this patch applied I see requests for which it seems like the timeout handler
did not get invoked:

sdc/hctx0/busy:0000000095e04b7c {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT|MQ_TIMEOUT_EXPIRED, .state=i
n_flight, .gstate=0xed/0xed, .tag=26, .internal_tag=-1, .cmd=Write(10) 2a 00 00  
00 60 ba 00 00 08 00, .retries=0, .result = 0x50000, .flags=TAGGED|INITIALIZED,  
.timeout=1.000, allocated 1093.180 s ago}

sdc/hctx0/busy:0000000065a64e9b {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT|MQ_TIMEOUT_EXPIRED, .state=i
n_flight, .gstate=0x5/0x5, .tag=27, .internal_tag=-1, .cmd=Write(10) 2a 00 00 00
 62 d2 00 00 08 00, .retries=0, .result = 0x50000, .flags=TAGGED|INITIALIZED, .t
imeout=1.000, allocated 1093.180 s ago}

[ ... ]

sdc/hctx3/busy:00000000479cc2a9 {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT, .state=in_flight, .gstate=0
x11/0x11, .tag=57, .internal_tag=-1, .cmd=Write(10) 2a 00 00 00 61 d2 00 00 08 0
0, .retries=0, .result = 0x0, .flags=TAGGED|INITIALIZED, .timeout=1.000, allocat
ed 1093.150 s ago}

sdc/hctx3/busy:000000008fd130d5 {.op=WRITE, .cmd_flags=FAILFAST_TRANSPORT|SYNC|N
OMERGE|IDLE, .rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT, .state=in_flight, .gstate=0
xd/0xd, .tag=61, .internal_tag=-1, .cmd=Write(10) 2a 00 00 00 c3 94 00 00 08 00,
 .retries=0, .result = 0x0, .flags=TAGGED|INITIALIZED, .timeout=1.000, allocated
 1093.140 s ago}

As one can see for some requests MQ_TIMEOUT_EXPIRED is set and .result = 0x50000.
The value of .result means that the SCSI error handler has submitted an abort (see
also scmnd->result = DID_ABORT << 16 in drivers/infiniband/ulp/srp/ib_srp.c). For
the last two requests shown above however MQ_TIMEOUT_EXPIRED is not set and the
SCSI result has value 0. I think that means that it can happen that a request
times out but that the timeout handler does not get invoked ...

Thanks,

Bart.
Bart Van Assche Feb. 8, 2018, 1:09 a.m. UTC | #2
On Wed, 2018-02-07 at 23:48 +0000, Bart Van Assche wrote:
> With this patch applied I see requests for which it seems like the timeout handler

> did not get invoked: [ ... ]


I just noticed the following in the system log, which is probably the reason why some
requests got stuck:

Feb  7 15:16:26 ubuntu-vm kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Feb  7 15:16:26 ubuntu-vm kernel: IP: scsi_times_out+0x17/0x2c0 [scsi_mod]
Feb  7 15:16:26 ubuntu-vm kernel: PGD 0 P4D 0  
Feb  7 15:16:26 ubuntu-vm kernel: Oops: 0000 [#1] PREEMPT SMP
Feb  7 15:16:26 ubuntu-vm kernel: Modules linked in: dm_service_time ib_srp libcrc32c crc32c_generic scsi_transport_srp target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod
rdma_rxe ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs scsi_debug brd mq_deadline kyber_iosched deadline_iosched cfq_iosched bfq crct10dif_pclmul crc32_pclmul af_packet ghash_clmulni_intel pcbc
aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw virtio_console virtio_balloon sg button i2c_piix4 dm_multipath dm_mod dax scsi_dh_rdac scsi_dh_emc scsi_dh_alua ib_iser rdma_cm iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables ipv6 autofs4 ext4 crc16 mbcache jbd2 sd_mod virtio_net virtio_blk virtio_scsi sr_mod cdrom ata_generic
pata_acpi psmouse crc32c_intel i2c_core atkbd
Feb  7 15:16:26 ubuntu-vm kernel: uhci_hcd ehci_hcd intel_agp ata_piix intel_gtt agpgart libata virtio_pci usbcore virtio_ring virtio scsi_mod usb_common unix
Feb  7 15:16:26 ubuntu-vm kernel: CPU: 0 PID: 146 Comm: kworker/0:1H Not tainted 4.15.0-dbg+ #1
Feb  7 15:16:26 ubuntu-vm kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Feb  7 15:16:26 ubuntu-vm kernel: Workqueue: kblockd blk_mq_timeout_work
Feb  7 15:16:26 ubuntu-vm kernel: RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Feb  7 15:16:26 ubuntu-vm kernel: RSP: 0018:ffff98f0c02abd58 EFLAGS: 00010246
Feb  7 15:16:26 ubuntu-vm kernel: RAX: 0000000000000000 RBX: ffff965de2b3a2c0 RCX: 0000000000000000
Feb  7 15:16:26 ubuntu-vm kernel: RDX: ffffffffc0094740 RSI: 0000000000000000 RDI: ffff965de2b3a2c0
Feb  7 15:16:26 ubuntu-vm kernel: RBP: ffff965de2b3a438 R08: fffffffffffffffc R09: 0000000000000007
Feb  7 15:16:26 ubuntu-vm kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
Feb  7 15:16:26 ubuntu-vm kernel: R13: 0000000000000000 R14: ffff965de2a44218 R15: ffff965de2a48728
Feb  7 15:16:26 ubuntu-vm kernel: FS:  0000000000000000(0000) GS:ffff965dffc00000(0000) knlGS:0000000000000000
Feb  7 15:16:26 ubuntu-vm kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  7 15:16:26 ubuntu-vm kernel: CR2: 0000000000000000 CR3: 000000003ce0f003 CR4: 00000000003606f0
Feb  7 15:16:26 ubuntu-vm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb  7 15:16:26 ubuntu-vm kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb  7 15:16:26 ubuntu-vm kernel: Call Trace:
Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_terminate_expired+0x42/0x80
Feb  7 15:16:26 ubuntu-vm kernel: bt_iter+0x3d/0x50
Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_queue_tag_busy_iter+0xe9/0x200
Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_timeout_work+0x181/0x2e0
Feb  7 15:16:26 ubuntu-vm kernel: process_one_work+0x21c/0x6d0
Feb  7 15:16:26 ubuntu-vm kernel: worker_thread+0x35/0x380
Feb  7 15:16:26 ubuntu-vm kernel: kthread+0x117/0x130
Feb  7 15:16:26 ubuntu-vm kernel: ret_from_fork+0x24/0x30
Feb  7 15:16:26 ubuntu-vm kernel: Code: ff ff 0f ff e9 fd fe ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 8d af 78 01 00 00 53 48 8b 87 b0 01 00 00 48 89 fb <4c> 8b 28 0f 1f 44 00 00 65
8b 05 6a b5 f8 3f 83 f8 0f 0f 87 ed  
Feb  7 15:16:26 ubuntu-vm kernel: RIP: scsi_times_out+0x17/0x2c0 [scsi_mod] RSP: ffff98f0c02abd58
Feb  7 15:16:26 ubuntu-vm kernel: CR2: 0000000000000000
Feb  7 15:16:26 ubuntu-vm kernel: ---[ end trace ce6c20d95bab450e ]---

Bart.
Tejun Heo Feb. 8, 2018, 3:39 p.m. UTC | #3
On Thu, Feb 08, 2018 at 01:09:57AM +0000, Bart Van Assche wrote:
> On Wed, 2018-02-07 at 23:48 +0000, Bart Van Assche wrote:
> > With this patch applied I see requests for which it seems like the timeout handler
> > did not get invoked: [ ... ]
> 
> I just noticed the following in the system log, which is probably the reason why some
> requests got stuck:
> 
> Feb  7 15:16:26 ubuntu-vm kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
> Feb  7 15:16:26 ubuntu-vm kernel: IP: scsi_times_out+0x17/0x2c0 [scsi_mod]
> Feb  7 15:16:26 ubuntu-vm kernel: PGD 0 P4D 0  
> Feb  7 15:16:26 ubuntu-vm kernel: Oops: 0000 [#1] PREEMPT SMP
> Feb  7 15:16:26 ubuntu-vm kernel: Modules linked in: dm_service_time ib_srp libcrc32c crc32c_generic scsi_transport_srp target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod
> rdma_rxe ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs scsi_debug brd mq_deadline kyber_iosched deadline_iosched cfq_iosched bfq crct10dif_pclmul crc32_pclmul af_packet ghash_clmulni_intel pcbc
> aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw virtio_console virtio_balloon sg button i2c_piix4 dm_multipath dm_mod dax scsi_dh_rdac scsi_dh_emc scsi_dh_alua ib_iser rdma_cm iw_cm
> ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables ipv6 autofs4 ext4 crc16 mbcache jbd2 sd_mod virtio_net virtio_blk virtio_scsi sr_mod cdrom ata_generic
> pata_acpi psmouse crc32c_intel i2c_core atkbd
> Feb  7 15:16:26 ubuntu-vm kernel: uhci_hcd ehci_hcd intel_agp ata_piix intel_gtt agpgart libata virtio_pci usbcore virtio_ring virtio scsi_mod usb_common unix
> Feb  7 15:16:26 ubuntu-vm kernel: CPU: 0 PID: 146 Comm: kworker/0:1H Not tainted 4.15.0-dbg+ #1
> Feb  7 15:16:26 ubuntu-vm kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Feb  7 15:16:26 ubuntu-vm kernel: Workqueue: kblockd blk_mq_timeout_work
> Feb  7 15:16:26 ubuntu-vm kernel: RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
> Feb  7 15:16:26 ubuntu-vm kernel: RSP: 0018:ffff98f0c02abd58 EFLAGS: 00010246
> Feb  7 15:16:26 ubuntu-vm kernel: RAX: 0000000000000000 RBX: ffff965de2b3a2c0 RCX: 0000000000000000
> Feb  7 15:16:26 ubuntu-vm kernel: RDX: ffffffffc0094740 RSI: 0000000000000000 RDI: ffff965de2b3a2c0
> Feb  7 15:16:26 ubuntu-vm kernel: RBP: ffff965de2b3a438 R08: fffffffffffffffc R09: 0000000000000007
> Feb  7 15:16:26 ubuntu-vm kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> Feb  7 15:16:26 ubuntu-vm kernel: R13: 0000000000000000 R14: ffff965de2a44218 R15: ffff965de2a48728
> Feb  7 15:16:26 ubuntu-vm kernel: FS:  0000000000000000(0000) GS:ffff965dffc00000(0000) knlGS:0000000000000000
> Feb  7 15:16:26 ubuntu-vm kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb  7 15:16:26 ubuntu-vm kernel: CR2: 0000000000000000 CR3: 000000003ce0f003 CR4: 00000000003606f0
> Feb  7 15:16:26 ubuntu-vm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Feb  7 15:16:26 ubuntu-vm kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Feb  7 15:16:26 ubuntu-vm kernel: Call Trace:
> Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_terminate_expired+0x42/0x80
> Feb  7 15:16:26 ubuntu-vm kernel: bt_iter+0x3d/0x50
> Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_queue_tag_busy_iter+0xe9/0x200
> Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_timeout_work+0x181/0x2e0
> Feb  7 15:16:26 ubuntu-vm kernel: process_one_work+0x21c/0x6d0
> Feb  7 15:16:26 ubuntu-vm kernel: worker_thread+0x35/0x380
> Feb  7 15:16:26 ubuntu-vm kernel: kthread+0x117/0x130
> Feb  7 15:16:26 ubuntu-vm kernel: ret_from_fork+0x24/0x30
> Feb  7 15:16:26 ubuntu-vm kernel: Code: ff ff 0f ff e9 fd fe ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 8d af 78 01 00 00 53 48 8b 87 b0 01 00 00 48 89 fb <4c> 8b 28 0f 1f 44 00 00 65
> 8b 05 6a b5 f8 3f 83 f8 0f 0f 87 ed  
> Feb  7 15:16:26 ubuntu-vm kernel: RIP: scsi_times_out+0x17/0x2c0 [scsi_mod] RSP: ffff98f0c02abd58
> Feb  7 15:16:26 ubuntu-vm kernel: CR2: 0000000000000000
> Feb  7 15:16:26 ubuntu-vm kernel: ---[ end trace ce6c20d95bab450e ]---

So, that's dereferencing %rax which is NULL.  That gotta be the ->host
deref in the following.

  enum blk_eh_timer_return scsi_times_out(struct request *req)
  {
	  struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
	  enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
	  struct Scsi_Host *host = scmd->device->host;
  ...

That sounds more like a scsi hotplug but than an issue in the timeout
code unless we messed up @req pointer to begin with.

Thanks.
Tejun Heo Feb. 8, 2018, 3:40 p.m. UTC | #4
On Thu, Feb 08, 2018 at 07:39:40AM -0800, tj@kernel.org wrote:
> That sounds more like a scsi hotplug but than an issue in the timeout
                                       ^bug
> code unless we messed up @req pointer to begin with.
Bart Van Assche Feb. 8, 2018, 4:31 p.m. UTC | #5
On Thu, 2018-02-08 at 07:39 -0800, tj@kernel.org wrote:
> On Thu, Feb 08, 2018 at 01:09:57AM +0000, Bart Van Assche wrote:

> > On Wed, 2018-02-07 at 23:48 +0000, Bart Van Assche wrote:

> > > With this patch applied I see requests for which it seems like the timeout handler

> > > did not get invoked: [ ... ]

> > 

> > I just noticed the following in the system log, which is probably the reason why some

> > requests got stuck:

> > 

> > Feb  7 15:16:26 ubuntu-vm kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)

> > Feb  7 15:16:26 ubuntu-vm kernel: IP: scsi_times_out+0x17/0x2c0 [scsi_mod]

> > Feb  7 15:16:26 ubuntu-vm kernel: PGD 0 P4D 0  

> > Feb  7 15:16:26 ubuntu-vm kernel: Oops: 0000 [#1] PREEMPT SMP

> > Feb  7 15:16:26 ubuntu-vm kernel: Modules linked in: dm_service_time ib_srp libcrc32c crc32c_generic scsi_transport_srp target_core_pscsi target_core_file ib_srpt target_core_iblock

> > target_core_mod

> > rdma_rxe ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs scsi_debug brd mq_deadline kyber_iosched deadline_iosched cfq_iosched bfq crct10dif_pclmul crc32_pclmul af_packet ghash_clmulni_intel pcbc

> > aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw virtio_console virtio_balloon sg button i2c_piix4 dm_multipath dm_mod dax scsi_dh_rdac scsi_dh_emc scsi_dh_alua ib_iser rdma_cm

> > iw_cm

> > ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables ipv6 autofs4 ext4 crc16 mbcache jbd2 sd_mod virtio_net virtio_blk virtio_scsi sr_mod cdrom

> > ata_generic

> > pata_acpi psmouse crc32c_intel i2c_core atkbd

> > Feb  7 15:16:26 ubuntu-vm kernel: uhci_hcd ehci_hcd intel_agp ata_piix intel_gtt agpgart libata virtio_pci usbcore virtio_ring virtio scsi_mod usb_common unix

> > Feb  7 15:16:26 ubuntu-vm kernel: CPU: 0 PID: 146 Comm: kworker/0:1H Not tainted 4.15.0-dbg+ #1

> > Feb  7 15:16:26 ubuntu-vm kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014

> > Feb  7 15:16:26 ubuntu-vm kernel: Workqueue: kblockd blk_mq_timeout_work

> > Feb  7 15:16:26 ubuntu-vm kernel: RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]

> > Feb  7 15:16:26 ubuntu-vm kernel: RSP: 0018:ffff98f0c02abd58 EFLAGS: 00010246

> > Feb  7 15:16:26 ubuntu-vm kernel: RAX: 0000000000000000 RBX: ffff965de2b3a2c0 RCX: 0000000000000000

> > Feb  7 15:16:26 ubuntu-vm kernel: RDX: ffffffffc0094740 RSI: 0000000000000000 RDI: ffff965de2b3a2c0

> > Feb  7 15:16:26 ubuntu-vm kernel: RBP: ffff965de2b3a438 R08: fffffffffffffffc R09: 0000000000000007

> > Feb  7 15:16:26 ubuntu-vm kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002

> > Feb  7 15:16:26 ubuntu-vm kernel: R13: 0000000000000000 R14: ffff965de2a44218 R15: ffff965de2a48728

> > Feb  7 15:16:26 ubuntu-vm kernel: FS:  0000000000000000(0000) GS:ffff965dffc00000(0000) knlGS:0000000000000000

> > Feb  7 15:16:26 ubuntu-vm kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

> > Feb  7 15:16:26 ubuntu-vm kernel: CR2: 0000000000000000 CR3: 000000003ce0f003 CR4: 00000000003606f0

> > Feb  7 15:16:26 ubuntu-vm kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

> > Feb  7 15:16:26 ubuntu-vm kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

> > Feb  7 15:16:26 ubuntu-vm kernel: Call Trace:

> > Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_terminate_expired+0x42/0x80

> > Feb  7 15:16:26 ubuntu-vm kernel: bt_iter+0x3d/0x50

> > Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_queue_tag_busy_iter+0xe9/0x200

> > Feb  7 15:16:26 ubuntu-vm kernel: blk_mq_timeout_work+0x181/0x2e0

> > Feb  7 15:16:26 ubuntu-vm kernel: process_one_work+0x21c/0x6d0

> > Feb  7 15:16:26 ubuntu-vm kernel: worker_thread+0x35/0x380

> > Feb  7 15:16:26 ubuntu-vm kernel: kthread+0x117/0x130

> > Feb  7 15:16:26 ubuntu-vm kernel: ret_from_fork+0x24/0x30

> > Feb  7 15:16:26 ubuntu-vm kernel: Code: ff ff 0f ff e9 fd fe ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 8d af 78 01 00 00 53 48 8b 87 b0 01 00 00 48 89 fb <4c> 8b 28 0f 1f 44 00 00

> > 65

> > 8b 05 6a b5 f8 3f 83 f8 0f 0f 87 ed  

> > Feb  7 15:16:26 ubuntu-vm kernel: RIP: scsi_times_out+0x17/0x2c0 [scsi_mod] RSP: ffff98f0c02abd58

> > Feb  7 15:16:26 ubuntu-vm kernel: CR2: 0000000000000000

> > Feb  7 15:16:26 ubuntu-vm kernel: ---[ end trace ce6c20d95bab450e ]---

> 

> So, that's dereferencing %rax which is NULL.  That gotta be the ->host

> deref in the following.

> 

>   enum blk_eh_timer_return scsi_times_out(struct request *req)

>   {

> 	  struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);

> 	  enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;

> 	  struct Scsi_Host *host = scmd->device->host;

>   ...

> 

> That sounds more like a scsi hotplug bug than an issue in the timeout

> code unless we messed up @req pointer to begin with.


I don't think that this is related to SCSI hotplugging: this crash does not
occur with the v4.15 block layer core and it does not occur with my timeout
handler rework patch applied either. I think that means that we cannot
exclude the block layer core timeout handler rework as a possible cause.

The disassembler output is as follows:

(gdb) disas /s scsi_times_out
Dump of assembler code for function scsi_times_out:
drivers/scsi/scsi_error.c:
282     {
   0x0000000000005bd0 <+0>:     push   %r13
   0x0000000000005bd2 <+2>:     push   %r12
   0x0000000000005bd4 <+4>:     push   %rbp
./include/linux/blk-mq.h:
300             return rq + 1;
   0x0000000000005bd5 <+5>:     lea    0x178(%rdi),%rbp
drivers/scsi/scsi_error.c:
282     {
   0x0000000000005bdc <+12>:    push   %rbx
283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285             struct Scsi_Host *host = scmd->device->host;
   0x0000000000005bdd <+13>:    mov    0x1b0(%rdi),%rax
282     {
   0x0000000000005be4 <+20>:    mov    %rdi,%rbx
283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285             struct Scsi_Host *host = scmd->device->host;
   0x0000000000005be7 <+23>:    mov    (%rax),%r13
   0x0000000000005bea <+26>:    nopl   0x0(%rax,%rax,1)
[ ... ]
(gdb) print /x sizeof(struct request)
$2 = 0x178
(gdb) print &(((struct scsi_cmnd*)0)->device)
$4 = (struct scsi_device **) 0x38 <scsi_cmd_get_serial+8>
(gdb) print &(((struct scsi_device*)0)->host)       
$5 = (struct Scsi_Host **) 0x0

The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
instruction at that address tries to dereference scsi_cmnd.device (%rax). The
register dump shows that that pointer has the value NULL. The only function I
know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
caller of that function in the SCSI core is scsi_initialize_rq(). That function
has two callers, namely scsi_init_command() and blk_get_request(). However,
the scsi_cmnd.device pointer is not cleared when a request finishes. This is
why I think that the above crash report indicates that scsi_times_out() was
called for a request that was being reinitialized and not by device hotplugging.

Thanks,

Bart.
Tejun Heo Feb. 8, 2018, 5 p.m. UTC | #6
Hello, Bart.

On Thu, Feb 08, 2018 at 04:31:43PM +0000, Bart Van Assche wrote:
> > That sounds more like a scsi hotplug bug than an issue in the timeout
> > code unless we messed up @req pointer to begin with.
> 
> I don't think that this is related to SCSI hotplugging: this crash does not
> occur with the v4.15 block layer core and it does not occur with my timeout
> handler rework patch applied either. I think that means that we cannot
> exclude the block layer core timeout handler rework as a possible cause.
> 
> The disassembler output is as follows:
> 
> (gdb) disas /s scsi_times_out
> Dump of assembler code for function scsi_times_out:
> drivers/scsi/scsi_error.c:
> 282     {
>    0x0000000000005bd0 <+0>:     push   %r13
>    0x0000000000005bd2 <+2>:     push   %r12
>    0x0000000000005bd4 <+4>:     push   %rbp
> ./include/linux/blk-mq.h:
> 300             return rq + 1;
>    0x0000000000005bd5 <+5>:     lea    0x178(%rdi),%rbp
> drivers/scsi/scsi_error.c:
> 282     {
>    0x0000000000005bdc <+12>:    push   %rbx
> 283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
> 284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
> 285             struct Scsi_Host *host = scmd->device->host;
>    0x0000000000005bdd <+13>:    mov    0x1b0(%rdi),%rax
> 282     {
>    0x0000000000005be4 <+20>:    mov    %rdi,%rbx
> 283             struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
> 284             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
> 285             struct Scsi_Host *host = scmd->device->host;
>    0x0000000000005be7 <+23>:    mov    (%rax),%r13
>    0x0000000000005bea <+26>:    nopl   0x0(%rax,%rax,1)
> [ ... ]
> (gdb) print /x sizeof(struct request)
> $2 = 0x178
> (gdb) print &(((struct scsi_cmnd*)0)->device)
> $4 = (struct scsi_device **) 0x38 <scsi_cmd_get_serial+8>
> (gdb) print &(((struct scsi_device*)0)->host)       
> $5 = (struct Scsi_Host **) 0x0
> 
> The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
> instruction at that address tries to dereference scsi_cmnd.device (%rax). The
> register dump shows that that pointer has the value NULL. The only function I
> know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
> caller of that function in the SCSI core is scsi_initialize_rq(). That function
> has two callers, namely scsi_init_command() and blk_get_request(). However,
> the scsi_cmnd.device pointer is not cleared when a request finishes. This is
> why I think that the above crash report indicates that scsi_times_out() was
> called for a request that was being reinitialized and not by device hotplugging.

I could be misreading it but scsi_cmnd->device dereference should be
the following.

    0x0000000000005bdd <+13>:    mov    0x1b0(%rdi),%rax

%rdi is @req, 0x1b0(%rdi) seems to be the combined arithmetic of
blk_mq_rq_to_pdu() and ->device dereference - 0x178 + 0x38.  The
faulting access is (%rax), which is deref'ing host from device.

Thanks.
Bart Van Assche Feb. 8, 2018, 5:10 p.m. UTC | #7
On Thu, 2018-02-08 at 09:00 -0800, tj@kernel.org wrote:
> On Thu, Feb 08, 2018 at 04:31:43PM +0000, Bart Van Assche wrote:

> > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The

> > instruction at that address tries to dereference scsi_cmnd.device (%rax). The

> > register dump shows that that pointer has the value NULL. The only function I

> > know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only

> > caller of that function in the SCSI core is scsi_initialize_rq(). That function

> > has two callers, namely scsi_init_command() and blk_get_request(). However,

> > the scsi_cmnd.device pointer is not cleared when a request finishes. This is

> > why I think that the above crash report indicates that scsi_times_out() was

> > called for a request that was being reinitialized and not by device hotplugging.

> 

> I could be misreading it but scsi_cmnd->device dereference should be

> the following.

> 

>     0x0000000000005bdd <+13>:    mov    0x1b0(%rdi),%rax

> 

> %rdi is @req, 0x1b0(%rdi) seems to be the combined arithmetic of

> blk_mq_rq_to_pdu() and ->device dereference - 0x178 + 0x38.  The

> faulting access is (%rax), which is deref'ing host from device.


Hello Tejun,

I think "dereferencing a pointer" means reading the memory location that pointer points
at? Anyway, I think we both interpret the crash report in the same way, namely that it
means that scmd->device == NULL.

Thanks,

Bart.
Tejun Heo Feb. 8, 2018, 5:19 p.m. UTC | #8
Hello, Bart.

On Thu, Feb 08, 2018 at 05:10:45PM +0000, Bart Van Assche wrote:
> I think "dereferencing a pointer" means reading the memory location that pointer points
> at? Anyway, I think we both interpret the crash report in the same way, namely that it
> means that scmd->device == NULL.

No, what I'm trying to say is that it's is the device->host reference
not the scmd->device reference.

Thanks.
Bart Van Assche Feb. 8, 2018, 5:37 p.m. UTC | #9
On Thu, 2018-02-08 at 09:19 -0800, tj@kernel.org wrote:
> Hello, Bart.

> 

> On Thu, Feb 08, 2018 at 05:10:45PM +0000, Bart Van Assche wrote:

> > I think "dereferencing a pointer" means reading the memory location that pointer points

> > at? Anyway, I think we both interpret the crash report in the same way, namely that it

> > means that scmd->device == NULL.

> 

> No, what I'm trying to say is that it's is the device->host reference

> not the scmd->device reference.


Again, I think that means that we agree about the interpretation of the crash
report. The big question here is what the next step should be to analyze this
further and/or to avoid that this crash can occur?

Thanks,

Bart.
Tejun Heo Feb. 8, 2018, 5:40 p.m. UTC | #10
On Thu, Feb 08, 2018 at 05:37:46PM +0000, Bart Van Assche wrote:
> On Thu, 2018-02-08 at 09:19 -0800, tj@kernel.org wrote:
> > Hello, Bart.
> > 
> > On Thu, Feb 08, 2018 at 05:10:45PM +0000, Bart Van Assche wrote:
> > > I think "dereferencing a pointer" means reading the memory location that pointer points
> > > at? Anyway, I think we both interpret the crash report in the same way, namely that it
> > > means that scmd->device == NULL.
> > 
> > No, what I'm trying to say is that it's is the device->host reference
> > not the scmd->device reference.
> 
> Again, I think that means that we agree about the interpretation of the crash
> report. The big question here is what the next step should be to analyze this
> further and/or to avoid that this crash can occur?

Heh, sorry about not being clear.  What I'm trying to say is that
scmd->device != NULL && device->host == NULL.  Or was this what you
were saying all along?

Thanks.
Bart Van Assche Feb. 8, 2018, 5:48 p.m. UTC | #11
On Thu, 2018-02-08 at 09:40 -0800, tj@kernel.org wrote:
> Heh, sorry about not being clear.  What I'm trying to say is that

> scmd->device != NULL && device->host == NULL.  Or was this what you

> were saying all along?


What I agree with is that the request pointer (req argument) is stored in %rdi
and that mov 0x1b0(%rdi),%rax loads scmd->device into %rax. Since RIP ==
scsi_times_out+0x17, since the instruction at that address tries to dereference
%rax and since the register dump shows that %rax == NULL I think that means that
scmd->device == NULL.

Thanks,

Bart.
Tejun Heo Feb. 8, 2018, 5:54 p.m. UTC | #12
On Thu, Feb 08, 2018 at 05:48:32PM +0000, Bart Van Assche wrote:
> On Thu, 2018-02-08 at 09:40 -0800, tj@kernel.org wrote:
> > Heh, sorry about not being clear.  What I'm trying to say is that
> > scmd->device != NULL && device->host == NULL.  Or was this what you
> > were saying all along?
> 
> What I agree with is that the request pointer (req argument) is stored in %rdi
> and that mov 0x1b0(%rdi),%rax loads scmd->device into %rax. Since RIP ==
> scsi_times_out+0x17, since the instruction at that address tries to dereference
> %rax and since the register dump shows that %rax == NULL I think that means that
> scmd->device == NULL.

Ah, I was completely confused about which one had to be NULL.  You're
absolutely right.  Let's continue earlier in the thread.

Thanks.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..d6edf3b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -593,7 +593,7 @@  static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	 */
 	local_irq_save(flags);
 	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
+	smp_store_release(&rq->aborted_gstate, gstate);
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
@@ -605,7 +605,7 @@  static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 
 	do {
 		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
+		aborted_gstate = smp_load_acquire(&rq->aborted_gstate);
 	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
 
 	return aborted_gstate;
@@ -836,8 +836,8 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		 * ->aborted_gstate is set, this may lead to ignored
 		 * completions and further spurious timeouts.
 		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
+		blk_mq_rq_update_aborted_gstate(req, 0);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;