diff mbox

[v2] blk-mq: Fix race between resetting the timer and completion handling

Message ID 1518037339.2870.61.camel@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche Feb. 7, 2018, 9:02 p.m. UTC
On Wed, 2018-02-07 at 12:09 -0800, tj@kernel.org wrote:
> Hello,
> 
> On Wed, Feb 07, 2018 at 07:03:56PM +0000, Bart Van Assche wrote:
> > I tried the above patch but already during the first iteration of the test I
> > noticed that the test hung, probably due to the following request that got stuck:
> > 
> > $ (cd /sys/kernel/debug/block && grep -aH . */*/*/rq_list)
> > 00000000a98cff60 {.op=SCSI_IN, .cmd_flags=, .rq_flags=MQ_INFLIGHT|PREEMPT|QUIET|IO_STAT|PM,
> >  .state=idle, .tag=22, .internal_tag=-1, .cmd=Synchronize Cache(10) 35 00 00 00 00 00, .retries=0,
> >  .result = 0x0, .flags=TAGGED, .timeout=60.000, allocated 872.690 s ago}
> 
> I'm wonder how this happened, so we can lose a completion when it
> races against BLK_EH_RESET_TIMER; however, the command should timeout
> later cuz the timer is running again now.  Maybe we actually had the
> memory barrier race that you pointed out in the other message?

Hello Tejun,

The patch that I used in my test had an smp_wmb() call (see also below). Anyway,
I will see whether I can extract more state information through debugfs.

Comments

Tejun Heo Feb. 7, 2018, 9:40 p.m. UTC | #1
Hello,

On Wed, Feb 07, 2018 at 09:02:21PM +0000, Bart Van Assche wrote:
> The patch that I used in my test had an smp_wmb() call (see also below). Anyway,
> I will see whether I can extract more state information through debugfs.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ef4f6df0f1df..8eb2105d82b7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -827,13 +827,9 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
> -		/*
> -		 * As nothing prevents from completion happening while
> -		 * ->aborted_gstate is set, this may lead to ignored
> -		 * completions and further spurious timeouts.
> -		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
>  		blk_add_timer(req);
> +		smp_wmb();
> +		blk_mq_rq_update_aborted_gstate(req, 0);

Without the matching rmb, just adding rmb won't do much but given the
default strong ordering on x86 and other operations around, what you
were seeing is probably not caused by lack of barriers.

Thanks.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ef4f6df0f1df..8eb2105d82b7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -827,13 +827,9 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
+		smp_wmb();
+		blk_mq_rq_update_aborted_gstate(req, 0);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;