From patchwork Thu Mar 3 12:53:06 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hannes Reinecke X-Patchwork-Id: 8492511 X-Patchwork-Delegate: agross@codeaurora.org Return-Path: X-Original-To: patchwork-linux-arm-msm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 404D4C0553 for ; Thu, 3 Mar 2016 12:53:45 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 3BF01201BC for ; Thu, 3 Mar 2016 12:53:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A34C62024F for ; Thu, 3 Mar 2016 12:53:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754338AbcCCMxX (ORCPT ); Thu, 3 Mar 2016 07:53:23 -0500 Received: from mx2.suse.de ([195.135.220.15]:37384 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751420AbcCCMxW (ORCPT ); Thu, 3 Mar 2016 07:53:22 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 86AF5AB9B; Thu, 3 Mar 2016 12:53:19 +0000 (UTC) Subject: Re: [PATCH v5 03/15] scsi: ufs: implement scsi host timeout handler To: ygardi@codeaurora.org References: <1456666367-11418-1-git-send-email-ygardi@codeaurora.org> <1456666367-11418-4-git-send-email-ygardi@codeaurora.org> <56D544E6.8040005@suse.de> <2b8282aad0b3edfaf873628edf03513d.squirrel@us.codeaurora.org> <56D7E652.90401@suse.de> Cc: james.bottomley@hansenpartnership.com, linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, linux-arm-msm@vger.kernel.org, santoshsy@gmail.com, linux-scsi-owner@vger.kernel.org, Gilad Broner , Vinayak Holikatti , "James E.J. Bottomley" , "Martin K. Petersen" From: Hannes Reinecke X-Enigmail-Draft-Status: N1110 Message-ID: <56D833B2.6030104@suse.de> Date: Thu, 3 Mar 2016 20:53:06 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: Sender: linux-arm-msm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-arm-msm@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 03/03/2016 05:10 PM, ygardi@codeaurora.org wrote: >> On 03/01/2016 09:25 PM, ygardi@codeaurora.org wrote: >>>> On 02/28/2016 09:32 PM, Yaniv Gardi wrote: >>>>> A race condition exists between request requeueing and scsi layer >>>>> error handling: >>>>> When UFS driver queuecommand returns a busy status for a request, >>>>> it will be requeued and its tag will be freed and set to -1. >>>>> At the same time it is possible that the request will timeout and >>>>> scsi layer will start error handling for it. The scsi layer reuses >>>>> the request and its tag to send error related commands to the device, >>>>> however its tag is no longer valid. >>>> Hmm. How can the host return a 'busy' status for a request? >>>> From my understanding we have three possibilities: >>>> >>>> 1) queuecommand returns busy; however, that means that the command has >>>> never been send and this issue shouldn't occur >>>> 2) The command returns with BUSY status. But in this case it has >>>> already >>>> been returned, so there cannot be any timeout coming in. >>>> 3) The host receives a command with a tag which is already in-use. >>>> However, that should have been prevented by the block-layer, which >>>> really should ensure that this situation never happens. >>>> >>>> So either way I look at it, it really looks like a bug and adding a >>>> timeout handler will just paper over it. >>>> (Not that a timeout handler is a bad idea, in fact I'm convinced that >>>> you need one. Just not for this purpose.) >>>> >>>> So can you elaborate how this 'busy' status comes about? >>>> Is the command sent to the device? >>>> >>>> Cheers, >>>> >>>> Hannes >>> >>> >>> Hi Hannes, >>> >>> it's going to be a bit long :) >>> I think you are missing the point. >>> I will describe a race condition happened to us a while ago, that was >>> quite difficult to understand and fix. >>> So, this patch is not about the "busy" returning to the scsi dispatch >>> routine. it's about the abort triggered after 30 seconds. >>> >>> imagine a request being queued and sent to the scsi, and then to the >>> ufs. >>> a timer, initialized to 30 seconds start ticking. >>> but the request is never sent to the ufs device, as queuecommand() >>> returns >>> with "SCSI_MLQUEUE_HOST_BUSY" >>> by looking at the code, this could happen, for example: >>> err = ufshcd_hold(hba, true); >>> if (err) { >>> err = SCSI_MLQUEUE_HOST_BUSY; >>> goto out; >>> } >>> >> Uuhhh. >> You probably should not have pointed me to that piece of code ... >> open-coding loops in ufshcd_hold() ... shudder. >> (Did I ever review that one? Must've ...) >> _Anyway_: sleeping in queuecommand is always a bad idea, as then >> precisely those issues you've just described will happen. >> >> Couldn't you just call >> ufshcd_hold(hba, false) >> instead of >> ufshcd_hold(hba, true) >> ? >> The request will be requeued more-or-less immediately, avoiding the >> issue with timeout handler kicking in. >> And the queue will remain blocked until the ungate work item returns, at >> which point I/O submission will continue. >> As the request will be requeued to the head of the queue there won't be >> other I/O competing with tags, so it shouldn't have any adverse effects. >> >> Wouldn't that work? >> >> Cheers, >> >> Hannes > > Hi Hannes > > This is a bug, and it should be fixed. Oh, definitely agreed. The question is _where_. > if you choose to bypass it, by calling ufshcd_hold(hba, false), not only > the race condition is still there, and can pop-out at any other point in > the future, but also, not sure what are the consequences of > ufshcd_hold(hba, false) unstead of "true". Well ... seeing it's your driver, I would've thought _you_ should know ... > so, changing the already tested and working code, (not to return BUSY from > queuecommand) is not a fix. Hey, I did _not_ suggest not to retury BUSY from queuecommand. I was suggesting this patch: clear_bit_unlock(tag, &hba->lrb_in_use); which, by reading the code, should be avoiding this issue. I was just asking you if you could give this patch a spin and see if it works. If not (for whatever reason) I'm happy to accept your patch. But first I would like to have an explanation why the above would _not_ work. Unfortunately I don't have the hardware otherwise I'd be running the tests myself. Cheers, Hannes Reviewed-by: Dolev Raviv diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 9c1b94b..b9295ad 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -1388,7 +1388,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd) goto out; } - err = ufshcd_hold(hba, true); + err = ufshcd_hold(hba, false); if (err) { err = SCSI_MLQUEUE_HOST_BUSY;