Message ID | 20210507123103.10265-1-dwagner@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | Serialize timeout handling and done callback. | expand |
Hi, We got a customer report where qla2xxx was crashing only if the kernel was booting and ql2xextended_error_logging was set. Loading the module with the log option didn't trigger the crash. After starring for a long time at the crash report I figured the problem might be a race between the timeout handler and done callback. I've come up with these patches here but unfortunatly, our customer is not able to reproduce the problem in the lab anymore (it was caused by a hardware issue which got fixed). So for these patches I don't have any feedback. Maybe they make sense to add the driver even if I don't have prove it really address the mentioned bug hence this is marked as RFC. Thanks, Daniel Daniel Wagner (2): qla2xxx: Refactor asynchronous command initialization qla2xxx: Do not free resource to early in qla24xx_async_gpsc_sp_done() drivers/scsi/qla2xxx/qla_def.h | 5 ++ drivers/scsi/qla2xxx/qla_gbl.h | 4 +- drivers/scsi/qla2xxx/qla_gs.c | 86 ++++++++++------------------- drivers/scsi/qla2xxx/qla_init.c | 91 +++++++++++++------------------ drivers/scsi/qla2xxx/qla_iocb.c | 54 +++++++++++++----- drivers/scsi/qla2xxx/qla_mbx.c | 11 ++-- drivers/scsi/qla2xxx/qla_mid.c | 5 +- drivers/scsi/qla2xxx/qla_mr.c | 7 +-- drivers/scsi/qla2xxx/qla_target.c | 6 +- 9 files changed, 127 insertions(+), 142 deletions(-)
Hi, On Fri, May 07, 2021 at 02:31:00PM +0200, Daniel Wagner wrote: > Maybe they make sense to add the driver even if I don't have prove it > really address the mentioned bug hence this is marked as RFC. Any feedback? Thanks, Daniel
On Tue, 25 May 2021, 6:26am, Daniel Wagner wrote: > External Email > > ---------------------------------------------------------------------- > Hi, > > On Fri, May 07, 2021 at 02:31:00PM +0200, Daniel Wagner wrote: > > Maybe they make sense to add the driver even if I don't have prove it > > really address the mentioned bug hence this is marked as RFC. > > Any feedback? > Apologies for the delay, Daniel. I will review this by end of this week. Regards, -Arun
On Fri, 7 May 2021, 5:31am, Daniel Wagner wrote: > External Email > > ---------------------------------------------------------------------- > Hi, > > We got a customer report where qla2xxx was crashing only if the kernel > was booting and ql2xextended_error_logging was set. Loading the module > with the log option didn't trigger the crash. > > After starring for a long time at the crash report I figured the > problem might be a race between the timeout handler and done callback. > I've come up with these patches here but unfortunatly, our customer is > not able to reproduce the problem in the lab anymore (it was caused by > a hardware issue which got fixed). So for these patches I don't have > any feedback. Thanks Daniel for the report and your effort here. Agree, this needs to be fixed. > > Maybe they make sense to add the driver even if I don't have prove it > really address the mentioned bug hence this is marked as RFC. If you do not mind, can I take this from here? This touches quite a number of paths, and I would like to have this go through a full regression cycle before this is merged. That said, I had some general comments: 1. I see sp->lock was introduced, but could not locate where it was used. 2. I did not see a release of lock after a successful kref_put_lock, maybe that piece was missed out. 3. The sp->done should be invoked only once, and I do not see if this is taken care of. 4. sp->cmd_sp could be a SCSI IO too, where sp is not allocated separately, so qla2x00_sp_release shall not be called for it. Regards, -Arun > > Thanks, > Daniel > > Daniel Wagner (2): > qla2xxx: Refactor asynchronous command initialization > qla2xxx: Do not free resource to early in qla24xx_async_gpsc_sp_done() > > drivers/scsi/qla2xxx/qla_def.h | 5 ++ > drivers/scsi/qla2xxx/qla_gbl.h | 4 +- > drivers/scsi/qla2xxx/qla_gs.c | 86 ++++++++++------------------- > drivers/scsi/qla2xxx/qla_init.c | 91 +++++++++++++------------------ > drivers/scsi/qla2xxx/qla_iocb.c | 54 +++++++++++++----- > drivers/scsi/qla2xxx/qla_mbx.c | 11 ++-- > drivers/scsi/qla2xxx/qla_mid.c | 5 +- > drivers/scsi/qla2xxx/qla_mr.c | 7 +-- > drivers/scsi/qla2xxx/qla_target.c | 6 +- > 9 files changed, 127 insertions(+), 142 deletions(-) > >
Hi Arun, On Mon, May 31, 2021 at 02:06:24AM -0700, Arun Easi wrote: > Thanks Daniel for the report and your effort here. Agree, this needs to be > fixed. Good to hear! > If you do not mind, can I take this from here? This touches quite a > number of paths, and I would like to have this go through a full > regression cycle before this is merged. Sure, that is what I hoped for. It is an invasive change and this needs to be properly tested with a few different setups. Something I can't really do. So I would be glad if you could pick up the patches and fix them up. > That said, I had some general comments: > > 1. I see sp->lock was introduced, but could not locate where it was > used. I thought I needed it for serializing the kref operations. The lock itself is not used in the driver. After re-reading the documentation, the lock is not necessary as kref_put() is able to serialize the ref counter inc/dec operation correctly. The lock would only be useful to serialize the kref_put() with something which runs in the driver concurrently. > 2. I did not see a release of lock after a successful kref_put_lock, maybe > that piece was missed out. I think you got it right. The lock is not necessary. > 3. The sp->done should be invoked only once, and I do not see if this is > taken care of. qla2x00_sp_release() will only be called when the ref counter gets 0. This makes sure we only call sp->done() once. > 4. sp->cmd_sp could be a SCSI IO too, where sp is not allocated > separately, so qla2x00_sp_release shall not be called for it. Okay, didn't realize this. Thanks, Daniel