Message ID | 20170115091925.GA26656@gmail.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Sun, 2017-01-15 at 10:19 +0100, Ingo Molnar wrote: > So there's a new mpt3sas SCSI driver boot regression, introduced in > this merge window, which made one of my servers unbootable. We're not reverting a fix that would cause regressions for others. However, The fix was manifestly wrong, so does this fix of the fix work for you: http://marc.info/?l=linux-scsi&m=148329237807604 It's been languishing a bit because no-one seemed to care enough to test or review it. IOf you can add a tested by, that will give the two we need to push it. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 15, 2017 at 8:11 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > We're not reverting a fix that would cause regressions for others. Oh HELL YES we are. The rule is that we never break old stuff. Some new fix that fixes something that never used to work, but breaks something else, gets reverted very aggressively. So if a new bugfix or workaround causes problems for existing users, it gets reverted. The fact that it fixed something else is COMPLETELY IRRELEVANT. We do not do the "one step forward, two steps back" dance. If you can't fix a bug without breaking old systems, the "fix" gets reverted. Apparently there is already a possible real fix in flight, so I won't actually do the revert, but I very much want to object to your statement. Reverts happen. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2017-01-15 at 10:54 -0800, Linus Torvalds wrote: > On Sun, Jan 15, 2017 at 8:11 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > We're not reverting a fix that would cause regressions for others. > > Oh HELL YES we are. > > The rule is that we never break old stuff. Some new fix that fixes > something that never used to work, but breaks something else, gets > reverted very aggressively. > > So if a new bugfix or workaround causes problems for existing users, > it gets reverted. The fact that it fixed something else is COMPLETELY > IRRELEVANT. > > We do not do the "one step forward, two steps back" dance. If you > can't fix a bug without breaking old systems, the "fix" gets > reverted. > > Apparently there is already a possible real fix in flight, so I won't > actually do the revert, but I very much want to object to your > statement. > > Reverts happen. Can we compromise on "try not to revert a fix ...". The problem with bugs in regression fixes is that we now have two constituencies: the people who get the regression back if we revert the fix and the people who are bitten by the bug in the original regression fix. In this particular case, I think we should always try to fix the fix because reversion also violates "never break old stuff". There are corner cases, of course, like if the latter constituency is much bigger and the fix is hard to fix, then we might revert and try again. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 15, 2017 at 11:13 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > Can we compromise on "try not to revert a fix ...". No. It's about timing, and about how serious the regression is. For example, if this happened in rc7, I would have reverted immediately. No questions asked. In this case, the "fix" was was also much less important then the problem it caused. Some specialized pass-through command not working right, vs a machine not even booting? There's just no question what-so-ever. So the "fix" you claim just wasn't nearly important enough. It was also pretty recent and clearly things had worked for _years_ without it. In fact, I'm still somewhat inclined to revert it, just to have a working rc4 release later today. But I'm hoping maybe Ingo has time to test things (although I suspect he's already asleep). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2017-01-15 at 11:41 -0800, Linus Torvalds wrote: > On Sun, Jan 15, 2017 at 11:13 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > Can we compromise on "try not to revert a fix ...". > > No. > > It's about timing, and about how serious the regression is. > > For example, if this happened in rc7, I would have reverted > immediately. No questions asked. > > In this case, the "fix" was was also much less important then the > problem it caused. Some specialized pass-through command not working > right, vs a machine not even booting? There's just no question > what-so-ever. > > So the "fix" you claim just wasn't nearly important enough. It was > also pretty recent and clearly things had worked for _years_ without > it. > > In fact, I'm still somewhat inclined to revert it, just to have a > working rc4 release later today. But I'm hoping maybe Ingo has time > to test things (although I suspect he's already asleep). OK, so the patch to revert would actually be commit 669f044170d8933c3d66d231b69ea97cb8447338 Author: Bart Van Assche <bart.vanassche@sandisk.com> Date: Tue Nov 22 16:17:13 2016 -0800 scsi: srp_transport: Move queuecommand() wait code to SCSI core Because that change in the wait code broke the "fix" in mpt3sas. Before that was applied, it actually worked even though I think it's a wrong fix. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2017-01-15 at 11:49 -0800, James Bottomley wrote: > On Sun, 2017-01-15 at 11:41 -0800, Linus Torvalds wrote: > > On Sun, Jan 15, 2017 at 11:13 AM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > Can we compromise on "try not to revert a fix ...". > > > > No. > > > > It's about timing, and about how serious the regression is. > > > > For example, if this happened in rc7, I would have reverted > > immediately. No questions asked. > > > > In this case, the "fix" was was also much less important then the > > problem it caused. Some specialized pass-through command not > > working > > right, vs a machine not even booting? There's just no question > > what-so-ever. > > > > So the "fix" you claim just wasn't nearly important enough. It was > > also pretty recent and clearly things had worked for _years_ > > without > > it. > > > > In fact, I'm still somewhat inclined to revert it, just to have a > > working rc4 release later today. But I'm hoping maybe Ingo has > > time > > to test things (although I suspect he's already asleep). > > OK, so the patch to revert would actually be > > commit 669f044170d8933c3d66d231b69ea97cb8447338 > Author: Bart Van Assche <bart.vanassche@sandisk.com> > Date: Tue Nov 22 16:17:13 2016 -0800 > > scsi: srp_transport: Move queuecommand() wait code to SCSI core > > Because that change in the wait code broke the "fix" in mpt3sas. > Before that was applied, it actually worked even though I think it's > a wrong fix. Hello James, I disagree. Even if my patch would be reverted that still wouldn't fix the severe race condition that was introduced in the mpt3sas driver by the patch that triggers the lockup during boot. As I explained two weeks ago (see also https://www.spinics.net/lists/kernel/msg2411413.htm l), commit 18f6084a989b ("scsi: mpt3sas: Fix secure erase premature termination") is the one that should be reverted instead of my patch. I agree with Linus that the offending mpt3sas patch already should have been reverted. Bart.
* James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > On Sun, 2017-01-15 at 10:19 +0100, Ingo Molnar wrote: > > So there's a new mpt3sas SCSI driver boot regression, introduced in > > this merge window, which made one of my servers unbootable. > > We're not reverting a fix that would cause regressions for others. You really need to reconsider that stance ... > However, The fix was manifestly wrong, so does this fix of the fix work for you: > > http://marc.info/?l=linux-scsi&m=148329237807604 > > It's been languishing a bit because no-one seemed to care enough to > test or review it. IOf you can add a tested by, that will give the two > we need to push it. I have tested your other patch that you pointed to: http://marc.info/?l=linux-scsi&m=148449968522828 Which patch fixes the bug too (I removed my revert first) - so you can add my: Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> BTW., is it wise to work around the out of spec firmware in the mpt3sas code and leave the overly optimistic assumptions in the SCSI code intact? The problem is that other SCSI hardware could be affected as well - and especially enterprise class server hardware has long testing and thus regression latencies (as my example proves). Wouldn't it be more robust to only submit one pass-through command at a time from the SCSI layer, and maybe opt-in hardware that is known to implement the SAT standard fully? (But I'm just kibitzing here really.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2017-01-16 at 10:22 +0100, Ingo Molnar wrote: > * James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > On Sun, 2017-01-15 at 10:19 +0100, Ingo Molnar wrote: > > > So there's a new mpt3sas SCSI driver boot regression, introduced > > > in > > > this merge window, which made one of my servers unbootable. > > > > We're not reverting a fix that would cause regressions for others. > > You really need to reconsider that stance ... > > > However, The fix was manifestly wrong, so does this fix of the fix > > work for you: > > > > http://marc.info/?l=linux-scsi&m=148329237807604 > > > > It's been languishing a bit because no-one seemed to care enough to > > test or review it. IOf you can add a tested by, that will give the > > two > > we need to push it. > > I have tested your other patch that you pointed to: > > http://marc.info/?l=linux-scsi&m=148449968522828 > > Which patch fixes the bug too (I removed my revert first) - so you > can add my: > > Reported-by: Ingo Molnar <mingo@kernel.org> > Tested-by: Ingo Molnar <mingo@kernel.org> Thanks ... just checking you tested the second version with the concurrency part? > BTW., is it wise to work around the out of spec firmware in the > mpt3sas code and leave the overly optimistic assumptions in the SCSI > code intact? The problem is that other SCSI hardware could be > affected as well - and especially enterprise class server hardware > has long testing and thus regression latencies (as my example > proves). Realistically, there is no other card. Every other SAS implementation uses the in-kernel SAT, which does the right thing. We've suggested on a few occasions that the mpt SAS might like to use it as well, given we keep tripping on SAT problems in their firmware. > Wouldn't it be more robust to only submit one pass-through command at > a time from the SCSI layer, and maybe opt-in hardware that is known > to implement the SAT standard fully? Unfortunately it's a lot more complex: the standard gives a queueing mechanism for SAT pass through, so mostly you *can* have multiple commands outstanding, so it looks like we shouldn't globally restrict that. However, it seems the mpt3 firmware is using a queue single command model *and* not doing the right thing with return codes hence the failure. Since the failure mode is mpt3 specific, I think the best place for the fix is in their code. We can revisit this decision if something else comes along that also has this problem (UAS springs to mind). James > (But I'm just kibitzing here really.) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 15, 2017 at 10:02:51PM +0000, Bart Van Assche wrote: > I disagree. Even if my patch would be reverted that still wouldn't fix > the severe race condition that was introduced in the mpt3sas driver by > the patch that triggers the lockup during boot. As I explained two > weeks ago (see also https://www.spinics.net/lists/kernel/msg2411413.htm > l), commit 18f6084a989b ("scsi: mpt3sas: Fix secure erase premature > termination") is the one that should be reverted instead of my patch. I > agree with Linus that the offending mpt3sas patch already should have > been reverted. In addition to that ATA passthrough through scsi CDBs is a fringe feature compared to normal operation of the HBA. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2017-01-16 at 07:27 -0800, Christoph Hellwig wrote: > On Sun, Jan 15, 2017 at 10:02:51PM +0000, Bart Van Assche wrote: > > I disagree. Even if my patch would be reverted that still wouldn't > > fix the severe race condition that was introduced in the mpt3sas > > driver by the patch that triggers the lockup during boot. As I > > explained two weeks ago (see also > > https://www.spinics.net/lists/kernel/msg2411413.htm > > l), commit 18f6084a989b ("scsi: mpt3sas: Fix secure erase premature > > termination") is the one that should be reverted instead of my > > patch. I agree with Linus that the offending mpt3sas patch already > > should have been reverted. > > In addition to that ATA passthrough through scsi CDBs is a fringe > feature compared to normal operation of the HBA. So how about, instead of arguing about reversion, one or other of you reviews the proposed fix of the fix which would avoid having to revert anything? James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > On Mon, 2017-01-16 at 10:22 +0100, Ingo Molnar wrote: > > * James Bottomley <James.Bottomley@HansenPartnership.com> wrote: > > > > > On Sun, 2017-01-15 at 10:19 +0100, Ingo Molnar wrote: > > > > So there's a new mpt3sas SCSI driver boot regression, introduced > > > > in > > > > this merge window, which made one of my servers unbootable. > > > > > > We're not reverting a fix that would cause regressions for others. > > > > You really need to reconsider that stance ... > > > > > However, The fix was manifestly wrong, so does this fix of the fix > > > work for you: > > > > > > http://marc.info/?l=linux-scsi&m=148329237807604 > > > > > > It's been languishing a bit because no-one seemed to care enough to > > > test or review it. IOf you can add a tested by, that will give the > > > two > > > we need to push it. > > > > I have tested your other patch that you pointed to: > > > > http://marc.info/?l=linux-scsi&m=148449968522828 > > > > Which patch fixes the bug too (I removed my revert first) - so you > > can add my: > > > > Reported-by: Ingo Molnar <mingo@kernel.org> > > Tested-by: Ingo Molnar <mingo@kernel.org> > > Thanks ... just checking you tested the second version with the > concurrency part? I tested the one I linked to: http://marc.info/?l=linux-scsi&m=148449968522828 I don't know which version that is exactly, I just tested what you asked me to test. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 16, 2017 at 08:14:39AM -0800, James Bottomley wrote: > So how about, instead of arguing about reversion, one or other of you > reviews the proposed fix of the fix which would avoid having to revert > anything? As far as I can tell everyone is waiting for you to resend it with the review comment from the Broadcom maintainers adddressed. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
====================> From 0734e6d2a7f757172d6b7750d8fcf602909300e6 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@kernel.org> Date: Sun, 15 Jan 2017 09:59:39 +0100 Subject: [PATCH] Revert "scsi: mpt3sas: Fix secure erase premature termination" This reverts commit 18f6084a989ba1b38702f9af37a2e4049a924be6. Conflicts: drivers/scsi/mpt3sas/mpt3sas_scsih.c Signed-off-by: Ingo Molnar <mingo@kernel.org> --- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index b5c966e319d3..3573daa2cce8 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -4063,13 +4063,6 @@ scsih_qcmd(struct Scsi_Host *shost, struct scsi_cmnd *scmd) if (ioc->logging_level & MPT_DEBUG_SCSI) scsi_print_command(scmd); - /* - * Lock the device for any subsequent command until command is - * done. - */ - if (ata_12_16_cmd(scmd)) - scsi_internal_device_block(scmd->device); - sas_device_priv_data = scmd->device->hostdata; if (!sas_device_priv_data || !sas_device_priv_data->sas_target) { scmd->result = DID_NO_CONNECT << 16; @@ -4650,9 +4643,6 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply) if (scmd == NULL) return 1; - if (ata_12_16_cmd(scmd)) - scsi_internal_device_unblock(scmd->device, SDEV_RUNNING); - mpi_request = mpt3sas_base_get_msg_frame(ioc, smid); if (mpi_reply == NULL) {