Message ID | 20210416235329.49234-1-snitzer@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | nvme: improve error handling and ana_state to work well with dm-multipath | expand |
On Fri, Apr 16 2021 at 7:53pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > Hi, > > This patchset reflects changes needed to make NVMe error handling and > ANA state updates work well with dm-multipath (which always sets > REQ_FAILFAST_TRANSPORT). > > RHEL8 has been carrying an older ~5.9 based version of this patchset > (since RHEL8.3, August 2020). > > RHEL9 is coming, would really prefer that these changes land upstream > rather than carry them within RHEL. > > All review/feedback welcome. > > Thanks, > Mike > > v3 -> v4, less is more: > - folded REQ_FAILFAST_TRANSPORT local retry and FAILUP patches > - simplified nvme_failup_req(), removes needless blk_path_error() et al > - removed comment block in nvme_decide_disposition() > > v2 -> v3: > - Added Reviewed-by tags to BLK_STS_DO_NOT_RETRY patch. > - Eliminated __nvme_end_req() and added code comment to > nvme_failup_req() in FAILUP handling patch. > > Mike Snitzer (3): > nvme: return BLK_STS_DO_NOT_RETRY if the DNR bit is set > nvme: allow local retry and proper failover for REQ_FAILFAST_TRANSPORT > nvme: decouple basic ANA log page re-read support from native > multipathing > > drivers/nvme/host/core.c | 22 +++++++++++++++++++--- > drivers/nvme/host/multipath.c | 16 +++++++++++----- > drivers/nvme/host/nvme.h | 4 ++++ > include/linux/blk_types.h | 8 ++++++++ > 4 files changed, 42 insertions(+), 8 deletions(-) Sorry for all the noise, but I had a cut-and-paste issue with this cover letter; should've said "[PATCH v4 0/4] ..." While I'm replying, I _think_ there is consensus that patch 1 is worthwile and acceptable. Here is a combined diff of patches 2+3 to illustrate just how minimalist these proposed changes are: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 540d6fd8ffef..83ca96292157 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -299,6 +299,7 @@ enum nvme_disposition { COMPLETE, RETRY, FAILOVER, + FAILUP, }; static inline enum nvme_disposition nvme_decide_disposition(struct request *req) @@ -306,15 +307,16 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req) if (likely(nvme_req(req)->status == 0)) return COMPLETE; - if (blk_noretry_request(req) || + if ((req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) || (nvme_req(req)->status & NVME_SC_DNR) || nvme_req(req)->retries >= nvme_max_retries) return COMPLETE; - if (req->cmd_flags & REQ_NVME_MPATH) { + if (req->cmd_flags & (REQ_NVME_MPATH | REQ_FAILFAST_TRANSPORT)) { if (nvme_is_path_error(nvme_req(req)->status) || blk_queue_dying(req->q)) - return FAILOVER; + return (req->cmd_flags & REQ_NVME_MPATH) ? + FAILOVER : FAILUP; } else { if (blk_queue_dying(req->q)) return COMPLETE; @@ -336,6 +338,14 @@ static inline void nvme_end_req(struct request *req) blk_mq_end_request(req, status); } +static inline void nvme_failup_req(struct request *req) +{ + nvme_update_ana(req); + + nvme_req(req)->status = NVME_SC_HOST_PATH_ERROR; + nvme_end_req(req); +} + void nvme_complete_rq(struct request *req) { trace_nvme_complete_rq(req); @@ -354,6 +364,9 @@ void nvme_complete_rq(struct request *req) case FAILOVER: nvme_failover_req(req); return; + case FAILUP: + nvme_failup_req(req); + return; } } EXPORT_SYMBOL_GPL(nvme_complete_rq); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a1d476e1ac02..7d94250264aa 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,23 +65,29 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -void nvme_failover_req(struct request *req) +void nvme_update_ana(struct request *req) { struct nvme_ns *ns = req->q->queuedata; u16 status = nvme_req(req)->status & 0x7ff; - unsigned long flags; - - nvme_mpath_clear_current_path(ns); /* * If we got back an ANA error, we know the controller is alive but not - * ready to serve this namespace. Kick of a re-read of the ANA + * ready to serve this namespace. Kick off a re-read of the ANA * information page, and just try any other available path for now. */ if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) { set_bit(NVME_NS_ANA_PENDING, &ns->flags); queue_work(nvme_wq, &ns->ctrl->ana_work); } +} + +void nvme_failover_req(struct request *req) +{ + struct nvme_ns *ns = req->q->queuedata; + unsigned long flags; + + nvme_mpath_clear_current_path(ns); + nvme_update_ana(req); spin_lock_irqsave(&ns->head->requeue_lock, flags); blk_steal_bios(&ns->head->requeue_list, req); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 07b34175c6ce..4eed8536625c 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -664,6 +664,7 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, struct nvme_ctrl *ctrl, int *flags); void nvme_failover_req(struct request *req); +void nvme_update_ana(struct request *req); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); @@ -714,6 +715,9 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, static inline void nvme_failover_req(struct request *req) { } +static inline void nvme_update_ana(struct request *req) +{ +} static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { }
On Fri, Apr 16 2021 at 8:02pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Fri, Apr 16 2021 at 7:53pm -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > > > Hi, > > > > This patchset reflects changes needed to make NVMe error handling and > > ANA state updates work well with dm-multipath (which always sets > > REQ_FAILFAST_TRANSPORT). > > > > RHEL8 has been carrying an older ~5.9 based version of this patchset > > (since RHEL8.3, August 2020). > > > > RHEL9 is coming, would really prefer that these changes land upstream > > rather than carry them within RHEL. > > > > All review/feedback welcome. > > > > Thanks, > > Mike > > > > v3 -> v4, less is more: > > - folded REQ_FAILFAST_TRANSPORT local retry and FAILUP patches > > - simplified nvme_failup_req(), removes needless blk_path_error() et al > > - removed comment block in nvme_decide_disposition() > > > > v2 -> v3: > > - Added Reviewed-by tags to BLK_STS_DO_NOT_RETRY patch. > > - Eliminated __nvme_end_req() and added code comment to > > nvme_failup_req() in FAILUP handling patch. > > > > Mike Snitzer (3): > > nvme: return BLK_STS_DO_NOT_RETRY if the DNR bit is set > > nvme: allow local retry and proper failover for REQ_FAILFAST_TRANSPORT > > nvme: decouple basic ANA log page re-read support from native > > multipathing > > > > drivers/nvme/host/core.c | 22 +++++++++++++++++++--- > > drivers/nvme/host/multipath.c | 16 +++++++++++----- > > drivers/nvme/host/nvme.h | 4 ++++ > > include/linux/blk_types.h | 8 ++++++++ > > 4 files changed, 42 insertions(+), 8 deletions(-) > > Sorry for all the noise, but I had a cut-and-paste issue with this cover > letter; should've said "[PATCH v4 0/4] ..." > > While I'm replying, I _think_ there is consensus that patch 1 is > worthwile and acceptable. Here is a combined diff of patches 2+3 to > illustrate just how minimalist these proposed changes are: > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 540d6fd8ffef..83ca96292157 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -299,6 +299,7 @@ enum nvme_disposition { > COMPLETE, > RETRY, > FAILOVER, > + FAILUP, > }; > > static inline enum nvme_disposition nvme_decide_disposition(struct request *req) > @@ -306,15 +307,16 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req) > if (likely(nvme_req(req)->status == 0)) > return COMPLETE; > > - if (blk_noretry_request(req) || > + if ((req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) || > (nvme_req(req)->status & NVME_SC_DNR) || > nvme_req(req)->retries >= nvme_max_retries) > return COMPLETE; > > - if (req->cmd_flags & REQ_NVME_MPATH) { > + if (req->cmd_flags & (REQ_NVME_MPATH | REQ_FAILFAST_TRANSPORT)) { > if (nvme_is_path_error(nvme_req(req)->status) || > blk_queue_dying(req->q)) > - return FAILOVER; > + return (req->cmd_flags & REQ_NVME_MPATH) ? > + FAILOVER : FAILUP; > } else { > if (blk_queue_dying(req->q)) > return COMPLETE; > @@ -336,6 +338,14 @@ static inline void nvme_end_req(struct request *req) > blk_mq_end_request(req, status); > } > > +static inline void nvme_failup_req(struct request *req) > +{ > + nvme_update_ana(req); > + > + nvme_req(req)->status = NVME_SC_HOST_PATH_ERROR; > + nvme_end_req(req); > +} > + > void nvme_complete_rq(struct request *req) > { > trace_nvme_complete_rq(req); > @@ -354,6 +364,9 @@ void nvme_complete_rq(struct request *req) > case FAILOVER: > nvme_failover_req(req); > return; > + case FAILUP: > + nvme_failup_req(req); > + return; > } > } > EXPORT_SYMBOL_GPL(nvme_complete_rq); > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index a1d476e1ac02..7d94250264aa 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -65,23 +65,29 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > } > } > > -void nvme_failover_req(struct request *req) > +void nvme_update_ana(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > u16 status = nvme_req(req)->status & 0x7ff; > - unsigned long flags; > - > - nvme_mpath_clear_current_path(ns); > > /* > * If we got back an ANA error, we know the controller is alive but not > - * ready to serve this namespace. Kick of a re-read of the ANA > + * ready to serve this namespace. Kick off a re-read of the ANA > * information page, and just try any other available path for now. > */ > if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) { > set_bit(NVME_NS_ANA_PENDING, &ns->flags); > queue_work(nvme_wq, &ns->ctrl->ana_work); > } > +} > + > +void nvme_failover_req(struct request *req) > +{ > + struct nvme_ns *ns = req->q->queuedata; > + unsigned long flags; > + > + nvme_mpath_clear_current_path(ns); > + nvme_update_ana(req); > > spin_lock_irqsave(&ns->head->requeue_lock, flags); > blk_steal_bios(&ns->head->requeue_list, req); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 07b34175c6ce..4eed8536625c 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -664,6 +664,7 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); > void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > struct nvme_ctrl *ctrl, int *flags); > void nvme_failover_req(struct request *req); > +void nvme_update_ana(struct request *req); > void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); > int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); > void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); > @@ -714,6 +715,9 @@ static inline void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, > static inline void nvme_failover_req(struct request *req) > { > } > +static inline void nvme_update_ana(struct request *req) > +{ > +} > static inline void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) > { > } Since I sent these changes late on Friday I figured I'd try to get them on your radar. I would really appreciate timely review of this series for 5.13: https://patchwork.kernel.org/project/dm-devel/list/?series=468905 Thanks, Mike
> RHEL9 is coming, would really prefer that these changes land upstream > rather than carry them within RHEL. We've told from the very beginning that dm-multipth on nvme is not a support configuration. Red Hat decided to ignore that and live with the pain. Your major version change is a chance to fix this up on the Red Hat side, not to resubmit bogus patches upstream. In other words: please get your house in order NOW.
On Tue, Apr 20 2021 at 5:37am -0400, Christoph Hellwig <hch@lst.de> wrote: > > RHEL9 is coming, would really prefer that these changes land upstream > > rather than carry them within RHEL. > > We've told from the very beginning that dm-multipth on nvme is not > a support configuration. You have some high quality revisionist history there. But other than pointing that out I'm not going to dwell on our past discussions on how NVMe multipathing would be. > Red Hat decided to ignore that and live with the pain. Red Hat supports both native nvme-multipath _and_ DM-multipath on NVMe. The only "pain" I've been living with is trying to get you to be impartial and allow others to provide Linux multipathing as they see fit. > Your major version change is a chance to fix this up on the Red Hat > side, not to resubmit bogus patches upstream. Please spare me the vapid and baseless assertion about patches you refuse to review technically without political motivation. > In other words: please get your house in order NOW. My simple 3 patch submission was an attempt to do so. Reality is the Linux NVMe maintainers need to get their collective house in order. Until sanity prevails these NVMe changes will be carried in RHEL. And if you go out of your way to cause trivial, or elaborate, conflicts now that you _know_ that changes that are being carried it will be handled without issue. Sad this is where we are but it is what it is. Linux is about choice that is founded upon need. Hostile action that unilaterally limits choice is antithetical to Linux and Open Source. Mike
On Tue, 2021-04-20 at 10:38 -0400, Mike Snitzer wrote: > On Tue, Apr 20 2021 at 5:37am -0400, > Christoph Hellwig <hch@lst.de> wrote: > > > > RHEL9 is coming, would really prefer that these changes land > > > upstream > > > rather than carry them within RHEL. > > > > We've told from the very beginning that dm-multipth on nvme is not > > a support configuration. > > You have some high quality revisionist history there. But other than > pointing that out I'm not going to dwell on our past discussions on > how > NVMe multipathing would be. > > > Red Hat decided to ignore that and live with the pain. > > Red Hat supports both native nvme-multipath _and_ DM-multipath on > NVMe. > > The only "pain" I've been living with is trying to get you to be > impartial and allow others to provide Linux multipathing as they see > fit. > > > Your major version change is a chance to fix this up on the Red Hat > > side, not to resubmit bogus patches upstream. > > Please spare me the vapid and baseless assertion about patches you > refuse to review technically without political motivation. > > > In other words: please get your house in order NOW. > > My simple 3 patch submission was an attempt to do so. Reality is the > Linux NVMe maintainers need to get their collective house in order. > > Until sanity prevails these NVMe changes will be carried in RHEL. And > if > you go out of your way to cause trivial, or elaborate, conflicts now > that you _know_ that changes that are being carried it will be > handled > without issue. > > Sad this is where we are but it is what it is. > > Linux is about choice that is founded upon need. Hostile action that > unilaterally limits choice is antithetical to Linux and Open Source. > > Mike > Hello Let me add some reasons why as primarily a support person that this is important and try avoid another combative situation. Customers depend on managing device-mapper-multipath the way it is now even with the advent of nvme-over-F/C. Years of administration and management for multiple Enterprise O/S vendor customers (Suse/Red Hat, Oracle) all depend on managing multipath access in a transparent way. I respect everybody's point of view here but native does change log alerting and recovery and that is what will take time for customers to adopt. It is going to take time for Enterprise customers to transition so all we want is an option for them. At some point they will move to native but we always like to keep in step with upstream as much as possible. Of course we could live with RHEL-only for while but that defeats our intention to be as close to upstream as possible. If we could have this accepted upstream for now perhaps when customers are ready to move to native only we could phase this out. Any technical reason why this would not fly is of course important to consider but perhaps for now we have a parallel option until we dont. With due respect to all concerned. Laurence Oberman
On 4/20/21 5:46 PM, Laurence Oberman wrote: [ .. ] > > Let me add some reasons why as primarily a support person that this is > important and try avoid another combative situation. > > Customers depend on managing device-mapper-multipath the way it is now > even with the advent of nvme-over-F/C. Years of administration and > management for multiple Enterprise O/S vendor customers (Suse/Red Hat, > Oracle) all depend on managing multipath access in a transparent way. > > I respect everybody's point of view here but native does change log > alerting and recovery and that is what will take time for customers to > adopt. > > It is going to take time for Enterprise customers to transition so all > we want is an option for them. At some point they will move to native > but we always like to keep in step with upstream as much as possible. > > Of course we could live with RHEL-only for while but that defeats our > intention to be as close to upstream as possible. > > If we could have this accepted upstream for now perhaps when customers > are ready to move to native only we could phase this out. > > Any technical reason why this would not fly is of course important to > consider but perhaps for now we have a parallel option until we dont. > Curiously, we (as in we as SLES developers) have found just the opposite. NVMe is a new technology, and out of necessity there will not be any existing installations where we have to be compatible with. We have switched to native NVMe multipathing with SLE15, and decided to educate customers that NVMe is a different concept than SCSI, and one shouldn't try treat both the same way. This was helped by the fact the SLE15 is a new release, so customers were accustomed to having to change bits and pieces in their infrastructure to support new releases. Overall it worked reasonably well; we sure found plenty of bugs, but that was kind of expected, and for bad or worse nearly all of them turned out to be upstream issues. Which was good for us (nothing beats being able to blame things on upstream, if one is careful to not linger too much on the fact that one is part of upstream); and upstream these things will need to be fixed anyway. So we had a bit of a mixed experience, but customers seemed to be happy enough with this step. Sorry about that :-) Cheers, Hannes
On Sat, May 01 2021 at 7:58am -0400, Hannes Reinecke <hare@suse.de> wrote: > On 4/20/21 5:46 PM, Laurence Oberman wrote: > [ .. ] > > > >Let me add some reasons why as primarily a support person that this is > >important and try avoid another combative situation. > > > >Customers depend on managing device-mapper-multipath the way it is now > >even with the advent of nvme-over-F/C. Years of administration and > >management for multiple Enterprise O/S vendor customers (Suse/Red Hat, > >Oracle) all depend on managing multipath access in a transparent way. > > > >I respect everybody's point of view here but native does change log > >alerting and recovery and that is what will take time for customers to > >adopt. > > > >It is going to take time for Enterprise customers to transition so all > >we want is an option for them. At some point they will move to native > >but we always like to keep in step with upstream as much as possible. > > > >Of course we could live with RHEL-only for while but that defeats our > >intention to be as close to upstream as possible. > > > >If we could have this accepted upstream for now perhaps when customers > >are ready to move to native only we could phase this out. > > > >Any technical reason why this would not fly is of course important to > >consider but perhaps for now we have a parallel option until we dont. > > > Curiously, we (as in we as SLES developers) have found just the opposite. > NVMe is a new technology, and out of necessity there will not be any > existing installations where we have to be compatible with. > We have switched to native NVMe multipathing with SLE15, and decided > to educate customers that NVMe is a different concept than SCSI, and > one shouldn't try treat both the same way. As you well know: dm-multipath was first engineered to handle SCSI, and it was too tightly coupled at the start, but the scsi_dh interface provided sorely missing abstraction. With NVMe, dm-multipath was further enhanced to not do work only needed for SCSI. Seems SUSE has forgotten that dm-multipath has taken strides to properly abstract away SCSI specific details, at least this patchset forgot it (because proper layering/abstraction is too hard? that mantra is what gave native NVMe multipath life BTW): https://patchwork.kernel.org/project/dm-devel/list/?series=475271 Long story short, there is utility in dm-multipath being transport agnostic with specialized protocol specific bits properly abstracted. If you or others don't care about any of that anymore, that's fine! But it doesn't mean others don't. Thankfully both can exist perfectly fine, sadly that clearly isn't possible without absurd tribal fighting (at least in the context of NVMe). And to be clear Hannes: your quick review of this patchset couldn't have been less helpful or informed. Yet it enabled NVMe maintainers to ignore technical review (you gave them cover). The lack of proper technical review of this patchset was astonishing but hch's dysfunctional attack that took its place really _should_ concern others. Seems it doesn't, must be nice to not have a dog in the fight other than philosophical ideals that enable turning a blind eye. > This was helped by the > fact the SLE15 is a new release, so customers were accustomed to > having to change bits and pieces in their infrastructure to support > new releases. Sounds like you either have very few customers and/or they don't use layers that were engineered with dm-multipath being an integral layer in the IO stack. That's fine, but that doesn't prove anything other than your limited experience. > Overall it worked reasonably well; we sure found plenty of bugs, but > that was kind of expected, and for bad or worse nearly all of them > turned out to be upstream issues. Which was good for us (nothing > beats being able to blame things on upstream, if one is careful to > not linger too much on the fact that one is part of upstream); and > upstream these things will need to be fixed anyway. > So we had a bit of a mixed experience, but customers seemed to be > happy enough with this step. > > Sorry about that :-) Nothing to be sorry about, good on you and the others at SUSE engineering for improving native NVMe multipathing. Red Hat supports it too, so your and others' efforts are appreciated there. Mike