Message ID | E463DF2B2E584B4A82673F53D62C2EF466A0E4DA@cosmail01.lsi.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Hi Babu, On 2009/04/21 3:05 +0900, Moger, Babu wrote: > This patch introduces the mechanism to recover from I/O failures by > re-initializing the path if the device is running on only one path. > > Problem: Device mapper fails the path for every I/O error. It does not > care about the type of error. There are certain errors which can be > recovered by re-initializing the path again. I have seen this problem > during my testing on rdac device handler. I have observed I/O errors > when there is a change in Lun ownership. When Lun ownership changes > device will return back with check condition with > sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). > Currently, device mapper fails the path for this error and eventually > this will lead to I/O error. We don't want to see I/O error for this reason. Shouldn't we handle this type of device error inside device handler? > The patch will set the flag pg_init_required if the device is running > on single path. The process_queued_ios will re-initialize path if required. > I have tested this patch on LSI rdac handler. > > Signed-off-by: Babu Moger <babu.moger@lsi.com> > --- > > --- linux-2.6.30-rc2/drivers/md/dm-mpath.c.orig 2009-04-17 16:49:33.000000000 -0500 > +++ linux-2.6.30-rc2/drivers/md/dm-mpath.c 2009-04-17 17:09:51.000000000 -0500 > @@ -1152,6 +1152,15 @@ static int do_end_io(struct multipath *m > return error; > > spin_lock_irqsave(&m->lock, flags); > + /* > + * If this is the only path left, then lets try to > + * re-initialize the PG one last time.. > + */ > + if (m->nr_valid_paths == 1 && m->hw_handler_name) { > + m->pg_init_required = 1; > + spin_unlock_irqrestore(&m->lock, flags); > + goto requeue; > + } > if (!m->nr_valid_paths) { > if (__must_push_back(m)) { > spin_unlock_irqrestore(&m->lock, flags); What happens in case of a real I/O error (e.g. I/O to a broken sector)? Is it correctly handled and returned to upper layer at last? I'm asking that because the change looks dm retries such errors forever. Or am I missing anything? Thanks, Kiyoshi Ueda -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi Kiyoshi, Thanks for your comment. > -----Original Message----- > From: Kiyoshi Ueda [mailto:k-ueda@ct.jp.nec.com] > Sent: Monday, April 20, 2009 8:07 PM > To: Moger, Babu > Cc: 'dm-devel@redhat.com'; linux-scsi@vger.kernel.org; Chauhan, Vijay; > 'sekharan@us.ibm.com' > Subject: Re: [PATCH] dm mpath: Try recover from I/O failure by re- > initializing the PG if device is running on one path > > Hi Babu, > > On 2009/04/21 3:05 +0900, Moger, Babu wrote: > > This patch introduces the mechanism to recover from I/O failures by > > re-initializing the path if the device is running on only one path. > > > > Problem: Device mapper fails the path for every I/O error. It does not > > care about the type of error. There are certain errors which can be > > recovered by re-initializing the path again. I have seen this problem > > during my testing on rdac device handler. I have observed I/O errors > > when there is a change in Lun ownership. When Lun ownership changes > > device will return back with check condition with > > sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). > > Currently, device mapper fails the path for this error and eventually > > this will lead to I/O error. We don't want to see I/O error for this > reason. > > Shouldn't we handle this type of device error inside device handler? The current error in question requires re-activation of the path. We already have a code to handle this scenario in device handler. But, the problem is the return status does not go to DM layer. The return status gets lost in scsi layer. For DM layer all the errors are -EIO. Any thoughts from your side. > > The patch will set the flag pg_init_required if the device is running > > on single path. The process_queued_ios will re-initialize path if > required. > > I have tested this patch on LSI rdac handler. > > > > Signed-off-by: Babu Moger <babu.moger@lsi.com> > > --- > > > > --- linux-2.6.30-rc2/drivers/md/dm-mpath.c.orig 2009-04-17 > 16:49:33.000000000 -0500 > > +++ linux-2.6.30-rc2/drivers/md/dm-mpath.c 2009-04-17 > 17:09:51.000000000 -0500 > > @@ -1152,6 +1152,15 @@ static int do_end_io(struct multipath *m > > return error; > > > > spin_lock_irqsave(&m->lock, flags); > > + /* > > + * If this is the only path left, then lets try to > > + * re-initialize the PG one last time.. > > + */ > > + if (m->nr_valid_paths == 1 && m->hw_handler_name) { > > + m->pg_init_required = 1; > > + spin_unlock_irqrestore(&m->lock, flags); > > + goto requeue; > > + } > > if (!m->nr_valid_paths) { > > if (__must_push_back(m)) { > > spin_unlock_irqrestore(&m->lock, flags); > > What happens in case of a real I/O error (e.g. I/O to a broken sector)? > Is it correctly handled and returned to upper layer at last? > I'm asking that because the change looks dm retries such errors forever. > Or am I missing anything? > Yes, you are right. There are chances of that happening. I will investigate and get back. > Thanks, > Kiyoshi Ueda -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi Babu, On 2009/04/22 2:06 +0900, Moger, Babu wrote: > Hi Kiyoshi, > > Thanks for your comment. > > >> -----Original Message----- >> From: Kiyoshi Ueda [mailto:k-ueda@ct.jp.nec.com] >> Sent: Monday, April 20, 2009 8:07 PM >> To: Moger, Babu >> Cc: 'dm-devel@redhat.com'; linux-scsi@vger.kernel.org; Chauhan, Vijay; >> 'sekharan@us.ibm.com' >> Subject: Re: [PATCH] dm mpath: Try recover from I/O failure by re- >> initializing the PG if device is running on one path >> >> Hi Babu, >> >> On 2009/04/21 3:05 +0900, Moger, Babu wrote: >>> This patch introduces the mechanism to recover from I/O failures by >>> re-initializing the path if the device is running on only one path. >>> >>> Problem: Device mapper fails the path for every I/O error. It does not >>> care about the type of error. There are certain errors which can be >>> recovered by re-initializing the path again. I have seen this problem >>> during my testing on rdac device handler. I have observed I/O errors >>> when there is a change in Lun ownership. When Lun ownership changes >>> device will return back with check condition with >>> sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). >>> Currently, device mapper fails the path for this error and eventually >>> this will lead to I/O error. We don't want to see I/O error for this >>> reason. >> >> Shouldn't we handle this type of device error inside device handler? > > The current error in question requires re-activation of the path. > We already have a code to handle this scenario in device handler. > But, the problem is the return status does not go to DM layer. > The return status gets lost in scsi layer. For DM layer all the errors > are -EIO. Any thoughts from your side. Oh, I missed the point and I thought that re-activating the path in your device handler was enough for the error. Currently, I have no idea to handle your case only in dm without seeing I/O error. By the way, who did change the ownership when the device was running with one path in your testing? I can't see why such case happened. Thanks, Kiyoshi Ueda -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi Kiyoshi, > >> > >> Hi Babu, > >> > >> On 2009/04/21 3:05 +0900, Moger, Babu wrote: > >>> This patch introduces the mechanism to recover from I/O failures by > >>> re-initializing the path if the device is running on only one path. > >>> > >>> Problem: Device mapper fails the path for every I/O error. It does not > >>> care about the type of error. There are certain errors which can be > >>> recovered by re-initializing the path again. I have seen this problem > >>> during my testing on rdac device handler. I have observed I/O errors > >>> when there is a change in Lun ownership. When Lun ownership changes > >>> device will return back with check condition with > >>> sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). > >>> Currently, device mapper fails the path for this error and eventually > >>> this will lead to I/O error. We don't want to see I/O error for this > >>> reason. > >> > >> Shouldn't we handle this type of device error inside device handler? > > > > The current error in question requires re-activation of the path. > > We already have a code to handle this scenario in device handler. > > But, the problem is the return status does not go to DM layer. > > The return status gets lost in scsi layer. For DM layer all the errors > > are -EIO. Any thoughts from your side. > > Oh, I missed the point and I thought that re-activating the path > in your device handler was enough for the error. > Currently, I have no idea to handle your case only in dm without > seeing I/O error. > I have discussed about re-activating path in device handler with Chandra. Looks like that will lead to other issues (one is long boot up). Look like that is not an option. > By the way, who did change the ownership when the device was running > with one path in your testing? I can't see why such case happened. > This can happen if the user knowingly or unknowingly changes the ownership. Also we have other feature in our storage which will allow user to redistribute the luns. Thanks for you comment. > Thanks, > Kiyoshi Ueda -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
You mean to say that we do not require this patch/fix ? On Wed, 2009-04-22 at 08:03 -0600, Moger, Babu wrote: > Hi Kiyoshi, > > > >> > > >> Hi Babu, > > >> > > >> On 2009/04/21 3:05 +0900, Moger, Babu wrote: > > >>> This patch introduces the mechanism to recover from I/O failures by > > >>> re-initializing the path if the device is running on only one path. > > >>> > > >>> Problem: Device mapper fails the path for every I/O error. It does not > > >>> care about the type of error. There are certain errors which can be > > >>> recovered by re-initializing the path again. I have seen this problem > > >>> during my testing on rdac device handler. I have observed I/O errors > > >>> when there is a change in Lun ownership. When Lun ownership changes > > >>> device will return back with check condition with > > >>> sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). > > >>> Currently, device mapper fails the path for this error and eventually > > >>> this will lead to I/O error. We don't want to see I/O error for this > > >>> reason. > > >> > > >> Shouldn't we handle this type of device error inside device handler? > > > > > > The current error in question requires re-activation of the path. > > > We already have a code to handle this scenario in device handler. > > > But, the problem is the return status does not go to DM layer. > > > The return status gets lost in scsi layer. For DM layer all the errors > > > are -EIO. Any thoughts from your side. > > > > Oh, I missed the point and I thought that re-activating the path > > in your device handler was enough for the error. > > Currently, I have no idea to handle your case only in dm without > > seeing I/O error. > > > I have discussed about re-activating path in device handler with Chandra. Looks like that will lead to other issues (one is long boot up). Look like that is not an option. > > By the way, who did change the ownership when the device was running > > with one path in your testing? I can't see why such case happened. > > > This can happen if the user knowingly or unknowingly changes the ownership. Also we have other feature in our storage which will allow user to redistribute the luns. Thanks for you comment. > > > Thanks, > > Kiyoshi Ueda > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Apr 20, 2009 at 11:05 AM, Moger, Babu <Babu.Moger@lsi.com> wrote: > This patch introduces the mechanism to recover from I/O failures by re-initializing the path if the device is running on only one path. > > Problem: Device mapper fails the path for every I/O error. > It does not care about the type of error. This is the fundamental problem. Different layers of the block IO path have to agree on how to handle each possible type of error that can be returned. I don't know where to find such an agreement and think an implementation that does discriminate is needed. > There are certain errors which can be recovered by re-initializing the path again. I have seen this problem during my testing on rdac device handler. I have observed I/O errors when there is a change in Lun ownership. When Lun ownership changes device will return back with check condition with sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). Currently, device mapper fails the path for this error and eventually this will lead to I/O error. We don't want to see I/O error for this reason. 1) This patch isn't discriminating between transport, media, or other device errors. Wouldn't it make sense to discriminate? "LUN ownership changed" sounds like some of the events possible in multi-inititiator enviroment would want to be notified about and perhaps even take some action (renegotiate access to 2) Will this result in resetting a SATA device? I ask because device reset may result in data loss due to WCE enabled. I just don't know the higher parts of the block SW stack and how errors flow up the stack. thanks, grant > > The patch will set the flag pg_init_required if the device is running on single path. The process_queued_ios will re-initialize path if required. I have tested this patch on LSI rdac handler. > > Signed-off-by: Babu Moger <babu.moger@lsi.com> > --- > > --- linux-2.6.30-rc2/drivers/md/dm-mpath.c.orig 2009-04-17 16:49:33.000000000 -0500 > +++ linux-2.6.30-rc2/drivers/md/dm-mpath.c    2009-04-17 17:09:51.000000000 -0500 > @@ -1152,6 +1152,15 @@ static int do_end_io(struct multipath *m >         return error; > >     spin_lock_irqsave(&m->lock, flags); > +    /* > +     * If this is the only path left, then lets try to > +     * re-initialize the PG one last time.. > +     */ > +    if (m->nr_valid_paths == 1 && m->hw_handler_name) { > +        m->pg_init_required = 1; > +        spin_unlock_irqrestore(&m->lock, flags); > +        goto requeue; > +    } >     if (!m->nr_valid_paths) { >         if (__must_push_back(m)) { >             spin_unlock_irqrestore(&m->lock, flags); > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at  http://vger.kernel.org/majordomo-info.html > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> -----Original Message----- > From: Chandra Seetharaman [mailto:sekharan@us.ibm.com] > Sent: Wednesday, April 22, 2009 12:33 PM > To: Moger, Babu > Cc: Kiyoshi Ueda; 'dm-devel@redhat.com'; linux-scsi@vger.kernel.org; > Chauhan, Vijay > Subject: RE: [PATCH] dm mpath: Try recover from I/O failure by re- > initializing the PG if device is running on one path > > You mean to say that we do not require this patch/fix ? No. I did not mean that. What I meant was re-activating the path from device handler (rdac_check_sense) is not an option. We have to find other ways to deal with it. Lun ownership change can happen if the user knowingly or unknowingly changes the ownership. Also our redistribute feature can change the lun ownership. > On Wed, 2009-04-22 at 08:03 -0600, Moger, Babu wrote: > > Hi Kiyoshi, > > > > > >> > > > >> Hi Babu, > > > >> > > > >> On 2009/04/21 3:05 +0900, Moger, Babu wrote: > > > >>> This patch introduces the mechanism to recover from I/O failures > by > > > >>> re-initializing the path if the device is running on only one > path. > > > >>> > > > >>> Problem: Device mapper fails the path for every I/O error. It does > not > > > >>> care about the type of error. There are certain errors which can > be > > > >>> recovered by re-initializing the path again. I have seen this > problem > > > >>> during my testing on rdac device handler. I have observed I/O > errors > > > >>> when there is a change in Lun ownership. When Lun ownership > changes > > > >>> device will return back with check condition with > > > >>> sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). > > > >>> Currently, device mapper fails the path for this error and > eventually > > > >>> this will lead to I/O error. We don't want to see I/O error for > this > > > >>> reason. > > > >> > > > >> Shouldn't we handle this type of device error inside device > handler? > > > > > > > > The current error in question requires re-activation of the path. > > > > We already have a code to handle this scenario in device handler. > > > > But, the problem is the return status does not go to DM layer. > > > > The return status gets lost in scsi layer. For DM layer all the > errors > > > > are -EIO. Any thoughts from your side. > > > > > > Oh, I missed the point and I thought that re-activating the path > > > in your device handler was enough for the error. > > > Currently, I have no idea to handle your case only in dm without > > > seeing I/O error. > > > > > I have discussed about re-activating path in device handler with > Chandra. Looks like that will lead to other issues (one is long boot up). > Look like that is not an option. > > > By the way, who did change the ownership when the device was running > > > with one path in your testing? I can't see why such case happened. > > > > > This can happen if the user knowingly or unknowingly changes the > ownership. Also we have other feature in our storage which will allow user > to redistribute the luns. Thanks for you comment. > > > > > Thanks, > > > Kiyoshi Ueda > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> > Problem: Device mapper fails the path for every I/O error. > > It does not care about the type of error. > > This is the fundamental problem. Different layers of the block IO > path have to agree on how to handle each possible type of error that > can be returned. I don't know where to find such an agreement and > think an implementation that does discriminate is needed. > > There are certain errors which can be recovered by re-initializing the > path again. I have seen this problem during my testing on rdac device > handler. I have observed I/O errors when there is a change in Lun > ownership. When Lun ownership changes device will return back with check > condition with sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership > changed). Currently, device mapper fails the path for this error and > eventually this will lead to I/O error. We don't want to see I/O error for > this reason. > > 1) This patch isn't discriminating between transport, media, or other > device errors. Wouldn't it make sense to discriminate? > "LUN ownership changed" sounds like some of the events possible in > multi-inititiator enviroment would want to be notified about and > perhaps even take some action (renegotiate access to We will not be able to discriminate the error because error specific information is not available to DM. > 2) Will this result in resetting a SATA device? > I ask because device reset may result in data loss due to WCE enabled. > I just don't know the higher parts of the block SW stack and how > errors flow up the stack. I am not sure about this. Don’t know if re-activating (or calling activate_path) the device causes reset. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 2009-04-22 at 10:41 -0700, Grant Grundler wrote: > On Mon, Apr 20, 2009 at 11:05 AM, Moger, Babu <Babu.Moger@lsi.com> wrote: > > This patch introduces the mechanism to recover from I/O failures by re-initializing the path if the device is running on only one path. > > > > Problem: Device mapper fails the path for every I/O error. > > It does not care about the type of error. > > This is the fundamental problem. Different layers of the block IO > path have to agree on how to handle each possible type of error that > can be returned. I don't know where to find such an agreement and > think an implementation that does discriminate is needed. > > > There are certain errors which can be recovered by re-initializing the path again. I have seen this problem during my testing on rdac device handler. I have observed I/O errors when there is a change in Lun ownership. When Lun ownership changes device will return back with check condition with sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). Currently, device mapper fails the path for this error and eventually this will lead to I/O error. We don't want to see I/O error for this reason. > > 1) This patch isn't discriminating between transport, media, or other > device errors. Wouldn't it make sense to discriminate? yes it is. But currently we do not have it. > "LUN ownership changed" sounds like some of the events possible in > multi-inititiator enviroment would want to be notified about and > perhaps even take some action (renegotiate access to > > 2) Will this result in resetting a SATA device? > I ask because device reset may result in data loss due to WCE enabled. > I just don't know the higher parts of the block SW stack and how > errors flow up the stack. The device is not hung, the I/O will come back after a while. BTW, activate doesn't do a reset, it just sends a command (in lsi rdac case, it just sends a mode select) to the controller. > > thanks, > grant > > > > > The patch will set the flag pg_init_required if the device is running on single path. The process_queued_ios will re-initialize path if required. I have tested this patch on LSI rdac handler. > > > > Signed-off-by: Babu Moger <babu.moger@lsi.com> > > --- > > > > --- linux-2.6.30-rc2/drivers/md/dm-mpath.c.orig 2009-04-17 16:49:33.000000000 -0500 > > +++ linux-2.6.30-rc2/drivers/md/dm-mpath.c 2009-04-17 17:09:51.000000000 -0500 > > @@ -1152,6 +1152,15 @@ static int do_end_io(struct multipath *m > > return error; > > > > spin_lock_irqsave(&m->lock, flags); > > + /* > > + * If this is the only path left, then lets try to > > + * re-initialize the PG one last time.. > > + */ > > + if (m->nr_valid_paths == 1 && m->hw_handler_name) { > > + m->pg_init_required = 1; > > + spin_unlock_irqrestore(&m->lock, flags); > > + goto requeue; > > + } > > if (!m->nr_valid_paths) { > > if (__must_push_back(m)) { > > spin_unlock_irqrestore(&m->lock, flags); > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
--- linux-2.6.30-rc2/drivers/md/dm-mpath.c.orig 2009-04-17 16:49:33.000000000 -0500 +++ linux-2.6.30-rc2/drivers/md/dm-mpath.c 2009-04-17 17:09:51.000000000 -0500 @@ -1152,6 +1152,15 @@ static int do_end_io(struct multipath *m return error; spin_lock_irqsave(&m->lock, flags); + /* + * If this is the only path left, then lets try to + * re-initialize the PG one last time.. + */ + if (m->nr_valid_paths == 1 && m->hw_handler_name) { + m->pg_init_required = 1; + spin_unlock_irqrestore(&m->lock, flags); + goto requeue; + } if (!m->nr_valid_paths) { if (__must_push_back(m)) { spin_unlock_irqrestore(&m->lock, flags);
This patch introduces the mechanism to recover from I/O failures by re-initializing the path if the device is running on only one path. Problem: Device mapper fails the path for every I/O error. It does not care about the type of error. There are certain errors which can be recovered by re-initializing the path again. I have seen this problem during my testing on rdac device handler. I have observed I/O errors when there is a change in Lun ownership. When Lun ownership changes device will return back with check condition with sense 0x05/0x94/0x01(SK/ASC/ASCQ -meaning Lun ownership changed). Currently, device mapper fails the path for this error and eventually this will lead to I/O error. We don't want to see I/O error for this reason. The patch will set the flag pg_init_required if the device is running on single path. The process_queued_ios will re-initialize path if required. I have tested this patch on LSI rdac handler. Signed-off-by: Babu Moger <babu.moger@lsi.com> --- -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel