Message ID | 20190227171442.11853-6-igor.j.konopko@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lightnvm: bugfixes and improvements | expand |
> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote: > > Currently when unknown error occurs on read path > there is only dmesg information about it, but it > is not counted in sysfs statistics. Since this is > still an error we should also count it there. > > Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> > --- > drivers/lightnvm/pblk-core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c > index eabcbc119681..a98b2255f963 100644 > --- a/drivers/lightnvm/pblk-core.c > +++ b/drivers/lightnvm/pblk-core.c > @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd) > atomic_long_inc(&pblk->read_failed); > break; > default: > + atomic_long_inc(&pblk->read_failed); > pblk_err(pblk, "unknown read error:%d\n", rqd->error); > } > #ifdef CONFIG_NVM_PBLK_DEBUG > -- > 2.17.1 I left this out intentionally so that we could correlate the logs from the controller and the errors in the read path. Since we do not have an standard way to correlate this on SMART yet, let’s add this now (I assume that you are using it for something?) and we can separate the error stats in the future. Reviewed-by: Javier González <javier@javigon.com>
Igor: Have you seen this happening in real life? I think it would be better to count all expected errors and put them in the right bucket (without spamming dmesg). If we need a new bucket for i.e. vendor-specific-errors, let's do that instead. Someone wiser than me told me that every error print in the log is a potential customer call. Javier: Yeah, I think S.M.A.R.T is the way to deliver this information. Why can't we let the drives expose this info and remove this from pblk? What's blocking that? Thanks, Hans On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote: > > > On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote: > > > > Currently when unknown error occurs on read path > > there is only dmesg information about it, but it > > is not counted in sysfs statistics. Since this is > > still an error we should also count it there. > > > > Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> > > --- > > drivers/lightnvm/pblk-core.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c > > index eabcbc119681..a98b2255f963 100644 > > --- a/drivers/lightnvm/pblk-core.c > > +++ b/drivers/lightnvm/pblk-core.c > > @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd) > > atomic_long_inc(&pblk->read_failed); > > break; > > default: > > + atomic_long_inc(&pblk->read_failed); > > pblk_err(pblk, "unknown read error:%d\n", rqd->error); > > } > > #ifdef CONFIG_NVM_PBLK_DEBUG > > -- > > 2.17.1 > > I left this out intentionally so that we could correlate the logs from > the controller and the errors in the read path. Since we do not have an > standard way to correlate this on SMART yet, let’s add this now (I > assume that you are using it for something?) and we can separate the > error stats in the future. > > Reviewed-by: Javier González <javier@javigon.com>
> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote: > > Igor: Have you seen this happening in real life? > > I think it would be better to count all expected errors and put them > in the right bucket (without spamming dmesg). If we need a new bucket > for i.e. vendor-specific-errors, let's do that instead. > > Someone wiser than me told me that every error print in the log is a > potential customer call. > > Javier: Yeah, I think S.M.A.R.T is the way to deliver this > information. Why can't we let the drives expose this info and remove > this from pblk? What's blocking that? Until now the spec. We added some new log information in Denali exactly for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to have it here, at least for debugging. > > Thanks, > Hans > > On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote: >>> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote: >>> >>> Currently when unknown error occurs on read path >>> there is only dmesg information about it, but it >>> is not counted in sysfs statistics. Since this is >>> still an error we should also count it there. >>> >>> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> >>> --- >>> drivers/lightnvm/pblk-core.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c >>> index eabcbc119681..a98b2255f963 100644 >>> --- a/drivers/lightnvm/pblk-core.c >>> +++ b/drivers/lightnvm/pblk-core.c >>> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd) >>> atomic_long_inc(&pblk->read_failed); >>> break; >>> default: >>> + atomic_long_inc(&pblk->read_failed); >>> pblk_err(pblk, "unknown read error:%d\n", rqd->error); >>> } >>> #ifdef CONFIG_NVM_PBLK_DEBUG >>> -- >>> 2.17.1 >> >> I left this out intentionally so that we could correlate the logs from >> the controller and the errors in the read path. Since we do not have an >> standard way to correlate this on SMART yet, let’s add this now (I >> assume that you are using it for something?) and we can separate the >> error stats in the future. >> >> Reviewed-by: Javier González <javier@javigon.com>
On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote: > > > On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote: > > > > Igor: Have you seen this happening in real life? > > > > I think it would be better to count all expected errors and put them > > in the right bucket (without spamming dmesg). If we need a new bucket > > for i.e. vendor-specific-errors, let's do that instead. > > > > Someone wiser than me told me that every error print in the log is a > > potential customer call. > > > > Javier: Yeah, I think S.M.A.R.T is the way to deliver this > > information. Why can't we let the drives expose this info and remove > > this from pblk? What's blocking that? > > Until now the spec. We added some new log information in Denali exactly > for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to > have it here, at least for debugging. Why add it to the spec? Why not use whatever everyone else is using? https://en.wikipedia.org/wiki/S.M.A.R.T. : "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its primary function is to detect and report various indicators of drive reliability with the intent of anticipating imminent hardware failures." Sounds like what we want here. For debugging, a trace point or something(i.e. BPF) would be a better solution that would not impact hot-path performance. > > > > > Thanks, > > Hans > > > > On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote: > >>> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote: > >>> > >>> Currently when unknown error occurs on read path > >>> there is only dmesg information about it, but it > >>> is not counted in sysfs statistics. Since this is > >>> still an error we should also count it there. > >>> > >>> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> > >>> --- > >>> drivers/lightnvm/pblk-core.c | 1 + > >>> 1 file changed, 1 insertion(+) > >>> > >>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c > >>> index eabcbc119681..a98b2255f963 100644 > >>> --- a/drivers/lightnvm/pblk-core.c > >>> +++ b/drivers/lightnvm/pblk-core.c > >>> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd) > >>> atomic_long_inc(&pblk->read_failed); > >>> break; > >>> default: > >>> + atomic_long_inc(&pblk->read_failed); > >>> pblk_err(pblk, "unknown read error:%d\n", rqd->error); > >>> } > >>> #ifdef CONFIG_NVM_PBLK_DEBUG > >>> -- > >>> 2.17.1 > >> > >> I left this out intentionally so that we could correlate the logs from > >> the controller and the errors in the read path. Since we do not have an > >> standard way to correlate this on SMART yet, let’s add this now (I > >> assume that you are using it for something?) and we can separate the > >> error stats in the future. > >> > >> Reviewed-by: Javier González <javier@javigon.com>
> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote: > > On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote: >>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote: >>> >>> Igor: Have you seen this happening in real life? >>> >>> I think it would be better to count all expected errors and put them >>> in the right bucket (without spamming dmesg). If we need a new bucket >>> for i.e. vendor-specific-errors, let's do that instead. >>> >>> Someone wiser than me told me that every error print in the log is a >>> potential customer call. >>> >>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this >>> information. Why can't we let the drives expose this info and remove >>> this from pblk? What's blocking that? >> >> Until now the spec. We added some new log information in Denali exactly >> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to >> have it here, at least for debugging. > > Why add it to the spec? Why not use whatever everyone else is using? > > https://en.wikipedia.org/wiki/S.M.A.R.T. : > "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often > written as SMART) is a monitoring system included in computer hard > disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its > primary function is to detect and report various indicators of drive > reliability with the intent of anticipating imminent hardware > failures." > Sounds like what we want here. I know what smart is… You need to define the fields. Maybe you want to read Denali again - the extensions are couple with smart. > For debugging, a trace point or something(i.e. BPF) would be a better > solution that would not impact hot-path performance. Cool. Look forward to the patches ;) Javier
On 04.03.2019 12:45, Javier González wrote: > >> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote: >> >> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote: >>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote: >>>> >>>> Igor: Have you seen this happening in real life? >>>> >>>> I think it would be better to count all expected errors and put them >>>> in the right bucket (without spamming dmesg). If we need a new bucket >>>> for i.e. vendor-specific-errors, let's do that instead. Generally I'm seeing different types of errors (which are typically as Javier mention controller errors) in cases such as hot drive removal, etc. We can skip that patch, since this are kind of corner cases. I can also create new type of pblk stats, sth. like "controller errors", which would collect all the other unexpected errors in one place instead of mixing them with real read/write errors as I did. >>>> >>>> Someone wiser than me told me that every error print in the log is a >>>> potential customer call. >>>> >>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this >>>> information. Why can't we let the drives expose this info and remove >>>> this from pblk? What's blocking that? >>> >>> Until now the spec. We added some new log information in Denali exactly >>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to >>> have it here, at least for debugging. >> >> Why add it to the spec? Why not use whatever everyone else is using? >> >> https://en.wikipedia.org/wiki/S.M.A.R.T. : >> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often >> written as SMART) is a monitoring system included in computer hard >> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its >> primary function is to detect and report various indicators of drive >> reliability with the intent of anticipating imminent hardware >> failures." >> Sounds like what we want here. > > I know what smart is… You need to define the fields. Maybe you want to > read Denali again - the extensions are couple with smart. > >> For debugging, a trace point or something(i.e. BPF) would be a better >> solution that would not impact hot-path performance. > > Cool. Look forward to the patches ;) > > Javier >
On Mon, Mar 4, 2019 at 1:42 PM Igor Konopko <igor.j.konopko@intel.com> wrote: > > > > On 04.03.2019 12:45, Javier González wrote: > > > >> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote: > >> > >> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote: > >>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote: > >>>> > >>>> Igor: Have you seen this happening in real life? > >>>> > >>>> I think it would be better to count all expected errors and put them > >>>> in the right bucket (without spamming dmesg). If we need a new bucket > >>>> for i.e. vendor-specific-errors, let's do that instead. > > Generally I'm seeing different types of errors (which are typically as > Javier mention controller errors) in cases such as hot drive removal, etc. > > We can skip that patch, since this are kind of corner cases. I can also > create new type of pblk stats, sth. like "controller errors", which > would collect all the other unexpected errors in one place instead of > mixing them with real read/write errors as I did. Yes, that makes sense. Thanks, Hans > > >>>> > >>>> Someone wiser than me told me that every error print in the log is a > >>>> potential customer call. > >>>> > >>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this > >>>> information. Why can't we let the drives expose this info and remove > >>>> this from pblk? What's blocking that? > >>> > >>> Until now the spec. We added some new log information in Denali exactly > >>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to > >>> have it here, at least for debugging. > >> > >> Why add it to the spec? Why not use whatever everyone else is using? > >> > >> https://en.wikipedia.org/wiki/S.M.A.R.T. : > >> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often > >> written as SMART) is a monitoring system included in computer hard > >> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its > >> primary function is to detect and report various indicators of drive > >> reliability with the intent of anticipating imminent hardware > >> failures." > >> Sounds like what we want here. > > > > I know what smart is… You need to define the fields. Maybe you want to > > read Denali again - the extensions are couple with smart. > > > >> For debugging, a trace point or something(i.e. BPF) would be a better > >> solution that would not impact hot-path performance. > > > > Cool. Look forward to the patches ;) > > > > Javier > >
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index eabcbc119681..a98b2255f963 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd) atomic_long_inc(&pblk->read_failed); break; default: + atomic_long_inc(&pblk->read_failed); pblk_err(pblk, "unknown read error:%d\n", rqd->error); } #ifdef CONFIG_NVM_PBLK_DEBUG
Currently when unknown error occurs on read path there is only dmesg information about it, but it is not counted in sysfs statistics. Since this is still an error we should also count it there. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> --- drivers/lightnvm/pblk-core.c | 1 + 1 file changed, 1 insertion(+)