[05/13] lightnvm: pblk: Count all read errors in stats

Message ID	20190227171442.11853-6-igor.j.konopko@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Igor Konopko <igor.j.konopko@intel.com> To: mb@lightnvm.io, javier@javigon.com, hans.holmberg@cnexlabs.com Cc: linux-block@vger.kernel.org, igor.j.konopko@intel.com Subject: [PATCH 05/13] lightnvm: pblk: Count all read errors in stats Date: Wed, 27 Feb 2019 18:14:34 +0100 Message-Id: <20190227171442.11853-6-igor.j.konopko@intel.com> In-Reply-To: <20190227171442.11853-1-igor.j.konopko@intel.com> References: <20190227171442.11853-1-igor.j.konopko@intel.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	lightnvm: bugfixes and improvements \| expand [00/13] lightnvm: bugfixes and improvements [01/13] lightnvm: pblk: Line reference fix in GC [02/13] lightnvm: pblk: Gracefully handle GC data malloc fail [03/13] lightnvm: pblk: Fix put line back behaviour [04/13] lightnvm: pblk: Rollback in gc read [05/13] lightnvm: pblk: Count all read errors in stats [06/13] lightnvm: pblk: Ensure that erase is chunk aligned [07/13] lightnvm: pblk: Cleanly fail when there is not enough memory [08/13] lightnvm: pblk: Set proper read stutus in bio [09/13] lightnvm: pblk: Kick writer for flush requests [10/13] lightnvm: pblk: Reduce L2P DRAM footprint [11/13] lightnvm: pblk: Remove unused smeta_ssec field [12/13] lightnvm: pblk: close opened chunks [13/13] lightnvm: Inherit mdts from the parent nvme device

Igor Konopko Feb. 27, 2019, 5:14 p.m. UTC

Currently when unknown error occurs on read path
there is only dmesg information about it, but it
is not counted in sysfs statistics. Since this is
still an error we should also count it there.

Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
---
 drivers/lightnvm/pblk-core.c | 1 +
 1 file changed, 1 insertion(+)

Javier González March 4, 2019, 7:42 a.m. UTC | #1

> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
> 
> Currently when unknown error occurs on read path
> there is only dmesg information about it, but it
> is not counted in sysfs statistics. Since this is
> still an error we should also count it there.
> 
> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
> ---
> drivers/lightnvm/pblk-core.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> index eabcbc119681..a98b2255f963 100644
> --- a/drivers/lightnvm/pblk-core.c
> +++ b/drivers/lightnvm/pblk-core.c
> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
> 		atomic_long_inc(&pblk->read_failed);
> 		break;
> 	default:
> +		atomic_long_inc(&pblk->read_failed);
> 		pblk_err(pblk, "unknown read error:%d\n", rqd->error);
> 	}
> #ifdef CONFIG_NVM_PBLK_DEBUG
> --
> 2.17.1

I left this out intentionally  so that we could correlate the logs from
the controller and the errors in the read path. Since we do not have an
standard way to correlate this on SMART yet, let’s add this now (I
assume that you are using it for something?) and we can separate the
error stats in the future.

Reviewed-by: Javier González <javier@javigon.com>

Hans Holmberg March 4, 2019, 9:02 a.m. UTC | #2

Igor: Have you seen this happening in real life?

I think it would be better to count all expected errors and put them
in the right bucket (without spamming dmesg). If we need a new bucket
for i.e. vendor-specific-errors, let's do that instead.

Someone wiser than me told me that every error print in the log is a
potential customer call.

Javier: Yeah, I think S.M.A.R.T is the way to deliver this
information. Why can't we let the drives expose this info and remove
this from pblk? What's blocking that?

Thanks,
Hans

On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote:
>
> > On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
> >
> > Currently when unknown error occurs on read path
> > there is only dmesg information about it, but it
> > is not counted in sysfs statistics. Since this is
> > still an error we should also count it there.
> >
> > Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
> > ---
> > drivers/lightnvm/pblk-core.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> > index eabcbc119681..a98b2255f963 100644
> > --- a/drivers/lightnvm/pblk-core.c
> > +++ b/drivers/lightnvm/pblk-core.c
> > @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
> >               atomic_long_inc(&pblk->read_failed);
> >               break;
> >       default:
> > +             atomic_long_inc(&pblk->read_failed);
> >               pblk_err(pblk, "unknown read error:%d\n", rqd->error);
> >       }
> > #ifdef CONFIG_NVM_PBLK_DEBUG
> > --
> > 2.17.1
>
> I left this out intentionally  so that we could correlate the logs from
> the controller and the errors in the read path. Since we do not have an
> standard way to correlate this on SMART yet, let’s add this now (I
> assume that you are using it for something?) and we can separate the
> error stats in the future.
>
> Reviewed-by: Javier González <javier@javigon.com>

Javier González March 4, 2019, 9:23 a.m. UTC | #3

> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
> 
> Igor: Have you seen this happening in real life?
> 
> I think it would be better to count all expected errors and put them
> in the right bucket (without spamming dmesg). If we need a new bucket
> for i.e. vendor-specific-errors, let's do that instead.
> 
> Someone wiser than me told me that every error print in the log is a
> potential customer call.
> 
> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> information. Why can't we let the drives expose this info and remove
> this from pblk? What's blocking that?

Until now the spec. We added some new log information in Denali exactly
for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
have it here, at least for debugging.

> 
> Thanks,
> Hans
> 
> On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote:
>>> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
>>> 
>>> Currently when unknown error occurs on read path
>>> there is only dmesg information about it, but it
>>> is not counted in sysfs statistics. Since this is
>>> still an error we should also count it there.
>>> 
>>> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
>>> ---
>>> drivers/lightnvm/pblk-core.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>> 
>>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
>>> index eabcbc119681..a98b2255f963 100644
>>> --- a/drivers/lightnvm/pblk-core.c
>>> +++ b/drivers/lightnvm/pblk-core.c
>>> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
>>>              atomic_long_inc(&pblk->read_failed);
>>>              break;
>>>      default:
>>> +             atomic_long_inc(&pblk->read_failed);
>>>              pblk_err(pblk, "unknown read error:%d\n", rqd->error);
>>>      }
>>> #ifdef CONFIG_NVM_PBLK_DEBUG
>>> --
>>> 2.17.1
>> 
>> I left this out intentionally  so that we could correlate the logs from
>> the controller and the errors in the read path. Since we do not have an
>> standard way to correlate this on SMART yet, let’s add this now (I
>> assume that you are using it for something?) and we can separate the
>> error stats in the future.
>> 
>> Reviewed-by: Javier González <javier@javigon.com>

Hans Holmberg March 4, 2019, 11:41 a.m. UTC | #4

On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
>
> > On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
> >
> > Igor: Have you seen this happening in real life?
> >
> > I think it would be better to count all expected errors and put them
> > in the right bucket (without spamming dmesg). If we need a new bucket
> > for i.e. vendor-specific-errors, let's do that instead.
> >
> > Someone wiser than me told me that every error print in the log is a
> > potential customer call.
> >
> > Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> > information. Why can't we let the drives expose this info and remove
> > this from pblk? What's blocking that?
>
> Until now the spec. We added some new log information in Denali exactly
> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
> have it here, at least for debugging.

Why add it to the spec? Why not use whatever everyone else is using?

https://en.wikipedia.org/wiki/S.M.A.R.T. :
"S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
written as SMART) is a monitoring system included in computer hard
disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
primary function is to detect and report various indicators of drive
reliability with the intent of anticipating imminent hardware
failures."
Sounds like what we want here.

For debugging, a trace point or something(i.e. BPF) would be a better
solution that would not impact hot-path performance.

>
> >
> > Thanks,
> > Hans
> >
> > On Mon, Mar 4, 2019 at 8:42 AM Javier González <javier@javigon.com> wrote:
> >>> On 27 Feb 2019, at 18.14, Igor Konopko <igor.j.konopko@intel.com> wrote:
> >>>
> >>> Currently when unknown error occurs on read path
> >>> there is only dmesg information about it, but it
> >>> is not counted in sysfs statistics. Since this is
> >>> still an error we should also count it there.
> >>>
> >>> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
> >>> ---
> >>> drivers/lightnvm/pblk-core.c | 1 +
> >>> 1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> >>> index eabcbc119681..a98b2255f963 100644
> >>> --- a/drivers/lightnvm/pblk-core.c
> >>> +++ b/drivers/lightnvm/pblk-core.c
> >>> @@ -493,6 +493,7 @@ void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd)
> >>>              atomic_long_inc(&pblk->read_failed);
> >>>              break;
> >>>      default:
> >>> +             atomic_long_inc(&pblk->read_failed);
> >>>              pblk_err(pblk, "unknown read error:%d\n", rqd->error);
> >>>      }
> >>> #ifdef CONFIG_NVM_PBLK_DEBUG
> >>> --
> >>> 2.17.1
> >>
> >> I left this out intentionally  so that we could correlate the logs from
> >> the controller and the errors in the read path. Since we do not have an
> >> standard way to correlate this on SMART yet, let’s add this now (I
> >> assume that you are using it for something?) and we can separate the
> >> error stats in the future.
> >>
> >> Reviewed-by: Javier González <javier@javigon.com>

Javier González March 4, 2019, 11:45 a.m. UTC | #5

> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote:
> 
> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
>>> 
>>> Igor: Have you seen this happening in real life?
>>> 
>>> I think it would be better to count all expected errors and put them
>>> in the right bucket (without spamming dmesg). If we need a new bucket
>>> for i.e. vendor-specific-errors, let's do that instead.
>>> 
>>> Someone wiser than me told me that every error print in the log is a
>>> potential customer call.
>>> 
>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
>>> information. Why can't we let the drives expose this info and remove
>>> this from pblk? What's blocking that?
>> 
>> Until now the spec. We added some new log information in Denali exactly
>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
>> have it here, at least for debugging.
> 
> Why add it to the spec? Why not use whatever everyone else is using?
> 
> https://en.wikipedia.org/wiki/S.M.A.R.T. :
> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
> written as SMART) is a monitoring system included in computer hard
> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
> primary function is to detect and report various indicators of drive
> reliability with the intent of anticipating imminent hardware
> failures."
> Sounds like what we want here.

I know what smart is… You need to define the fields. Maybe you want to
read Denali again - the extensions are couple with smart.

> For debugging, a trace point or something(i.e. BPF) would be a better
> solution that would not impact hot-path performance.

Cool. Look forward to the patches ;)

Javier

Igor Konopko March 4, 2019, 12:42 p.m. UTC | #6

On 04.03.2019 12:45, Javier González wrote:
> 
>> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote:
>>
>> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
>>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
>>>>
>>>> Igor: Have you seen this happening in real life?
>>>>
>>>> I think it would be better to count all expected errors and put them
>>>> in the right bucket (without spamming dmesg). If we need a new bucket
>>>> for i.e. vendor-specific-errors, let's do that instead.

Generally I'm seeing different types of errors (which are typically as 
Javier mention controller errors) in cases such as hot drive removal, etc.

We can skip that patch, since this are kind of corner cases. I can also
create new type of pblk stats, sth. like "controller errors", which 
would collect all the other unexpected errors in one place instead of 
mixing them with real read/write errors as I did.

>>>>
>>>> Someone wiser than me told me that every error print in the log is a
>>>> potential customer call.
>>>>
>>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
>>>> information. Why can't we let the drives expose this info and remove
>>>> this from pblk? What's blocking that?
>>>
>>> Until now the spec. We added some new log information in Denali exactly
>>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
>>> have it here, at least for debugging.
>>
>> Why add it to the spec? Why not use whatever everyone else is using?
>>
>> https://en.wikipedia.org/wiki/S.M.A.R.T. :
>> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
>> written as SMART) is a monitoring system included in computer hard
>> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
>> primary function is to detect and report various indicators of drive
>> reliability with the intent of anticipating imminent hardware
>> failures."
>> Sounds like what we want here.
> 
> I know what smart is… You need to define the fields. Maybe you want to
> read Denali again - the extensions are couple with smart.
> 
>> For debugging, a trace point or something(i.e. BPF) would be a better
>> solution that would not impact hot-path performance.
> 
> Cool. Look forward to the patches ;)
> 
> Javier
>

Hans Holmberg March 4, 2019, 12:48 p.m. UTC | #7

On Mon, Mar 4, 2019 at 1:42 PM Igor Konopko <igor.j.konopko@intel.com> wrote:
>
>
>
> On 04.03.2019 12:45, Javier González wrote:
> >
> >> On 4 Mar 2019, at 12.41, Hans Holmberg <hans@owltronix.com> wrote:
> >>
> >> On Mon, Mar 4, 2019 at 10:23 AM Javier González <javier@javigon.com> wrote:
> >>>> On 4 Mar 2019, at 10.02, Hans Holmberg <hans.ml.holmberg@owltronix.com> wrote:
> >>>>
> >>>> Igor: Have you seen this happening in real life?
> >>>>
> >>>> I think it would be better to count all expected errors and put them
> >>>> in the right bucket (without spamming dmesg). If we need a new bucket
> >>>> for i.e. vendor-specific-errors, let's do that instead.
>
> Generally I'm seeing different types of errors (which are typically as
> Javier mention controller errors) in cases such as hot drive removal, etc.
>
> We can skip that patch, since this are kind of corner cases. I can also
> create new type of pblk stats, sth. like "controller errors", which
> would collect all the other unexpected errors in one place instead of
> mixing them with real read/write errors as I did.

Yes, that makes sense.

Thanks,
Hans

>
> >>>>
> >>>> Someone wiser than me told me that every error print in the log is a
> >>>> potential customer call.
> >>>>
> >>>> Javier: Yeah, I think S.M.A.R.T is the way to deliver this
> >>>> information. Why can't we let the drives expose this info and remove
> >>>> this from pblk? What's blocking that?
> >>>
> >>> Until now the spec. We added some new log information in Denali exactly
> >>> for this. But since pblk supports OCSSD 1.2 and 2.0 I think it is needed to
> >>> have it here, at least for debugging.
> >>
> >> Why add it to the spec? Why not use whatever everyone else is using?
> >>
> >> https://en.wikipedia.org/wiki/S.M.A.R.T. :
> >> "S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often
> >> written as SMART) is a monitoring system included in computer hard
> >> disk drives (HDDs), solid-state drives (SSDs),[1] and eMMC drives. Its
> >> primary function is to detect and report various indicators of drive
> >> reliability with the intent of anticipating imminent hardware
> >> failures."
> >> Sounds like what we want here.
> >
> > I know what smart is… You need to define the fields. Maybe you want to
> > read Denali again - the extensions are couple with smart.
> >
> >> For debugging, a trace point or something(i.e. BPF) would be a better
> >> solution that would not impact hot-path performance.
> >
> > Cool. Look forward to the patches ;)
> >
> > Javier
> >

[05/13] lightnvm: pblk: Count all read errors in stats

Commit Message

Comments

Patch