Message ID | 1456059126-25469-1-git-send-email-mpa@pengutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Markus, On Sun, 21 Feb 2016 13:52:06 +0100 Markus Pargmann <mpa@pengutronix.de> wrote: > ECC is only calculated for written pages. As erased pages are not > actively written the ECC is always invalid. For this purpose the > Hardware BCH unit is able to check for erased pages and does not raise > an ECC error in this case. This behaviour can be influenced using the > BCH_MODE register which sets the number of allowed bitflips in an erased > page. Unfortunately the unit is not capable of fixing the bitflips in > memory. > > To avoid complete software checks for erased pages, we can simply check > buffers with uncorrectable ECC errors because we know that any erased > page with errors is uncorrectable by the BCH unit. > > This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand > to correct erased pages. To have the valid data in the buffer before > using them, this patch moves the read_page_swap_end() call before the > ECC status checking for-loop. > > Signed-off-by: Markus Pargmann <mpa@pengutronix.de> > --- > drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++---- > 1 file changed, 44 insertions(+), 5 deletions(-) > > diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > index 235ddcb58f39..ce5a21252102 100644 > --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, > /* Loop over status bytes, accumulating ECC status. */ > status = auxiliary_virt + nfc_geo->auxiliary_status_offset; > > + read_page_swap_end(this, buf, nfc_geo->payload_size, > + this->payload_virt, this->payload_phys, > + nfc_geo->payload_size, > + payload_virt, payload_phys); > + > for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) { > if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED)) > continue; > > if (*status == STATUS_UNCORRECTABLE) { > + int flips; > + > + /* > + * The ECC hardware has an uncorrectable ECC status > + * code in case we have bitflips in an erased page. As > + * nothing was written into this subpage the ECC is > + * obviously wrong and we can not trust it. We assume > + * at this point that we are reading an erased page and > + * try to correct the bitflips in buffer up to > + * ecc_strength bitflips. If this is a page with random > + * data, we exceed this number of bitflips and have a > + * ECC failure. Otherwise we use the corrected buffer. > + */ > + if (i == 0) { > + /* The first block includes metadata */ > + flips = nand_check_erased_ecc_chunk( > + buf + i * nfc_geo->ecc_chunk_size, > + nfc_geo->ecc_chunk_size, > + NULL, 0, > + auxiliary_virt, > + nfc_geo->metadata_size, > + nfc_geo->ecc_strength); > + } else { > + flips = nand_check_erased_ecc_chunk( > + buf + i * nfc_geo->ecc_chunk_size, > + nfc_geo->ecc_chunk_size, > + NULL, 0, > + NULL, 0, > + nfc_geo->ecc_strength); > + } You're not checking ECC bytes. I know it's a bit more complicated to implement, but as Brian stated several times, you should always check ECC bytes along with data bytes when testing for an erased chunk. Indeed, it might appear that the user really programmed ff on the page, and in this case you don't want to consider the page as erased. In this case, the ECC bytes will be different from ff, and you'll be able to identify it by checking those ECC bytes. Best Regards, Boris
Hi Boris, On Wednesday 24 February 2016 14:57:50 Boris Brezillon wrote: > Hi Markus, > > On Sun, 21 Feb 2016 13:52:06 +0100 > Markus Pargmann <mpa@pengutronix.de> wrote: > > > ECC is only calculated for written pages. As erased pages are not > > actively written the ECC is always invalid. For this purpose the > > Hardware BCH unit is able to check for erased pages and does not raise > > an ECC error in this case. This behaviour can be influenced using the > > BCH_MODE register which sets the number of allowed bitflips in an erased > > page. Unfortunately the unit is not capable of fixing the bitflips in > > memory. > > > > To avoid complete software checks for erased pages, we can simply check > > buffers with uncorrectable ECC errors because we know that any erased > > page with errors is uncorrectable by the BCH unit. > > > > This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand > > to correct erased pages. To have the valid data in the buffer before > > using them, this patch moves the read_page_swap_end() call before the > > ECC status checking for-loop. > > > > Signed-off-by: Markus Pargmann <mpa@pengutronix.de> > > --- > > drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++---- > > 1 file changed, 44 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > > index 235ddcb58f39..ce5a21252102 100644 > > --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > > +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > > @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, > > /* Loop over status bytes, accumulating ECC status. */ > > status = auxiliary_virt + nfc_geo->auxiliary_status_offset; > > > > + read_page_swap_end(this, buf, nfc_geo->payload_size, > > + this->payload_virt, this->payload_phys, > > + nfc_geo->payload_size, > > + payload_virt, payload_phys); > > + > > for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) { > > if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED)) > > continue; > > > > if (*status == STATUS_UNCORRECTABLE) { > > + int flips; > > + > > + /* > > + * The ECC hardware has an uncorrectable ECC status > > + * code in case we have bitflips in an erased page. As > > + * nothing was written into this subpage the ECC is > > + * obviously wrong and we can not trust it. We assume > > + * at this point that we are reading an erased page and > > + * try to correct the bitflips in buffer up to > > + * ecc_strength bitflips. If this is a page with random > > + * data, we exceed this number of bitflips and have a > > + * ECC failure. Otherwise we use the corrected buffer. > > + */ > > + if (i == 0) { > > + /* The first block includes metadata */ > > + flips = nand_check_erased_ecc_chunk( > > + buf + i * nfc_geo->ecc_chunk_size, > > + nfc_geo->ecc_chunk_size, > > + NULL, 0, > > + auxiliary_virt, > > + nfc_geo->metadata_size, > > + nfc_geo->ecc_strength); > > + } else { > > + flips = nand_check_erased_ecc_chunk( > > + buf + i * nfc_geo->ecc_chunk_size, > > + nfc_geo->ecc_chunk_size, > > + NULL, 0, > > + NULL, 0, > > + nfc_geo->ecc_strength); > > + } > > You're not checking ECC bytes. I know it's a bit more complicated to > implement, but as Brian stated several times, you should always check > ECC bytes along with data bytes when testing for an erased chunk. > > Indeed, it might appear that the user really programmed ff on the page, > and in this case you don't want to consider the page as erased. > In this case, the ECC bytes will be different from ff, and you'll > be able to identify it by checking those ECC bytes. Thanks for the feedback. Talking with a coworker about this we may have found a better approach to this that is less complicated to implement. The hardware unit allows us to set a bitflip threshold for erased pages. The ECC unit creates an ECC error only if the number of bitflips exceeds this threshold, but it does not correct these. So the idea is to change the patch so that we set pages, that are signaled by the ECC as erased, to 0xff completely without checking. So the ECC will do all the work and we completely trust in its abilities to do it correctly. I will send a modified patch as soon as I have some time for this. Best Regards, Markus
On Mon, 11 Apr 2016 08:34:35 +0200 Markus Pargmann <mpa@pengutronix.de> wrote: > Hi Boris, > > On Wednesday 24 February 2016 14:57:50 Boris Brezillon wrote: > > Hi Markus, > > > > On Sun, 21 Feb 2016 13:52:06 +0100 > > Markus Pargmann <mpa@pengutronix.de> wrote: > > > > > ECC is only calculated for written pages. As erased pages are not > > > actively written the ECC is always invalid. For this purpose the > > > Hardware BCH unit is able to check for erased pages and does not raise > > > an ECC error in this case. This behaviour can be influenced using the > > > BCH_MODE register which sets the number of allowed bitflips in an erased > > > page. Unfortunately the unit is not capable of fixing the bitflips in > > > memory. > > > > > > To avoid complete software checks for erased pages, we can simply check > > > buffers with uncorrectable ECC errors because we know that any erased > > > page with errors is uncorrectable by the BCH unit. > > > > > > This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand > > > to correct erased pages. To have the valid data in the buffer before > > > using them, this patch moves the read_page_swap_end() call before the > > > ECC status checking for-loop. > > > > > > Signed-off-by: Markus Pargmann <mpa@pengutronix.de> > > > --- > > > drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++---- > > > 1 file changed, 44 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > > > index 235ddcb58f39..ce5a21252102 100644 > > > --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > > > +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c > > > @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, > > > /* Loop over status bytes, accumulating ECC status. */ > > > status = auxiliary_virt + nfc_geo->auxiliary_status_offset; > > > > > > + read_page_swap_end(this, buf, nfc_geo->payload_size, > > > + this->payload_virt, this->payload_phys, > > > + nfc_geo->payload_size, > > > + payload_virt, payload_phys); > > > + > > > for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) { > > > if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED)) > > > continue; > > > > > > if (*status == STATUS_UNCORRECTABLE) { > > > + int flips; > > > + > > > + /* > > > + * The ECC hardware has an uncorrectable ECC status > > > + * code in case we have bitflips in an erased page. As > > > + * nothing was written into this subpage the ECC is > > > + * obviously wrong and we can not trust it. We assume > > > + * at this point that we are reading an erased page and > > > + * try to correct the bitflips in buffer up to > > > + * ecc_strength bitflips. If this is a page with random > > > + * data, we exceed this number of bitflips and have a > > > + * ECC failure. Otherwise we use the corrected buffer. > > > + */ > > > + if (i == 0) { > > > + /* The first block includes metadata */ > > > + flips = nand_check_erased_ecc_chunk( > > > + buf + i * nfc_geo->ecc_chunk_size, > > > + nfc_geo->ecc_chunk_size, > > > + NULL, 0, > > > + auxiliary_virt, > > > + nfc_geo->metadata_size, > > > + nfc_geo->ecc_strength); > > > + } else { > > > + flips = nand_check_erased_ecc_chunk( > > > + buf + i * nfc_geo->ecc_chunk_size, > > > + nfc_geo->ecc_chunk_size, > > > + NULL, 0, > > > + NULL, 0, > > > + nfc_geo->ecc_strength); > > > + } > > > > You're not checking ECC bytes. I know it's a bit more complicated to > > implement, but as Brian stated several times, you should always check > > ECC bytes along with data bytes when testing for an erased chunk. > > > > Indeed, it might appear that the user really programmed ff on the page, > > and in this case you don't want to consider the page as erased. > > In this case, the ECC bytes will be different from ff, and you'll > > be able to identify it by checking those ECC bytes. > > Thanks for the feedback. Talking with a coworker about this we may have found a > better approach to this that is less complicated to implement. The hardware > unit allows us to set a bitflip threshold for erased pages. The ECC unit > creates an ECC error only if the number of bitflips exceeds this threshold, but > it does not correct these. So the idea is to change the patch so that we set > pages, that are signaled by the ECC as erased, to 0xff completely without > checking. So the ECC will do all the work and we completely trust in its > abilities to do it correctly. Sounds good.
On Tue, 12 Apr 2016 22:39:08 +0000 Han Xu <han.xu@nxp.com> wrote: > > Thanks for the feedback. Talking with a coworker about this we may have found a > > better approach to this that is less complicated to implement. The hardware > > unit allows us to set a bitflip threshold for erased pages. The ECC unit > > creates an ECC error only if the number of bitflips exceeds this threshold, but > > it does not correct these. So the idea is to change the patch so that we set > > pages, that are signaled by the ECC as erased, to 0xff completely without > > checking. So the ECC will do all the work and we completely trust in its > > abilities to do it correctly. > > Sounds good. > > some new platforms with new gpmi controller could check the count of 0 bits in page, > refer to my patch https://patchwork.ozlabs.org/patch/587124/ > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and > only check the uncorrectable branch and then correct data sounds better. Setting threshold > and correcting all erased page may highly impact the performance. Indeed, bitflips in erased pages is not so common, and penalizing the likely case (erased pages without any bitflips) doesn't look like a good idea in the end. You can still implement this check in software. You can have a look at nand_check_erased_ecc_chunk() [1] if you need an example, but you'll have to adapt it because your controller does not guarantees that ECC bits for a given chunk are byte aligned :-/ [1]http://lxr.free-electrons.com/source/drivers/mtd/nand/nand_base.c#L1200
On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote: > On Tue, 12 Apr 2016 22:39:08 +0000 > Han Xu <han.xu@nxp.com> wrote: > > > > Thanks for the feedback. Talking with a coworker about this we may have found a > > > better approach to this that is less complicated to implement. The hardware > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit > > > creates an ECC error only if the number of bitflips exceeds this threshold, but > > > it does not correct these. So the idea is to change the patch so that we set > > > pages, that are signaled by the ECC as erased, to 0xff completely without > > > checking. So the ECC will do all the work and we completely trust in its > > > abilities to do it correctly. > > > > Sounds good. > > > > some new platforms with new gpmi controller could check the count of 0 bits in page, > > refer to my patch https://patchwork.ozlabs.org/patch/587124/ > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and > > only check the uncorrectable branch and then correct data sounds better. Setting threshold > > and correcting all erased page may highly impact the performance. > > Indeed, bitflips in erased pages is not so common, and penalizing the > likely case (erased pages without any bitflips) doesn't look like a good > idea in the end. Are erased pages really read that often? I am not sure how UBI handles this, does it read every page before writing? > > You can still implement this check in software. You can have a look at > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll > have to adapt it because your controller does not guarantees that ECC > bits for a given chunk are byte aligned :-/ Yes I used this function in the patch. The issue is that I am not quite sure yet where to find the raw ECC data (without rereading the page). The reference manual is not extremely clear about that, ecc data may be in the 'auxilliary data' but I am not sure that it really is available somewhere. Best Regards, Markus
Hi Markus, On Fri, 15 Apr 2016 09:55:45 +0200 Markus Pargmann <mpa@pengutronix.de> wrote: > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote: > > On Tue, 12 Apr 2016 22:39:08 +0000 > > Han Xu <han.xu@nxp.com> wrote: > > > > > > Thanks for the feedback. Talking with a coworker about this we may have found a > > > > better approach to this that is less complicated to implement. The hardware > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but > > > > it does not correct these. So the idea is to change the patch so that we set > > > > pages, that are signaled by the ECC as erased, to 0xff completely without > > > > checking. So the ECC will do all the work and we completely trust in its > > > > abilities to do it correctly. > > > > > > Sounds good. > > > > > > some new platforms with new gpmi controller could check the count of 0 bits in page, > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/ > > > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold > > > and correcting all erased page may highly impact the performance. > > > > Indeed, bitflips in erased pages is not so common, and penalizing the > > likely case (erased pages without any bitflips) doesn't look like a good > > idea in the end. > > Are erased pages really read that often? Yes, it's not unusual to have those "empty pages?" checks (added Artem and Richard to get a confirmation). AFAIR, UBIFS check for empty pages in its journal heads after an unclean unmount (which happens quite often) to make sure there's no corruption. > I am not sure how UBI handles > this, does it read every page before writing? Nope, or maybe it does when you activate some extra checks. > > > > > You can still implement this check in software. You can have a look at > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll > > have to adapt it because your controller does not guarantees that ECC > > bits for a given chunk are byte aligned :-/ > > Yes I used this function in the patch. The issue is that I am not quite > sure yet where to find the raw ECC data (without rereading the page). > The reference manual is not extremely clear about that, ecc data may be > in the 'auxilliary data' but I am not sure that it really is available > somewhere. AFAIR (and I'm not sure since it was a long time ago), you don't have direct access to ECC bytes with the GPMI engine. If that's the case, you'll have to read the ECC bytes manually (moving the page pointer using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with this engine, because ECC bytes are not guaranteed to be byte aligned (see gpmi ->read_page_raw() implementation). Once you've retrieved ECC bytes (or bits in this case), for each ECC chunk, you can use the nand_check_erased_ecc_chunk() function (just make sure you're padding the last ECC byte of each chunk with ones so that bitflips cannot be reported on this section). Best Regards, Boris
Hi Boris, On Friday 15 April 2016 10:35:08 Boris Brezillon wrote: > Hi Markus, > > On Fri, 15 Apr 2016 09:55:45 +0200 > Markus Pargmann <mpa@pengutronix.de> wrote: > > > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote: > > > On Tue, 12 Apr 2016 22:39:08 +0000 > > > Han Xu <han.xu@nxp.com> wrote: > > > > > > > > Thanks for the feedback. Talking with a coworker about this we may have found a > > > > > better approach to this that is less complicated to implement. The hardware > > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit > > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but > > > > > it does not correct these. So the idea is to change the patch so that we set > > > > > pages, that are signaled by the ECC as erased, to 0xff completely without > > > > > checking. So the ECC will do all the work and we completely trust in its > > > > > abilities to do it correctly. > > > > > > > > Sounds good. > > > > > > > > some new platforms with new gpmi controller could check the count of 0 bits in page, > > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/ > > > > > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and > > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold > > > > and correcting all erased page may highly impact the performance. > > > > > > Indeed, bitflips in erased pages is not so common, and penalizing the > > > likely case (erased pages without any bitflips) doesn't look like a good > > > idea in the end. > > > > Are erased pages really read that often? > > Yes, it's not unusual to have those "empty pages?" checks (added Artem > and Richard to get a confirmation). AFAIR, UBIFS check for empty pages > in its journal heads after an unclean unmount (which happens quite > often) to make sure there's no corruption. > > > I am not sure how UBI handles > > this, does it read every page before writing? > > Nope, or maybe it does when you activate some extra checks. > > > > > > > > > You can still implement this check in software. You can have a look at > > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll > > > have to adapt it because your controller does not guarantees that ECC > > > bits for a given chunk are byte aligned :-/ > > > > Yes I used this function in the patch. The issue is that I am not quite > > sure yet where to find the raw ECC data (without rereading the page). > > The reference manual is not extremely clear about that, ecc data may be > > in the 'auxilliary data' but I am not sure that it really is available > > somewhere. > > AFAIR (and I'm not sure since it was a long time ago), you don't have > direct access to ECC bytes with the GPMI engine. If that's the case, > you'll have to read the ECC bytes manually (moving the page pointer > using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with > this engine, because ECC bytes are not guaranteed to be byte aligned > (see gpmi ->read_page_raw() implementation). > Once you've retrieved ECC bytes (or bits in this case), for each ECC > chunk, you can use the nand_check_erased_ecc_chunk() function (just make > sure you're padding the last ECC byte of each chunk with ones so that > bitflips cannot be reported on this section). Thanks for the information. So I understand that this approach is the preferred one to avoid any performance issues for normal operation. I actually won't be able to fix this patch accordingly for some time. If anyone else needs this earlier, feel free to implement it. Best Regards, Markus
Hi Markus, On Fri, 15 Apr 2016 11:35:07 +0200 Markus Pargmann <mpa@pengutronix.de> wrote: > Hi Boris, > > On Friday 15 April 2016 10:35:08 Boris Brezillon wrote: > > Hi Markus, > > > > On Fri, 15 Apr 2016 09:55:45 +0200 > > Markus Pargmann <mpa@pengutronix.de> wrote: > > > > > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote: > > > > On Tue, 12 Apr 2016 22:39:08 +0000 > > > > Han Xu <han.xu@nxp.com> wrote: > > > > > > > > > > Thanks for the feedback. Talking with a coworker about this we may have found a > > > > > > better approach to this that is less complicated to implement. The hardware > > > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit > > > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but > > > > > > it does not correct these. So the idea is to change the patch so that we set > > > > > > pages, that are signaled by the ECC as erased, to 0xff completely without > > > > > > checking. So the ECC will do all the work and we completely trust in its > > > > > > abilities to do it correctly. > > > > > > > > > > Sounds good. > > > > > > > > > > some new platforms with new gpmi controller could check the count of 0 bits in page, > > > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/ > > > > > > > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and > > > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold > > > > > and correcting all erased page may highly impact the performance. > > > > > > > > Indeed, bitflips in erased pages is not so common, and penalizing the > > > > likely case (erased pages without any bitflips) doesn't look like a good > > > > idea in the end. > > > > > > Are erased pages really read that often? > > > > Yes, it's not unusual to have those "empty pages?" checks (added Artem > > and Richard to get a confirmation). AFAIR, UBIFS check for empty pages > > in its journal heads after an unclean unmount (which happens quite > > often) to make sure there's no corruption. > > > > > I am not sure how UBI handles > > > this, does it read every page before writing? > > > > Nope, or maybe it does when you activate some extra checks. > > > > > > > > > > > > > You can still implement this check in software. You can have a look at > > > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll > > > > have to adapt it because your controller does not guarantees that ECC > > > > bits for a given chunk are byte aligned :-/ > > > > > > Yes I used this function in the patch. The issue is that I am not quite > > > sure yet where to find the raw ECC data (without rereading the page). > > > The reference manual is not extremely clear about that, ecc data may be > > > in the 'auxilliary data' but I am not sure that it really is available > > > somewhere. > > > > AFAIR (and I'm not sure since it was a long time ago), you don't have > > direct access to ECC bytes with the GPMI engine. If that's the case, > > you'll have to read the ECC bytes manually (moving the page pointer > > using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with > > this engine, because ECC bytes are not guaranteed to be byte aligned > > (see gpmi ->read_page_raw() implementation). > > Once you've retrieved ECC bytes (or bits in this case), for each ECC > > chunk, you can use the nand_check_erased_ecc_chunk() function (just make > > sure you're padding the last ECC byte of each chunk with ones so that > > bitflips cannot be reported on this section). > > Thanks for the information. So I understand that this approach is the > preferred one to avoid any performance issues for normal operation. > > I actually won't be able to fix this patch accordingly for some time. If > anyone else needs this earlier, feel free to implement it. I just did [1] (it applies on top of your patch), but maybe you can test it (I don't have any imx platforms right now) ;). If these changes work, feel free to squash them into your previous patch. Thanks, Boris [1]http://code.bulix.org/bq6yyh-96549
Hi Boris, On Friday 15 April 2016 11:39:06 Boris Brezillon wrote: > Hi Markus, > > On Fri, 15 Apr 2016 11:35:07 +0200 > Markus Pargmann <mpa@pengutronix.de> wrote: > > > Hi Boris, > > > > On Friday 15 April 2016 10:35:08 Boris Brezillon wrote: > > > Hi Markus, > > > > > > On Fri, 15 Apr 2016 09:55:45 +0200 > > > Markus Pargmann <mpa@pengutronix.de> wrote: > > > > > > > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote: > > > > > On Tue, 12 Apr 2016 22:39:08 +0000 > > > > > Han Xu <han.xu@nxp.com> wrote: > > > > > > > > > > > > Thanks for the feedback. Talking with a coworker about this we may have found a > > > > > > > better approach to this that is less complicated to implement. The hardware > > > > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit > > > > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but > > > > > > > it does not correct these. So the idea is to change the patch so that we set > > > > > > > pages, that are signaled by the ECC as erased, to 0xff completely without > > > > > > > checking. So the ECC will do all the work and we completely trust in its > > > > > > > abilities to do it correctly. > > > > > > > > > > > > Sounds good. > > > > > > > > > > > > some new platforms with new gpmi controller could check the count of 0 bits in page, > > > > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/ > > > > > > > > > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and > > > > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold > > > > > > and correcting all erased page may highly impact the performance. > > > > > > > > > > Indeed, bitflips in erased pages is not so common, and penalizing the > > > > > likely case (erased pages without any bitflips) doesn't look like a good > > > > > idea in the end. > > > > > > > > Are erased pages really read that often? > > > > > > Yes, it's not unusual to have those "empty pages?" checks (added Artem > > > and Richard to get a confirmation). AFAIR, UBIFS check for empty pages > > > in its journal heads after an unclean unmount (which happens quite > > > often) to make sure there's no corruption. > > > > > > > I am not sure how UBI handles > > > > this, does it read every page before writing? > > > > > > Nope, or maybe it does when you activate some extra checks. > > > > > > > > > > > > > > > > > You can still implement this check in software. You can have a look at > > > > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll > > > > > have to adapt it because your controller does not guarantees that ECC > > > > > bits for a given chunk are byte aligned :-/ > > > > > > > > Yes I used this function in the patch. The issue is that I am not quite > > > > sure yet where to find the raw ECC data (without rereading the page). > > > > The reference manual is not extremely clear about that, ecc data may be > > > > in the 'auxilliary data' but I am not sure that it really is available > > > > somewhere. > > > > > > AFAIR (and I'm not sure since it was a long time ago), you don't have > > > direct access to ECC bytes with the GPMI engine. If that's the case, > > > you'll have to read the ECC bytes manually (moving the page pointer > > > using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with > > > this engine, because ECC bytes are not guaranteed to be byte aligned > > > (see gpmi ->read_page_raw() implementation). > > > Once you've retrieved ECC bytes (or bits in this case), for each ECC > > > chunk, you can use the nand_check_erased_ecc_chunk() function (just make > > > sure you're padding the last ECC byte of each chunk with ones so that > > > bitflips cannot be reported on this section). > > > > Thanks for the information. So I understand that this approach is the > > preferred one to avoid any performance issues for normal operation. > > > > I actually won't be able to fix this patch accordingly for some time. If > > anyone else needs this earlier, feel free to implement it. > > I just did [1] (it applies on top of your patch), but maybe you > can test it (I don't have any imx platforms right now) ;). Great, thank you :). I just tested the patch and it works for me. The erased page bitflips are still detected and fixed. I will send a new version then. Best Regards, Markus > > If these changes work, feel free to squash them into your previous > patch. > > Thanks, > > Boris > > [1]http://code.bulix.org/bq6yyh-96549 > >
On Fri, 15 Apr 2016 15:33:07 +0000 Han Xu <han.xu@nxp.com> wrote: > > > > I just did [1] (it applies on top of your patch), but maybe you > > can test it (I don't have any imx platforms right now) ;). > > Great, thank you :). I just tested the patch and it works for me. The > erased page bitflips are still detected and fixed. I will send a new > version then. > > Hi Markus, > > Could you please share how to verify the patch, in other words, how to reproduce the > UBIFS corruption issue consistently. Thanks. I should really post a new version of the nandflipbits tool [1]. That's clearly the easiest solution to verify that your bitflip correction is reliable. [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html
On Friday 15 April 2016 17:40:31 Boris Brezillon wrote: > On Fri, 15 Apr 2016 15:33:07 +0000 > Han Xu <han.xu@nxp.com> wrote: > > > > > > > I just did [1] (it applies on top of your patch), but maybe you > > > can test it (I don't have any imx platforms right now) ;). > > > > Great, thank you :). I just tested the patch and it works for me. The > > erased page bitflips are still detected and fixed. I will send a new > > version then. > > > > Hi Markus, > > > > Could you please share how to verify the patch, in other words, how to reproduce the > > UBIFS corruption issue consistently. Thanks. I used a simple bashscript with 'nandwrite --noecc' that writes a number of zeroes in every sub-page. For written pages the ECC will handle the flips. For erased pages we can test the algorithm that handles the erased page bitflips. > > I should really post a new version of the nandflipbits tool [1]. > That's clearly the easiest solution to verify that your bitflip > correction is reliable. Seems really useful. Best Regards, Markus > > > [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html > >
diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c index 235ddcb58f39..ce5a21252102 100644 --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, /* Loop over status bytes, accumulating ECC status. */ status = auxiliary_virt + nfc_geo->auxiliary_status_offset; + read_page_swap_end(this, buf, nfc_geo->payload_size, + this->payload_virt, this->payload_phys, + nfc_geo->payload_size, + payload_virt, payload_phys); + for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) { if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED)) continue; if (*status == STATUS_UNCORRECTABLE) { + int flips; + + /* + * The ECC hardware has an uncorrectable ECC status + * code in case we have bitflips in an erased page. As + * nothing was written into this subpage the ECC is + * obviously wrong and we can not trust it. We assume + * at this point that we are reading an erased page and + * try to correct the bitflips in buffer up to + * ecc_strength bitflips. If this is a page with random + * data, we exceed this number of bitflips and have a + * ECC failure. Otherwise we use the corrected buffer. + */ + if (i == 0) { + /* The first block includes metadata */ + flips = nand_check_erased_ecc_chunk( + buf + i * nfc_geo->ecc_chunk_size, + nfc_geo->ecc_chunk_size, + NULL, 0, + auxiliary_virt, + nfc_geo->metadata_size, + nfc_geo->ecc_strength); + } else { + flips = nand_check_erased_ecc_chunk( + buf + i * nfc_geo->ecc_chunk_size, + nfc_geo->ecc_chunk_size, + NULL, 0, + NULL, 0, + nfc_geo->ecc_strength); + } + + if (flips > 0) { + max_bitflips = max_t(unsigned int, max_bitflips, + flips); + mtd->ecc_stats.corrected += flips; + continue; + } + mtd->ecc_stats.failed++; continue; } + mtd->ecc_stats.corrected += *status; max_bitflips = max_t(unsigned int, max_bitflips, *status); } @@ -1062,11 +1106,6 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip, chip->oob_poi[0] = ((uint8_t *) auxiliary_virt)[0]; } - read_page_swap_end(this, buf, nfc_geo->payload_size, - this->payload_virt, this->payload_phys, - nfc_geo->payload_size, - payload_virt, payload_phys); - return max_bitflips; }
ECC is only calculated for written pages. As erased pages are not actively written the ECC is always invalid. For this purpose the Hardware BCH unit is able to check for erased pages and does not raise an ECC error in this case. This behaviour can be influenced using the BCH_MODE register which sets the number of allowed bitflips in an erased page. Unfortunately the unit is not capable of fixing the bitflips in memory. To avoid complete software checks for erased pages, we can simply check buffers with uncorrectable ECC errors because we know that any erased page with errors is uncorrectable by the BCH unit. This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand to correct erased pages. To have the valid data in the buffer before using them, this patch moves the read_page_swap_end() call before the ECC status checking for-loop. Signed-off-by: Markus Pargmann <mpa@pengutronix.de> --- drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++---- 1 file changed, 44 insertions(+), 5 deletions(-)