lightnvm: pblk: fix race condition on GC

Message ID	20190127065421.10662-1-hlitz@ucsc.edu (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Heiner Litz <hlitz@ucsc.edu> To: mb@lightnvm.io Cc: javier@javigon.com, hans.holmberg@cnexlabs.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Heiner Litz <hlitz@ucsc.edu> Subject: [PATCH] lightnvm: pblk: fix race condition on GC Date: Sat, 26 Jan 2019 22:54:21 -0800 Message-Id: <20190127065421.10662-1-hlitz@ucsc.edu> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	lightnvm: pblk: fix race condition on GC \| expand lightnvm: pblk: fix race condition on GC

Message ID

20190127065421.10662-1-hlitz@ucsc.edu (mailing list archive)

State

New, archived

Headers

From: Heiner Litz <hlitz@ucsc.edu>
To: mb@lightnvm.io
Cc: javier@javigon.com, hans.holmberg@cnexlabs.com,
        linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
        Heiner Litz <hlitz@ucsc.edu>
Subject: [PATCH] lightnvm: pblk: fix race condition on GC
Date: Sat, 26 Jan 2019 22:54:21 -0800
Message-Id: <20190127065421.10662-1-hlitz@ucsc.edu>
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk

Series

lightnvm: pblk: fix race condition on GC | expand

Commit Message

Heiner Litz Jan. 27, 2019, 6:54 a.m. UTC

This patch fixes a race condition where a write is mapped to the last
sectors of a line. The write is synced to the device but the L2P is not
updated yet. When the line is garbage collected before the L2P update is
performed, the sectors are ignored by the GC logic and the line is freed
before all sectors are moved. When the L2P is finally updated, it contains
a mapping to a freed line, subsequent reads of the corresponding LBAs fail.

Note that looking up the L2P and checking the ppa in the write buffer needs
to be performed atomically, hence the refactor of pblk_lookup_l2p_rand.

Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
---
 drivers/lightnvm/pblk-read.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

Comments

Javier González Jan. 29, 2019, 8:13 a.m. UTC | #1

> On 27 Jan 2019, at 07.54, Heiner Litz <hlitz@ucsc.edu> wrote:
> 
> This patch fixes a race condition where a write is mapped to the last
> sectors of a line. The write is synced to the device but the L2P is not
> updated yet. When the line is garbage collected before the L2P update is
> performed, the sectors are ignored by the GC logic and the line is freed
> before all sectors are moved. When the L2P is finally updated, it contains
> a mapping to a freed line, subsequent reads of the corresponding LBAs fail.

Hi Heiner,

This has been an interesting issue to debug - good catch!

> 
> Note that looking up the L2P and checking the ppa in the write buffer needs
> to be performed atomically, hence the refactor of pblk_lookup_l2p_rand.
> 
> Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> ---
> drivers/lightnvm/pblk-read.c | 27 +++++++++++++++++++++++++--
> 1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
> index 3789185144da..7c556b2218e4 100644
> --- a/drivers/lightnvm/pblk-read.c
> +++ b/drivers/lightnvm/pblk-read.c
> @@ -529,13 +529,35 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> 	int valid_secs = 0;
> 	int i;
> 
> -	pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
> -
> +	spin_lock(&pblk->trans_lock);
> 	for (i = 0; i < nr_secs; i++) {
> 		if (lba_list[i] == ADDR_EMPTY)
> 			continue;
> 
> +		ppa_list_l2p[i] = pblk_trans_map_get(pblk, lba_list[i]);
> 		ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
> +
> +		/* Obtain ppa from cache if the sector has been synced to the
> +		   device but the L2P has not been updated yet */
> +		if(pblk_addr_in_cache(ppa_list_l2p[i])) {
> +			struct pblk_rb *rb = &pblk->rwb;
> +			struct pblk_rb_entry *entry;
> +			struct pblk_w_ctx *w_ctx;
> +			u64 pos = pblk_addr_to_cacheline(ppa_list_l2p[i]);
> +
> +#ifdef CONFIG_NVM_PBLK_DEBUG
> +			/* Ensure that the access will not cause an overflow */
> +			BUG_ON(pos >= rb->nr_entries);
> +#endif
> +
> +			entry = &rb->entries[pos];
> +			w_ctx = &entry->w_ctx;
> +			if (pblk_ppa_comp(w_ctx->ppa, ppa_gc)) {
> +				rqd->ppa_list[valid_secs++] = ppa_gc;
> +				continue;
> +			}
> +		}
> +
> 		if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
> 			paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
> 			continue;
> @@ -543,6 +565,7 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> 
> 		rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
> 	}
> +	spin_unlock(&pblk->trans_lock);
> 
> #ifdef CONFIG_NVM_PBLK_DEBUG
> 	atomic_long_add(valid_secs, &pblk->inflight_reads);
> --
> 2.17.1


Here is a suggestion: Why not add an atomic counter to the
line stating the sectors that are synced in the L2P table and then
loosely wait (i.e., check and sleep / schedule) until the counter
reaches 0 on pblk_line_close_ws()? This way you guarantee that the line
does not close - and therefore never reaches the GC lists - before all
the L2P entries for that line point to the media. Any other form of
synchronization that puts the burden at pblk_line_close_ws() would also
work for me.

In essence, I would rather pay the price on a per-line basis than
blocking the trans_lock longer for each I/O.

Thoughts?

Javier

Javier González Jan. 29, 2019, 6:39 p.m. UTC | #2

> On 29 Jan 2019, at 17.36, Heiner Litz <hlitz@ucsc.edu> wrote:
> 
> Javier,
> 
> On Tue, Jan 29, 2019 at 12:13 AM Javier González <javier@javigon.com> wrote:
> > On 27 Jan 2019, at 07.54, Heiner Litz <hlitz@ucsc.edu> wrote:
> >
> > This patch fixes a race condition where a write is mapped to the last
> > sectors of a line. The write is synced to the device but the L2P is not
> > updated yet. When the line is garbage collected before the L2P update is
> > performed, the sectors are ignored by the GC logic and the line is freed
> > before all sectors are moved. When the L2P is finally updated, it contains
> > a mapping to a freed line, subsequent reads of the corresponding LBAs fail.
> 
> Hi Heiner,
> 
> This has been an interesting issue to debug - good catch!
> 
> Felt more like a marathon than a catch ;)

Hehehe, I know. It is a good maraton then :)

> 
> 
> >
> > Note that looking up the L2P and checking the ppa in the write buffer needs
> > to be performed atomically, hence the refactor of pblk_lookup_l2p_rand.
> >
> > Signed-off-by: Heiner Litz <hlitz@ucsc.edu>
> > ---
> > drivers/lightnvm/pblk-read.c | 27 +++++++++++++++++++++++++--
> > 1 file changed, 25 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
> > index 3789185144da..7c556b2218e4 100644
> > --- a/drivers/lightnvm/pblk-read.c
> > +++ b/drivers/lightnvm/pblk-read.c
> > @@ -529,13 +529,35 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> >       int valid_secs = 0;
> >       int i;
> >
> > -     pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
> > -
> > +     spin_lock(&pblk->trans_lock);
> >       for (i = 0; i < nr_secs; i++) {
> >               if (lba_list[i] == ADDR_EMPTY)
> >                       continue;
> >
> > +             ppa_list_l2p[i] = pblk_trans_map_get(pblk, lba_list[i]);
> >               ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
> > +
> > +             /* Obtain ppa from cache if the sector has been synced to the
> > +                device but the L2P has not been updated yet */
> > +             if(pblk_addr_in_cache(ppa_list_l2p[i])) {
> > +                     struct pblk_rb *rb = &pblk->rwb;
> > +                     struct pblk_rb_entry *entry;
> > +                     struct pblk_w_ctx *w_ctx;
> > +                     u64 pos = pblk_addr_to_cacheline(ppa_list_l2p[i]);
> > +
> > +#ifdef CONFIG_NVM_PBLK_DEBUG
> > +                     /* Ensure that the access will not cause an overflow */
> > +                     BUG_ON(pos >= rb->nr_entries);
> > +#endif
> > +
> > +                     entry = &rb->entries[pos];
> > +                     w_ctx = &entry->w_ctx;
> > +                     if (pblk_ppa_comp(w_ctx->ppa, ppa_gc)) {
> > +                             rqd->ppa_list[valid_secs++] = ppa_gc;
> > +                             continue;
> > +                     }
> > +             }
> > +
> >               if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
> >                       paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
> >                       continue;
> > @@ -543,6 +565,7 @@ static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
> >
> >               rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
> >       }
> > +     spin_unlock(&pblk->trans_lock);
> >
> > #ifdef CONFIG_NVM_PBLK_DEBUG
> >       atomic_long_add(valid_secs, &pblk->inflight_reads);
> > --
> > 2.17.1
> 
> 
> Here is a suggestion: Why not add an atomic counter to the
> line stating the sectors that are synced in the L2P table and then
> loosely wait (i.e., check and sleep / schedule) until the counter
> reaches 0 on pblk_line_close_ws()? This way you guarantee that the line
> does not close - and therefore never reaches the GC lists - before all
> the L2P entries for that line point to the media. Any other form of
> synchronization that puts the burden at pblk_line_close_ws() would also
> work for me.
> 
> In essence, I would rather pay the price on a per-line basis than
> blocking the trans_lock longer for each I/O.
> 
> The patch only adds 2 well predictable branches to the loop so I think the
> impact would be minimal, but I still think your approach is cleaner.
> I suggest checking the proposed sync counter whenever selecting a GC
> candidate. I'll send out a V2
> 

Sounds good! Thanks!

Javier

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 3789185144da..7c556b2218e4 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -529,13 +529,35 @@  static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
 	int valid_secs = 0;
 	int i;
 
-	pblk_lookup_l2p_rand(pblk, ppa_list_l2p, lba_list, nr_secs);
-
+	spin_lock(&pblk->trans_lock);
 	for (i = 0; i < nr_secs; i++) {
 		if (lba_list[i] == ADDR_EMPTY)
 			continue;
 
+		ppa_list_l2p[i] = pblk_trans_map_get(pblk, lba_list[i]);
 		ppa_gc = addr_to_gen_ppa(pblk, paddr_list_gc[i], line->id);
+
+		/* Obtain ppa from cache if the sector has been synced to the
+		   device but the L2P has not been updated yet */
+		if(pblk_addr_in_cache(ppa_list_l2p[i])) {
+			struct pblk_rb *rb = &pblk->rwb;
+			struct pblk_rb_entry *entry;
+			struct pblk_w_ctx *w_ctx;
+			u64 pos = pblk_addr_to_cacheline(ppa_list_l2p[i]);
+
+#ifdef CONFIG_NVM_PBLK_DEBUG
+			/* Ensure that the access will not cause an overflow */
+			BUG_ON(pos >= rb->nr_entries);
+#endif
+
+			entry = &rb->entries[pos];
+			w_ctx = &entry->w_ctx;
+			if (pblk_ppa_comp(w_ctx->ppa, ppa_gc)) {
+				rqd->ppa_list[valid_secs++] = ppa_gc;
+				continue;
+			}
+		}
+
 		if (!pblk_ppa_comp(ppa_list_l2p[i], ppa_gc)) {
 			paddr_list_gc[i] = lba_list[i] = ADDR_EMPTY;
 			continue;
@@ -543,6 +565,7 @@  static int read_ppalist_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
 
 		rqd->ppa_list[valid_secs++] = ppa_list_l2p[i];
 	}
+	spin_unlock(&pblk->trans_lock);
 
 #ifdef CONFIG_NVM_PBLK_DEBUG
 	atomic_long_add(valid_secs, &pblk->inflight_reads);

lightnvm: pblk: fix race condition on GC

Commit Message

Comments

Patch