diff mbox

mds: sort dentries when committing dir fragment

Message ID 50B32E48.50403@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yan, Zheng Nov. 26, 2012, 8:54 a.m. UTC
On 11/26/2012 10:19 AM, Yan, Zheng wrote:
> On 11/26/2012 06:32 AM, Sage Weil wrote:
>> I pushed an alternative approach to wip-tmap.
>>
>> This sorting is an artifact of tmap's crummy implementation, and the mds 
>> workaround will need to get reverted when we switch to omap.  Instead, fix 
>> tmap so that it will tolerate unsorted keys.  (Also, drop the ENOENT on rm 
>> on missing key.)
>>
>> Eventually we can deprecate and remove tmap entirely...
>>
>> What do you think?
> 
> This approach is cleaner than mine. But I think your fix isn't enough because
> MDS may provide tmap contains misordered items to the TMAPPUT method. Misordered
> items will confuse future TMAPUP. This fix is either sorting items when handling
> TMAPPUT or searching forward for any potential misordered items when TMAP_SET
> wants to add a new item or TMAP_RM fails to find an item.
>

how about the patch attached below

-----
From e3c4fb68dc6c7592b7f53ab7a98b561167b567df Mon Sep 17 00:00:00 2001
From: "Yan, Zheng" <zheng.z.yan@intel.com>
Date: Mon, 26 Nov 2012 12:28:30 +0800
Subject: [PATCH] osd: check misordered items in TMAP

There is a bug in the MDS that causes misordered items in TMAPs
that store dir fragments. Misordered items confuse TMAP updates.

Fix this by adding code to do_tmapup() to check if there are
misordered items that may affect the correctness of TMAP update.
Fall back to use do_tmapup_slow if misordered items are found.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/osd/ReplicatedPG.cc | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Comments

Sage Weil Nov. 26, 2012, 7:06 p.m. UTC | #1
On Mon, 26 Nov 2012, Yan, Zheng wrote:

> On 11/26/2012 10:19 AM, Yan, Zheng wrote:
> > On 11/26/2012 06:32 AM, Sage Weil wrote:
> >> I pushed an alternative approach to wip-tmap.
> >>
> >> This sorting is an artifact of tmap's crummy implementation, and the mds 
> >> workaround will need to get reverted when we switch to omap.  Instead, fix 
> >> tmap so that it will tolerate unsorted keys.  (Also, drop the ENOENT on rm 
> >> on missing key.)
> >>
> >> Eventually we can deprecate and remove tmap entirely...
> >>
> >> What do you think?
> > 
> > This approach is cleaner than mine. But I think your fix isn't enough because
> > MDS may provide tmap contains misordered items to the TMAPPUT method. Misordered
> > items will confuse future TMAPUP. This fix is either sorting items when handling
> > TMAPPUT or searching forward for any potential misordered items when TMAP_SET
> > wants to add a new item or TMAP_RM fails to find an item.
> >
> 
> how about the patch attached below

Pushed this to wip-tmap, along with a change to TMAPPUT that verifies 
everything is fully sorted.

I'm am bit worried that looking specifically for the <foo>_ sorting 
behavior is fragile, though.  And I'm less worried about performance of 
MDS stuff on bobtail than just correctness (fixing this particular issue). 
That will probably mean we trigger a slow update frequently, particularly 
when snapshots are in use, but I'm okay with that...

sage

> 
> -----
> >From e3c4fb68dc6c7592b7f53ab7a98b561167b567df Mon Sep 17 00:00:00 2001
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> Date: Mon, 26 Nov 2012 12:28:30 +0800
> Subject: [PATCH] osd: check misordered items in TMAP
> 
> There is a bug in the MDS that causes misordered items in TMAPs
> that store dir fragments. Misordered items confuse TMAP updates.
> 
> Fix this by adding code to do_tmapup() to check if there are
> misordered items that may affect the correctness of TMAP update.
> Fall back to use do_tmapup_slow if misordered items are found.
> 
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  src/osd/ReplicatedPG.cc | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
> index 48cba10..71f5363 100644
> --- a/src/osd/ReplicatedPG.cc
> +++ b/src/osd/ReplicatedPG.cc
> @@ -1803,8 +1803,17 @@ int ReplicatedPG::do_tmapup(OpContext *ctx, bufferlist::iterator& bp, OSDOp& osd
>  	  nkeys--;
>  	}
>  	if (!ip.end()) {
> +	  string last_key = nextkey;
> +
>  	  ::decode(nextkey, ip);
>  	  ::decode(nextval, ip);
> +
> +	  if (nextkey <= last_key) {
> +	    dout(0) << "tmapup warning: key '" << key << "' < previous key '" << last_key
> +		    << "', falling back to an inefficient (unsorted) update" << dendl;
> +	    bp = orig_bp;
> +	    return do_tmapup_slow(ctx, bp, osd_op, newop.outdata);
> +	  }
>  	} else {
>  	  have_next = false;
>  	}
> @@ -1848,6 +1857,35 @@ int ReplicatedPG::do_tmapup(OpContext *ctx, bufferlist::iterator& bp, OSDOp& osd
>        ::encode(nextkey, newkeydata);
>        ::encode(nextval, newkeydata);
>        dout(20) << "  keep " << nextkey << " " << nextval.length() << dendl;
> +
> +      /*
> +       * TMAPs for storing dir fragments may contain misordered items.
> +       * Only items corresponding to dentries that have the same name
> +       * prefix can be out of order.
> +       */
> +      size_t lastlen = nextkey.find_last_of('_');
> +      if (lastlen > 0 && lastlen != string::npos) {
> +	string last_key = nextkey;
> +	bufferlist::iterator tmp_ip = ip;
> +	while (!tmp_ip.end()) {
> +	  ::decode(nextkey, tmp_ip);
> +	  ::decode(nextval, tmp_ip);
> +
> +	  if (nextkey <= last_key) {
> +	    dout(0) << "tmapup warning: key '" << nextkey << "' < previous key '" << last_key
> +		    << "', falling back to an inefficient (unsorted) update" << dendl;
> +	    bp = orig_bp;
> +	    return do_tmapup_slow(ctx, bp, osd_op, newop.outdata);
> +	  }
> +
> +	  size_t len = nextkey.find_last_of('_');
> +	  if (len == 0 || len == string::npos)
> +	    break;
> +	  len = min(len, lastlen);
> +	  if (last_key.compare(0, len, nextkey, 0, len) < 0)
> +	    break;
> +	}
> +      }
>      }
>      if (!ip.end()) {
>        bufferlist rest;
> -- 
> 1.7.11.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 48cba10..71f5363 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1803,8 +1803,17 @@  int ReplicatedPG::do_tmapup(OpContext *ctx, bufferlist::iterator& bp, OSDOp& osd
 	  nkeys--;
 	}
 	if (!ip.end()) {
+	  string last_key = nextkey;
+
 	  ::decode(nextkey, ip);
 	  ::decode(nextval, ip);
+
+	  if (nextkey <= last_key) {
+	    dout(0) << "tmapup warning: key '" << key << "' < previous key '" << last_key
+		    << "', falling back to an inefficient (unsorted) update" << dendl;
+	    bp = orig_bp;
+	    return do_tmapup_slow(ctx, bp, osd_op, newop.outdata);
+	  }
 	} else {
 	  have_next = false;
 	}
@@ -1848,6 +1857,35 @@  int ReplicatedPG::do_tmapup(OpContext *ctx, bufferlist::iterator& bp, OSDOp& osd
       ::encode(nextkey, newkeydata);
       ::encode(nextval, newkeydata);
       dout(20) << "  keep " << nextkey << " " << nextval.length() << dendl;
+
+      /*
+       * TMAPs for storing dir fragments may contain misordered items.
+       * Only items corresponding to dentries that have the same name
+       * prefix can be out of order.
+       */
+      size_t lastlen = nextkey.find_last_of('_');
+      if (lastlen > 0 && lastlen != string::npos) {
+	string last_key = nextkey;
+	bufferlist::iterator tmp_ip = ip;
+	while (!tmp_ip.end()) {
+	  ::decode(nextkey, tmp_ip);
+	  ::decode(nextval, tmp_ip);
+
+	  if (nextkey <= last_key) {
+	    dout(0) << "tmapup warning: key '" << nextkey << "' < previous key '" << last_key
+		    << "', falling back to an inefficient (unsorted) update" << dendl;
+	    bp = orig_bp;
+	    return do_tmapup_slow(ctx, bp, osd_op, newop.outdata);
+	  }
+
+	  size_t len = nextkey.find_last_of('_');
+	  if (len == 0 || len == string::npos)
+	    break;
+	  len = min(len, lastlen);
+	  if (last_key.compare(0, len, nextkey, 0, len) < 0)
+	    break;
+	}
+      }
     }
     if (!ip.end()) {
       bufferlist rest;