Message ID | 1363531902-24909-5-git-send-email-zheng.z.yan@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests…). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so… -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: > From: "Yan, Zheng" <zheng.z.yan@intel.com> > > When a MDS becomes active, the table server re-sends 'agree' messages > for old prepared request. If the recoverd MDS starts a new table request > at the same time, The new request's ID can happen to be the same as old > prepared request's ID, because current table client assigns request ID > from zero after MDS restarts. > > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> > --- > src/mds/MDS.cc (http://MDS.cc) | 3 +++ > src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++ > src/mds/MDSTableClient.h | 2 ++ > 3 files changed, 10 insertions(+) > > diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) > index bb1c833..859782a 100644 > --- a/src/mds/MDS.cc (http://MDS.cc) > +++ b/src/mds/MDS.cc (http://MDS.cc) > @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) > dout(2) << "boot_start " << step << ": opening snap table" << dendl; > snapserver->load(gather.new_sub()); > } > + > + anchorclient->init(); > + snapclient->init(); > > dout(2) << "boot_start " << step << ": opening mds log" << dendl; > mdlog->open(gather.new_sub()); > diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) > index ea021f5..beba0a3 100644 > --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) > +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) > @@ -34,6 +34,11 @@ > #undef dout_prefix > #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") " > > +void MDSTableClient::init() > +{ > + // make reqid unique between MDS restarts > + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32; > +} > > void MDSTableClient::handle_request(class MMDSTableRequest *m) > { > diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h > index e15837f..78035db 100644 > --- a/src/mds/MDSTableClient.h > +++ b/src/mds/MDSTableClient.h > @@ -63,6 +63,8 @@ public: > MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} > virtual ~MDSTableClient() {} > > + void init(); > + > void handle_request(MMDSTableRequest *m); > > void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); > -- > 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/20/2013 07:09 AM, Greg Farnum wrote: > Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests…). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. > I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so… Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id. Thanks Yan, Zheng > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: > >> From: "Yan, Zheng" <zheng.z.yan@intel.com> >> >> When a MDS becomes active, the table server re-sends 'agree' messages >> for old prepared request. If the recoverd MDS starts a new table request >> at the same time, The new request's ID can happen to be the same as old >> prepared request's ID, because current table client assigns request ID >> from zero after MDS restarts. >> >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> >> --- >> src/mds/MDS.cc (http://MDS.cc) | 3 +++ >> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++ >> src/mds/MDSTableClient.h | 2 ++ >> 3 files changed, 10 insertions(+) >> >> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) >> index bb1c833..859782a 100644 >> --- a/src/mds/MDS.cc (http://MDS.cc) >> +++ b/src/mds/MDS.cc (http://MDS.cc) >> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) >> dout(2) << "boot_start " << step << ": opening snap table" << dendl; >> snapserver->load(gather.new_sub()); >> } >> + >> + anchorclient->init(); >> + snapclient->init(); >> >> dout(2) << "boot_start " << step << ": opening mds log" << dendl; >> mdlog->open(gather.new_sub()); >> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >> index ea021f5..beba0a3 100644 >> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >> @@ -34,6 +34,11 @@ >> #undef dout_prefix >> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") " >> >> +void MDSTableClient::init() >> +{ >> + // make reqid unique between MDS restarts >> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32; >> +} >> >> void MDSTableClient::handle_request(class MMDSTableRequest *m) >> { >> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h >> index e15837f..78035db 100644 >> --- a/src/mds/MDSTableClient.h >> +++ b/src/mds/MDSTableClient.h >> @@ -63,6 +63,8 @@ public: >> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} >> virtual ~MDSTableClient() {} >> >> + void init(); >> + >> void handle_request(MMDSTableRequest *m); >> >> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); >> -- >> 1.7.11.7 > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 20 Mar 2013, Yan, Zheng wrote: > On 03/20/2013 07:09 AM, Greg Farnum wrote: > > Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. > > I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? > > Not just 4 billion requests, MDS restart has several stage, mdsmap epoch > increases for each stage. I don't think there are any more colliding > states in the table. The table client/server use two phase commit. it's > similar to client request that involves multiple MDS. the reqid is > analogy to client request id. The difference is client request ID is > unique because new client always get an unique session id. Each time a tid is consumed (at least for an update) it is journaled in the EMetaBlob::table_tids list, right? So we could actually take a max from journal replay and pick up where we left off? That seems like the cleanest. I'm not too worried about 2^32 tids, I guess, but it would be nicer to avoid that possibility. sage > > Thanks > Yan, Zheng > > > -Greg > > > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > > > On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: > > > >> From: "Yan, Zheng" <zheng.z.yan@intel.com> > >> > >> When a MDS becomes active, the table server re-sends 'agree' messages > >> for old prepared request. If the recoverd MDS starts a new table request > >> at the same time, The new request's ID can happen to be the same as old > >> prepared request's ID, because current table client assigns request ID > >> from zero after MDS restarts. > >> > >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> > >> --- > >> src/mds/MDS.cc (http://MDS.cc) | 3 +++ > >> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++ > >> src/mds/MDSTableClient.h | 2 ++ > >> 3 files changed, 10 insertions(+) > >> > >> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) > >> index bb1c833..859782a 100644 > >> --- a/src/mds/MDS.cc (http://MDS.cc) > >> +++ b/src/mds/MDS.cc (http://MDS.cc) > >> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) > >> dout(2) << "boot_start " << step << ": opening snap table" << dendl; > >> snapserver->load(gather.new_sub()); > >> } > >> + > >> + anchorclient->init(); > >> + snapclient->init(); > >> > >> dout(2) << "boot_start " << step << ": opening mds log" << dendl; > >> mdlog->open(gather.new_sub()); > >> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) > >> index ea021f5..beba0a3 100644 > >> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) > >> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) > >> @@ -34,6 +34,11 @@ > >> #undef dout_prefix > >> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") " > >> > >> +void MDSTableClient::init() > >> +{ > >> + // make reqid unique between MDS restarts > >> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32; > >> +} > >> > >> void MDSTableClient::handle_request(class MMDSTableRequest *m) > >> { > >> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h > >> index e15837f..78035db 100644 > >> --- a/src/mds/MDSTableClient.h > >> +++ b/src/mds/MDSTableClient.h > >> @@ -63,6 +63,8 @@ public: > >> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} > >> virtual ~MDSTableClient() {} > >> > >> + void init(); > >> + > >> void handle_request(MMDSTableRequest *m); > >> > >> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); > >> -- > >> 1.7.11.7 > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/20/2013 02:15 PM, Sage Weil wrote: > On Wed, 20 Mar 2013, Yan, Zheng wrote: >> On 03/20/2013 07:09 AM, Greg Farnum wrote: >>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. >>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? >> >> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch >> increases for each stage. I don't think there are any more colliding >> states in the table. The table client/server use two phase commit. it's >> similar to client request that involves multiple MDS. the reqid is >> analogy to client request id. The difference is client request ID is >> unique because new client always get an unique session id. > > Each time a tid is consumed (at least for an update) it is journaled in > the EMetaBlob::table_tids list, right? So we could actually take a max > from journal replay and pick up where we left off? That seems like the > cleanest. This approach works only if client journal the reqid before it sending the request to the server. but current implementation is client journal the reqid when it receives server's agree message. Regards Yan, Zheng > > I'm not too worried about 2^32 tids, I guess, but it would be nicer to > avoid that possibility. > > sage > >> >> Thanks >> Yan, Zheng >> >>> -Greg >>> >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: >>> >>>> From: "Yan, Zheng" <zheng.z.yan@intel.com> >>>> >>>> When a MDS becomes active, the table server re-sends 'agree' messages >>>> for old prepared request. If the recoverd MDS starts a new table request >>>> at the same time, The new request's ID can happen to be the same as old >>>> prepared request's ID, because current table client assigns request ID >>>> from zero after MDS restarts. >>>> >>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> >>>> --- >>>> src/mds/MDS.cc (http://MDS.cc) | 3 +++ >>>> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++ >>>> src/mds/MDSTableClient.h | 2 ++ >>>> 3 files changed, 10 insertions(+) >>>> >>>> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) >>>> index bb1c833..859782a 100644 >>>> --- a/src/mds/MDS.cc (http://MDS.cc) >>>> +++ b/src/mds/MDS.cc (http://MDS.cc) >>>> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) >>>> dout(2) << "boot_start " << step << ": opening snap table" << dendl; >>>> snapserver->load(gather.new_sub()); >>>> } >>>> + >>>> + anchorclient->init(); >>>> + snapclient->init(); >>>> >>>> dout(2) << "boot_start " << step << ": opening mds log" << dendl; >>>> mdlog->open(gather.new_sub()); >>>> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >>>> index ea021f5..beba0a3 100644 >>>> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >>>> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >>>> @@ -34,6 +34,11 @@ >>>> #undef dout_prefix >>>> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") " >>>> >>>> +void MDSTableClient::init() >>>> +{ >>>> + // make reqid unique between MDS restarts >>>> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32; >>>> +} >>>> >>>> void MDSTableClient::handle_request(class MMDSTableRequest *m) >>>> { >>>> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h >>>> index e15837f..78035db 100644 >>>> --- a/src/mds/MDSTableClient.h >>>> +++ b/src/mds/MDSTableClient.h >>>> @@ -63,6 +63,8 @@ public: >>>> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} >>>> virtual ~MDSTableClient() {} >>>> >>>> + void init(); >>>> + >>>> void handle_request(MMDSTableRequest *m); >>>> >>>> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); >>>> -- >>>> 1.7.11.7 >>> >>> >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/20/2013 02:15 PM, Sage Weil wrote: > On Wed, 20 Mar 2013, Yan, Zheng wrote: >> On 03/20/2013 07:09 AM, Greg Farnum wrote: >>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. >>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? >> >> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch >> increases for each stage. I don't think there are any more colliding >> states in the table. The table client/server use two phase commit. it's >> similar to client request that involves multiple MDS. the reqid is >> analogy to client request id. The difference is client request ID is >> unique because new client always get an unique session id. > > Each time a tid is consumed (at least for an update) it is journaled in > the EMetaBlob::table_tids list, right? So we could actually take a max > from journal replay and pick up where we left off? That seems like the > cleanest. > > I'm not too worried about 2^32 tids, I guess, but it would be nicer to > avoid that possibility. > Can we re-use the client request ID as table client request ID ? Regards Yan, Zheng > sage > >> >> Thanks >> Yan, Zheng >> >>> -Greg >>> >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote: >>> >>>> From: "Yan, Zheng" <zheng.z.yan@intel.com> >>>> >>>> When a MDS becomes active, the table server re-sends 'agree' messages >>>> for old prepared request. If the recoverd MDS starts a new table request >>>> at the same time, The new request's ID can happen to be the same as old >>>> prepared request's ID, because current table client assigns request ID >>>> from zero after MDS restarts. >>>> >>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)> >>>> --- >>>> src/mds/MDS.cc (http://MDS.cc) | 3 +++ >>>> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++ >>>> src/mds/MDSTableClient.h | 2 ++ >>>> 3 files changed, 10 insertions(+) >>>> >>>> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc) >>>> index bb1c833..859782a 100644 >>>> --- a/src/mds/MDS.cc (http://MDS.cc) >>>> +++ b/src/mds/MDS.cc (http://MDS.cc) >>>> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) >>>> dout(2) << "boot_start " << step << ": opening snap table" << dendl; >>>> snapserver->load(gather.new_sub()); >>>> } >>>> + >>>> + anchorclient->init(); >>>> + snapclient->init(); >>>> >>>> dout(2) << "boot_start " << step << ": opening mds log" << dendl; >>>> mdlog->open(gather.new_sub()); >>>> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >>>> index ea021f5..beba0a3 100644 >>>> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >>>> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) >>>> @@ -34,6 +34,11 @@ >>>> #undef dout_prefix >>>> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") " >>>> >>>> +void MDSTableClient::init() >>>> +{ >>>> + // make reqid unique between MDS restarts >>>> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32; >>>> +} >>>> >>>> void MDSTableClient::handle_request(class MMDSTableRequest *m) >>>> { >>>> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h >>>> index e15837f..78035db 100644 >>>> --- a/src/mds/MDSTableClient.h >>>> +++ b/src/mds/MDSTableClient.h >>>> @@ -63,6 +63,8 @@ public: >>>> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} >>>> virtual ~MDSTableClient() {} >>>> >>>> + void init(); >>>> + >>>> void handle_request(MMDSTableRequest *m); >>>> >>>> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish); >>>> -- >>>> 1.7.11.7 >>> >>> >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: > On 03/20/2013 02:15 PM, Sage Weil wrote: > > On Wed, 20 Mar 2013, Yan, Zheng wrote: > > > On 03/20/2013 07:09 AM, Greg Farnum wrote: > > > > Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. > > > > I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? > > > > > > > > > > > > Not just 4 billion requests, MDS restart has several stage, mdsmap epoch > > > increases for each stage. I don't think there are any more colliding > > > states in the table. The table client/server use two phase commit. it's > > > similar to client request that involves multiple MDS. the reqid is > > > analogy to client request id. The difference is client request ID is > > > unique because new client always get an unique session id. > > > > > > > > Each time a tid is consumed (at least for an update) it is journaled in > > the EMetaBlob::table_tids list, right? So we could actually take a max > > from journal replay and pick up where we left off? That seems like the > > cleanest. > > > > I'm not too worried about 2^32 tids, I guess, but it would be nicer to > > avoid that possibility. > > > > Can we re-use the client request ID as table client request ID ? > > Regards > Yan, Zheng Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates. As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places… -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/21/2013 02:31 AM, Greg Farnum wrote: > On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: >> On 03/20/2013 02:15 PM, Sage Weil wrote: >>> On Wed, 20 Mar 2013, Yan, Zheng wrote: >>>> On 03/20/2013 07:09 AM, Greg Farnum wrote: >>>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. >>>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? >>>> >>>> >>>> >>>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch >>>> increases for each stage. I don't think there are any more colliding >>>> states in the table. The table client/server use two phase commit. it's >>>> similar to client request that involves multiple MDS. the reqid is >>>> analogy to client request id. The difference is client request ID is >>>> unique because new client always get an unique session id. >>> >>> >>> >>> Each time a tid is consumed (at least for an update) it is journaled in >>> the EMetaBlob::table_tids list, right? So we could actually take a max >>> from journal replay and pick up where we left off? That seems like the >>> cleanest. >>> >>> I'm not too worried about 2^32 tids, I guess, but it would be nicer to >>> avoid that possibility. >> >> >> >> Can we re-use the client request ID as table client request ID ? >> >> Regards >> Yan, Zheng > > Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates. > You are right, client request ID does not work. > As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places… > -Greg > The table server sends 'agree' message to table client after a 'prepare entry' is safely logged. The table server re-sends 'agree' message in two cases, one is the table client restarts, another is the table server itself restarts. The purpose of re-sending 'agree' message is to check if the table client still wants to keep the update preparation. (The table client might crash before submitting the update). The purpose of reqid is associate table update preparation request with the server's 'agree' reply message. The problem here is that the table client does not make sure reqid unique between restarts. If you feel 2^32 reqids are still enough, set the reqid to a randomized 64bit value should be safe enough. Thanks Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 1:07 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote: > On 03/21/2013 02:31 AM, Greg Farnum wrote: >> On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: >>> On 03/20/2013 02:15 PM, Sage Weil wrote: >>>> On Wed, 20 Mar 2013, Yan, Zheng wrote: >>>>> On 03/20/2013 07:09 AM, Greg Farnum wrote: >>>>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart. >>>>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so? >>>>> >>>>> >>>>> >>>>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch >>>>> increases for each stage. I don't think there are any more colliding >>>>> states in the table. The table client/server use two phase commit. it's >>>>> similar to client request that involves multiple MDS. the reqid is >>>>> analogy to client request id. The difference is client request ID is >>>>> unique because new client always get an unique session id. >>>> >>>> >>>> >>>> Each time a tid is consumed (at least for an update) it is journaled in >>>> the EMetaBlob::table_tids list, right? So we could actually take a max >>>> from journal replay and pick up where we left off? That seems like the >>>> cleanest. >>>> >>>> I'm not too worried about 2^32 tids, I guess, but it would be nicer to >>>> avoid that possibility. >>> >>> >>> >>> Can we re-use the client request ID as table client request ID ? >>> >>> Regards >>> Yan, Zheng >> >> Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates. >> > > You are right, client request ID does not work. > >> As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places… >> -Greg >> > > The table server sends 'agree' message to table client after a 'prepare entry' is safely logged. The table server re-sends 'agree' message in two cases, one is the table client restarts, another is the table server itself restarts. > The purpose of re-sending 'agree' message is to check if the table client still wants to keep the update preparation. (The table client might crash before submitting the update). The purpose of reqid is associate table update > preparation request with the server's 'agree' reply message. The problem here is that the table client does not make sure reqid unique between restarts. If you feel 2^32 reqids are still enough, set the reqid to a randomized 64bit > value should be safe enough. Right. I'd like to somehow mark those reqid's so that we can tell when they come from a different incarnation of the MDS TableClient daemon. One way is via some piece of random data that will probably distinguish them, although if we have something which we can know is different that would be preferable. I think we can work something out of the startup session data each MDS does with the monitors, but I'm not sure I can check any time soon; I have a number of other things to get to now that I've gotten through (the first round on) this series. Thanks for all the patches, by the way. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc index bb1c833..859782a 100644 --- a/src/mds/MDS.cc +++ b/src/mds/MDS.cc @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r) dout(2) << "boot_start " << step << ": opening snap table" << dendl; snapserver->load(gather.new_sub()); } + + anchorclient->init(); + snapclient->init(); dout(2) << "boot_start " << step << ": opening mds log" << dendl; mdlog->open(gather.new_sub()); diff --git a/src/mds/MDSTableClient.cc b/src/mds/MDSTableClient.cc index ea021f5..beba0a3 100644 --- a/src/mds/MDSTableClient.cc +++ b/src/mds/MDSTableClient.cc @@ -34,6 +34,11 @@ #undef dout_prefix #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") " +void MDSTableClient::init() +{ + // make reqid unique between MDS restarts + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32; +} void MDSTableClient::handle_request(class MMDSTableRequest *m) { diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h index e15837f..78035db 100644 --- a/src/mds/MDSTableClient.h +++ b/src/mds/MDSTableClient.h @@ -63,6 +63,8 @@ public: MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {} virtual ~MDSTableClient() {} + void init(); + void handle_request(MMDSTableRequest *m); void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);