Message ID | 1527132420-10740-1-git-send-email-dongsheng.yang@easystack.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
There is a related change in https://github.com/ceph/ceph/pull/22200 This PR set a osd_request_timeout in rbdmap.service stopping. Thanx On 05/24/2018 11:27 AM, Dongsheng Yang wrote: > Default value of osd_request_timeout is 0 means never timeout, > and we can set this value in rbd mapping with -o "osd_request_timeout=XX". > But we can't change this value online. > > [Question 1]: Why we need to set osd_request_timeout? > When we are going to reboot a node which have krbd devices mapped, > even with the rbdmap.service enabled, we will be blocked > in shuting down if the ceph cluster is not working. > > Especially, If we have three controller nodes which is running as ceph mon, > but at the same time, there are some k8s pod with krbd devices on this nodes, > then we can't shut down the last controller node when we want to shutdown > all nodes, because when we are going to shutdown last controller node, the > ceph cluster is not reachable. > > [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? > We don't want to set the osd_request_timeout in rbd device whole lifecycle, > there would be some problem in networking or cluster recovery to make > the request timeout. This would make the fs readonly and application down. > > [Question 3]: How can this patch solve this problems? > With this patch, we can map rbd device with default value of osd_reques_timeout, > means never timeout, then we can solve the problem mentioned Question 2. > > At the same time we can set the osd_request_timeout to what we need, > in system shuting down, for example, we can do this in rbdmap.service. > then we can make sure we can shutdown or reboot host normally no matter > ceph cluster is working well or not. This can solve the problem mentioned > in Question 1. > > Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> > --- > include/linux/ceph/libceph.h | 1 + > net/ceph/debugfs.c | 31 +++++++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+) > > diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h > index 9ce689a..8fdea50 100644 > --- a/include/linux/ceph/libceph.h > +++ b/include/linux/ceph/libceph.h > @@ -133,6 +133,7 @@ struct ceph_client { > struct dentry *debugfs_monmap; > struct dentry *debugfs_osdmap; > struct dentry *debugfs_options; > + struct dentry *debugfs_osd_req_timeout; > #endif > }; > > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c > index 46f6570..2b6cae3 100644 > --- a/net/ceph/debugfs.c > +++ b/net/ceph/debugfs.c > @@ -389,6 +389,28 @@ static int client_options_show(struct seq_file *s, void *p) > CEPH_DEFINE_SHOW_FUNC(osdc_show) > CEPH_DEFINE_SHOW_FUNC(client_options_show) > > +static int osd_request_timeout_get(void *data, u64 *val) > +{ > + struct ceph_client *client = data; > + > + *val = client->options->osd_request_timeout; > + return 0; > +} > + > +static int osd_request_timeout_set(void *data, u64 val) > +{ > + struct ceph_client *client = data; > + > + client->options->osd_request_timeout = val; > + return 0; > +} > + > +DEFINE_SIMPLE_ATTRIBUTE(client_osd_req_timeout_fops, > + osd_request_timeout_get, > + osd_request_timeout_set, > + "%lld\n"); > + > + > int __init ceph_debugfs_init(void) > { > ceph_debugfs_dir = debugfs_create_dir("ceph", NULL); > @@ -457,6 +479,14 @@ int ceph_debugfs_client_init(struct ceph_client *client) > if (!client->debugfs_options) > goto out; > > + client->debugfs_osd_req_timeout = debugfs_create_file("osd_request_timeout", > + 0600, > + client->debugfs_dir, > + client, > + &client_osd_req_timeout_fops); > + if (!client->debugfs_osd_req_timeout) > + goto out; > + > return 0; > > out: > @@ -467,6 +497,7 @@ int ceph_debugfs_client_init(struct ceph_client *client) > void ceph_debugfs_client_cleanup(struct ceph_client *client) > { > dout("ceph_debugfs_client_cleanup %p\n", client); > + debugfs_remove(client->debugfs_osd_req_timeout); > debugfs_remove(client->debugfs_options); > debugfs_remove(client->debugfs_osdmap); > debugfs_remove(client->debugfs_monmap); -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > Default value of osd_request_timeout is 0 means never timeout, > and we can set this value in rbd mapping with -o "osd_request_timeout=XX". > But we can't change this value online. Hi Dongsheng, Changing just osd_request_timeout won't do anything about outstanding requests waiting for the acquisition of the exclusive lock. This is an rbd problem and should be dealt with in rbd.c. > > [Question 1]: Why we need to set osd_request_timeout? > When we are going to reboot a node which have krbd devices mapped, > even with the rbdmap.service enabled, we will be blocked > in shuting down if the ceph cluster is not working. > > Especially, If we have three controller nodes which is running as ceph mon, > but at the same time, there are some k8s pod with krbd devices on this nodes, > then we can't shut down the last controller node when we want to shutdown > all nodes, because when we are going to shutdown last controller node, the > ceph cluster is not reachable. Why can't rbd images be unmapped in a proper way before the cluster is shutdown? > > [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? > We don't want to set the osd_request_timeout in rbd device whole lifecycle, > there would be some problem in networking or cluster recovery to make > the request timeout. This would make the fs readonly and application down. > > [Question 3]: How can this patch solve this problems? > With this patch, we can map rbd device with default value of osd_reques_timeout, > means never timeout, then we can solve the problem mentioned Question 2. > > At the same time we can set the osd_request_timeout to what we need, > in system shuting down, for example, we can do this in rbdmap.service. > then we can make sure we can shutdown or reboot host normally no matter > ceph cluster is working well or not. This can solve the problem mentioned > in Question 1. The plan is to add a new unmap option, so one can do "rbd unmap -o full-force", as noted in https://tracker.ceph.com/issues/20927. This is an old problem but it's been blocked on various issues in libceph. Most of them look like they will be resolved in 4.18, bringing better support for "umount -f" in kcephfs. We should be able to reuse that work for "rbd unmap -o full-force". Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/24/2018 04:23 PM, Ilya Dryomov wrote: > On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang > <dongsheng.yang@easystack.cn> wrote: >> Default value of osd_request_timeout is 0 means never timeout, >> and we can set this value in rbd mapping with -o "osd_request_timeout=XX". >> But we can't change this value online. > Hi Dongsheng, > > Changing just osd_request_timeout won't do anything about outstanding > requests waiting for the acquisition of the exclusive lock. This is an > rbd problem and should be dealt with in rbd.c. Yes, we are using images without exclusive-lock. And yes, the -ETIMEDOUT will try to aquire lock again and again currently, we need to handle it properly. > >> [Question 1]: Why we need to set osd_request_timeout? >> When we are going to reboot a node which have krbd devices mapped, >> even with the rbdmap.service enabled, we will be blocked >> in shuting down if the ceph cluster is not working. >> >> Especially, If we have three controller nodes which is running as ceph mon, >> but at the same time, there are some k8s pod with krbd devices on this nodes, >> then we can't shut down the last controller node when we want to shutdown >> all nodes, because when we are going to shutdown last controller node, the >> ceph cluster is not reachable. > Why can't rbd images be unmapped in a proper way before the cluster is > shutdown? That's optional, that's another solution in our plan. But this solution can't solve the problem if the ceph cluster is not working and we can't recover it soon, then we can't reboot nodes and the all IO threads is in D state. > >> [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? >> We don't want to set the osd_request_timeout in rbd device whole lifecycle, >> there would be some problem in networking or cluster recovery to make >> the request timeout. This would make the fs readonly and application down. >> >> [Question 3]: How can this patch solve this problems? >> With this patch, we can map rbd device with default value of osd_reques_timeout, >> means never timeout, then we can solve the problem mentioned Question 2. >> >> At the same time we can set the osd_request_timeout to what we need, >> in system shuting down, for example, we can do this in rbdmap.service. >> then we can make sure we can shutdown or reboot host normally no matter >> ceph cluster is working well or not. This can solve the problem mentioned >> in Question 1. > The plan is to add a new unmap option, so one can do "rbd unmap -o > full-force", as noted in https://tracker.ceph.com/issues/20927. This > is an old problem but it's been blocked on various issues in libceph. > Most of them look like they will be resolved in 4.18, bringing better > support for "umount -f" in kcephfs. We should be able to reuse that > work for "rbd unmap -o full-force". But even we have a full-force unmap for rbd device, we need to umount the fs first, e.g ext4. When we do umount fs before rbd unmap, we will be blocked, and the umount process would be in D state. Thanx > > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 24, 2018 at 11:15 AM, Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > On 05/24/2018 04:23 PM, Ilya Dryomov wrote: >> >> On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang >> <dongsheng.yang@easystack.cn> wrote: >>> >>> Default value of osd_request_timeout is 0 means never timeout, >>> and we can set this value in rbd mapping with -o >>> "osd_request_timeout=XX". >>> But we can't change this value online. >> >> Hi Dongsheng, >> >> Changing just osd_request_timeout won't do anything about outstanding >> requests waiting for the acquisition of the exclusive lock. This is an >> rbd problem and should be dealt with in rbd.c. > > Yes, we are using images without exclusive-lock. > > And yes, the -ETIMEDOUT will try to aquire lock again and again currently, > we need to handle it properly. Well, any change that is pushed upstream needs to work with the full set of features. >> >> >>> [Question 1]: Why we need to set osd_request_timeout? >>> When we are going to reboot a node which have krbd devices mapped, >>> even with the rbdmap.service enabled, we will be blocked >>> in shuting down if the ceph cluster is not working. >>> >>> Especially, If we have three controller nodes which is running as ceph >>> mon, >>> but at the same time, there are some k8s pod with krbd devices on this >>> nodes, >>> then we can't shut down the last controller node when we want to shutdown >>> all nodes, because when we are going to shutdown last controller node, >>> the >>> ceph cluster is not reachable. >> >> Why can't rbd images be unmapped in a proper way before the cluster is >> shutdown? > > That's optional, that's another solution in our plan. But this solution > can't solve the problem if the ceph cluster is not working and we > can't recover it soon, then we can't reboot nodes and the all IO > threads is in D state. That is (hopefully!) an extreme case. Rebooting nodes with I/O threads in D state is a sure way to lose data. Wouldn't implementing proper unmapping somewhere in the orchestration layer make this problem pretty much go away? >> >> >>> [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? >>> We don't want to set the osd_request_timeout in rbd device whole >>> lifecycle, >>> there would be some problem in networking or cluster recovery to make >>> the request timeout. This would make the fs readonly and application >>> down. >>> >>> [Question 3]: How can this patch solve this problems? >>> With this patch, we can map rbd device with default value of >>> osd_reques_timeout, >>> means never timeout, then we can solve the problem mentioned Question 2. >>> >>> At the same time we can set the osd_request_timeout to what we need, >>> in system shuting down, for example, we can do this in rbdmap.service. >>> then we can make sure we can shutdown or reboot host normally no matter >>> ceph cluster is working well or not. This can solve the problem mentioned >>> in Question 1. >> >> The plan is to add a new unmap option, so one can do "rbd unmap -o >> full-force", as noted in https://tracker.ceph.com/issues/20927. This >> is an old problem but it's been blocked on various issues in libceph. >> Most of them look like they will be resolved in 4.18, bringing better >> support for "umount -f" in kcephfs. We should be able to reuse that >> work for "rbd unmap -o full-force". > > But even we have a full-force unmap for rbd device, we need to umount > the fs first, e.g ext4. When we do umount fs before rbd unmap, we will > be blocked, and the umount process would be in D state. I think there are two separate issues here. The first is that there is no way to say "I don't care about possible data loss, unmap this device no matter what". This is what -o full-force is targeted at. As I said in the github PR, you wouldn't need to umount before unmapping with it. The second issue is what to do on reboot if the cluster is unavailable, i.e. _whether_ and when to raise the axe and fail outstanding requests. The problem is that there is no way to distinguish between a transient hold up and a full outage. I'm not sure whether a default timeout will suite everybody, because people get really unhappy when they lose data. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 24, 2018 at 4:19 PM, 杨东升 <dongsheng.yang@easystack.cn> wrote: > Hi Ilya, > > At 2018-05-24 21:17:55, "Ilya Dryomov" <idryomov@gmail.com> wrote: >>On Thu, May 24, 2018 at 11:15 AM, Dongsheng Yang >><dongsheng.yang@easystack.cn> wrote: >>> On 05/24/2018 04:23 PM, Ilya Dryomov wrote: >>>> >>>> On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang >>>> <dongsheng.yang@easystack.cn> wrote: >>>>> >>>>> Default value of osd_request_timeout is 0 means never timeout, >>>>> and we can set this value in rbd mapping with -o >>>>> "osd_request_timeout=XX". >>>>> But we can't change this value online. >>>> >>>> Hi Dongsheng, >>>> >>>> Changing just osd_request_timeout won't do anything about outstanding >>>> requests waiting for the acquisition of the exclusive lock. This is an >>>> rbd problem and should be dealt with in rbd.c. >>> >>> Yes, we are using images without exclusive-lock. >>> >>> And yes, the -ETIMEDOUT will try to aquire lock again and again >>> currently, >>> we need to handle it properly. >> >>Well, any change that is pushed upstream needs to work with the full set >>of features. > > Yes, agreed. will fix it later after our discussion below. >> >>>> >>>> >>>>> [Question 1]: Why we need to set osd_request_timeout? >>>>> When we are going to reboot a node which have krbd devices mapped, >>>>> even with the rbdmap.service enabled, we will be blocked >>>>> in shuting down if the ceph cluster is not working. >>>>> >>>>> Especially, If we have three controller nodes which is running as ceph >>>>> mon, >>>>> but at the same time, there are some k8s pod with krbd devices on this >>>>> nodes, >>>>> then we can't shut down the last controller node when we want to >>>>> shutdown >>>>> all nodes, because when we are going to shutdown last controller node, >>>>> the >>>>> ceph cluster is not reachable. >>>> >>>> Why can't rbd images be unmapped in a proper way before the cluster is >>>> shutdown? >>> >>> That's optional, that's another solution in our plan. But this solution >>> can't solve the problem if the ceph cluster is not working and we >>> can't recover it soon, then we can't reboot nodes and the all IO >>> threads is in D state. >> >>That is (hopefully!) an extreme case. Rebooting nodes with I/O threads >>in D state is a sure way to lose data. >> >>Wouldn't implementing proper unmapping somewhere in the orchestration >>layer make this problem pretty much go away? > > It's possible and we are doing that work. but this patch actually is going > to > fix the last problem you mentioned......... >> >>>> >>>> >>>>> [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? >>>>> We don't want to set the osd_request_timeout in rbd device whole >>>>> lifecycle, >>>>> there would be some problem in networking or cluster recovery to make >>>>> the request timeout. This would make the fs readonly and application >>>>> down. >>>>> >>>>> [Question 3]: How can this patch solve this problems? >>>>> With this patch, we can map rbd device with default value of >>>>> osd_reques_timeout, >>>>> means never timeout, then we can solve the problem mentioned Question >>>>> 2. >>>>> >>>>> At the same time we can set the osd_request_timeout to what we need, >>>>> in system shuting down, for example, we can do this in rbdmap.service. >>>>> then we can make sure we can shutdown or reboot host normally no matter >>>>> ceph cluster is working well or not. This can solve the problem >>>>> mentioned >>>>> in Question 1. >>>> >>>> The plan is to add a new unmap option, so one can do "rbd unmap -o >>>> full-force", as noted in https://tracker.ceph.com/issues/20927. This >>>> is an old problem but it's been blocked on various issues in libceph. >>>> Most of them look like they will be resolved in 4.18, bringing better >>>> support for "umount -f" in kcephfs. We should be able to reuse that >>>> work for "rbd unmap -o full-force". >>> >>> But even we have a full-force unmap for rbd device, we need to umount >>> the fs first, e.g ext4. When we do umount fs before rbd unmap, we will >>> be blocked, and the umount process would be in D state. >> >>I think there are two separate issues here. The first is that there is >>no way to say "I don't care about possible data loss, unmap this device >>no matter what". This is what -o full-force is targeted at. As I said >>in the github PR, you wouldn't need to umount before unmapping with it. >> >>The second issue is what to do on reboot if the cluster is unavailable, >>i.e. _whether_ and when to raise the axe and fail outstanding requests. >>The problem is that there is no way to distinguish between a transient >>hold up and a full outage. I'm not sure whether a default timeout will >>suite everybody, because people get really unhappy when they lose data. > > ..... Yes, this is what we want to solve. We design to set the timeout from > the setting of rbdmap.service and then to do a umount and unmap. > > I agree that it's difficult to decide the default timeout of it. So we give > an option to user to decide it. And set the default of it to 0. > > https://github.com/yangdongsheng/ceph/commit/8dceacac1e9c707f011157b433f0cbd1c7053f1e#diff-59215bb23b07b89b90f1cd0e02abb8d9R10 Your github PR was setting a default timeout of 60 seconds for everybody -- that is what I reacted to. > > Then I think that will not make it worse at least, but we have a optional > way to solve this kind of problem if you want. What do you think ? I won't object to a new rbdmap.service configurable, but I think it should be implemented in terms of -o full-force, with the timeout logic based on systemd's mechanisms. IIRC if the service doesn't stop in time, systemd will send a SIGTERM followed by a SIGKILL after another timeout. I think it could be as simple as "unmap everything with -o full-force" on SIGKILL or something like that. Let's wait for -o full-force and then reevaluate. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 24/05/2018 22:55, Ilya Dryomov wrote: > On Thu, May 24, 2018 at 4:19 PM, 杨东升 <dongsheng.yang@easystack.cn> wrote: >> Hi Ilya, >> >> At 2018-05-24 21:17:55, "Ilya Dryomov" <idryomov@gmail.com> wrote: >>> On Thu, May 24, 2018 at 11:15 AM, Dongsheng Yang >>> <dongsheng.yang@easystack.cn> wrote: >>>> On 05/24/2018 04:23 PM, Ilya Dryomov wrote: >>>>> On Thu, May 24, 2018 at 5:27 AM, Dongsheng Yang >>>>> <dongsheng.yang@easystack.cn> wrote: >>>>>> Default value of osd_request_timeout is 0 means never timeout, >>>>>> and we can set this value in rbd mapping with -o >>>>>> "osd_request_timeout=XX". >>>>>> But we can't change this value online. >>>>> Hi Dongsheng, >>>>> >>>>> Changing just osd_request_timeout won't do anything about outstanding >>>>> requests waiting for the acquisition of the exclusive lock. This is an >>>>> rbd problem and should be dealt with in rbd.c. >>>> Yes, we are using images without exclusive-lock. >>>> >>>> And yes, the -ETIMEDOUT will try to aquire lock again and again >>>> currently, >>>> we need to handle it properly. >>> Well, any change that is pushed upstream needs to work with the full set >>> of features. >> Yes, agreed. will fix it later after our discussion below. >>>>> >>>>>> [Question 1]: Why we need to set osd_request_timeout? >>>>>> When we are going to reboot a node which have krbd devices mapped, >>>>>> even with the rbdmap.service enabled, we will be blocked >>>>>> in shuting down if the ceph cluster is not working. >>>>>> >>>>>> Especially, If we have three controller nodes which is running as ceph >>>>>> mon, >>>>>> but at the same time, there are some k8s pod with krbd devices on this >>>>>> nodes, >>>>>> then we can't shut down the last controller node when we want to >>>>>> shutdown >>>>>> all nodes, because when we are going to shutdown last controller node, >>>>>> the >>>>>> ceph cluster is not reachable. >>>>> Why can't rbd images be unmapped in a proper way before the cluster is >>>>> shutdown? >>>> That's optional, that's another solution in our plan. But this solution >>>> can't solve the problem if the ceph cluster is not working and we >>>> can't recover it soon, then we can't reboot nodes and the all IO >>>> threads is in D state. >>> That is (hopefully!) an extreme case. Rebooting nodes with I/O threads >>> in D state is a sure way to lose data. >>> >>> Wouldn't implementing proper unmapping somewhere in the orchestration >>> layer make this problem pretty much go away? >> It's possible and we are doing that work. but this patch actually is going >> to >> fix the last problem you mentioned......... >>>>> >>>>>> [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? >>>>>> We don't want to set the osd_request_timeout in rbd device whole >>>>>> lifecycle, >>>>>> there would be some problem in networking or cluster recovery to make >>>>>> the request timeout. This would make the fs readonly and application >>>>>> down. >>>>>> >>>>>> [Question 3]: How can this patch solve this problems? >>>>>> With this patch, we can map rbd device with default value of >>>>>> osd_reques_timeout, >>>>>> means never timeout, then we can solve the problem mentioned Question >>>>>> 2. >>>>>> >>>>>> At the same time we can set the osd_request_timeout to what we need, >>>>>> in system shuting down, for example, we can do this in rbdmap.service. >>>>>> then we can make sure we can shutdown or reboot host normally no matter >>>>>> ceph cluster is working well or not. This can solve the problem >>>>>> mentioned >>>>>> in Question 1. >>>>> The plan is to add a new unmap option, so one can do "rbd unmap -o >>>>> full-force", as noted in https://tracker.ceph.com/issues/20927. This >>>>> is an old problem but it's been blocked on various issues in libceph. >>>>> Most of them look like they will be resolved in 4.18, bringing better >>>>> support for "umount -f" in kcephfs. We should be able to reuse that >>>>> work for "rbd unmap -o full-force". >>>> But even we have a full-force unmap for rbd device, we need to umount >>>> the fs first, e.g ext4. When we do umount fs before rbd unmap, we will >>>> be blocked, and the umount process would be in D state. >>> I think there are two separate issues here. The first is that there is >>> no way to say "I don't care about possible data loss, unmap this device >>> no matter what". This is what -o full-force is targeted at. As I said >>> in the github PR, you wouldn't need to umount before unmapping with it. >>> >>> The second issue is what to do on reboot if the cluster is unavailable, >>> i.e. _whether_ and when to raise the axe and fail outstanding requests. >>> The problem is that there is no way to distinguish between a transient >>> hold up and a full outage. I'm not sure whether a default timeout will >>> suite everybody, because people get really unhappy when they lose data. >> ..... Yes, this is what we want to solve. We design to set the timeout from >> the setting of rbdmap.service and then to do a umount and unmap. >> >> I agree that it's difficult to decide the default timeout of it. So we give >> an option to user to decide it. And set the default of it to 0. >> >> https://github.com/yangdongsheng/ceph/commit/8dceacac1e9c707f011157b433f0cbd1c7053f1e#diff-59215bb23b07b89b90f1cd0e02abb8d9R10 > Your github PR was setting a default timeout of 60 seconds for > everybody -- that is what I reacted to. HA, yes, that was not a good idea. > >> Then I think that will not make it worse at least, but we have a optional >> way to solve this kind of problem if you want. What do you think ? > I won't object to a new rbdmap.service configurable, but I think it > should be implemented in terms of -o full-force, with the timeout logic > based on systemd's mechanisms. IIRC if the service doesn't stop in > time, systemd will send a SIGTERM followed by a SIGKILL after another > timeout. I think it could be as simple as "unmap everything with -o > full-force" on SIGKILL or something like that. > > Let's wait for -o full-force and then reevaluate. okey, let's wait for full-force. thanx > > Thanks, > > Ilya > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[resend because of a SMTP error, please ignore this if you have received it.......] Hi Ilya, I think there is no conflict between this patch and -o full-force. We can use them in different use cases. (1), This patch is simple. When we are going to fix the problem of umounting fs and unmap device but the ceph cluster is unavailable in production, we want the logic to be as simple as possible, which will introduce regression with a very little possibility. Especially when we need to backport commits to stable branches. (2), When we don't want to change the original logical of user applications. Let's compare the work we need to do in higher-level applications, if we are going to use full-force to solve the problem, we need to change the user applications, for example, in k8s, it's going to umount fs at first and then detachdisk. That's not easy to change the framework of it. (3), When we don't want to implement another "timeout and retry with full-force" As what we discussed about the full-force, IIUC, we don't have to use full-force at first, but we should try it with normal way, and retry with full-force when a timedout. For example, you mentioned, we can retry when we got a specified Signal in systemd shuting down. But in some other use case, we have to implement this timeout and retry mechanism. And yes, there is some other cases this patch is not suitable, for example, when the system don't have debugfs mounted. So I think we can merge this patch into upstream, but continue to implement full-force. What do you think? Thanx Dongsheng On 05/24/2018 11:27 AM, Dongsheng Yang wrote: > Default value of osd_request_timeout is 0 means never timeout, > and we can set this value in rbd mapping with -o "osd_request_timeout=XX". > But we can't change this value online. > > [Question 1]: Why we need to set osd_request_timeout? > When we are going to reboot a node which have krbd devices mapped, > even with the rbdmap.service enabled, we will be blocked > in shuting down if the ceph cluster is not working. > > Especially, If we have three controller nodes which is running as ceph mon, > but at the same time, there are some k8s pod with krbd devices on this nodes, > then we can't shut down the last controller node when we want to shutdown > all nodes, because when we are going to shutdown last controller node, the > ceph cluster is not reachable. > > [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? > We don't want to set the osd_request_timeout in rbd device whole lifecycle, > there would be some problem in networking or cluster recovery to make > the request timeout. This would make the fs readonly and application down. > > [Question 3]: How can this patch solve this problems? > With this patch, we can map rbd device with default value of osd_reques_timeout, > means never timeout, then we can solve the problem mentioned Question 2. > > At the same time we can set the osd_request_timeout to what we need, > in system shuting down, for example, we can do this in rbdmap.service. > then we can make sure we can shutdown or reboot host normally no matter > ceph cluster is working well or not. This can solve the problem mentioned > in Question 1. > > Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> > --- > include/linux/ceph/libceph.h | 1 + > net/ceph/debugfs.c | 31 +++++++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+) > > diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h > index 9ce689a..8fdea50 100644 > --- a/include/linux/ceph/libceph.h > +++ b/include/linux/ceph/libceph.h > @@ -133,6 +133,7 @@ struct ceph_client { > struct dentry *debugfs_monmap; > struct dentry *debugfs_osdmap; > struct dentry *debugfs_options; > + struct dentry *debugfs_osd_req_timeout; > #endif > }; > > diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c > index 46f6570..2b6cae3 100644 > --- a/net/ceph/debugfs.c > +++ b/net/ceph/debugfs.c > @@ -389,6 +389,28 @@ static int client_options_show(struct seq_file *s, void *p) > CEPH_DEFINE_SHOW_FUNC(osdc_show) > CEPH_DEFINE_SHOW_FUNC(client_options_show) > > +static int osd_request_timeout_get(void *data, u64 *val) > +{ > + struct ceph_client *client = data; > + > + *val = client->options->osd_request_timeout; > + return 0; > +} > + > +static int osd_request_timeout_set(void *data, u64 val) > +{ > + struct ceph_client *client = data; > + > + client->options->osd_request_timeout = val; > + return 0; > +} > + > +DEFINE_SIMPLE_ATTRIBUTE(client_osd_req_timeout_fops, > + osd_request_timeout_get, > + osd_request_timeout_set, > + "%lld\n"); > + > + > int __init ceph_debugfs_init(void) > { > ceph_debugfs_dir = debugfs_create_dir("ceph", NULL); > @@ -457,6 +479,14 @@ int ceph_debugfs_client_init(struct ceph_client *client) > if (!client->debugfs_options) > goto out; > > + client->debugfs_osd_req_timeout = debugfs_create_file("osd_request_timeout", > + 0600, > + client->debugfs_dir, > + client, > + &client_osd_req_timeout_fops); > + if (!client->debugfs_osd_req_timeout) > + goto out; > + > return 0; > > out: > @@ -467,6 +497,7 @@ int ceph_debugfs_client_init(struct ceph_client *client) > void ceph_debugfs_client_cleanup(struct ceph_client *client) > { > dout("ceph_debugfs_client_cleanup %p\n", client); > + debugfs_remove(client->debugfs_osd_req_timeout); > debugfs_remove(client->debugfs_options); > debugfs_remove(client->debugfs_osdmap); > debugfs_remove(client->debugfs_monmap); -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 26, 2018 at 3:21 AM, Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > [resend because of a SMTP error, please ignore this if you have received > it.......] > > Hi Ilya, > I think there is no conflict between this patch and -o full-force. We > can use them > in different use cases. > > (1), This patch is simple. > When we are going to fix the problem of umounting fs and unmap device > but the ceph cluster > is unavailable in production, we want the logic to be as simple as possible, > which will > introduce regression with a very little possibility. Especially when we need > to backport > commits to stable branches. This patch is simple only because it doesn't handle all cases. As you acknowledged, it doesn't deal with requests that are stuck on exclusive lock at all. Once you make it do that, it won't be any simpler than -o full-force patch. The fundamental problem with this patch is that it introduces a timeout at the wrong level. If we were to add such a timeout, it would need to work at the rbd level and not within libceph, because there are things at the rbd level that need handling when the timeout is fired. > > (2), When we don't want to change the original logical of user applications. > Let's compare the work we need to do in higher-level applications, if > we are going to > use full-force to solve the problem, we need to change the user > applications, for example, > in k8s, it's going to umount fs at first and then detachdisk. That's not > easy to change the framework > of it. > > (3), When we don't want to implement another "timeout and retry with > full-force" > As what we discussed about the full-force, IIUC, we don't have to use > full-force > at first, but we should try it with normal way, and retry with full-force > when a timedout. > For example, you mentioned, we can retry when we got a specified Signal in > systemd shuting > down. But in some other use case, we have to implement this timeout and > retry mechanism. -o full-force is a mechanism, "try it with normal way, and retry with full-force when a timedout" is a policy. One may want to do something else before forcing, leave it up to the user, etc. Or, if the cluster is known to be permanently unavailable, force without a timeout. > > And yes, there is some other cases this patch is not suitable, for example, > when the system don't have debugfs mounted. > > So I think we can merge this patch into upstream, but continue to implement > full-force. I'm not opposed to wiring up an rbd level timeout, but in order to be merged the code must handle all cases. The reason I suggested to wait for -o full-force is that it should take care of the hard stuff and make implementing a proper timeout handler much easier. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Ilya, On 05/28/2018 05:13 PM, Ilya Dryomov wrote: > On Sat, May 26, 2018 at 3:21 AM, Dongsheng Yang > <dongsheng.yang@easystack.cn> wrote: >> [resend because of a SMTP error, please ignore this if you have received >> it.......] >> >> Hi Ilya, >> I think there is no conflict between this patch and -o full-force. We >> can use them >> in different use cases. >> >> (1), This patch is simple. >> When we are going to fix the problem of umounting fs and unmap device >> but the ceph cluster >> is unavailable in production, we want the logic to be as simple as possible, >> which will >> introduce regression with a very little possibility. Especially when we need >> to backport >> commits to stable branches. > This patch is simple only because it doesn't handle all cases. As you > acknowledged, it doesn't deal with requests that are stuck on exclusive > lock at all. Once you make it do that, it won't be any simpler than > -o full-force patch. > > The fundamental problem with this patch is that it introduces a timeout > at the wrong level. If we were to add such a timeout, it would need to > work at the rbd level and not within libceph, because there are things > at the rbd level that need handling when the timeout is fired. > >> (2), When we don't want to change the original logical of user applications. >> Let's compare the work we need to do in higher-level applications, if >> we are going to >> use full-force to solve the problem, we need to change the user >> applications, for example, >> in k8s, it's going to umount fs at first and then detachdisk. That's not >> easy to change the framework >> of it. >> >> (3), When we don't want to implement another "timeout and retry with >> full-force" >> As what we discussed about the full-force, IIUC, we don't have to use >> full-force >> at first, but we should try it with normal way, and retry with full-force >> when a timedout. >> For example, you mentioned, we can retry when we got a specified Signal in >> systemd shuting >> down. But in some other use case, we have to implement this timeout and >> retry mechanism. > -o full-force is a mechanism, "try it with normal way, and retry with > full-force when a timedout" is a policy. One may want to do something > else before forcing, leave it up to the user, etc. Or, if the cluster > is known to be permanently unavailable, force without a timeout. > >> And yes, there is some other cases this patch is not suitable, for example, >> when the system don't have debugfs mounted. >> >> So I think we can merge this patch into upstream, but continue to implement >> full-force. > I'm not opposed to wiring up an rbd level timeout, but in order to be > merged the code must handle all cases. The reason I suggested to wait > for -o full-force is that it should take care of the hard stuff and > make implementing a proper timeout handler much easier. Okey, agree with this point. Let's wait for the full-force and then decide whether necessary to implement a rbd level timeout. Thanx :) > > Thanks, > > Ilya > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 9ce689a..8fdea50 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -133,6 +133,7 @@ struct ceph_client { struct dentry *debugfs_monmap; struct dentry *debugfs_osdmap; struct dentry *debugfs_options; + struct dentry *debugfs_osd_req_timeout; #endif }; diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c index 46f6570..2b6cae3 100644 --- a/net/ceph/debugfs.c +++ b/net/ceph/debugfs.c @@ -389,6 +389,28 @@ static int client_options_show(struct seq_file *s, void *p) CEPH_DEFINE_SHOW_FUNC(osdc_show) CEPH_DEFINE_SHOW_FUNC(client_options_show) +static int osd_request_timeout_get(void *data, u64 *val) +{ + struct ceph_client *client = data; + + *val = client->options->osd_request_timeout; + return 0; +} + +static int osd_request_timeout_set(void *data, u64 val) +{ + struct ceph_client *client = data; + + client->options->osd_request_timeout = val; + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(client_osd_req_timeout_fops, + osd_request_timeout_get, + osd_request_timeout_set, + "%lld\n"); + + int __init ceph_debugfs_init(void) { ceph_debugfs_dir = debugfs_create_dir("ceph", NULL); @@ -457,6 +479,14 @@ int ceph_debugfs_client_init(struct ceph_client *client) if (!client->debugfs_options) goto out; + client->debugfs_osd_req_timeout = debugfs_create_file("osd_request_timeout", + 0600, + client->debugfs_dir, + client, + &client_osd_req_timeout_fops); + if (!client->debugfs_osd_req_timeout) + goto out; + return 0; out: @@ -467,6 +497,7 @@ int ceph_debugfs_client_init(struct ceph_client *client) void ceph_debugfs_client_cleanup(struct ceph_client *client) { dout("ceph_debugfs_client_cleanup %p\n", client); + debugfs_remove(client->debugfs_osd_req_timeout); debugfs_remove(client->debugfs_options); debugfs_remove(client->debugfs_osdmap); debugfs_remove(client->debugfs_monmap);
Default value of osd_request_timeout is 0 means never timeout, and we can set this value in rbd mapping with -o "osd_request_timeout=XX". But we can't change this value online. [Question 1]: Why we need to set osd_request_timeout? When we are going to reboot a node which have krbd devices mapped, even with the rbdmap.service enabled, we will be blocked in shuting down if the ceph cluster is not working. Especially, If we have three controller nodes which is running as ceph mon, but at the same time, there are some k8s pod with krbd devices on this nodes, then we can't shut down the last controller node when we want to shutdown all nodes, because when we are going to shutdown last controller node, the ceph cluster is not reachable. [Question 2]: Why don't we use rbd map -o "osd_request_timeout=XX"? We don't want to set the osd_request_timeout in rbd device whole lifecycle, there would be some problem in networking or cluster recovery to make the request timeout. This would make the fs readonly and application down. [Question 3]: How can this patch solve this problems? With this patch, we can map rbd device with default value of osd_reques_timeout, means never timeout, then we can solve the problem mentioned Question 2. At the same time we can set the osd_request_timeout to what we need, in system shuting down, for example, we can do this in rbdmap.service. then we can make sure we can shutdown or reboot host normally no matter ceph cluster is working well or not. This can solve the problem mentioned in Question 1. Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> --- include/linux/ceph/libceph.h | 1 + net/ceph/debugfs.c | 31 +++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)