Message ID | 20230509084637.213326-1-xiubli@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ceph: just wait the osd requests' callbacks to finish when unmounting | expand |
xiubli@redhat.com writes: > From: Xiubo Li <xiubli@redhat.com> > > The sync_filesystem() will flush all the dirty buffer and submit the > osd reqs to the osdc and then is blocked to wait for all the reqs to > finish. But the when the reqs' replies come, the reqs will be removed > from osdc just before the req->r_callback()s are called. Which means > the sync_filesystem() will be woke up by leaving the req->r_callback()s > are still running. > > This will be buggy when the waiter require the req->r_callback()s to > release some resources before continuing. So we need to make sure the > req->r_callback()s are called before removing the reqs from the osdc. > > WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 > CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 > Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 > RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 > RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 > RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 > RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 > RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 > R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 > R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 > FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > generic_shutdown_super+0x47/0x120 > kill_anon_super+0x14/0x30 > ceph_kill_sb+0x36/0x90 [ceph] > deactivate_locked_super+0x29/0x60 > cleanup_mnt+0xb8/0x140 > task_work_run+0x67/0xb0 > exit_to_user_mode_prepare+0x23d/0x240 > syscall_exit_to_user_mode+0x25/0x60 > do_syscall_64+0x40/0x80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7fd83dc39e9b > > We need to increase the blocker counter to make sure all the osd > requests' callbacks have been finished just before calling the > kill_anon_super() when unmounting. (Sorry for taking so long replying to this patch! And I've still a few others on the queue!) I've been looking at this patch and at patch "ceph: drop the messages from MDS when unmounting", but I still can't say whether they are correct or not. They seem to be working, but they don't _look_ right. For example, mdsc->stopping is being used by ceph_dec_stopping_blocker() and ceph_inc_stopping_blocker() for setting the ceph_mds_client state, but the old usage for that field (e.g. in ceph_mdsc_pre_umount()) is being kept. Is that correct? Maybe, but it looks wrong: the old usage isn't protected by the spinlock and doesn't use the atomic counter. Another example: in patch "ceph: drop the messages from MDS when unmounting" we're adding calls to ceph_inc_stopping_blocker() in ceph_handle_caps(), ceph_handle_quota(), and ceph_handle_snap(). Are those the only places where this is needed? Obviously not, because this new patch is adding extra calls in the read/write paths. But maybe there are more places...? And another one: changing ceph_inc_stopping_blocker() to accept NULL to distinguish between mds and osd requests makes things look even more hacky :-( On the other end, I've been testing these patches thoroughly and couldn't see any issues with them. And although I'm still not convinced that they will not deadlock in some corner cases, I don't have a better solution either for the problem they're solving. FWIW you can add my: Tested-by: Luís Henriques <lhenriques@suse.de> to this patch (the other one already has it), but I'll need to spend more time to see if there are better solutions. Cheers,
On 6/2/23 19:29, Luís Henriques wrote: > xiubli@redhat.com writes: > >> From: Xiubo Li <xiubli@redhat.com> >> >> The sync_filesystem() will flush all the dirty buffer and submit the >> osd reqs to the osdc and then is blocked to wait for all the reqs to >> finish. But the when the reqs' replies come, the reqs will be removed >> from osdc just before the req->r_callback()s are called. Which means >> the sync_filesystem() will be woke up by leaving the req->r_callback()s >> are still running. >> >> This will be buggy when the waiter require the req->r_callback()s to >> release some resources before continuing. So we need to make sure the >> req->r_callback()s are called before removing the reqs from the osdc. >> >> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 >> CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 >> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 >> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 >> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 >> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 >> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 >> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 >> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 >> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 >> FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> <TASK> >> generic_shutdown_super+0x47/0x120 >> kill_anon_super+0x14/0x30 >> ceph_kill_sb+0x36/0x90 [ceph] >> deactivate_locked_super+0x29/0x60 >> cleanup_mnt+0xb8/0x140 >> task_work_run+0x67/0xb0 >> exit_to_user_mode_prepare+0x23d/0x240 >> syscall_exit_to_user_mode+0x25/0x60 >> do_syscall_64+0x40/0x80 >> entry_SYSCALL_64_after_hwframe+0x63/0xcd >> RIP: 0033:0x7fd83dc39e9b >> >> We need to increase the blocker counter to make sure all the osd >> requests' callbacks have been finished just before calling the >> kill_anon_super() when unmounting. > (Sorry for taking so long replying to this patch! And I've still a few > others on the queue!) > > I've been looking at this patch and at patch "ceph: drop the messages from > MDS when unmounting", but I still can't say whether they are correct or > not. They seem to be working, but they don't _look_ right. > > For example, mdsc->stopping is being used by ceph_dec_stopping_blocker() > and ceph_inc_stopping_blocker() for setting the ceph_mds_client state, but > the old usage for that field (e.g. in ceph_mdsc_pre_umount()) is being > kept. Is that correct? Maybe, but it looks wrong: the old usage isn't > protected by the spinlock and doesn't use the atomic counter. This is okay, the spin lock "stopping_lock" is not trying to protect the "stopping", and it's just trying to protect the "stopping_blockers" and the order of accessing "fsc->mdsc->stopping" and "fsc->mdsc->stopping_blockers", you can try without this you can reproduce it again. > > Another example: in patch "ceph: drop the messages from MDS when > unmounting" we're adding calls to ceph_inc_stopping_blocker() in > ceph_handle_caps(), ceph_handle_quota(), and ceph_handle_snap(). Are > those the only places where this is needed? Obviously not, because this > new patch is adding extra calls in the read/write paths. But maybe there > are more places...? I have gone through all the related code and this should be all the places, which will grab the inode, we need to do this. You can confirm that. > And another one: changing ceph_inc_stopping_blocker() to accept NULL to > distinguish between mds and osd requests makes things look even more > hacky :-( This can be improved. Let me update it to make it easier to read. > > On the other end, I've been testing these patches thoroughly and couldn't > see any issues with them. And although I'm still not convinced that they > will not deadlock in some corner cases, I don't have a better solution > either for the problem they're solving. > > FWIW you can add my: > > Tested-by: Luís Henriques <lhenriques@suse.de> > > to this patch (the other one already has it), but I'll need to spend more > time to see if there are better solutions. Thanks Luis for your test and review. - Xiubo > > Cheers,
Xiubo Li <xiubli@redhat.com> writes: > On 6/2/23 19:29, Luís Henriques wrote: >> xiubli@redhat.com writes: >> >>> From: Xiubo Li <xiubli@redhat.com> >>> >>> The sync_filesystem() will flush all the dirty buffer and submit the >>> osd reqs to the osdc and then is blocked to wait for all the reqs to >>> finish. But the when the reqs' replies come, the reqs will be removed >>> from osdc just before the req->r_callback()s are called. Which means >>> the sync_filesystem() will be woke up by leaving the req->r_callback()s >>> are still running. >>> >>> This will be buggy when the waiter require the req->r_callback()s to >>> release some resources before continuing. So we need to make sure the >>> req->r_callback()s are called before removing the reqs from the osdc. >>> >>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 >>> CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 >>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 >>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 >>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 >>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 >>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 >>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 >>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 >>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 >>> FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Call Trace: >>> <TASK> >>> generic_shutdown_super+0x47/0x120 >>> kill_anon_super+0x14/0x30 >>> ceph_kill_sb+0x36/0x90 [ceph] >>> deactivate_locked_super+0x29/0x60 >>> cleanup_mnt+0xb8/0x140 >>> task_work_run+0x67/0xb0 >>> exit_to_user_mode_prepare+0x23d/0x240 >>> syscall_exit_to_user_mode+0x25/0x60 >>> do_syscall_64+0x40/0x80 >>> entry_SYSCALL_64_after_hwframe+0x63/0xcd >>> RIP: 0033:0x7fd83dc39e9b >>> >>> We need to increase the blocker counter to make sure all the osd >>> requests' callbacks have been finished just before calling the >>> kill_anon_super() when unmounting. >> (Sorry for taking so long replying to this patch! And I've still a few >> others on the queue!) >> >> I've been looking at this patch and at patch "ceph: drop the messages from >> MDS when unmounting", but I still can't say whether they are correct or >> not. They seem to be working, but they don't _look_ right. >> >> For example, mdsc->stopping is being used by ceph_dec_stopping_blocker() >> and ceph_inc_stopping_blocker() for setting the ceph_mds_client state, but >> the old usage for that field (e.g. in ceph_mdsc_pre_umount()) is being >> kept. Is that correct? Maybe, but it looks wrong: the old usage isn't >> protected by the spinlock and doesn't use the atomic counter. > > This is okay, the spin lock "stopping_lock" is not trying to protect the > "stopping", and it's just trying to protect the "stopping_blockers" and the > order of accessing "fsc->mdsc->stopping" and "fsc->mdsc->stopping_blockers", you > can try without this you can reproduce it again. > > >> >> Another example: in patch "ceph: drop the messages from MDS when >> unmounting" we're adding calls to ceph_inc_stopping_blocker() in >> ceph_handle_caps(), ceph_handle_quota(), and ceph_handle_snap(). Are >> those the only places where this is needed? Obviously not, because this >> new patch is adding extra calls in the read/write paths. But maybe there >> are more places...? > > I have gone through all the related code and this should be all the places, > which will grab the inode, we need to do this. You can confirm that. > >> And another one: changing ceph_inc_stopping_blocker() to accept NULL to >> distinguish between mds and osd requests makes things look even more >> hacky :-( > > This can be improved. > > Let me update it to make it easier to read. > >> >> On the other end, I've been testing these patches thoroughly and couldn't >> see any issues with them. And although I'm still not convinced that they >> will not deadlock in some corner cases, I don't have a better solution >> either for the problem they're solving. >> >> FWIW you can add my: >> >> Tested-by: Luís Henriques <lhenriques@suse.de> >> >> to this patch (the other one already has it), but I'll need to spend more >> time to see if there are better solutions. > > Thanks Luis for your test and review. One final question: Would it be better to use wait_for_completion_timeout() in ceph_kill_sb()? I'm mostly worried with the read/write paths, of course. For example, the complexity of function ceph_writepages_start() makes it easy to introduce a bug where we'll have an ceph_inc_stopping_blocker() without an ceph_dec_stopping_blocker(). On the other hand, adding a timeout won't solve anything as we'll get the fscrypt warning in the unlikely even of timing-out; but maybe that's more acceptable than blocking the umount operation forever. So, I'm not really sure it's worth it. Cheers,
On 6/5/23 22:54, Luís Henriques wrote: > Xiubo Li <xiubli@redhat.com> writes: > >> On 6/2/23 19:29, Luís Henriques wrote: >>> xiubli@redhat.com writes: >>> >>>> From: Xiubo Li <xiubli@redhat.com> >>>> >>>> The sync_filesystem() will flush all the dirty buffer and submit the >>>> osd reqs to the osdc and then is blocked to wait for all the reqs to >>>> finish. But the when the reqs' replies come, the reqs will be removed >>>> from osdc just before the req->r_callback()s are called. Which means >>>> the sync_filesystem() will be woke up by leaving the req->r_callback()s >>>> are still running. >>>> >>>> This will be buggy when the waiter require the req->r_callback()s to >>>> release some resources before continuing. So we need to make sure the >>>> req->r_callback()s are called before removing the reqs from the osdc. >>>> >>>> WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 >>>> CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 >>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 >>>> RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 >>>> RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 >>>> RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 >>>> RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 >>>> RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 >>>> R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 >>>> R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 >>>> FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> Call Trace: >>>> <TASK> >>>> generic_shutdown_super+0x47/0x120 >>>> kill_anon_super+0x14/0x30 >>>> ceph_kill_sb+0x36/0x90 [ceph] >>>> deactivate_locked_super+0x29/0x60 >>>> cleanup_mnt+0xb8/0x140 >>>> task_work_run+0x67/0xb0 >>>> exit_to_user_mode_prepare+0x23d/0x240 >>>> syscall_exit_to_user_mode+0x25/0x60 >>>> do_syscall_64+0x40/0x80 >>>> entry_SYSCALL_64_after_hwframe+0x63/0xcd >>>> RIP: 0033:0x7fd83dc39e9b >>>> >>>> We need to increase the blocker counter to make sure all the osd >>>> requests' callbacks have been finished just before calling the >>>> kill_anon_super() when unmounting. >>> (Sorry for taking so long replying to this patch! And I've still a few >>> others on the queue!) >>> >>> I've been looking at this patch and at patch "ceph: drop the messages from >>> MDS when unmounting", but I still can't say whether they are correct or >>> not. They seem to be working, but they don't _look_ right. >>> >>> For example, mdsc->stopping is being used by ceph_dec_stopping_blocker() >>> and ceph_inc_stopping_blocker() for setting the ceph_mds_client state, but >>> the old usage for that field (e.g. in ceph_mdsc_pre_umount()) is being >>> kept. Is that correct? Maybe, but it looks wrong: the old usage isn't >>> protected by the spinlock and doesn't use the atomic counter. >> This is okay, the spin lock "stopping_lock" is not trying to protect the >> "stopping", and it's just trying to protect the "stopping_blockers" and the >> order of accessing "fsc->mdsc->stopping" and "fsc->mdsc->stopping_blockers", you >> can try without this you can reproduce it again. >> >> >>> Another example: in patch "ceph: drop the messages from MDS when >>> unmounting" we're adding calls to ceph_inc_stopping_blocker() in >>> ceph_handle_caps(), ceph_handle_quota(), and ceph_handle_snap(). Are >>> those the only places where this is needed? Obviously not, because this >>> new patch is adding extra calls in the read/write paths. But maybe there >>> are more places...? >> I have gone through all the related code and this should be all the places, >> which will grab the inode, we need to do this. You can confirm that. >> >>> And another one: changing ceph_inc_stopping_blocker() to accept NULL to >>> distinguish between mds and osd requests makes things look even more >>> hacky :-( >> This can be improved. >> >> Let me update it to make it easier to read. >> >>> On the other end, I've been testing these patches thoroughly and couldn't >>> see any issues with them. And although I'm still not convinced that they >>> will not deadlock in some corner cases, I don't have a better solution >>> either for the problem they're solving. >>> >>> FWIW you can add my: >>> >>> Tested-by: Luís Henriques <lhenriques@suse.de> >>> >>> to this patch (the other one already has it), but I'll need to spend more >>> time to see if there are better solutions. >> Thanks Luis for your test and review. > One final question: > > Would it be better to use wait_for_completion_timeout() in ceph_kill_sb()? I will add one mount option at the same time for the umount timeout and switch to wait_for_completion_killable_timeout(). > I'm mostly worried with the read/write paths, of course. For example, the > complexity of function ceph_writepages_start() makes it easy to introduce > a bug where we'll have an ceph_inc_stopping_blocker() without an > ceph_dec_stopping_blocker(). This also could be improved, we can just skip submitting the requests in case failing to increase the counters. > > On the other hand, adding a timeout won't solve anything as we'll get the > fscrypt warning in the unlikely even of timing-out; but maybe that's more > acceptable than blocking the umount operation forever. So, I'm not really > sure it's worth it. This sound reasonable to me. Thanks Luis, - Xiubo > Cheers,
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 683ba9fbd590..eb89e3f2a416 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -272,6 +272,7 @@ static void finish_netfs_read(struct ceph_osd_request *req) } netfs_subreq_terminated(subreq, err, false); iput(req->r_inode); + ceph_dec_stopping_blocker(fsc->mdsc); } static bool ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq) @@ -399,6 +400,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq) } else { osd_req_op_extent_osd_iter(req, 0, &iter); } + ceph_inc_stopping_blocker(fsc->mdsc, NULL); req->r_callback = finish_netfs_read; req->r_priv = subreq; req->r_inode = inode; @@ -871,6 +873,7 @@ static void writepages_finish(struct ceph_osd_request *req) else kfree(osd_data->pages); ceph_osdc_put_request(req); + ceph_dec_stopping_blocker(fsc->mdsc); } /* @@ -1179,6 +1182,7 @@ static int ceph_writepages_start(struct address_space *mapping, BUG_ON(len < ceph_fscrypt_page_offset(pages[locked_pages - 1]) + thp_size(pages[locked_pages - 1]) - offset); + ceph_inc_stopping_blocker(fsc->mdsc, NULL); req->r_callback = writepages_finish; req->r_inode = inode; diff --git a/fs/ceph/super.c b/fs/ceph/super.c index de2a45fa451a..c29d2ddd1a7b 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -1475,16 +1475,20 @@ static int ceph_init_fs_context(struct fs_context *fc) /* * Return true if mdsc successfully increase blocker counter, * or false if the mdsc is in stopping and flushed state. + * + * session: NULL then it's for osd requests or for mds requests. */ bool ceph_inc_stopping_blocker(struct ceph_mds_client *mdsc, struct ceph_mds_session *session) { - mutex_lock(&session->s_mutex); - inc_session_sequence(session); - mutex_unlock(&session->s_mutex); + if (session) { + mutex_lock(&session->s_mutex); + inc_session_sequence(session); + mutex_unlock(&session->s_mutex); + } spin_lock(&mdsc->stopping_lock); - if (mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED) { + if (session && mdsc->stopping >= CEPH_MDSC_STOPPING_FLUSHED) { spin_unlock(&mdsc->stopping_lock); return false; }