Message ID | 1528107946-2327-1-git-send-email-dongsheng.yang@easystack.cn (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 4, 2018 at 12:25 PM, Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > There is a problem if we abort a linger request and then request_reinit(), > because abort_request() aborted the osd_request but did not put the lreq->r_klref; > Then we will hit this WARN_ON(kref_read(&req->r_kref) != 1); > > [ 609.492250] WARNING: CPU: 6 PID: 4933 at net/ceph//osd_client.c:558 request_reinit+0x116/0x130 [libceph] > [ 609.501561] Modules linked in: rbd(OE) libceph(OE) dns_resolver tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg cfg80211 rfkill snd_hda_codec_generic ext4 snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc mbcache aesni_intel snd_hwdep crypto_simd snd_seq cryptd jbd2 glue_helper snd_seq_device nfsd snd_pcm snd_timer snd virtio_balloon auth_rpcgss pcspkr soundcore i2c_piix4 nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c virtio_blk virtio_console ata_generic pata_acpi 8139too qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crc32c_intel virtio_pci i2c_core virtio_ring 8139cp virtio mii floppy serio_raw dm_mirror dm_region_hash > [ 609.518259] dm_log dm_mod dax [last unloaded: libceph] > [ 609.520015] CPU: 6 PID: 4933 Comm: kworker/6:0 Tainted: G OE 4.17.0-rc6+ #16 > [ 609.522261] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 609.524135] Workqueue: events handle_timeout [libceph] > [ 609.525887] RIP: 0010:request_reinit+0x116/0x130 [libceph] > [ 609.527723] RSP: 0018:ffffaf90833ebd78 EFLAGS: 00010202 > [ 609.529502] RAX: 0000000000000002 RBX: ffff89bb140f1f00 RCX: 0000000000000000 > [ 609.531576] RDX: 0000000000000000 RSI: ffff89bb5f196978 RDI: ffff89bb5f196978 > [ 609.533643] RBP: ffff89bb52b47800 R08: 0000000000000000 R09: 0000000000000000 > [ 609.535737] R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff89baf96632e0 > [ 609.537836] R13: 0000000000000001 R14: 0000000000000000 R15: ffff89bb516af768 > [ 609.539919] FS: 0000000000000000(0000) GS:ffff89bb5f180000(0000) knlGS:0000000000000000 > [ 609.542151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 609.544009] CR2: 000055ed56747020 CR3: 000000032e00a003 CR4: 00000000000206e0 > [ 609.546111] Call Trace: > [ 609.547482] handle_timeout+0x4e1/0x750 [libceph] > [ 609.549208] ? pick_next_task_fair+0x273/0x630 > [ 609.550867] ? __switch_to+0xa8/0x480 > [ 609.552422] process_one_work+0x158/0x360 > [ 609.554012] worker_thread+0x47/0x3e0 > [ 609.555560] kthread+0xf8/0x130 > [ 609.557018] ? max_active_store+0x80/0x80 > [ 609.558606] ? kthread_bind+0x10/0x10 > [ 609.560142] ret_from_fork+0x35/0x40 > > Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> > --- > net/ceph/osd_client.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index ab287e5..406ecdf 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -2385,6 +2385,9 @@ static void cancel_request(struct ceph_osd_request *req) > > static void abort_request(struct ceph_osd_request *req, int err) > { > + if (req->r_linger) > + return; > + > dout("%s req %p tid %llu err %d\n", __func__, req, req->r_tid, err); > > cancel_map_check(req); I think all OSD requests should be aborted, without exception. Otherwise abort_request() and its users become less useful... This WARN can be triggered under (almost) normal conditions, without abort_request() in the mix -- all it takes is a timing irregularity [1]. A proper fix is to probably get rid of request_reinit(), but it will take some effort. [1] http://tracker.ceph.com/issues/20996 Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/05/2018 04:25 PM, Ilya Dryomov wrote: > On Mon, Jun 4, 2018 at 12:25 PM, Dongsheng Yang > <dongsheng.yang@easystack.cn> wrote: >> There is a problem if we abort a linger request and then request_reinit(), >> because abort_request() aborted the osd_request but did not put the lreq->r_klref; >> Then we will hit this WARN_ON(kref_read(&req->r_kref) != 1); >> >> [ 609.492250] WARNING: CPU: 6 PID: 4933 at net/ceph//osd_client.c:558 request_reinit+0x116/0x130 [libceph] >> [ 609.501561] Modules linked in: rbd(OE) libceph(OE) dns_resolver tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg cfg80211 rfkill snd_hda_codec_generic ext4 snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc mbcache aesni_intel snd_hwdep crypto_simd snd_seq cryptd jbd2 glue_helper snd_seq_device nfsd snd_pcm snd_timer snd virtio_balloon auth_rpcgss pcspkr soundcore i2c_piix4 nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c virtio_blk virtio_console ata_generic pata_acpi 8139too qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crc32c_intel virtio_pci i2c_core virtio_ring 8139cp virtio mii floppy serio_raw dm_mirror dm_region_hash >> [ 609.518259] dm_log dm_mod dax [last unloaded: libceph] >> [ 609.520015] CPU: 6 PID: 4933 Comm: kworker/6:0 Tainted: G OE 4.17.0-rc6+ #16 >> [ 609.522261] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >> [ 609.524135] Workqueue: events handle_timeout [libceph] >> [ 609.525887] RIP: 0010:request_reinit+0x116/0x130 [libceph] >> [ 609.527723] RSP: 0018:ffffaf90833ebd78 EFLAGS: 00010202 >> [ 609.529502] RAX: 0000000000000002 RBX: ffff89bb140f1f00 RCX: 0000000000000000 >> [ 609.531576] RDX: 0000000000000000 RSI: ffff89bb5f196978 RDI: ffff89bb5f196978 >> [ 609.533643] RBP: ffff89bb52b47800 R08: 0000000000000000 R09: 0000000000000000 >> [ 609.535737] R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff89baf96632e0 >> [ 609.537836] R13: 0000000000000001 R14: 0000000000000000 R15: ffff89bb516af768 >> [ 609.539919] FS: 0000000000000000(0000) GS:ffff89bb5f180000(0000) knlGS:0000000000000000 >> [ 609.542151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 609.544009] CR2: 000055ed56747020 CR3: 000000032e00a003 CR4: 00000000000206e0 >> [ 609.546111] Call Trace: >> [ 609.547482] handle_timeout+0x4e1/0x750 [libceph] >> [ 609.549208] ? pick_next_task_fair+0x273/0x630 >> [ 609.550867] ? __switch_to+0xa8/0x480 >> [ 609.552422] process_one_work+0x158/0x360 >> [ 609.554012] worker_thread+0x47/0x3e0 >> [ 609.555560] kthread+0xf8/0x130 >> [ 609.557018] ? max_active_store+0x80/0x80 >> [ 609.558606] ? kthread_bind+0x10/0x10 >> [ 609.560142] ret_from_fork+0x35/0x40 >> >> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> >> --- >> net/ceph/osd_client.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c >> index ab287e5..406ecdf 100644 >> --- a/net/ceph/osd_client.c >> +++ b/net/ceph/osd_client.c >> @@ -2385,6 +2385,9 @@ static void cancel_request(struct ceph_osd_request *req) >> >> static void abort_request(struct ceph_osd_request *req, int err) >> { >> + if (req->r_linger) >> + return; >> + >> dout("%s req %p tid %llu err %d\n", __func__, req, req->r_tid, err); >> >> cancel_map_check(req); > I think all OSD requests should be aborted, without exception. > Otherwise abort_request() and its users become less useful... > > This WARN can be triggered under (almost) normal conditions, without > abort_request() in the mix -- all it takes is a timing irregularity [1]. > A proper fix is to probably get rid of request_reinit(), but it will > take some effort. Hi Ilya, yes, that's not correct way to fix this problem. :) thanx Dongsheng > > [1] http://tracker.ceph.com/issues/20996 > > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index ab287e5..406ecdf 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2385,6 +2385,9 @@ static void cancel_request(struct ceph_osd_request *req) static void abort_request(struct ceph_osd_request *req, int err) { + if (req->r_linger) + return; + dout("%s req %p tid %llu err %d\n", __func__, req, req->r_tid, err); cancel_map_check(req);
There is a problem if we abort a linger request and then request_reinit(), because abort_request() aborted the osd_request but did not put the lreq->r_klref; Then we will hit this WARN_ON(kref_read(&req->r_kref) != 1); [ 609.492250] WARNING: CPU: 6 PID: 4933 at net/ceph//osd_client.c:558 request_reinit+0x116/0x130 [libceph] [ 609.501561] Modules linked in: rbd(OE) libceph(OE) dns_resolver tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg cfg80211 rfkill snd_hda_codec_generic ext4 snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc mbcache aesni_intel snd_hwdep crypto_simd snd_seq cryptd jbd2 glue_helper snd_seq_device nfsd snd_pcm snd_timer snd virtio_balloon auth_rpcgss pcspkr soundcore i2c_piix4 nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c virtio_blk virtio_console ata_generic pata_acpi 8139too qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crc32c_intel virtio_pci i2c_core virtio_ring 8139cp virtio mii floppy serio_raw dm_mirror dm_region_hash [ 609.518259] dm_log dm_mod dax [last unloaded: libceph] [ 609.520015] CPU: 6 PID: 4933 Comm: kworker/6:0 Tainted: G OE 4.17.0-rc6+ #16 [ 609.522261] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 609.524135] Workqueue: events handle_timeout [libceph] [ 609.525887] RIP: 0010:request_reinit+0x116/0x130 [libceph] [ 609.527723] RSP: 0018:ffffaf90833ebd78 EFLAGS: 00010202 [ 609.529502] RAX: 0000000000000002 RBX: ffff89bb140f1f00 RCX: 0000000000000000 [ 609.531576] RDX: 0000000000000000 RSI: ffff89bb5f196978 RDI: ffff89bb5f196978 [ 609.533643] RBP: ffff89bb52b47800 R08: 0000000000000000 R09: 0000000000000000 [ 609.535737] R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff89baf96632e0 [ 609.537836] R13: 0000000000000001 R14: 0000000000000000 R15: ffff89bb516af768 [ 609.539919] FS: 0000000000000000(0000) GS:ffff89bb5f180000(0000) knlGS:0000000000000000 [ 609.542151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 609.544009] CR2: 000055ed56747020 CR3: 000000032e00a003 CR4: 00000000000206e0 [ 609.546111] Call Trace: [ 609.547482] handle_timeout+0x4e1/0x750 [libceph] [ 609.549208] ? pick_next_task_fair+0x273/0x630 [ 609.550867] ? __switch_to+0xa8/0x480 [ 609.552422] process_one_work+0x158/0x360 [ 609.554012] worker_thread+0x47/0x3e0 [ 609.555560] kthread+0xf8/0x130 [ 609.557018] ? max_active_store+0x80/0x80 [ 609.558606] ? kthread_bind+0x10/0x10 [ 609.560142] ret_from_fork+0x35/0x40 Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> --- net/ceph/osd_client.c | 3 +++ 1 file changed, 3 insertions(+)