Message ID | CAOi1vP8-6KUh5drZ5Sc5g08VdTtE6kovDeUUD0f8d2L1FX+mLg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/31/2018 04:13 PM, Ilya Dryomov wrote: > On Thu, May 31, 2018 at 9:18 AM, Dongsheng Yang > <dongsheng.yang@easystack.cn> wrote: >> On 05/31/2018 03:11 AM, Ilya Dryomov wrote: >>> On Tue, May 29, 2018 at 5:22 AM, Dongsheng Yang >>> <dongsheng.yang@easystack.cn> wrote: >>>> We will cancel all watch delayed work in >>>> cancel_delayed_work_sync(&rbd_dev->watch_dwork); >>>> If we queue delayed work after this, there will be a use-after-free >>>> problem: >>>> >>>> [ 549.932085] BUG: unable to handle kernel NULL pointer dereference at >>>> 0000000000000000 >>>> [ 549.934134] PGD 0 P4D 0 >>>> [ 549.935145] Oops: 0000 [#1] SMP PTI >>>> [ 549.936283] Modules linked in: rbd(OE) libceph(OE) tcp_diag udp_diag >>>> inet_diag unix_diag af_packet_diag netlink_diag dns_resolver ebtable_filter >>>> ebtables ip6table_filter ip6_tables iptable_filter sg cfg80211 rfkill >>>> snd_hda_codec_generic ext4 snd_hda_intel snd_hda_codec crct10dif_pclmul >>>> crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc snd_hwdep snd_seq mbcache >>>> aesni_intel snd_seq_device jbd2 crypto_simd nfsd cryptd glue_helper snd_pcm >>>> snd_timer auth_rpcgss pcspkr snd virtio_balloon nfs_acl soundcore i2c_piix4 >>>> lockd grace sunrpc ip_tables xfs libcrc32c virtio_console virtio_blk >>>> ata_generic pata_acpi 8139too qxl drm_kms_helper syscopyarea sysfillrect >>>> sysimgblt fb_sys_fops ttm drm ata_piix libata crc32c_intel virtio_pci 8139cp >>>> virtio_ring i2c_core mii virtio floppy serio_raw dm_mirror dm_region_hash >>>> [ 549.951835] dm_log dm_mod dax [last unloaded: libceph] >>>> [ 549.953490] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE >>>> 4.17.0-rc6+ #13 >>>> [ 549.955502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >>>> [ 549.957246] RIP: 0010:__queue_work+0x6a/0x3b0 >>>> [ 549.958744] RSP: 0018:ffff9427df1c3e90 EFLAGS: 00010086 >>>> [ 549.960374] RAX: ffff9427deca8400 RBX: 0000000000000000 RCX: >>>> 0000000000000000 >>>> [ 549.962297] RDX: ffff9427deca8400 RSI: ffff9427df1c3e50 RDI: >>>> 0000000000000000 >>>> [ 549.964216] RBP: ffff942783e39e00 R08: ffff9427deca8400 R09: >>>> ffff9427df1c3f00 >>>> [ 549.966136] R10: 0000000000000004 R11: 0000000000000005 R12: >>>> ffff9427cfb85970 >>>> [ 549.968070] R13: 0000000000002000 R14: 000000000001eca0 R15: >>>> 0000000000000007 >>>> [ 549.969999] FS: 0000000000000000(0000) GS:ffff9427df1c0000(0000) >>>> knlGS:0000000000000000 >>>> [ 549.972069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 549.973775] CR2: 0000000000000000 CR3: 00000004c900a005 CR4: >>>> 00000000000206e0 >>>> [ 549.975695] Call Trace: >>>> [ 549.976900] <IRQ> >>>> [ 549.978033] ? __queue_work+0x3b0/0x3b0 >>>> [ 549.979442] call_timer_fn+0x2d/0x130 >>>> [ 549.980824] run_timer_softirq+0x16e/0x430 >>>> [ 549.982263] ? tick_sched_timer+0x37/0x70 >>>> [ 549.983691] __do_softirq+0xd2/0x280 >>>> [ 549.985035] irq_exit+0xd5/0xe0 >>>> [ 549.986316] smp_apic_timer_interrupt+0x6c/0x130 >>>> [ 549.987835] apic_timer_interrupt+0xf/0x20 >>>> >>>> This patch forbid to queue watch_dwork when we are removing device. >>>> >>>> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> >>>> --- >>>> drivers/block/rbd.c | 10 +++++++--- >>>> 1 file changed, 7 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>> index 2b4e90d..d1d8f46 100644 >>>> --- a/drivers/block/rbd.c >>>> +++ b/drivers/block/rbd.c >>>> @@ -3475,9 +3475,13 @@ static void rbd_reregister_watch(struct >>>> work_struct *work) >>>> set_bit(RBD_DEV_FLAG_BLACKLISTED, >>>> &rbd_dev->flags); >>>> wake_requests(rbd_dev, true); >>>> } else { >>>> - queue_delayed_work(rbd_dev->task_wq, >>>> - &rbd_dev->watch_dwork, >>>> - RBD_RETRY_DELAY); >>>> + spin_lock_irq(&rbd_dev->lock); >>>> + if (!test_bit(RBD_DEV_FLAG_REMOVING, >>>> &rbd_dev->flags)) { >>>> + queue_delayed_work(rbd_dev->task_wq, >>>> + &rbd_dev->watch_dwork, >>>> + RBD_RETRY_DELAY); >>>> + } >>>> + spin_unlock_irq(&rbd_dev->lock); >>>> } >>>> mutex_unlock(&rbd_dev->watch_mutex); >>>> return; >>> Hi Dongsheng, >>> >>> What made you think it is rbd (or ceph) related? Do you know what gets >>> dereferenced? Is it reproducible? >> >> Hi Ilya, >> There is a simple reproduce script: >> >> ./vstart.sh -k -l --bluestore >> rbd map -o osd_request_timeout=10 test1 >> time dd if=/dev/zero of=/dev/rbd0 bs=64K count=1000 oflag=direct & >> sleep 1 >> ps -ef|grep -E "ceph-mon|ceph-osd"|gawk '{print "kill -9 "$2}'|bash >> rbd unmap -o force /dev/rbd0 >> >> But maybe you need to run this script in a loop, because that's not 100% >> happen. > Ah, I see. I think a more correct fix would be to move ->watch_dwork > cancellation: > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 2b4e90d06822..23ae0df7a978 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -3400,7 +3400,6 @@ static void cancel_tasks_sync(struct rbd_device *rbd_dev) > { > dout("%s rbd_dev %p\n", __func__, rbd_dev); > > - cancel_delayed_work_sync(&rbd_dev->watch_dwork); > cancel_work_sync(&rbd_dev->acquired_lock_work); > cancel_work_sync(&rbd_dev->released_lock_work); > cancel_delayed_work_sync(&rbd_dev->lock_dwork); > @@ -3418,6 +3417,7 @@ static void rbd_unregister_watch(struct > rbd_device *rbd_dev) > rbd_dev->watch_state = RBD_WATCH_STATE_UNREGISTERED; > mutex_unlock(&rbd_dev->watch_mutex); > > + cancel_delayed_work_sync(&rbd_dev->watch_dwork); > ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc); > } > > This way rbd_reregister_watch() should either bail out early because > the watch is UNREGISTERED at that point or just get cancelled. Sounds good. what's your suggestion about the [2/2] about lock_dwork? thanx Dongsheng > > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 31, 2018 at 10:34 AM, Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > > On 05/31/2018 04:13 PM, Ilya Dryomov wrote: >> >> On Thu, May 31, 2018 at 9:18 AM, Dongsheng Yang >> <dongsheng.yang@easystack.cn> wrote: >>> >>> On 05/31/2018 03:11 AM, Ilya Dryomov wrote: >>>> >>>> On Tue, May 29, 2018 at 5:22 AM, Dongsheng Yang >>>> <dongsheng.yang@easystack.cn> wrote: >>>>> >>>>> We will cancel all watch delayed work in >>>>> cancel_delayed_work_sync(&rbd_dev->watch_dwork); >>>>> If we queue delayed work after this, there will be a use-after-free >>>>> problem: >>>>> >>>>> [ 549.932085] BUG: unable to handle kernel NULL pointer dereference at >>>>> 0000000000000000 >>>>> [ 549.934134] PGD 0 P4D 0 >>>>> [ 549.935145] Oops: 0000 [#1] SMP PTI >>>>> [ 549.936283] Modules linked in: rbd(OE) libceph(OE) tcp_diag udp_diag >>>>> inet_diag unix_diag af_packet_diag netlink_diag dns_resolver >>>>> ebtable_filter >>>>> ebtables ip6table_filter ip6_tables iptable_filter sg cfg80211 rfkill >>>>> snd_hda_codec_generic ext4 snd_hda_intel snd_hda_codec crct10dif_pclmul >>>>> crc32_pclmul ghash_clmulni_intel snd_hda_core pcbc snd_hwdep snd_seq >>>>> mbcache >>>>> aesni_intel snd_seq_device jbd2 crypto_simd nfsd cryptd glue_helper >>>>> snd_pcm >>>>> snd_timer auth_rpcgss pcspkr snd virtio_balloon nfs_acl soundcore >>>>> i2c_piix4 >>>>> lockd grace sunrpc ip_tables xfs libcrc32c virtio_console virtio_blk >>>>> ata_generic pata_acpi 8139too qxl drm_kms_helper syscopyarea >>>>> sysfillrect >>>>> sysimgblt fb_sys_fops ttm drm ata_piix libata crc32c_intel virtio_pci >>>>> 8139cp >>>>> virtio_ring i2c_core mii virtio floppy serio_raw dm_mirror >>>>> dm_region_hash >>>>> [ 549.951835] dm_log dm_mod dax [last unloaded: libceph] >>>>> [ 549.953490] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G OE >>>>> 4.17.0-rc6+ #13 >>>>> [ 549.955502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >>>>> [ 549.957246] RIP: 0010:__queue_work+0x6a/0x3b0 >>>>> [ 549.958744] RSP: 0018:ffff9427df1c3e90 EFLAGS: 00010086 >>>>> [ 549.960374] RAX: ffff9427deca8400 RBX: 0000000000000000 RCX: >>>>> 0000000000000000 >>>>> [ 549.962297] RDX: ffff9427deca8400 RSI: ffff9427df1c3e50 RDI: >>>>> 0000000000000000 >>>>> [ 549.964216] RBP: ffff942783e39e00 R08: ffff9427deca8400 R09: >>>>> ffff9427df1c3f00 >>>>> [ 549.966136] R10: 0000000000000004 R11: 0000000000000005 R12: >>>>> ffff9427cfb85970 >>>>> [ 549.968070] R13: 0000000000002000 R14: 000000000001eca0 R15: >>>>> 0000000000000007 >>>>> [ 549.969999] FS: 0000000000000000(0000) GS:ffff9427df1c0000(0000) >>>>> knlGS:0000000000000000 >>>>> [ 549.972069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 549.973775] CR2: 0000000000000000 CR3: 00000004c900a005 CR4: >>>>> 00000000000206e0 >>>>> [ 549.975695] Call Trace: >>>>> [ 549.976900] <IRQ> >>>>> [ 549.978033] ? __queue_work+0x3b0/0x3b0 >>>>> [ 549.979442] call_timer_fn+0x2d/0x130 >>>>> [ 549.980824] run_timer_softirq+0x16e/0x430 >>>>> [ 549.982263] ? tick_sched_timer+0x37/0x70 >>>>> [ 549.983691] __do_softirq+0xd2/0x280 >>>>> [ 549.985035] irq_exit+0xd5/0xe0 >>>>> [ 549.986316] smp_apic_timer_interrupt+0x6c/0x130 >>>>> [ 549.987835] apic_timer_interrupt+0xf/0x20 >>>>> >>>>> This patch forbid to queue watch_dwork when we are removing device. >>>>> >>>>> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> >>>>> --- >>>>> drivers/block/rbd.c | 10 +++++++--- >>>>> 1 file changed, 7 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>>> index 2b4e90d..d1d8f46 100644 >>>>> --- a/drivers/block/rbd.c >>>>> +++ b/drivers/block/rbd.c >>>>> @@ -3475,9 +3475,13 @@ static void rbd_reregister_watch(struct >>>>> work_struct *work) >>>>> set_bit(RBD_DEV_FLAG_BLACKLISTED, >>>>> &rbd_dev->flags); >>>>> wake_requests(rbd_dev, true); >>>>> } else { >>>>> - queue_delayed_work(rbd_dev->task_wq, >>>>> - &rbd_dev->watch_dwork, >>>>> - RBD_RETRY_DELAY); >>>>> + spin_lock_irq(&rbd_dev->lock); >>>>> + if (!test_bit(RBD_DEV_FLAG_REMOVING, >>>>> &rbd_dev->flags)) { >>>>> + queue_delayed_work(rbd_dev->task_wq, >>>>> + >>>>> &rbd_dev->watch_dwork, >>>>> + RBD_RETRY_DELAY); >>>>> + } >>>>> + spin_unlock_irq(&rbd_dev->lock); >>>>> } >>>>> mutex_unlock(&rbd_dev->watch_mutex); >>>>> return; >>>> >>>> Hi Dongsheng, >>>> >>>> What made you think it is rbd (or ceph) related? Do you know what gets >>>> dereferenced? Is it reproducible? >>> >>> >>> Hi Ilya, >>> There is a simple reproduce script: >>> >>> ./vstart.sh -k -l --bluestore >>> rbd map -o osd_request_timeout=10 test1 >>> time dd if=/dev/zero of=/dev/rbd0 bs=64K count=1000 oflag=direct & >>> sleep 1 >>> ps -ef|grep -E "ceph-mon|ceph-osd"|gawk '{print "kill -9 "$2}'|bash >>> rbd unmap -o force /dev/rbd0 >>> >>> But maybe you need to run this script in a loop, because that's not 100% >>> happen. >> >> Ah, I see. I think a more correct fix would be to move ->watch_dwork >> cancellation: >> >> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >> index 2b4e90d06822..23ae0df7a978 100644 >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -3400,7 +3400,6 @@ static void cancel_tasks_sync(struct rbd_device >> *rbd_dev) >> { >> dout("%s rbd_dev %p\n", __func__, rbd_dev); >> >> - cancel_delayed_work_sync(&rbd_dev->watch_dwork); >> cancel_work_sync(&rbd_dev->acquired_lock_work); >> cancel_work_sync(&rbd_dev->released_lock_work); >> cancel_delayed_work_sync(&rbd_dev->lock_dwork); >> @@ -3418,6 +3417,7 @@ static void rbd_unregister_watch(struct >> rbd_device *rbd_dev) >> rbd_dev->watch_state = RBD_WATCH_STATE_UNREGISTERED; >> mutex_unlock(&rbd_dev->watch_mutex); >> >> + cancel_delayed_work_sync(&rbd_dev->watch_dwork); >> ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc); >> } >> >> This way rbd_reregister_watch() should either bail out early because >> the watch is UNREGISTERED at that point or just get cancelled. > > > Sounds good. what's your suggestion about the [2/2] about lock_dwork? I don't think it's needed. In order for rbd_acquire_lock() to possibly requeue itself, it must be queued after it is cancelled. Looking at the places where ->lock_dwork is queued: - rbd_wait_state_locked(), i.e. new I/O request -- shoudn't happen at that point - rbd_reacquire_lock(), from rbd_reregister_watch() if LOCKED -- again shouldn't happen at that point because it is preceded by rbd_unlock() in rbd_dev_image_unlock() If you are seeing a similar crash on ->lock_dwork, the bug is somewhere else. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 2b4e90d06822..23ae0df7a978 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3400,7 +3400,6 @@ static void cancel_tasks_sync(struct rbd_device *rbd_dev) { dout("%s rbd_dev %p\n", __func__, rbd_dev); - cancel_delayed_work_sync(&rbd_dev->watch_dwork); cancel_work_sync(&rbd_dev->acquired_lock_work); cancel_work_sync(&rbd_dev->released_lock_work); cancel_delayed_work_sync(&rbd_dev->lock_dwork); @@ -3418,6 +3417,7 @@ static void rbd_unregister_watch(struct rbd_device *rbd_dev) rbd_dev->watch_state = RBD_WATCH_STATE_UNREGISTERED; mutex_unlock(&rbd_dev->watch_mutex); + cancel_delayed_work_sync(&rbd_dev->watch_dwork); ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc); }