Message ID | 20250219003419.241017-1-slava@dubeyko.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ceph: fix slab-use-after-free in have_mon_and_osd_map() | expand |
This fixes the TOCTOU problem of accessing the epoch field after map-free. I'd like to know if it's not leaving a correctness problem instead. Is the return value of have_mon_and_osd_map still valid and relevant in this case, where it is being concurrently freed? On Wed, Feb 19, 2025 at 2:34 AM Viacheslav Dubeyko <slava@dubeyko.com> wrote: > > From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> > > The generic/395 and generic/397 is capable of generating > the oops is on line net/ceph/ceph_common.c:794 with > KASAN enabled. > > BUG: KASAN: slab-use-after-free in have_mon_and_osd_map+0x56/0x70 > Read of size 4 at addr ffff88811012d810 by task mount.ceph/13305 > > CPU: 2 UID: 0 PID: 13305 Comm: mount.ceph Not tainted 6.14.0-rc2-build2+ #1266 > Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 > Call Trace: > <TASK> > dump_stack_lvl+0x57/0x80 > ? have_mon_and_osd_map+0x56/0x70 > print_address_description.constprop.0+0x84/0x330 > ? have_mon_and_osd_map+0x56/0x70 > print_report+0xe2/0x1e0 > ? rcu_read_unlock_sched+0x60/0x80 > ? kmem_cache_debug_flags+0xc/0x20 > ? fixup_red_left+0x17/0x30 > ? have_mon_and_osd_map+0x56/0x70 > kasan_report+0x8d/0xc0 > ? have_mon_and_osd_map+0x56/0x70 > have_mon_and_osd_map+0x56/0x70 > ceph_open_session+0x182/0x290 > ? __pfx_ceph_open_session+0x10/0x10 > ? __init_swait_queue_head+0x8d/0xa0 > ? __pfx_autoremove_wake_function+0x10/0x10 > ? shrinker_register+0xdd/0xf0 > ceph_get_tree+0x333/0x680 > vfs_get_tree+0x49/0x180 > do_new_mount+0x1a3/0x2d0 > ? __pfx_do_new_mount+0x10/0x10 > ? security_capable+0x39/0x70 > path_mount+0x6dd/0x730 > ? __pfx_path_mount+0x10/0x10 > ? kmem_cache_free+0x1e5/0x270 > ? user_path_at+0x48/0x60 > do_mount+0x99/0xe0 > ? __pfx_do_mount+0x10/0x10 > ? lock_release+0x155/0x190 > __do_sys_mount+0x141/0x180 > do_syscall_64+0x9f/0x100 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7f01b1b14f3e > Code: 48 8b 0d d5 3e 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a2 3e 0f 00 f7 d8 64 89 01 48 > RSP: 002b:00007fffd129fa08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 > RAX: ffffffffffffffda RBX: 0000564ec01a7850 RCX: 00007f01b1b14f3e > RDX: 0000564ec00f2225 RSI: 00007fffd12a1964 RDI: 0000564ec0147a20 > RBP: 00007fffd129fbd0 R08: 0000564ec014da90 R09: 0000000000000080 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffd12a194e > R13: 0000000000000000 R14: 00007fffd129fa50 R15: 00007fffd129fa40 > </TASK> > > Allocated by task 13305: > stack_trace_save+0x8c/0xc0 > kasan_save_stack+0x1e/0x40 > kasan_save_track+0x10/0x30 > __kasan_kmalloc+0x3a/0x50 > __kmalloc_noprof+0x247/0x290 > ceph_osdmap_alloc+0x16/0x130 > ceph_osdc_init+0x27a/0x4c0 > ceph_create_client+0x153/0x190 > create_fs_client+0x50/0x2a0 > ceph_get_tree+0xff/0x680 > vfs_get_tree+0x49/0x180 > do_new_mount+0x1a3/0x2d0 > path_mount+0x6dd/0x730 > do_mount+0x99/0xe0 > __do_sys_mount+0x141/0x180 > do_syscall_64+0x9f/0x100 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Freed by task 9475: > stack_trace_save+0x8c/0xc0 > kasan_save_stack+0x1e/0x40 > kasan_save_track+0x10/0x30 > kasan_save_free_info+0x3b/0x50 > __kasan_slab_free+0x18/0x30 > kfree+0x212/0x290 > handle_one_map+0x23c/0x3b0 > ceph_osdc_handle_map+0x3c9/0x590 > mon_dispatch+0x655/0x6f0 > ceph_con_process_message+0xc3/0xe0 > ceph_con_v1_try_read+0x614/0x760 > ceph_con_workfn+0x2de/0x650 > process_one_work+0x486/0x7c0 > process_scheduled_works+0x73/0x90 > worker_thread+0x1c8/0x2a0 > kthread+0x2ec/0x300 > ret_from_fork+0x24/0x40 > ret_from_fork_asm+0x1a/0x30 > > The buggy address belongs to the object at ffff88811012d800 > which belongs to the cache kmalloc-512 of size 512 > The buggy address is located 16 bytes inside of > freed 512-byte region [ffff88811012d800, ffff88811012da00) > > The buggy address belongs to the physical page: > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11012c > head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0x200000000000040(head|node=0|zone=2) > page_type: f5(slab) > raw: 0200000000000040 ffff888100042c80 dead000000000100 dead000000000122 > raw: 0000000000000000 0000000080100010 00000000f5000000 0000000000000000 > head: 0200000000000040 ffff888100042c80 dead000000000100 dead000000000122 > head: 0000000000000000 0000000080100010 00000000f5000000 0000000000000000 > head: 0200000000000002 ffffea0004404b01 ffffffffffffffff 0000000000000000 > head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff88811012d700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > ffff88811012d780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > ffff88811012d800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > ffff88811012d880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88811012d900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== > Disabling lock debugging due to kernel taint > libceph: client274326 fsid 8598140e-35c2-11ee-b97c-001517c545cc > libceph: mon0 (1)90.155.74.19:6789 session established > libceph: client274327 fsid 8598140e-35c2-11ee-b97c-001517c545cc > > We have such scenario: > > Thread 1: > void ceph_osdmap_destroy(...) { > <skipped> > kfree(map); > } > Thread 1 sleep... > > Thread 2: > static bool have_mon_and_osd_map(struct ceph_client *client) { > return client->monc.monmap && client->monc.monmap->epoch && > client->osdc.osdmap && client->osdc.osdmap->epoch; > } > Thread 2 has oops... > > Thread 1 wake up: > static int handle_one_map(...) { > <skipped> > osdc->osdmap = newmap; > <skipped> > } > > This patch fixes the issue by means of locking > client->osdc.lock and client->monc.mutex before > the checking client->osdc.osdmap and > client->monc.monmap. > > Reported-by: David Howells <dhowells@redhat.com> > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> > --- > net/ceph/ceph_common.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c > index 4c6441536d55..5c8fd78d6bd5 100644 > --- a/net/ceph/ceph_common.c > +++ b/net/ceph/ceph_common.c > @@ -790,8 +790,18 @@ EXPORT_SYMBOL(ceph_reset_client_addr); > */ > static bool have_mon_and_osd_map(struct ceph_client *client) > { > - return client->monc.monmap && client->monc.monmap->epoch && > - client->osdc.osdmap && client->osdc.osdmap->epoch; > + bool have_mon_map = false; > + bool have_osd_map = false; > + > + mutex_lock(&client->monc.mutex); > + have_mon_map = client->monc.monmap && client->monc.monmap->epoch; > + mutex_unlock(&client->monc.mutex); > + > + down_read(&client->osdc.lock); > + have_osd_map = client->osdc.osdmap && client->osdc.osdmap->epoch; > + up_read(&client->osdc.lock); > + > + return have_mon_map && have_osd_map; > } > > /* > -- > 2.48.0 >
On Wed, 2025-02-19 at 14:44 +0200, Alex Markuze wrote: > This fixes the TOCTOU problem of accessing the epoch field after map-free. > I'd like to know if it's not leaving a correctness problem instead. Is > the return value of have_mon_and_osd_map still valid and relevant in > this case, where it is being concurrently freed? > Frankly speaking, I don't quite follow your point. The handle_one_map() [1] can change the old map on new one: if (newmap != osdc->osdmap) { <skipped> ceph_osdmap_destroy(osdc->osdmap); <--- Thread could sleep here ---> osdc->osdmap = newmap; } And have_mon_and_osd_map() [2] can try to access osdc->osdmap in the middle of this change operation without using the lock, because this osdmap change is executing under osdc.lock. So, do you mean that it is impossible case? Or do you mean that the suggested fix is not enough for fixing the issue? What is your vision of the fix then? :) The issue is not about complete stop but about of switching from one osdmap to another one. And have_mon_and_osd_map() is used in __ceph_open_session() [3] to join the ceph cluster, and open root directory: /* * mount: join the ceph cluster, and open root directory. */ int __ceph_open_session(struct ceph_client *client, unsigned long started) { unsigned long timeout = client->options->mount_timeout; long err; /* open session, and wait for mon and osd maps */ err = ceph_monc_open_session(&client->monc); if (err < 0) return err; while (!have_mon_and_osd_map(client)) { if (timeout && time_after_eq(jiffies, started + timeout)) return -ETIMEDOUT; /* wait */ dout("mount waiting for mon_map\n"); err = wait_event_interruptible_timeout(client->auth_wq, have_mon_and_osd_map(client) || (client->auth_err < 0), ceph_timeout_jiffies(timeout)); if (err < 0) return err; if (client->auth_err < 0) return client->auth_err; } pr_info("client%llu fsid %pU\n", ceph_client_gid(client), &client->fsid); ceph_debugfs_client_init(client); return 0; } EXPORT_SYMBOL(__ceph_open_session); So, we simply need to be sure that some osdmap is available, as far as I can see. Potentially, maybe, we need to NULL the osdc->osdmap in ceph_osdc_stop() [4]: void ceph_osdc_stop(struct ceph_osd_client *osdc) { destroy_workqueue(osdc->completion_wq); destroy_workqueue(osdc->notify_wq); cancel_delayed_work_sync(&osdc->timeout_work); cancel_delayed_work_sync(&osdc->osds_timeout_work); down_write(&osdc->lock); while (!RB_EMPTY_ROOT(&osdc->osds)) { struct ceph_osd *osd = rb_entry(rb_first(&osdc->osds), struct ceph_osd, o_node); close_osd(osd); } up_write(&osdc->lock); WARN_ON(refcount_read(&osdc->homeless_osd.o_ref) != 1); osd_cleanup(&osdc->homeless_osd); WARN_ON(!list_empty(&osdc->osd_lru)); WARN_ON(!RB_EMPTY_ROOT(&osdc->linger_requests)); WARN_ON(!RB_EMPTY_ROOT(&osdc->map_checks)); WARN_ON(!RB_EMPTY_ROOT(&osdc->linger_map_checks)); WARN_ON(atomic_read(&osdc->num_requests)); WARN_ON(atomic_read(&osdc->num_homeless)); ceph_osdmap_destroy(osdc->osdmap); <--- Here, we need to NULL the poiner mempool_destroy(osdc->req_mempool); ceph_msgpool_destroy(&osdc->msgpool_op); ceph_msgpool_destroy(&osdc->msgpool_op_reply); } Thanks, Slava. [1] https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/osd_client.c#L4025 [2] https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/ceph_common.c#L791 [3] https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/ceph_common.c#L800 [4] https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/osd_client.c#L5285
It's a good fix, I just want to make sure we are not leaving more subtle correctness issues. Adding a NULL and a proper cleanup sounds like a good practice. On Wed, Feb 19, 2025 at 10:13 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > On Wed, 2025-02-19 at 14:44 +0200, Alex Markuze wrote: > > This fixes the TOCTOU problem of accessing the epoch field after map-free. > > I'd like to know if it's not leaving a correctness problem instead. Is > > the return value of have_mon_and_osd_map still valid and relevant in > > this case, where it is being concurrently freed? > > > > Frankly speaking, I don't quite follow your point. > > The handle_one_map() [1] can change the old map on new one: > > if (newmap != osdc->osdmap) { > <skipped> > ceph_osdmap_destroy(osdc->osdmap); > <--- Thread could sleep here ---> > osdc->osdmap = newmap; > } > > And have_mon_and_osd_map() [2] can try to access osdc->osdmap > in the middle of this change operation without using the lock, > because this osdmap change is executing under osdc.lock. > > So, do you mean that it is impossible case? Or do you mean that > the suggested fix is not enough for fixing the issue? What is your > vision of the fix then? :) > > The issue is not about complete stop but about of switching from > one osdmap to another one. And have_mon_and_osd_map() is used > in __ceph_open_session() [3] to join the ceph cluster, and open > root directory: > > /* > * mount: join the ceph cluster, and open root directory. > */ > int __ceph_open_session(struct ceph_client *client, unsigned long started) > { > unsigned long timeout = client->options->mount_timeout; > long err; > > /* open session, and wait for mon and osd maps */ > err = ceph_monc_open_session(&client->monc); > if (err < 0) > return err; > > while (!have_mon_and_osd_map(client)) { > if (timeout && time_after_eq(jiffies, started + timeout)) > return -ETIMEDOUT; > > /* wait */ > dout("mount waiting for mon_map\n"); > err = wait_event_interruptible_timeout(client->auth_wq, > have_mon_and_osd_map(client) || (client->auth_err < 0), > ceph_timeout_jiffies(timeout)); > if (err < 0) > return err; > if (client->auth_err < 0) > return client->auth_err; > } > > pr_info("client%llu fsid %pU\n", ceph_client_gid(client), > &client->fsid); > ceph_debugfs_client_init(client); > > return 0; > } > EXPORT_SYMBOL(__ceph_open_session); > > So, we simply need to be sure that some osdmap is available, > as far as I can see. Potentially, maybe, we need to NULL > the osdc->osdmap in ceph_osdc_stop() [4]: > > void ceph_osdc_stop(struct ceph_osd_client *osdc) > { > destroy_workqueue(osdc->completion_wq); > destroy_workqueue(osdc->notify_wq); > cancel_delayed_work_sync(&osdc->timeout_work); > cancel_delayed_work_sync(&osdc->osds_timeout_work); > > down_write(&osdc->lock); > while (!RB_EMPTY_ROOT(&osdc->osds)) { > struct ceph_osd *osd = rb_entry(rb_first(&osdc->osds), > struct ceph_osd, o_node); > close_osd(osd); > } > up_write(&osdc->lock); > WARN_ON(refcount_read(&osdc->homeless_osd.o_ref) != 1); > osd_cleanup(&osdc->homeless_osd); > > WARN_ON(!list_empty(&osdc->osd_lru)); > WARN_ON(!RB_EMPTY_ROOT(&osdc->linger_requests)); > WARN_ON(!RB_EMPTY_ROOT(&osdc->map_checks)); > WARN_ON(!RB_EMPTY_ROOT(&osdc->linger_map_checks)); > WARN_ON(atomic_read(&osdc->num_requests)); > WARN_ON(atomic_read(&osdc->num_homeless)); > > ceph_osdmap_destroy(osdc->osdmap); <--- Here, we need to NULL the > poiner > mempool_destroy(osdc->req_mempool); > ceph_msgpool_destroy(&osdc->msgpool_op); > ceph_msgpool_destroy(&osdc->msgpool_op_reply); > } > > Thanks, > Slava. > > [1] > https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/osd_client.c#L4025 > [2] > https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/ceph_common.c#L791 > [3] > https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/ceph_common.c#L800 > [4] > https://elixir.bootlin.com/linux/v6.14-rc2/source/net/ceph/osd_client.c#L5285 > > >
On Thu, 2025-02-20 at 13:43 +0200, Alex Markuze wrote: > It's a good fix, I just want to make sure we are not leaving more > subtle correctness issues. > Adding a NULL and a proper cleanup sounds like a good practice. > Sounds good! Let me slightly rework the patch. Thanks, Slava.
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index 4c6441536d55..5c8fd78d6bd5 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -790,8 +790,18 @@ EXPORT_SYMBOL(ceph_reset_client_addr); */ static bool have_mon_and_osd_map(struct ceph_client *client) { - return client->monc.monmap && client->monc.monmap->epoch && - client->osdc.osdmap && client->osdc.osdmap->epoch; + bool have_mon_map = false; + bool have_osd_map = false; + + mutex_lock(&client->monc.mutex); + have_mon_map = client->monc.monmap && client->monc.monmap->epoch; + mutex_unlock(&client->monc.mutex); + + down_read(&client->osdc.lock); + have_osd_map = client->osdc.osdmap && client->osdc.osdmap->epoch; + up_read(&client->osdc.lock); + + return have_mon_map && have_osd_map; } /*