Message ID | 1651652269-15342-1-git-send-email-byungchul.park@lge.com (mailing list archive) |
---|---|
Headers | show |
Series | DEPT(Dependency Tracker) | expand |
On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > Hi Linus and folks, > > I've been developing a tool for detecting deadlock possibilities by > tracking wait/event rather than lock(?) acquisition order to try to > cover all synchonization machanisms. So what is the actual status of reports these days? Last time I looked at some reports, it gave a lot of false positives due to mis-understanding prepare_to_sleep(). For this all to make sense, it would need to not have false positives (or at least a very small number of them together with a way to sanely get rid of them), and also have a track record of finding things that lockdep doesn't. Maybe such reports have been sent out with the current situation, and I haven't seen them. Linus
Linus wrote: > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > Hi Linus and folks, > > > > I've been developing a tool for detecting deadlock possibilities by > > tracking wait/event rather than lock(?) acquisition order to try to > > cover all synchonization machanisms. > > So what is the actual status of reports these days? > > Last time I looked at some reports, it gave a lot of false positives > due to mis-understanding prepare_to_sleep(). Yes, it was. I handled the case in the following way: 1. Stage the wait at prepare_to_sleep(), which might be used at commit. Which has yet to be an actual wait that Dept considers. 2. If the condition for sleep is true, the wait will be committed at __schedule(). The wait becomes an actual one that Dept considers. 3. If the condition is false and the task gets back to TASK_RUNNING, clean(=reset) the staged wait. That way, Dept only works with what actually hits to __schedule() for the waits through sleep. > For this all to make sense, it would need to not have false positives > (or at least a very small number of them together with a way to sanely Yes. I agree with you. I got rid of them that way I described above. > get rid of them), and also have a track record of finding things that > lockdep doesn't. I have some reports that wait_for_completion or waitqueue is involved. It's worth noting those are not tracked by Lockdep. I'm checking if those are true positive or not. I will share those reports once I get more convinced for that. > Maybe such reports have been sent out with the current situation, and > I haven't seen them. Dept reports usually have been sent to me privately, not in LKML. As I told you, I'm planning to share them. Byungchul > > Linus >
On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > Linus wrote: > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > Hi Linus and folks, > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > tracking wait/event rather than lock(?) acquisition order to try to > > > cover all synchonization machanisms. > > > > So what is the actual status of reports these days? > > > > Last time I looked at some reports, it gave a lot of false positives > > due to mis-understanding prepare_to_sleep(). > > Yes, it was. I handled the case in the following way: > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > Which has yet to be an actual wait that Dept considers. > 2. If the condition for sleep is true, the wait will be committed at > __schedule(). The wait becomes an actual one that Dept considers. > 3. If the condition is false and the task gets back to TASK_RUNNING, > clean(=reset) the staged wait. > > That way, Dept only works with what actually hits to __schedule() for > the waits through sleep. > > > For this all to make sense, it would need to not have false positives > > (or at least a very small number of them together with a way to sanely > > Yes. I agree with you. I got rid of them that way I described above. > IMHO DEPT should not report what lockdep allows (Not talking about wait events). I mean lockdep allows some kind of nested locks but DEPT reports them. When I was collecting reports from DEPT on varous configurations, Most of them was report of down_write_nested(), which is allowed in lockdep. DEPT should not report at least what we know it's not a real deadlock. Otherwise there will be reports that is never fixed, which is quite unpleasant and reporters cannot examine all of them if it's real deadlock or not. > > get rid of them), and also have a track record of finding things that > > lockdep doesn't. > > I have some reports that wait_for_completion or waitqueue is involved. > It's worth noting those are not tracked by Lockdep. I'm checking if > those are true positive or not. I will share those reports once I get > more convinced for that. > > > Maybe such reports have been sent out with the current situation, and > > I haven't seen them. > > Dept reports usually have been sent to me privately, not in LKML. As I > told you, I'm planning to share them. > > Byungchul > > > > > Linus > >
On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > Linus wrote: > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > Hi Linus and folks, > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > cover all synchonization machanisms. > > > > > > So what is the actual status of reports these days? > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > due to mis-understanding prepare_to_sleep(). > > > > Yes, it was. I handled the case in the following way: > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > Which has yet to be an actual wait that Dept considers. > > 2. If the condition for sleep is true, the wait will be committed at > > __schedule(). The wait becomes an actual one that Dept considers. > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > clean(=reset) the staged wait. > > > > That way, Dept only works with what actually hits to __schedule() for > > the waits through sleep. > > > > > For this all to make sense, it would need to not have false positives > > > (or at least a very small number of them together with a way to sanely > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > IMHO DEPT should not report what lockdep allows (Not talking about No. > wait events). I mean lockdep allows some kind of nested locks but > DEPT reports them. You have already asked exactly same question in another thread of LKML. That time I answered to it but let me explain it again. --- CASE 1. lock L with depth n lock_nested L' with depth n + 1 ... unlock L' unlock L This case is allowed by Lockdep. This case is allowed by DEPT cuz it's not a deadlock. CASE 2. lock L with depth n lock A lock_nested L' with depth n + 1 ... unlock L' unlock A unlock L This case is allowed by Lockdep. This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. --- The following scenario would explain why CASE 2 is problematic. THREAD X THREAD Y lock L with depth n lock L' with depth n lock A lock A lock_nested L' with depth n + 1 lock_nested L'' with depth n + 1 ... ... unlock L' unlock L'' unlock A unlock A unlock L unlock L' Yes. I need to check if the report you shared with me is a true one, but it's not because DEPT doesn't work with *_nested() APIs. Byungchul
On Wed, May 04, 2022 at 11:17:02AM -0700, Linus Torvalds wrote: > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > Hi Linus and folks, > > > > I've been developing a tool for detecting deadlock possibilities by > > tracking wait/event rather than lock(?) acquisition order to try to > > cover all synchonization machanisms. > > So what is the actual status of reports these days? I'd like to mention one important thing here. Reportability would get stronger if the more wait-event pairs get tagged everywhere DEPT can work. Everything e.g. HW-SW interface, any retry logic and so on can be a wait-event pair if they work wait or event anyway. For example, polling on an IO mapped read register and initiating the HW to go for the event also can be a pair. Definitely those make DEPT more useful. --- The way to use the APIs: 1. Define SDT(Simple Dependency Tracker) DEFINE_DEPT_SDT(my_hw_event); <- add this 2. Tag on the waits sdt_wait(&my_hw_event); <- add this ... retry logic until my hw work done ... <- the original code 3. Tag on the events sdt_event(&my_hw_event); <- add this run_my_hw(); <- the original code --- These are all we should do. I believe DEPT would be a very useful tool once all wait-event pairs get tagged by the developers in all subsystems and device drivers. Byungchul > Last time I looked at some reports, it gave a lot of false positives > due to mis-understanding prepare_to_sleep(). > > For this all to make sense, it would need to not have false positives > (or at least a very small number of them together with a way to sanely > get rid of them), and also have a track record of finding things that > lockdep doesn't. > > Maybe such reports have been sent out with the current situation, and > I haven't seen them. > > Linus
On Mon, 9 May 2022 09:16:37 +0900 Byungchul Park <byungchul.park@lge.com> wrote: > CASE 2. > > lock L with depth n > lock A > lock_nested L' with depth n + 1 > ... > unlock L' > unlock A > unlock L > > This case is allowed by Lockdep. > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > --- > > The following scenario would explain why CASE 2 is problematic. > > THREAD X THREAD Y > > lock L with depth n > lock L' with depth n > lock A > lock A > lock_nested L' with depth n + 1 I'm confused by what exactly you are saying is a deadlock above. Are you saying that lock A and L' are inversed? If so, lockdep had better detect that regardless of L. A nested lock associates the the nesting with the same type of lock. That is, in lockdep nested tells lockdep not to trigger on the L and L' but it will not ignore that A was taken. -- Steve > lock_nested L'' with depth n + 1 > ... ... > unlock L' unlock L'' > unlock A unlock A > unlock L unlock L'
I tried DEPT-v6 applied against 5.18-rc5, and it reported the following positive. The reason why it's nonsense is that in context A's [W] wait: [ 1538.545054] [W] folio_wait_bit_common(pglocked:0): [ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420 [ 1538.545763] stacktrace: [ 1538.545928] folio_wait_bit_common+0x2fa/0x460 [ 1538.546248] __filemap_get_folio+0x3e4/0x420 [ 1538.546558] pagecache_get_page+0x11/0x40 [ 1538.546852] ext4_mb_init_group+0x80/0x2e0 [ 1538.547152] ext4_mb_good_group_nolock+0x2a3/0x2d0 ... we're reading the block allocation bitmap into the page cache. This does not correspond to a real inode, and so we don't actually take ei->i_data_sem in this on the psuedo-inode used. In contast, context's B's [W] and [E]'s stack traces, the folio_wait_bit is clearly associated with page which is mapped to a real inode: [ 1538.553656] [W] down_write(&ei->i_data_sem:0): [ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680 [ 1538.554320] stacktrace: [ 1538.554485] ext4_map_blocks+0x17b/0x680 [ 1538.554772] mpage_map_and_submit_extent+0xef/0x530 [ 1538.555122] ext4_writepages+0x798/0x990 [ 1538.555409] do_writepages+0xcf/0x1c0 [ 1538.555682] __writeback_single_inode+0x58/0x3f0 [ 1538.556014] writeback_sb_inodes+0x210/0x540 ... [ 1538.558621] [E] folio_wake_bit(pglocked:0): [ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560 [ 1538.559290] stacktrace: [ 1538.559455] ext4_bio_write_page+0x400/0x560 [ 1538.559765] mpage_submit_page+0x5c/0x80 [ 1538.560051] mpage_map_and_submit_buffers+0x15a/0x250 [ 1538.560409] mpage_map_and_submit_extent+0x134/0x530 [ 1538.560764] ext4_writepages+0x798/0x990 [ 1538.561057] do_writepages+0xcf/0x1c0 [ 1538.561329] __writeback_single_inode+0x58/0x3f0 ... In any case, this will ***never*** deadlock, and it's due to DEPT fundamentally not understanding that waiting on different pages may be due to inodes that come from completely different inodes, and so there is zero possible chance this would never deadlock. I suspect there will be similar false positives for tests (or userspace) that uses copy_file_range(2) or send_file(2) system calls. I've included the full DEPT log report below. - Ted generic/011 [20:11:16][ 1533.411773] run fstests generic/011 at 2022-05-07 20:11:16 [ 1533.509603] DEPT_INFO_ONCE: Need to expand the ring buffer. [ 1536.910044] DEPT_INFO_ONCE: Pool(wait) is empty. [ 1538.533315] =================================================== [ 1538.533793] DEPT: Circular dependency has been detected. [ 1538.534199] 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 Not tainted [ 1538.534645] --------------------------------------------------- [ 1538.535035] summary [ 1538.535177] --------------------------------------------------- [ 1538.535567] *** DEADLOCK *** [ 1538.535567] [ 1538.535854] context A [ 1538.536008] [S] down_write(&ei->i_data_sem:0) [ 1538.536323] [W] folio_wait_bit_common(pglocked:0) [ 1538.536655] [E] up_write(&ei->i_data_sem:0) [ 1538.536958] [ 1538.537063] context B [ 1538.537216] [S] (unknown)(pglocked:0) [ 1538.537480] [W] down_write(&ei->i_data_sem:0) [ 1538.537789] [E] folio_wake_bit(pglocked:0) [ 1538.538082] [ 1538.538184] [S]: start of the event context [ 1538.538460] [W]: the wait blocked [ 1538.538680] [E]: the event not reachable [ 1538.538939] --------------------------------------------------- [ 1538.539327] context A's detail [ 1538.539530] --------------------------------------------------- [ 1538.539918] context A [ 1538.540072] [S] down_write(&ei->i_data_sem:0) [ 1538.540382] [W] folio_wait_bit_common(pglocked:0) [ 1538.540712] [E] up_write(&ei->i_data_sem:0) [ 1538.541015] [ 1538.541119] [S] down_write(&ei->i_data_sem:0): [ 1538.541410] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680 [ 1538.541782] stacktrace: [ 1538.541946] ext4_map_blocks+0x17b/0x680 [ 1538.542234] ext4_getblk+0x5f/0x1f0 [ 1538.542493] ext4_bread+0xc/0x70 [ 1538.542736] ext4_append+0x48/0xf0 [ 1538.542991] ext4_init_new_dir+0xc8/0x160 [ 1538.543284] ext4_mkdir+0x19a/0x320 [ 1538.543542] vfs_mkdir+0x83/0xe0 [ 1538.543788] do_mkdirat+0x8c/0x130 [ 1538.544042] __x64_sys_mkdir+0x29/0x30 [ 1538.544319] do_syscall_64+0x40/0x90 [ 1538.544584] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1538.544949] [ 1538.545054] [W] folio_wait_bit_common(pglocked:0): [ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420 [ 1538.545763] stacktrace: [ 1538.545928] folio_wait_bit_common+0x2fa/0x460 [ 1538.546248] __filemap_get_folio+0x3e4/0x420 [ 1538.546558] pagecache_get_page+0x11/0x40 [ 1538.546852] ext4_mb_init_group+0x80/0x2e0 [ 1538.547152] ext4_mb_good_group_nolock+0x2a3/0x2d0 [ 1538.547496] ext4_mb_regular_allocator+0x391/0x780 [ 1538.547840] ext4_mb_new_blocks+0x44e/0x720 [ 1538.548145] ext4_ext_map_blocks+0x7f1/0xd00 [ 1538.548455] ext4_map_blocks+0x19e/0x680 [ 1538.548743] ext4_getblk+0x5f/0x1f0 [ 1538.549006] ext4_bread+0xc/0x70 [ 1538.549250] ext4_append+0x48/0xf0 [ 1538.549505] ext4_init_new_dir+0xc8/0x160 [ 1538.549798] ext4_mkdir+0x19a/0x320 [ 1538.550058] vfs_mkdir+0x83/0xe0 [ 1538.550302] do_mkdirat+0x8c/0x130 [ 1538.550557] [ 1538.550660] [E] up_write(&ei->i_data_sem:0): [ 1538.550940] (N/A) [ 1538.551071] --------------------------------------------------- [ 1538.551459] context B's detail [ 1538.551662] --------------------------------------------------- [ 1538.552047] context B [ 1538.552202] [S] (unknown)(pglocked:0) [ 1538.552466] [W] down_write(&ei->i_data_sem:0) [ 1538.552775] [E] folio_wake_bit(pglocked:0) [ 1538.553071] [ 1538.553174] [S] (unknown)(pglocked:0): [ 1538.553422] (N/A) [ 1538.553553] [ 1538.553656] [W] down_write(&ei->i_data_sem:0): [ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680 [ 1538.554320] stacktrace: [ 1538.554485] ext4_map_blocks+0x17b/0x680 [ 1538.554772] mpage_map_and_submit_extent+0xef/0x530 [ 1538.555122] ext4_writepages+0x798/0x990 [ 1538.555409] do_writepages+0xcf/0x1c0 [ 1538.555682] __writeback_single_inode+0x58/0x3f0 [ 1538.556014] writeback_sb_inodes+0x210/0x540 [ 1538.556324] __writeback_inodes_wb+0x4c/0xe0 [ 1538.556635] wb_writeback+0x298/0x450 [ 1538.556911] wb_do_writeback+0x29e/0x320 [ 1538.557199] wb_workfn+0x6a/0x2c0 [ 1538.557447] process_one_work+0x302/0x650 [ 1538.557743] worker_thread+0x55/0x400 [ 1538.558013] kthread+0xf0/0x120 [ 1538.558251] ret_from_fork+0x1f/0x30 [ 1538.558518] [ 1538.558621] [E] folio_wake_bit(pglocked:0): [ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560 [ 1538.559290] stacktrace: [ 1538.559455] ext4_bio_write_page+0x400/0x560 [ 1538.559765] mpage_submit_page+0x5c/0x80 [ 1538.560051] mpage_map_and_submit_buffers+0x15a/0x250 [ 1538.560409] mpage_map_and_submit_extent+0x134/0x530 [ 1538.560764] ext4_writepages+0x798/0x990 [ 1538.561057] do_writepages+0xcf/0x1c0 [ 1538.561329] __writeback_single_inode+0x58/0x3f0 [ 1538.561662] writeback_sb_inodes+0x210/0x540 [ 1538.561973] __writeback_inodes_wb+0x4c/0xe0 [ 1538.562283] wb_writeback+0x298/0x450 [ 1538.562555] wb_do_writeback+0x29e/0x320 [ 1538.562842] wb_workfn+0x6a/0x2c0 [ 1538.563095] process_one_work+0x302/0x650 [ 1538.563387] worker_thread+0x55/0x400 [ 1538.563658] kthread+0xf0/0x120 [ 1538.563895] ret_from_fork+0x1f/0x30 [ 1538.564161] --------------------------------------------------- [ 1538.564548] information that might be helpful [ 1538.564832] --------------------------------------------------- [ 1538.565223] CPU: 1 PID: 46539 Comm: dirstress Not tainted 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 [ 1538.565854] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [ 1538.566394] Call Trace: [ 1538.566559] <TASK> [ 1538.566701] dump_stack_lvl+0x4f/0x68 [ 1538.566945] print_circle.cold+0x15b/0x169 [ 1538.567218] ? print_circle+0xe0/0xe0 [ 1538.567461] cb_check_dl+0x55/0x60 [ 1538.567687] bfs+0xd5/0x1b0 [ 1538.567874] add_dep+0xd3/0x1a0 [ 1538.568083] ? __filemap_get_folio+0x3e4/0x420 [ 1538.568374] add_wait+0xe3/0x250 [ 1538.568590] ? __filemap_get_folio+0x3e4/0x420 [ 1538.568886] dept_wait_split_map+0xb1/0x130 [ 1538.569163] folio_wait_bit_common+0x2fa/0x460 [ 1538.569456] ? lock_is_held_type+0xfc/0x130 [ 1538.569733] __filemap_get_folio+0x3e4/0x420 [ 1538.570013] ? __lock_release+0x1b2/0x2c0 [ 1538.570278] pagecache_get_page+0x11/0x40 [ 1538.570543] ext4_mb_init_group+0x80/0x2e0 [ 1538.570813] ? ext4_get_group_desc+0xb2/0x200 [ 1538.571102] ext4_mb_good_group_nolock+0x2a3/0x2d0 [ 1538.571418] ext4_mb_regular_allocator+0x391/0x780 [ 1538.571733] ? rcu_read_lock_sched_held+0x3f/0x70 [ 1538.572044] ? trace_kmem_cache_alloc+0x2c/0xd0 [ 1538.572343] ? kmem_cache_alloc+0x1f7/0x3f0 [ 1538.572618] ext4_mb_new_blocks+0x44e/0x720 [ 1538.572896] ext4_ext_map_blocks+0x7f1/0xd00 [ 1538.573179] ? find_held_lock+0x2b/0x80 [ 1538.573434] ext4_map_blocks+0x19e/0x680 [ 1538.573693] ext4_getblk+0x5f/0x1f0 [ 1538.573927] ext4_bread+0xc/0x70 [ 1538.574141] ext4_append+0x48/0xf0 [ 1538.574369] ext4_init_new_dir+0xc8/0x160 [ 1538.574634] ext4_mkdir+0x19a/0x320 [ 1538.574866] vfs_mkdir+0x83/0xe0 [ 1538.575082] do_mkdirat+0x8c/0x130 [ 1538.575308] __x64_sys_mkdir+0x29/0x30 [ 1538.575557] do_syscall_64+0x40/0x90 [ 1538.575795] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1538.576128] RIP: 0033:0x7f0960466b07 [ 1538.576367] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48 [ 1538.577576] RSP: 002b:00007ffd0fa955a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 [ 1538.578069] RAX: ffffffffffffffda RBX: 0000000000000239 RCX: 00007f0960466b07 [ 1538.578533] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffd0fa955d0 [ 1538.578995] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000010 [ 1538.579458] R10: 00007ffd0fa95345 R11: 0000000000000246 R12: 00000000000003e8 [ 1538.579923] R13: 0000000000000000 R14: 00007ffd0fa955d0 R15: 00007ffd0fa95dd0 [ 1538.580389] </TASK> [ 1540.581382] EXT4-fs (vdb): mounted filesystem with ordered data mode. Quota mode: none. [20:11:24] 8s P.S. Later on the console, the test ground to the halt because DEPT started WARNING over and over and over again.... [ 3129.686102] DEPT_WARN_ON: dt->ecxt_held_pos == DEPT_MAX_ECXT_HELD [ 3129.686396] ? __might_fault+0x32/0x80 [ 3129.686660] WARNING: CPU: 1 PID: 107320 at kernel/dependency/dept.c:1537 add_ecxt+0x1c0/0x1d0 [ 3129.687040] ? __might_fault+0x32/0x80 [ 3129.687282] CPU: 1 PID: 107320 Comm: aio-stress Tainted: G W 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 with multiple CPU's completely spamming the serial console. This should probably be a WARN_ON_ONCE, or some thing that disables DEPT entirely, since apparently won't be any useful DEPT reports (or any useful kernel work, for that matteR) is going to be happening after this.
Oh, one other problem with DEPT --- it's SLOW --- the overhead is enormous. Using kvm-xfstests[1] running "kvm-xfstests smoke", here are some sample times: LOCKDEP DEPT Time to first test 49 seconds 602 seconds ext4/001 2 s 22 s ext4/003 2 s 8 s ext4/005 0 s 7 s ext4/020 1 s 8 s ext4/021 11 s 17 s ext4/023 0 s 83 s generic/001 4 s 76 s generic/002 0 s 11 s generic/003 10 s 19 s There are some large variations; in some cases, some xfstests take 10x as much time or more to run. In fact, when I first started the kvm-xfstests run with DEPT, I thought something had hung and that tests would never start. (In fact, with gce-xfstests the default watchdog "something has gone terribly wrong with the kexec" had fired, and I didn't get any test results using gce-xfstests at all. If DEPT goes in without any optimizations, I'm going to have to adjust the watchdogs timers for gce-xfstests.) The bottom line is that at the moment, between the false positives, and the significant overhead imposed by DEPT, I would suggest that if DEPT ever does go in, that it should be possible to disable DEPT and only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just because DEPT is S - L - O - W. [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md - Ted P.S. Darrick and I both have disabled using LOCKDEP by default because it slows down ext4 -g auto testing by a factor 2, and xfs -g auto testing by a factor of 3. So the fact that DEPT is a factor of 2x to 10x or more slower than LOCKDEP when running various xfstests tests should be a real concern.
On Mon, May 09, 2022 at 04:47:12PM -0400, Steven Rostedt wrote: > On Mon, 9 May 2022 09:16:37 +0900 > Byungchul Park <byungchul.park@lge.com> wrote: > > > CASE 2. > > > > lock L with depth n > > lock A > > lock_nested L' with depth n + 1 > > ... > > unlock L' > > unlock A > > unlock L > > > > This case is allowed by Lockdep. > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > --- > > > > The following scenario would explain why CASE 2 is problematic. > > > > THREAD X THREAD Y > > > > lock L with depth n > > lock L' with depth n > > lock A > > lock A > > lock_nested L' with depth n + 1 > > I'm confused by what exactly you are saying is a deadlock above. > > Are you saying that lock A and L' are inversed? If so, lockdep had better Hi Steven, Yes, I was talking about A and L'. > detect that regardless of L. A nested lock associates the the nesting with When I checked Lockdep code, L' with depth n + 1 and L' with depth n have different classes in Lockdep. That's why I said Lockdep cannot detect it. By any chance, has it changed so as to consider this case? Or am I missing something? > the same type of lock. That is, in lockdep nested tells lockdep not to > trigger on the L and L' but it will not ignore that A was taken. It will not ignore A but it would work like this: THREAD X THREAD Y lock Ln lock Ln lock A lock A lock_nested Lm lock_nested Lm So, Lockdep considers this case safe, actually not tho. Byungchul > > -- Steve > > > > > lock_nested L'' with depth n + 1 > > ... ... > > unlock L' unlock L'' > > unlock A unlock A > > unlock L unlock L'
On Mon, May 09, 2022 at 06:28:17PM -0400, Theodore Ts'o wrote: > Oh, one other problem with DEPT --- it's SLOW --- the overhead is > enormous. Using kvm-xfstests[1] running "kvm-xfstests smoke", here > are some sample times: Yes, right. DEPT has never been optimized. It rather turns on CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on because of porting issue. I have no choice but to rely on those to develop DEPT out of tree. Of course, that's what I don't like. Plus, for now, I'm focusing on removing false positives. Once it's considered settled down, I will work on performance optimizaition. But it should still keep relying on Lockdep CONFIGs and adding additional overhead on it until DEPT can be developed in the tree. > LOCKDEP DEPT > Time to first test 49 seconds 602 seconds > ext4/001 2 s 22 s > ext4/003 2 s 8 s > ext4/005 0 s 7 s > ext4/020 1 s 8 s > ext4/021 11 s 17 s > ext4/023 0 s 83 s > generic/001 4 s 76 s > generic/002 0 s 11 s > generic/003 10 s 19 s > > There are some large variations; in some cases, some xfstests take 10x > as much time or more to run. In fact, when I first started the > kvm-xfstests run with DEPT, I thought something had hung and that > tests would never start. (In fact, with gce-xfstests the default > watchdog "something has gone terribly wrong with the kexec" had fired, > and I didn't get any test results using gce-xfstests at all. If DEPT > goes in without any optimizations, I'm going to have to adjust the > watchdogs timers for gce-xfstests.) Thank you for informing it. I will go for the optimization as well. > The bottom line is that at the moment, between the false positives, > and the significant overhead imposed by DEPT, I would suggest that if > DEPT ever does go in, that it should be possible to disable DEPT and > only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just > because DEPT is S - L - O - W. > > [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md > > - Ted > > P.S. Darrick and I both have disabled using LOCKDEP by default > because it slows down ext4 -g auto testing by a factor 2, and xfs -g > auto testing by a factor of 3. So the fact that DEPT is a factor of > 2x to 10x or more slower than LOCKDEP when running various xfstests > tests should be a real concern. DEPT is tracking way more objects than Lockdep so it's inevitable to be slower, but let me try to make it have the similar performance to Lockdep. Byungchul
On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote: > Yes, right. DEPT has never been optimized. It rather turns on > CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on > because of porting issue. I have no choice but to rely on those to > develop DEPT out of tree. Of course, that's what I don't like. Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING overhead can explain only a tiny fraction of the slowdown. Consider: if time to first test (time to boot the kernel, setup the test environment, figure out which tests to run, etc.) is 12 seconds w/o LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with DEPT, you can really only blame 37 seconds out of the 602 seconds of DEPT on unnecessary PROVE_LOCKING overhead. So let's assume we can get rid of all of the PROVE_LOCKING overhead. We're still talking about 12 seconds for time-to-first test without any lock debugging, versus ** 565 ** seconds for time-to-first test with DEPT. That's a factor of 47x for DEPT sans LOCKDEP overhead, compared to a 4x overhead for PROVE_LOCKING. > Plus, for now, I'm focusing on removing false positives. Once it's > considered settled down, I will work on performance optimizaition. But > it should still keep relying on Lockdep CONFIGs and adding additional > overhead on it until DEPT can be developed in the tree. Well, please take a look at the false positive which I reported. I suspect that in order to fix that particular false positive, we'll either need to have a way to disable DEPT on waiting on all page/folio dirty bits, or it will need to treat pages from different inodes and/or address spaces as being entirely separate classes, instead of collapsing all inode dirty bits, and all of various inode's mutexes (such as ext4's i_data_sem) as being part of a single object class. > DEPT is tracking way more objects than Lockdep so it's inevitable to be > slower, but let me try to make it have the similar performance to > Lockdep. In order to eliminate some of these false positives, I suspect it's going to increase the number of object classes that DEPT will need to track even *more*. At which point, the cost/benefit of DEPT may get called into question, especially if all of the false positives can't be suppressed. - Ted
Ted wrote: > On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote: > > Yes, right. DEPT has never been optimized. It rather turns on > > CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on > > because of porting issue. I have no choice but to rely on those to > > develop DEPT out of tree. Of course, that's what I don't like. > > Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING > overhead can explain only a tiny fraction of the slowdown. Consider: > if time to first test (time to boot the kernel, setup the test > environment, figure out which tests to run, etc.) is 12 seconds w/o > LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with > DEPT, you can really only blame 37 seconds out of the 602 seconds of > DEPT on unnecessary PROVE_LOCKING overhead. > > So let's assume we can get rid of all of the PROVE_LOCKING overhead. > We're still talking about 12 seconds for time-to-first test without > any lock debugging, versus ** 565 ** seconds for time-to-first test > with DEPT. That's a factor of 47x for DEPT sans LOCKDEP overhead, > compared to a 4x overhead for PROVE_LOCKING. Okay. I will work on it. > > Plus, for now, I'm focusing on removing false positives. Once it's > > considered settled down, I will work on performance optimizaition. But > > it should still keep relying on Lockdep CONFIGs and adding additional > > overhead on it until DEPT can be developed in the tree. > > Well, please take a look at the false positive which I reported. I > suspect that in order to fix that particular false positive, we'll > either need to have a way to disable DEPT on waiting on all page/folio > dirty bits, or it will need to treat pages from different inodes > and/or address spaces as being entirely separate classes, instead of > collapsing all inode dirty bits, and all of various inode's mutexes > (such as ext4's i_data_sem) as being part of a single object class. I'd rather solve it by assigning different classes to different types of inode. This is the right way. > > DEPT is tracking way more objects than Lockdep so it's inevitable to be > > slower, but let me try to make it have the similar performance to > > Lockdep. > > In order to eliminate some of these false positives, I suspect it's > going to increase the number of object classes that DEPT will need to > track even *more*. At which point, the cost/benefit of DEPT may get > called into question, especially if all of the false positives can't > be suppressed. Look. Let's talk in general terms. There's no way to get rid of the false positives all the way. It's a decision issue for *balancing* between considering potential cases and only real ones. Definitely, potential is not real. The more potential things we consider, the higher the chances are, that false positives appear. But yes. The advantage we'd take by detecting potential ones should be higher than the risk of being bothered by false ones. Do you think a tool is useless if it produces a few false positives? Of course, it'd be a problem if it's too many, but otherwise, I think it'd be a great tool if the advantage > the risk. Don't get me wrong here. It doesn't mean DEPT is perfect for now. The performance should be improved and false alarms that appear should be removed, of course. I'm talking about the direction. For now, there's no tool to track wait/event itself in Linux kernel - a subset of the functionality exists tho. DEPT is the 1st try for that purpose and can be a useful tool by the right direction. I know what you are concerning about. I bet it's false positives that are going to bother you once merged. I'll insist that DEPT shouldn't be used as a mandatory testing tool until considered stable enough. But what about ones who would take the advantage use DEPT. Why don't you think of folks who will take the advantage from the hints about dependency of synchronization esp. when their subsystem requires very complicated synchronization? Should a tool be useful only in a final testing stage? What about the usefulness during development stage? It's worth noting DEPT works with any wait/event so any lockups e.g. even by HW-SW interface, retry logic or the like can be detected by DEPT once all waits and events are tagged properly. I believe the advantage by that is much higher than the bad side facing false alarms. It's just my opinion. I'm goning to respect the majority opinion. Byungchul > > - Ted >
On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > > Linus wrote: > > > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > > > Hi Linus and folks, > > > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > > cover all synchonization machanisms. > > > > > > > > So what is the actual status of reports these days? > > > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > > due to mis-understanding prepare_to_sleep(). > > > > > > Yes, it was. I handled the case in the following way: > > > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > > Which has yet to be an actual wait that Dept considers. > > > 2. If the condition for sleep is true, the wait will be committed at > > > __schedule(). The wait becomes an actual one that Dept considers. > > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > > clean(=reset) the staged wait. > > > > > > That way, Dept only works with what actually hits to __schedule() for > > > the waits through sleep. > > > > > > > For this all to make sense, it would need to not have false positives > > > > (or at least a very small number of them together with a way to sanely > > > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > > > > IMHO DEPT should not report what lockdep allows (Not talking about > > No. > > > wait events). I mean lockdep allows some kind of nested locks but > > DEPT reports them. > > You have already asked exactly same question in another thread of > LKML. That time I answered to it but let me explain it again. > > --- > > CASE 1. > > lock L with depth n > lock_nested L' with depth n + 1 > ... > unlock L' > unlock L > > This case is allowed by Lockdep. > This case is allowed by DEPT cuz it's not a deadlock. > > CASE 2. > > lock L with depth n > lock A > lock_nested L' with depth n + 1 > ... > unlock L' > unlock A > unlock L > > This case is allowed by Lockdep. > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > Yeah, in previous threads we discussed this [1] And the case was: scan_mutex -> object_lock -> kmemleak_lock -> object_lock And dept reported: object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as deadlock. But IIUC - What DEPT reported happens only under scan_mutex and It is not simple just not to take them because the object can be removed from the list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock. Just I'm still not sure that someone will fix the warning in the future - even if the locking rule is not good - if it will not cause a real deadlock. > --- > > The following scenario would explain why CASE 2 is problematic. > > THREAD X THREAD Y > > lock L with depth n > lock L' with depth n > lock A > lock A > lock_nested L' with depth n + 1 > lock_nested L'' with depth n + 1 > ... ... > unlock L' unlock L'' > unlock A unlock A > unlock L unlock L' > > Yes. I need to check if the report you shared with me is a true one, but > it's not because DEPT doesn't work with *_nested() APIs. > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs. > Byungchul [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/
On Tue, 10 May 2022 08:38:38 +0900 Byungchul Park <byungchul.park@lge.com> wrote: > Yes, I was talking about A and L'. > > > detect that regardless of L. A nested lock associates the the nesting with > > When I checked Lockdep code, L' with depth n + 1 and L' with depth n > have different classes in Lockdep. If that's the case, then that's a bug in lockdep. > > That's why I said Lockdep cannot detect it. By any chance, has it > changed so as to consider this case? Or am I missing something? No, it's not that lockdep cannot detect it, it should detect it. If it is not detecting it, then we need to fix that. -- Steve
On Tue, May 10, 2022 at 10:12:54AM -0400, Steven Rostedt wrote: > On Tue, 10 May 2022 08:38:38 +0900 > Byungchul Park <byungchul.park@lge.com> wrote: > > > Yes, I was talking about A and L'. > > > > > detect that regardless of L. A nested lock associates the the nesting with > > > > When I checked Lockdep code, L' with depth n + 1 and L' with depth n > > have different classes in Lockdep. > > If that's the case, then that's a bug in lockdep. Yes, agree. I should've said 'Lockdep doesn't detect it currently.' rather than 'Lockdep can't detect it.'. I also think we make it for this case by fixing the bug in Lockdep. > > > > That's why I said Lockdep cannot detect it. By any chance, has it > > changed so as to consider this case? Or am I missing something? > > No, it's not that lockdep cannot detect it, it should detect it. If it is > not detecting it, then we need to fix that. Yes. Byungchul > > -- Steve
On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > > > Linus wrote: > > > > > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > > > > > Hi Linus and folks, > > > > > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > > > cover all synchonization machanisms. > > > > > > > > > > So what is the actual status of reports these days? > > > > > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > > > due to mis-understanding prepare_to_sleep(). > > > > > > > > Yes, it was. I handled the case in the following way: > > > > > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > > > Which has yet to be an actual wait that Dept considers. > > > > 2. If the condition for sleep is true, the wait will be committed at > > > > __schedule(). The wait becomes an actual one that Dept considers. > > > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > > > clean(=reset) the staged wait. > > > > > > > > That way, Dept only works with what actually hits to __schedule() for > > > > the waits through sleep. > > > > > > > > > For this all to make sense, it would need to not have false positives > > > > > (or at least a very small number of them together with a way to sanely > > > > > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > > > > > > > IMHO DEPT should not report what lockdep allows (Not talking about > > > > No. > > > > > wait events). I mean lockdep allows some kind of nested locks but > > > DEPT reports them. > > > > You have already asked exactly same question in another thread of > > LKML. That time I answered to it but let me explain it again. > > > > --- > > > > CASE 1. > > > > lock L with depth n > > lock_nested L' with depth n + 1 > > ... > > unlock L' > > unlock L > > > > This case is allowed by Lockdep. > > This case is allowed by DEPT cuz it's not a deadlock. > > > > CASE 2. > > > > lock L with depth n > > lock A > > lock_nested L' with depth n + 1 > > ... > > unlock L' > > unlock A > > unlock L > > > > This case is allowed by Lockdep. > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > Yeah, in previous threads we discussed this [1] > > And the case was: > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > And dept reported: > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > deadlock. > > But IIUC - What DEPT reported happens only under scan_mutex and > It is not simple just not to take them because the object can be removed from the > list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock. That should be one of the following order: 1. kmemleak_lock -> object_lock -> object_lock(nested) 2. object_lock -> object_lock(nested) -> kmemleak_lock > Just I'm still not sure that someone will fix the warning in the future - even if the > locking rule is not good - if it will not cause a real deadlock. There's more important thing than making code just work for now. For example, maintainance, communcation via code between current developers and potential new commers in the future and so on. At least, a comment describing why the wrong order in the code is safe should be added. I wouldn't allow the current order in the code if I were the maintainer. Byungchul > > --- > > > > The following scenario would explain why CASE 2 is problematic. > > > > THREAD X THREAD Y > > > > lock L with depth n > > lock L' with depth n > > lock A > > lock A > > lock_nested L' with depth n + 1 > > lock_nested L'' with depth n + 1 > > ... ... > > unlock L' unlock L'' > > unlock A unlock A > > unlock L unlock L' > > > > Yes. I need to check if the report you shared with me is a true one, but > > it's not because DEPT doesn't work with *_nested() APIs. > > > > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs. > > > Byungchul > > [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/ > > -- > Thanks, > Hyeonggon
On Tue, May 10, 2022 at 02:37:40PM +0900, Byungchul Park wrote: > Ted wrote: > > On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote: > > > DEPT is tracking way more objects than Lockdep so it's inevitable to be > > > slower, but let me try to make it have the similar performance to > > > Lockdep. > > > > In order to eliminate some of these false positives, I suspect it's > > going to increase the number of object classes that DEPT will need to > > track even *more*. At which point, the cost/benefit of DEPT may get > > called into question, especially if all of the false positives can't > > be suppressed. > > Look. Let's talk in general terms. There's no way to get rid of the > false positives all the way. It's a decision issue for *balancing* > between considering potential cases and only real ones. Definitely, > potential is not real. The more potential things we consider, the higher > the chances are, that false positives appear. > > But yes. The advantage we'd take by detecting potential ones should be > higher than the risk of being bothered by false ones. Do you think a > tool is useless if it produces a few false positives? Of course, it'd > be a problem if it's too many, but otherwise, I think it'd be a great > tool if the advantage > the risk. > > Don't get me wrong here. It doesn't mean DEPT is perfect for now. The > performance should be improved and false alarms that appear should be > removed, of course. I'm talking about the direction. > > For now, there's no tool to track wait/event itself in Linux kernel - > a subset of the functionality exists tho. DEPT is the 1st try for that > purpose and can be a useful tool by the right direction. > > I know what you are concerning about. I bet it's false positives that > are going to bother you once merged. I'll insist that DEPT shouldn't be > used as a mandatory testing tool until considered stable enough. But > what about ones who would take the advantage use DEPT. Why don't you > think of folks who will take the advantage from the hints about > dependency of synchronization esp. when their subsystem requires very > complicated synchronization? Should a tool be useful only in a final > testing stage? What about the usefulness during development stage? > > It's worth noting DEPT works with any wait/event so any lockups e.g. > even by HW-SW interface, retry logic or the like can be detected by DEPT > once all waits and events are tagged properly. I believe the advantage > by that is much higher than the bad side facing false alarms. It's just > my opinion. I'm goning to respect the majority opinion. s/take advantage/have the benefit/g Byungchul
On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote: > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > > > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > > > > Linus wrote: > > > > > > > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > > > > > > > Hi Linus and folks, > > > > > > > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > > > > cover all synchonization machanisms. > > > > > > > > > > > > So what is the actual status of reports these days? > > > > > > > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > > > > due to mis-understanding prepare_to_sleep(). > > > > > > > > > > Yes, it was. I handled the case in the following way: > > > > > > > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > > > > Which has yet to be an actual wait that Dept considers. > > > > > 2. If the condition for sleep is true, the wait will be committed at > > > > > __schedule(). The wait becomes an actual one that Dept considers. > > > > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > > > > clean(=reset) the staged wait. > > > > > > > > > > That way, Dept only works with what actually hits to __schedule() for > > > > > the waits through sleep. > > > > > > > > > > > For this all to make sense, it would need to not have false positives > > > > > > (or at least a very small number of them together with a way to sanely > > > > > > > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > > > > > > > > > > IMHO DEPT should not report what lockdep allows (Not talking about > > > > > > No. > > > > > > > wait events). I mean lockdep allows some kind of nested locks but > > > > DEPT reports them. > > > > > > You have already asked exactly same question in another thread of > > > LKML. That time I answered to it but let me explain it again. > > > > > > --- > > > > > > CASE 1. > > > > > > lock L with depth n > > > lock_nested L' with depth n + 1 > > > ... > > > unlock L' > > > unlock L > > > > > > This case is allowed by Lockdep. > > > This case is allowed by DEPT cuz it's not a deadlock. > > > > > > CASE 2. > > > > > > lock L with depth n > > > lock A > > > lock_nested L' with depth n + 1 > > > ... > > > unlock L' > > > unlock A > > > unlock L > > > > > > This case is allowed by Lockdep. > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > > > > Yeah, in previous threads we discussed this [1] > > > > And the case was: > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > > And dept reported: > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > > deadlock. > > > > But IIUC - What DEPT reported happens only under scan_mutex and > > It is not simple just not to take them because the object can be removed from the > > list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock. > > > That should be one of the following order: > > 1. kmemleak_lock -> object_lock -> object_lock(nested) > 2. object_lock -> object_lock(nested) -> kmemleak_lock > > > Just I'm still not sure that someone will fix the warning in the future - even if the > > locking rule is not good - if it will not cause a real deadlock. > > There's more important thing than making code just work for now. For > example, maintainance, communcation via code between current developers > and potential new commers in the future and so on. Then we will get same reports from DEPT until already existing bad code (even if it does not cause deadlock) is reworked. If you think that is right thing to do, okay. > At least, a comment describing why the wrong order in the code is safe > should be added. AFAIK The comment is already there in mm/kmemleak.c. > I wouldn't allow the current order in the code if I > were the maintainer. [+Cc Catalin] He may have opinion. Thanks, Hyeonggon > Byungchul > > > > --- > > > > > > The following scenario would explain why CASE 2 is problematic. > > > > > > THREAD X THREAD Y > > > > > > lock L with depth n > > > lock L' with depth n > > > lock A > > > lock A > > > lock_nested L' with depth n + 1 > > > lock_nested L'' with depth n + 1 > > > ... ... > > > unlock L' unlock L'' > > > unlock A unlock A > > > unlock L unlock L' > > > > > > Yes. I need to check if the report you shared with me is a true one, but > > > it's not because DEPT doesn't work with *_nested() APIs. > > > > > > > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs. > > > > > Byungchul > > > > [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/ > > > > -- > > Thanks, > > Hyeonggon
+cc mcgrof@kernel.org (firmware) +cc holt@sgi.com (syscall reboot) Hi Luis, Robin and folks, I'm developing a tool for lockup detection, DETP(Dependency Tracker). I got a DEPT report from Hyeonggon - Thanks, Hyeonggon! It doesn't mean the code *definitely* has a deadlock. However, it looks problematic to me. So I'd like to ask things to see if it is actually. Because Hyeonggon didn't run decode_stacktrace.sh before sending it to me, I don't have a report having a better debugging information. But I can explain it in this mail. The problematic scenario looks like: PROCESS A PROCESS B WORKER C __do_sys_reboot() __do_sys_reboot() mutex_lock(&system_transition_mutex) ... mutex_lock(&system_transition_mutex) <- stuck ... request_firmware_work_func() _request_firmware() firmware_fallback_sysfs() usermodehelper_read_lock_wait() down_read(&umhelper_sem) ... fw_load_sysfs_fallback() fw_sysfs_wait_timeout() wait_for_completion_killable_timeout(&fw_st->completion) <- stuck kernel_halt() __usermodehelper_disable() down_write(&umhelper_sem) <- stuck -------------------------------------------------------- All the 3 contexts are stuck at this point. -------------------------------------------------------- PROCESS A PROCESS B WORKER C ... up_write(&umhelper_sem) ... mutex_unlock(&system_transition_mutex) <- cannot wake up B ... kernel_halt() notifier_call_chain() hw_shutdown_notify() kill_pending_fw_fallback_reqs() __fw_load_abort() complete_all(&fw_st->completion) <- cannot wake up C ... usermodeheler_read_unlock() up_read(&umhelper_sem) <- cannot wake up A So I think this scenario is problematic. Or am I missing something here? Or do you think it's okay because the wait_for_completion_*() has a timeout? AFAIK, timeouts are not supposed to fire in normal cases. It'd be appreciated if you share your opinion on the report. Byungchul --- [ 18.136012][ T1] =================================================== [ 18.136419][ T1] DEPT: Circular dependency has been detected. [ 18.136782][ T1] 5.18.0-rc3-57979-gc2b89afca919 #2374 Tainted: G B [ 18.137249][ T1] --------------------------------------------------- [ 18.137649][ T1] summary [ 18.137823][ T1] --------------------------------------------------- [ 18.138222][ T1] *** DEADLOCK *** [ 18.138222][ T1] [ 18.138569][ T1] context A [ 18.138754][ T1] [S] __mutex_lock_common(system_transition_mutex:0) [ 18.139170][ T1] [W] down_write(umhelper_sem:0) [ 18.139482][ T1] [E] mutex_unlock(system_transition_mutex:0) [ 18.139865][ T1] [ 18.140004][ T1] context B [ 18.140189][ T1] [S] (unknown)(&fw_st->completion:0) [ 18.140527][ T1] [W] __mutex_lock_common(system_transition_mutex:0) [ 18.140942][ T1] [E] complete_all(&fw_st->completion:0) [ 18.141295][ T1] [ 18.141434][ T1] context C [ 18.141618][ T1] [S] down_read(umhelper_sem:0) [ 18.141926][ T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0) [ 18.142402][ T1] [E] up_read(umhelper_sem:0) [ 18.142699][ T1] [ 18.142837][ T1] [S]: start of the event context [ 18.143134][ T1] [W]: the wait blocked [ 18.143379][ T1] [E]: the event not reachable [ 18.143661][ T1] --------------------------------------------------- [ 18.144063][ T1] context A's detail [ 18.144293][ T1] --------------------------------------------------- [ 18.144691][ T1] context A [ 18.144875][ T1] [S] __mutex_lock_common(system_transition_mutex:0) [ 18.145290][ T1] [W] down_write(umhelper_sem:0) [ 18.145602][ T1] [E] mutex_unlock(system_transition_mutex:0) [ 18.145982][ T1] [ 18.146120][ T1] [S] __mutex_lock_common(system_transition_mutex:0): [ 18.146519][ T1] [<ffffffff810ee14c>] __do_sys_reboot+0x11f/0x24f [ 18.146907][ T1] stacktrace: [ 18.147101][ T1] __mutex_lock+0x1f3/0x3f3 [ 18.147396][ T1] __do_sys_reboot+0x11f/0x24f [ 18.147706][ T1] do_syscall_64+0xd4/0xfb [ 18.148001][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.148379][ T1] [ 18.148517][ T1] [W] down_write(umhelper_sem:0): [ 18.148815][ T1] [<ffffffff810d9c14>] __usermodehelper_disable+0x80/0x17f [ 18.149243][ T1] stacktrace: [ 18.149438][ T1] __dept_wait+0x115/0x15b [ 18.149726][ T1] dept_wait+0xcd/0xf3 [ 18.149993][ T1] down_write+0x4e/0x82 [ 18.150266][ T1] __usermodehelper_disable+0x80/0x17f [ 18.150615][ T1] kernel_halt+0x33/0x5d [ 18.150893][ T1] __do_sys_reboot+0x197/0x24f [ 18.151201][ T1] do_syscall_64+0xd4/0xfb [ 18.151489][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.151866][ T1] [ 18.152004][ T1] [E] mutex_unlock(system_transition_mutex:0): [ 18.152368][ T1] (N/A) [ 18.152532][ T1] --------------------------------------------------- [ 18.152931][ T1] context B's detail [ 18.153161][ T1] --------------------------------------------------- [ 18.153559][ T1] context B [ 18.153743][ T1] [S] (unknown)(&fw_st->completion:0) [ 18.154082][ T1] [W] __mutex_lock_common(system_transition_mutex:0) [ 18.154496][ T1] [E] complete_all(&fw_st->completion:0) [ 18.154848][ T1] [ 18.154987][ T1] [S] (unknown)(&fw_st->completion:0): [ 18.155310][ T1] (N/A) [ 18.155473][ T1] [ 18.155612][ T1] [W] __mutex_lock_common(system_transition_mutex:0): [ 18.156014][ T1] [<ffffffff810ee14c>] __do_sys_reboot+0x11f/0x24f [ 18.156400][ T1] stacktrace: [ 18.156594][ T1] __mutex_lock+0x1ce/0x3f3 [ 18.156887][ T1] __do_sys_reboot+0x11f/0x24f [ 18.157196][ T1] do_syscall_64+0xd4/0xfb [ 18.157482][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.157856][ T1] [ 18.157993][ T1] [E] complete_all(&fw_st->completion:0): [ 18.158330][ T1] [<ffffffff81c04230>] kill_pending_fw_fallback_reqs+0x66/0x95 [ 18.158778][ T1] stacktrace: [ 18.158973][ T1] complete_all+0x28/0x58 [ 18.159256][ T1] kill_pending_fw_fallback_reqs+0x66/0x95 [ 18.159624][ T1] fw_shutdown_notify+0x7/0xa [ 18.159929][ T1] notifier_call_chain+0x4f/0x81 [ 18.160246][ T1] blocking_notifier_call_chain+0x4c/0x64 [ 18.160611][ T1] kernel_halt+0x13/0x5d [ 18.160889][ T1] __do_sys_reboot+0x197/0x24f [ 18.161197][ T1] do_syscall_64+0xd4/0xfb [ 18.161485][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.161861][ T1] --------------------------------------------------- [ 18.162260][ T1] context C's detail [ 18.162490][ T1] --------------------------------------------------- [ 18.162889][ T1] context C [ 18.163073][ T1] [S] down_read(umhelper_sem:0) [ 18.163379][ T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0) [ 18.163855][ T1] [E] up_read(umhelper_sem:0) [ 18.164150][ T1] [ 18.164288][ T1] [S] down_read(umhelper_sem:0): [ 18.164578][ T1] [<ffffffff810d8f99>] usermodehelper_read_lock_wait+0xad/0x139 [ 18.165027][ T1] stacktrace: [ 18.165220][ T1] down_read+0x74/0x85 [ 18.165487][ T1] usermodehelper_read_lock_wait+0xad/0x139 [ 18.165860][ T1] firmware_fallback_sysfs+0x118/0x521 [ 18.166207][ T1] _request_firmware+0x7ef/0x91b [ 18.166525][ T1] request_firmware_work_func+0xb1/0x13b [ 18.166884][ T1] process_one_work+0x4c3/0x771 [ 18.167199][ T1] worker_thread+0x37f/0x49e [ 18.167497][ T1] kthread+0x122/0x131 [ 18.167768][ T1] ret_from_fork+0x1f/0x30 [ 18.168055][ T1] [ 18.168192][ T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0): [ 18.168650][ T1] [<ffffffff81c046b7>] firmware_fallback_sysfs+0x42a/0x521 [ 18.169076][ T1] stacktrace: [ 18.169270][ T1] wait_for_completion_killable_timeout+0x3c/0x58 [ 18.169676][ T1] firmware_fallback_sysfs+0x42a/0x521 [ 18.170024][ T1] _request_firmware+0x7ef/0x91b [ 18.170341][ T1] request_firmware_work_func+0xb1/0x13b [ 18.170699][ T1] process_one_work+0x4c3/0x771 [ 18.171012][ T1] worker_thread+0x37f/0x49e [ 18.171309][ T1] kthread+0x122/0x131 [ 18.171575][ T1] ret_from_fork+0x1f/0x30 [ 18.171863][ T1] [ 18.172001][ T1] [E] up_read(umhelper_sem:0): [ 18.172281][ T1] (N/A) [ 18.172445][ T1] --------------------------------------------------- [ 18.172844][ T1] information that might be helpful [ 18.173151][ T1] --------------------------------------------------- [ 18.173549][ T1] CPU: 0 PID: 1 Comm: init Tainted: G B 5.18.0-rc3-57979-gc2b89afca919 #2374 [ 18.174144][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 18.174687][ T1] Call Trace: [ 18.174882][ T1] <TASK> [ 18.175056][ T1] print_circle+0x6a2/0x6f9 [ 18.175326][ T1] ? extend_queue+0xf4/0xf4 [ 18.175594][ T1] ? __list_add_valid+0xce/0xf6 [ 18.175886][ T1] ? __list_add+0x45/0x4e [ 18.176144][ T1] ? print_circle+0x6f9/0x6f9 [ 18.176422][ T1] cb_check_dl+0xc0/0xc7 [ 18.176675][ T1] bfs+0x1c8/0x27b [ 18.176897][ T1] ? disconnect_class+0x24c/0x24c [ 18.177195][ T1] ? __list_add+0x45/0x4e [ 18.177454][ T1] add_dep+0x140/0x217 [ 18.177697][ T1] ? add_ecxt+0x33b/0x33b [ 18.177955][ T1] ? llist_del_first+0xc/0x46 [ 18.178234][ T1] add_wait+0x58a/0x723 [ 18.178482][ T1] ? __thaw_task+0x3e/0x3e [ 18.178745][ T1] ? add_dep+0x217/0x217 [ 18.178998][ T1] ? llist_add_batch+0x33/0x4c [ 18.179281][ T1] ? check_new_class+0x139/0x30d [ 18.179574][ T1] __dept_wait+0x115/0x15b [ 18.179837][ T1] ? __usermodehelper_disable+0x80/0x17f [ 18.180170][ T1] ? add_wait+0x723/0x723 [ 18.180426][ T1] ? lock_release+0x338/0x338 [ 18.180704][ T1] ? __usermodehelper_disable+0x80/0x17f [ 18.181037][ T1] dept_wait+0xcd/0xf3 [ 18.181280][ T1] down_write+0x4e/0x82 [ 18.181527][ T1] ? __usermodehelper_disable+0x80/0x17f [ 18.181861][ T1] __usermodehelper_disable+0x80/0x17f [ 18.182184][ T1] ? __usermodehelper_set_disable_depth+0x3a/0x3a [ 18.182565][ T1] ? dept_ecxt_exit+0x1c9/0x1f7 [ 18.182854][ T1] ? blocking_notifier_call_chain+0x57/0x64 [ 18.183205][ T1] kernel_halt+0x33/0x5d [ 18.183458][ T1] __do_sys_reboot+0x197/0x24f [ 18.183742][ T1] ? kernel_power_off+0x6b/0x6b [ 18.184033][ T1] ? dept_on+0x27/0x33 [ 18.184275][ T1] ? dept_exit+0x38/0x42 [ 18.184526][ T1] ? dept_on+0x27/0x33 [ 18.184767][ T1] ? dept_on+0x27/0x33 [ 18.185008][ T1] ? dept_exit+0x38/0x42 [ 18.185260][ T1] ? dept_enirq_transition+0x25a/0x295 [ 18.185582][ T1] ? syscall_enter_from_user_mode+0x47/0x71 [ 18.185930][ T1] ? dept_aware_softirqs_disable+0x1e/0x1e [ 18.186274][ T1] ? syscall_enter_from_user_mode+0x47/0x71 [ 18.186622][ T1] do_syscall_64+0xd4/0xfb [ 18.186883][ T1] ? asm_exc_page_fault+0x1e/0x30 [ 18.187180][ T1] ? dept_aware_softirqs_disable+0x1e/0x1e [ 18.187526][ T1] ? dept_kernel_enter+0x15/0x1e [ 18.187821][ T1] ? do_syscall_64+0x13/0xfb [ 18.188094][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.188443][ T1] RIP: 0033:0x4485f7 [ 18.188672][ T1] Code: 00 75 05 48 83 c4 28 c3 e8 26 17 00 00 66 0f 1f 44 00 00 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 [ 18.189822][ T1] RSP: 002b:00007ffc92206f28 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9 [ 18.190316][ T1] RAX: ffffffffffffffda RBX: 00007ffc92207118 RCX: 00000000004485f7 [ 18.190784][ T1] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead [ 18.191254][ T1] RBP: 00007ffc92206f30 R08: 00000000016376a0 R09: 0000000000000000 [ 18.191722][ T1] R10: 00000000004c3820 R11: 0000000000000202 R12: 0000000000000001 [ 18.192194][ T1] R13: 00007ffc92207108 R14: 00000000004bf8d0 R15: 0000000000000001 [ 18.192667][ T1] </TASK>
Hello, Just took a look out of curiosity. On Thu, May 12, 2022 at 02:25:57PM +0900, Byungchul Park wrote: > PROCESS A PROCESS B WORKER C > > __do_sys_reboot() > __do_sys_reboot() > mutex_lock(&system_transition_mutex) > ... mutex_lock(&system_transition_mutex) <- stuck > ... > request_firmware_work_func() > _request_firmware() > firmware_fallback_sysfs() > usermodehelper_read_lock_wait() > down_read(&umhelper_sem) > ... > fw_load_sysfs_fallback() > fw_sysfs_wait_timeout() > wait_for_completion_killable_timeout(&fw_st->completion) <- stuck > kernel_halt() > __usermodehelper_disable() > down_write(&umhelper_sem) <- stuck > > -------------------------------------------------------- > All the 3 contexts are stuck at this point. > -------------------------------------------------------- > > PROCESS A PROCESS B WORKER C > > ... > up_write(&umhelper_sem) > ... > mutex_unlock(&system_transition_mutex) <- cannot wake up B > > ... > kernel_halt() > notifier_call_chain() > hw_shutdown_notify() > kill_pending_fw_fallback_reqs() > __fw_load_abort() > complete_all(&fw_st->completion) <- cannot wake up C > > ... > usermodeheler_read_unlock() > up_read(&umhelper_sem) <- cannot wake up A I'm not sure I'm reading it correctly but it looks like "process B" column is superflous given that it's waiting on the same lock to do the same thing that A is already doing (besides, you can't really halt the machine twice). What it's reporting seems to be ABBA deadlock between A waiting on umhelper_sem and C waiting on fw_st->completion. The report seems spurious: 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up to make forward progress because it will unstick itself after timeout expires. 2. complete_all() from __fw_load_abort() isn't the only source of wakeup. The fw loader can be, and mainly should be, woken up by firmware loading actually completing instead of being aborted. I guess the reason why B shows up there is because the operation order is such that just between A and C, the complete_all() takes place before __usermodehlper_disable(), so the whole thing kinda doesn't make sense as you can't block a past operation by a future one. Inserting process B introduces the reverse ordering. Thanks.
Tejun wrote: > Hello, Hello, > I'm not sure I'm reading it correctly but it looks like "process B" column I think you're interpreting the report correctly. > is superflous given that it's waiting on the same lock to do the same thing > that A is already doing (besides, you can't really halt the machine twice). Indeed! I've been in a daze. I thought kernel_halt() can be called twice by two different purposes. Sorry for the noise. > What it's reporting seems to be ABBA deadlock between A waiting on > umhelper_sem and C waiting on fw_st->completion. The report seems spurious: > > 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up > to make forward progress because it will unstick itself after timeout > expires. I have a question about this one. Yes, it would never been stuck thanks to timeout. However, IIUC, timeouts are not supposed to expire in normal cases. So I thought a timeout expiration means not a normal case so need to inform it in terms of dependency so as to prevent further expiraton. That's why I have been trying to track even timeout'ed APIs. Do you think DEPT shouldn't track timeout APIs? If I was wrong, I shouldn't track the timeout APIs any more. > 2. complete_all() from __fw_load_abort() isn't the only source of wakeup. > The fw loader can be, and mainly should be, woken up by firmware loading > actually completing instead of being aborted. This is the point I'd like to ask. In normal cases, fw_load_done() might happen, of course, if the loading gets completed. However, I was wondering if the kernel ensures either fw_load_done() or fw_load_abort() to be called by *another* context while kernel_halt(). > Thanks. Thank you very much! Byungchul > > -- > tejun >
On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote: > I have a question about this one. Yes, it would never been stuck thanks > to timeout. However, IIUC, timeouts are not supposed to expire in normal > cases. So I thought a timeout expiration means not a normal case so need > to inform it in terms of dependency so as to prevent further expiraton. > That's why I have been trying to track even timeout'ed APIs. As I beleive I've already pointed out to you previously in ext4 and ocfs2, the jbd2 timeout every five seconds happens **all** the time while the file system is mounted. Commits more frequently than five seconds is the exception case, at least for desktops/laptop workloads. We *don't* get to the timeout only when a userspace process calls fsync(2), or if the journal was incorrectly sized by the system administrator so that it's too small, and the workload has so many file system mutations that we have to prematurely close the transaction ahead of the 5 second timeout. > Do you think DEPT shouldn't track timeout APIs? If I was wrong, I > shouldn't track the timeout APIs any more. DEPT tracking timeouts will cause false positives in at least some cases. At the very least, there needs to be an easy way to suppress these false positives on a per wait/mutex/spinlock basis. - Ted
Hello, On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote: > > 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up > > to make forward progress because it will unstick itself after timeout > > expires. > > I have a question about this one. Yes, it would never been stuck thanks > to timeout. However, IIUC, timeouts are not supposed to expire in normal > cases. So I thought a timeout expiration means not a normal case so need > to inform it in terms of dependency so as to prevent further expiraton. > That's why I have been trying to track even timeout'ed APIs. > > Do you think DEPT shouldn't track timeout APIs? If I was wrong, I > shouldn't track the timeout APIs any more. Without actually surveying the use cases, I can't say for sure but my experience has been that we often get pretty creative with timeouts and it's something people actively think about and monitor (and it's usually not subtle). Given that, I'm skeptical about how much value it'd add for a dependency checker to warn about timeouts. It might be net negative than the other way around. > > 2. complete_all() from __fw_load_abort() isn't the only source of wakeup. > > The fw loader can be, and mainly should be, woken up by firmware loading > > actually completing instead of being aborted. > > This is the point I'd like to ask. In normal cases, fw_load_done() might > happen, of course, if the loading gets completed. However, I was > wondering if the kernel ensures either fw_load_done() or fw_load_abort() > to be called by *another* context while kernel_halt(). We'll have to walk through the code to tell that. On a cursory look tho, up until that point (just before shutting down usermode helper), I don't see anything which would actively block firmware loading. Thanks.
On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote: > On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote: > > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > > > CASE 1. > > > > > > > > lock L with depth n > > > > lock_nested L' with depth n + 1 > > > > ... > > > > unlock L' > > > > unlock L > > > > > > > > This case is allowed by Lockdep. > > > > This case is allowed by DEPT cuz it's not a deadlock. > > > > > > > > CASE 2. > > > > > > > > lock L with depth n > > > > lock A > > > > lock_nested L' with depth n + 1 > > > > ... > > > > unlock L' > > > > unlock A > > > > unlock L > > > > > > > > This case is allowed by Lockdep. > > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > > > Yeah, in previous threads we discussed this [1] > > > > > > And the case was: > > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > > > And dept reported: > > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > > > deadlock. > > > > > > But IIUC - What DEPT reported happens only under scan_mutex and it > > > is not simple just not to take them because the object can be > > > removed from the list and freed while scanning via kmemleak_free() > > > without kmemleak_lock and object_lock. The above kmemleak sequence shouldn't deadlock since those locks, even if taken in a different order, are serialised by scan_mutex. For various reasons, trying to reduce the latency, I ended up with some fine-grained, per-object locking. For object allocation (rbtree modification) and tree search, we use kmemleak_lock. During scanning (which can take minutes under scan_mutex), we want to prevent (a) long latencies and (b) freeing the object being scanned. We release the locks regularly for (a) and hold the object->lock for (b). In another thread Byungchul mentioned: | context X context Y | | lock mutex A lock mutex A | lock B lock C | lock C lock B | unlock C unlock B | unlock B unlock C | unlock mutex A unlock mutex A | | In my opinion, lock B and lock C are unnecessary if they are always | along with lock mutex A. Or we should keep correct lock order across all | the code. If these are the only two places, yes, locks B and C would be unnecessary. But we have those locks acquired (not nested) on the allocation path (kmemleak_lock) and freeing path (object->lock). We don't want to block those paths while scan_mutex is held. That said, we may be able to use a single kmemleak_lock for everything. The object freeing path may be affected slightly during scanning but the code does release it every MAX_SCAN_SIZE bytes. It may even get slightly faster as we'd hammer a single lock (I'll do some benchmarks). But from a correctness perspective, I think the DEPT tool should be improved a bit to detect when such out of order locking is serialised by an enclosing lock/mutex.
On Thu, May 12, 2022 at 09:56:46AM -0400, Theodore Ts'o wrote: > On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote: > > I have a question about this one. Yes, it would never been stuck thanks > > to timeout. However, IIUC, timeouts are not supposed to expire in normal > > cases. So I thought a timeout expiration means not a normal case so need > > to inform it in terms of dependency so as to prevent further expiraton. > > That's why I have been trying to track even timeout'ed APIs. > > As I beleive I've already pointed out to you previously in ext4 and > ocfs2, the jbd2 timeout every five seconds happens **all** the time > while the file system is mounted. Commits more frequently than five > seconds is the exception case, at least for desktops/laptop workloads. Thanks, Ted. It's easy to stop tracking APIs with timeout. I've been just afraid that the cases that we want to suppress anyway will be skipped. However, I should stop it if it produces too many false alarms. > We *don't* get to the timeout only when a userspace process calls > fsync(2), or if the journal was incorrectly sized by the system > administrator so that it's too small, and the workload has so many > file system mutations that we have to prematurely close the > transaction ahead of the 5 second timeout. Yeah... It's how journaling works. Thanks. > > Do you think DEPT shouldn't track timeout APIs? If I was wrong, I > > shouldn't track the timeout APIs any more. > > DEPT tracking timeouts will cause false positives in at least some > cases. At the very least, there needs to be an easy way to suppress > these false positives on a per wait/mutex/spinlock basis. The easy way is to stop tracking those that are along with timeout until DEPT starts to consider waits/events by timeout functionality itself. Thanks. Byungchul > > - Ted
On Thu, May 19, 2022 at 11:11:10AM +0100, Catalin Marinas wrote: > On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote: > > On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote: > > > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > > > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > > > > CASE 1. > > > > > > > > > > lock L with depth n > > > > > lock_nested L' with depth n + 1 > > > > > ... > > > > > unlock L' > > > > > unlock L > > > > > > > > > > This case is allowed by Lockdep. > > > > > This case is allowed by DEPT cuz it's not a deadlock. > > > > > > > > > > CASE 2. > > > > > > > > > > lock L with depth n > > > > > lock A > > > > > lock_nested L' with depth n + 1 > > > > > ... > > > > > unlock L' > > > > > unlock A > > > > > unlock L > > > > > > > > > > This case is allowed by Lockdep. > > > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > > > > > Yeah, in previous threads we discussed this [1] > > > > > > > > And the case was: > > > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > > > > And dept reported: > > > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > > > > deadlock. > > > > > > > > But IIUC - What DEPT reported happens only under scan_mutex and it > > > > is not simple just not to take them because the object can be > > > > removed from the list and freed while scanning via kmemleak_free() > > > > without kmemleak_lock and object_lock. > > The above kmemleak sequence shouldn't deadlock since those locks, even > if taken in a different order, are serialised by scan_mutex. For various > reasons, trying to reduce the latency, I ended up with some > fine-grained, per-object locking. I understand why you introduced the fine-grained lock. However, the different order should be avoided anyway. As Steven said, Lockdep also should've detected this case, say, this would have been detected if Lockdep worked correctly. It's not a technical issue to make a tool skip the reversed order when it's already protected by another lock. Because each lock has its own purpose as you explained, no body knows if the cases might arise that use kmemleak_lock and object_lock only w/o holding scan_mutex someday. I'm wondering how other folks think this case should be handled tho. > For object allocation (rbtree modification) and tree search, we use > kmemleak_lock. During scanning (which can take minutes under > scan_mutex), we want to prevent (a) long latencies and (b) freeing the > object being scanned. We release the locks regularly for (a) and hold > the object->lock for (b). > > In another thread Byungchul mentioned: > > | context X context Y > | > | lock mutex A lock mutex A > | lock B lock C > | lock C lock B > | unlock C unlock B > | unlock B unlock C > | unlock mutex A unlock mutex A > | > | In my opinion, lock B and lock C are unnecessary if they are always > | along with lock mutex A. Or we should keep correct lock order across all > | the code. > > If these are the only two places, yes, locks B and C would be > unnecessary. But we have those locks acquired (not nested) on the > allocation path (kmemleak_lock) and freeing path (object->lock). We > don't want to block those paths while scan_mutex is held. > > That said, we may be able to use a single kmemleak_lock for everything. > The object freeing path may be affected slightly during scanning but the > code does release it every MAX_SCAN_SIZE bytes. It may even get slightly > faster as we'd hammer a single lock (I'll do some benchmarks). > > But from a correctness perspective, I think the DEPT tool should be > improved a bit to detect when such out of order locking is serialised by > an enclosing lock/mutex. Again, I don't think this is a technical issue. Byungchul > > -- > Catalin