Message ID | ZxcDzT/iv/f0Gyz0@localhost.localdomain (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Fix out-of-bounds write in trie_get_next_key() | expand |
Byeonguk Jeong <jungbu2855@gmail.com> writes: > trie_get_next_key() allocates a node stack with size trie->max_prefixlen, > while it writes (trie->max_prefixlen + 1) nodes to the stack when it has > full paths from the root to leaves. For example, consider a trie with > max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... > 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with > .prefixlen = 8 make 9 nodes be written on the node stack with size 8. > > Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") > Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com> Makes sense! Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
On Mon, Oct 21, 2024 at 6:49 PM Byeonguk Jeong <jungbu2855@gmail.com> wrote: > > trie_get_next_key() allocates a node stack with size trie->max_prefixlen, > while it writes (trie->max_prefixlen + 1) nodes to the stack when it has > full paths from the root to leaves. For example, consider a trie with > max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... > 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with > .prefixlen = 8 make 9 nodes be written on the node stack with size 8. Hmm. It sounds possible, but pls demonstrate it with a selftest. With the amount of fuzzing I'm surprised it was not discovered earlier. pw-bot: cr
On Tue, Oct 22, 2024 at 12:51:05PM -0700, Alexei Starovoitov wrote: > On Mon, Oct 21, 2024 at 6:49 PM Byeonguk Jeong <jungbu2855@gmail.com> wrote: > > > > trie_get_next_key() allocates a node stack with size trie->max_prefixlen, > > while it writes (trie->max_prefixlen + 1) nodes to the stack when it has > > full paths from the root to leaves. For example, consider a trie with > > max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... > > 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with > > .prefixlen = 8 make 9 nodes be written on the node stack with size 8. > > Hmm. It sounds possible, but pls demonstrate it with a selftest. > With the amount of fuzzing I'm surprised it was not discovered earlier. > > pw-bot: cr With a simple test below, the kernel crashes in a minute or you can easily discover the bug on KFENCE-enabled kernels. #!/bin/bash bpftool map create /sys/fs/bpf/lpm type lpm_trie key 5 value 1 \ entries 16 flags 0x1name lpm for i in {0..8}; do bpftool map update pinned /sys/fs/bpf/lpm \ key hex 0$i 00 00 00 00 \ value hex 00 any done while true; do bpftool map dump pinned /sys/fs/bpf/lpm done In my environment (6.12-rc4, with CONFIG_KFENCE), dmesg gave me this message as expected. [ 463.141394] BUG: KFENCE: out-of-bounds write in trie_get_next_key+0x2f2/0x670 [ 463.143422] Out-of-bounds write at 0x0000000095bc45ea (256B right of kfence-#156): [ 463.144438] trie_get_next_key+0x2f2/0x670 [ 463.145439] map_get_next_key+0x261/0x410 [ 463.146444] __sys_bpf+0xad4/0x1170 [ 463.147438] __x64_sys_bpf+0x74/0xc0 [ 463.148431] do_syscall_64+0x79/0x150 [ 463.149425] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 463.151436] kfence-#156: 0x00000000279749c1-0x0000000034dc4abb, size=256, cache=kmalloc-256 [ 463.153414] allocated by task 2021 on cpu 2 at 463.140440s (0.012974s ago): [ 463.154413] trie_get_next_key+0x252/0x670 [ 463.155411] map_get_next_key+0x261/0x410 [ 463.156402] __sys_bpf+0xad4/0x1170 [ 463.157390] __x64_sys_bpf+0x74/0xc0 [ 463.158386] do_syscall_64+0x79/0x150 [ 463.159372] entry_SYSCALL_64_after_hwframe+0x76/0x7e
On 10/22/2024 9:45 AM, Byeonguk Jeong wrote: > trie_get_next_key() allocates a node stack with size trie->max_prefixlen, > while it writes (trie->max_prefixlen + 1) nodes to the stack when it has > full paths from the root to leaves. For example, consider a trie with > max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... > 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with > .prefixlen = 8 make 9 nodes be written on the node stack with size 8. > > Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") > Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com> > --- Tested-by: Hou Tao <houtao1@huawei.com> Without the fix, there will be KASAN report as show below when dumping all keys in the lpm-trie through bpf_map_get_next_key(). However, I have a dumb question: does it make sense to reject the element with prefixlen = 0 ? Because I can't think of a use case where a zero-length prefix will be useful. ================================================================== BUG: KASAN: slab-out-of-bounds in trie_get_next_key+0x133/0x530 Write of size 8 at addr ffff8881076c2fc0 by task test_lpm_trie.b/446 CPU: 0 UID: 0 PID: 446 Comm: test_lpm_trie.b Not tainted 6.11.0+ #52 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... Call Trace: <TASK> dump_stack_lvl+0x6e/0xb0 print_report+0xce/0x610 ? trie_get_next_key+0x133/0x530 ? kasan_complete_mode_report_info+0x3c/0x200 ? trie_get_next_key+0x133/0x530 kasan_report+0x9c/0xd0 ? trie_get_next_key+0x133/0x530 __asan_store8+0x81/0xb0 trie_get_next_key+0x133/0x530 __sys_bpf+0x1b03/0x3140 ? __pfx___sys_bpf+0x10/0x10 ? __pfx_vfs_write+0x10/0x10 ? find_held_lock+0x8e/0xb0 ? ksys_write+0xee/0x180 ? syscall_exit_to_user_mode+0xb3/0x220 ? mark_held_locks+0x28/0x90 ? mark_held_locks+0x28/0x90 __x64_sys_bpf+0x45/0x60 x64_sys_call+0x1b2a/0x20d0 do_syscall_64+0x5d/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f9c5e9c9c5d ...... </TASK> Allocated by task 446: kasan_save_stack+0x28/0x50 kasan_save_track+0x14/0x30 kasan_save_alloc_info+0x36/0x40 __kasan_kmalloc+0x84/0xa0 __kmalloc_noprof+0x214/0x540 trie_get_next_key+0xa7/0x530 __sys_bpf+0x1b03/0x3140 __x64_sys_bpf+0x45/0x60 x64_sys_call+0x1b2a/0x20d0 do_syscall_64+0x5d/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e The buggy address belongs to the object at ffff8881076c2f80 which belongs to the cache kmalloc-rnd-09-64 of size 64 The buggy address is located 0 bytes to the right of allocated 64-byte region [ffff8881076c2f80, ffff8881076c2fc0) > kernel/bpf/lpm_trie.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c > index 0218a5132ab5..9b60eda0f727 100644 > --- a/kernel/bpf/lpm_trie.c > +++ b/kernel/bpf/lpm_trie.c > @@ -655,7 +655,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key) > if (!key || key->prefixlen > trie->max_prefixlen) > goto find_leftmost; > > - node_stack = kmalloc_array(trie->max_prefixlen, > + node_stack = kmalloc_array(trie->max_prefixlen + 1, > sizeof(struct lpm_trie_node *), > GFP_ATOMIC | __GFP_NOWARN); > if (!node_stack)
On Wed, Oct 23, 2024 at 10:03:44AM +0800, Hou Tao wrote: > > Without the fix, there will be KASAN report as show below when dumping > all keys in the lpm-trie through bpf_map_get_next_key(). Thank you for testing. > > However, I have a dumb question: does it make sense to reject the > element with prefixlen = 0 ? Because I can't think of a use case where a > zero-length prefix will be useful. With prefixlen = 0, it would always return -ENOENT, I think. Maybe it is good to reject it earlier!
On Tue, Oct 22, 2024 at 12:51:05PM -0700, Alexei Starovoitov wrote: > On Mon, Oct 21, 2024 at 6:49 PM Byeonguk Jeong <jungbu2855@gmail.com> wrote: > > > > trie_get_next_key() allocates a node stack with size trie->max_prefixlen, > > while it writes (trie->max_prefixlen + 1) nodes to the stack when it has > > full paths from the root to leaves. For example, consider a trie with > > max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... > > 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with > > .prefixlen = 8 make 9 nodes be written on the node stack with size 8. > > Hmm. It sounds possible, but pls demonstrate it with a selftest. > With the amount of fuzzing I'm surprised it was not discovered earlier. > > pw-bot: cr I sent this again because lkml did not understand previous one which is 8B encoded. With a simple test below, the kernel crashes in a minute or you can discover the bug on KFENCE-enabled kernels easily. #!/bin/bash bpftool map create /sys/fs/bpf/lpm type lpm_trie key 5 value 1 \ entries 16 flags 0x1name lpm for i in {0..8}; do bpftool map update pinned /sys/fs/bpf/lpm \ key hex 0$i 00 00 00 00 \ value hex 00 any done while true; do bpftool map dump pinned /sys/fs/bpf/lpm done In my environment (6.12-rc4, with CONFIG_KFENCE), dmesg gave me this message as expected. [ 463.141394] BUG: KFENCE: out-of-bounds write in trie_get_next_key+0x2f2/0x670 [ 463.143422] Out-of-bounds write at 0x0000000095bc45ea (256B right of kfence-#156): [ 463.144438] trie_get_next_key+0x2f2/0x670 [ 463.145439] map_get_next_key+0x261/0x410 [ 463.146444] __sys_bpf+0xad4/0x1170 [ 463.147438] __x64_sys_bpf+0x74/0xc0 [ 463.148431] do_syscall_64+0x79/0x150 [ 463.149425] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 463.151436] kfence-#156: 0x00000000279749c1-0x0000000034dc4abb, size=256, cache=kmalloc-256 [ 463.153414] allocated by task 2021 on cpu 2 at 463.140440s (0.012974s ago): [ 463.154413] trie_get_next_key+0x252/0x670 [ 463.155411] map_get_next_key+0x261/0x410 [ 463.156402] __sys_bpf+0xad4/0x1170 [ 463.157390] __x64_sys_bpf+0x74/0xc0 [ 463.158386] do_syscall_64+0x79/0x150 [ 463.159372] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Hi, On 10/23/2024 3:30 PM, Byeonguk Jeong wrote: > On Wed, Oct 23, 2024 at 10:03:44AM +0800, Hou Tao wrote: >> Without the fix, there will be KASAN report as show below when dumping >> all keys in the lpm-trie through bpf_map_get_next_key(). > Thank you for testing. Alexei suggested adding a bpf self-test for the patch. I think you could reference the code in lpm_trie_map_batch_ops.c [1] or similar and add a new file that uses bpf_map_get_next_key to demonstrate the out-of-bound problem. The test can be run by ./test_maps. There is some document for the procedure in [2]. [1]: tools/testing/selftests/bpf/map_tests/lpm_trie_map_batch_ops.c [2]: https://github.com/torvalds/linux/blob/master/Documentation/bpf/bpf_devel_QA.rst > >> However, I have a dumb question: does it make sense to reject the >> element with prefixlen = 0 ? Because I can't think of a use case where a >> zero-length prefix will be useful. > With prefixlen = 0, it would always return -ENOENT, I think. Maybe it is > good to reject it earlier! > > . Which procedure will return -ENOENT ? I think the element with prefixlen=0 could still be found through the key with prefixlen = 0.
Hi, On Wed, Oct 23, 2024 at 05:59:53PM +0800, Hou Tao wrote: > Alexei suggested adding a bpf self-test for the patch. I think you > could reference the code in lpm_trie_map_batch_ops.c [1] or similar and > add a new file that uses bpf_map_get_next_key to demonstrate the > out-of-bound problem. The test can be run by ./test_maps. There is some > document for the procedure in [2]. > > [1]: tools/testing/selftests/bpf/map_tests/lpm_trie_map_batch_ops.c > [2]: > https://github.com/torvalds/linux/blob/master/Documentation/bpf/bpf_devel_QA.rst Okay, I will add a new test. Thanks for the detailed guideline. > Which procedure will return -ENOENT ? I think the element with > prefixlen=0 could still be found through the key with prefixlen = 0. I mean, BPF_MAP_GET_NEXT_KEY with .prefixlen = 0 would give us -ENOENT, as it follows postorder. BPF_MAP_LOOKUP_ELEM still find the element with prefixlen 0 through the key with prefixlen 0 as you said.
On 10/24/2024 9:48 AM, Byeonguk Jeong wrote: > Hi, > > On Wed, Oct 23, 2024 at 05:59:53PM +0800, Hou Tao wrote: >> Alexei suggested adding a bpf self-test for the patch. I think you >> could reference the code in lpm_trie_map_batch_ops.c [1] or similar and >> add a new file that uses bpf_map_get_next_key to demonstrate the >> out-of-bound problem. The test can be run by ./test_maps. There is some >> document for the procedure in [2]. >> >> [1]: tools/testing/selftests/bpf/map_tests/lpm_trie_map_batch_ops.c >> [2]: >> https://github.com/torvalds/linux/blob/master/Documentation/bpf/bpf_devel_QA.rst > Okay, I will add a new test. Thanks for the detailed guideline. > >> Which procedure will return -ENOENT ? I think the element with >> prefixlen=0 could still be found through the key with prefixlen = 0. > I mean, BPF_MAP_GET_NEXT_KEY with .prefixlen = 0 would give us -ENOENT, > as it follows postorder. BPF_MAP_LOOKUP_ELEM still find the element > with prefixlen 0 through the key with prefixlen 0 as you said. I see. But considering the element with .prefixlen = 0 is the last one in the map, returning -ENOENT is expected. > .
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 0218a5132ab5..9b60eda0f727 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -655,7 +655,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key) if (!key || key->prefixlen > trie->max_prefixlen) goto find_leftmost; - node_stack = kmalloc_array(trie->max_prefixlen, + node_stack = kmalloc_array(trie->max_prefixlen + 1, sizeof(struct lpm_trie_node *), GFP_ATOMIC | __GFP_NOWARN); if (!node_stack)
trie_get_next_key() allocates a node stack with size trie->max_prefixlen, while it writes (trie->max_prefixlen + 1) nodes to the stack when it has full paths from the root to leaves. For example, consider a trie with max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with .prefixlen = 8 make 9 nodes be written on the node stack with size 8. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com> --- kernel/bpf/lpm_trie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)