Message ID | 20221212003711.24977-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm, bpf: Add BPF into /proc/meminfo | expand |
On 12/12/22 01:37, Yafang Shao wrote: > Currently there's no way to get BPF memory usage, while we can only > estimate the usage by bpftool or memcg, both of which are not reliable. > > - bpftool > `bpftool {map,prog} show` can show us the memlock of each map and > prog, but the memlock is vary from the real memory size. The memlock > of a bpf object is approximately > `round_up(key_size + value_size, 8) * max_entries`, > so 1) it can't apply to the non-preallocated bpf map which may > increase or decrease the real memory size dynamically. 2) the element > size of some bpf map is not `key_size + value_size`, for example the > element size of htab is > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > That said the differece between these two values may be very great if > the key_size and value_size is small. For example in my verifaction, > the size of memlock and real memory of a preallocated hash map are, > > $ grep BPF /proc/meminfo > BPF: 1026048 B <<< the size of preallocated memalloc pool > > (create hash map) > > $ bpftool map show > 3: hash name count_map flags 0x0 > key 4B value 4B max_entries 1048576 memlock 8388608B > > $ grep BPF /proc/meminfo > BPF: 84919344 B > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > bytes while the memlock is only 8388608 bytes. > > - memcg > With memcg we only know that the BPF memory usage is less than > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > know that the BPF memory usage is less than $MemTotal if the BPF > object is charged into root memcg :) > > So we need a way to get the BPF memory usage especially there will be > more and more bpf programs running on the production environment. The > memory usage of BPF memory is not trivial, which deserves a new item in > /proc/meminfo. > > This patchset introduce a solution to calculate the BPF memory usage. > This solution is similar to how memory is charged into memcg, so it is > easy to understand. It counts three types of memory usage - > - page > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > their families. > When a page is allocated, we will count its size and mark the head > page, and then check the head page at page freeing. > - slab > via kmalloc, kmem_cache_alloc and their families. > When a slab object is allocated, we will mark this object in this > slab and check it at slab object freeing. That said we need extra memory > to store the information of each object in a slab. > - percpu > via alloc_percpu and its family. > When a percpu area is allocated, we will mark this area in this > percpu chunk and check it at percpu area freeing. That said we need > extra memory to store the information of each area in a percpu chunk. > > So we only need to annotate the allcation to add the BPF memory size, > and the sub of the BPF memory size will be handled automatically at > freeing. We can annotate it in irq, softirq or process context. To avoid > counting the nested allcations, for example the percpu backing allocator, > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > the count consistent with memcg accounting. So you can't easily annotate the freeing places as well, to avoid the whole tracking infrastructure? I thought there was a patchset for a whole bfp-specific memory allocator, where accounting would be implemented naturally, I would imagine. > To store the information of a slab or a page, we need to create a new > member in struct page, but we can do it in page extension which can > avoid changing the size of struct page. So a new page extension > active_vm is introduced. Each page and each slab which is allocated as > BPF memory will have a struct active_vm. The reason it is named as > active_vm is that we can extend it to other areas easily, for example in > the future we may use it to count other memory usage. > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at > compile time or kernel parameter `active_vm=` at runtime. The issue with page_ext is the extra memory usage, so it was rather intended for debugging features that can be always compiled in, but only enabled at runtime when debugging is needed. The overhead is only paid when enabled. That's at least the case of page_owner and page_table_check. The 32bit page_idle is rather an oddity that could have instead stayed 64-bit only. But this is proposing a page_ext functionality supposed to be enabled at all times in production, with the goal of improved accounting. Not an on-demand debugging. I'm afraid the costs will outweight the benefits. Just a quick thought, in case the bpf accounting really can't be handled without marking pages and slab objects - since memcg already has hooks there without need of page_ext, couldn't it be done by extending the memcg infra instead? > Below is the result of this patchset, > > $ grep BPF /proc/meminfo > BPF: 1002 kB > > Currently only bpf map is supported, and only slub in supported. > > Future works: > - support bpf prog > - not sure if it needs to support slab > (it seems slab will be deprecated) > - support per-map memory usage > - support per-memcg memory usage > > Yafang Shao (9): > mm: Introduce active vm item > mm: Allow using active vm in all contexts > mm: percpu: Account active vm for percpu > mm: slab: Account active vm for slab > mm: Account active vm for page > bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} > bpf: Use bpf_map_kzalloc in arraymap > bpf: Use bpf_map_kvcalloc in bpf_local_storage > bpf: Use active vm to account bpf map memory usage > > fs/proc/meminfo.c | 3 + > include/linux/active_vm.h | 73 ++++++++++++ > include/linux/bpf.h | 8 ++ > include/linux/page_ext.h | 1 + > include/linux/sched.h | 5 + > kernel/bpf/arraymap.c | 16 +-- > kernel/bpf/bpf_local_storage.c | 4 +- > kernel/bpf/memalloc.c | 5 + > kernel/bpf/ringbuf.c | 75 ++++++++---- > kernel/bpf/syscall.c | 40 ++++++- > kernel/fork.c | 4 + > mm/Kconfig | 8 ++ > mm/Makefile | 1 + > mm/active_vm.c | 203 +++++++++++++++++++++++++++++++++ > mm/active_vm.h | 74 ++++++++++++ > mm/page_alloc.c | 14 +++ > mm/page_ext.c | 4 + > mm/percpu-internal.h | 3 + > mm/percpu.c | 43 +++++++ > mm/slab.h | 7 ++ > mm/slub.c | 2 + > 21 files changed, 557 insertions(+), 36 deletions(-) > create mode 100644 include/linux/active_vm.h > create mode 100644 mm/active_vm.c > create mode 100644 mm/active_vm.h >
On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 12/12/22 01:37, Yafang Shao wrote: > > Currently there's no way to get BPF memory usage, while we can only > > estimate the usage by bpftool or memcg, both of which are not reliable. > > > > - bpftool > > `bpftool {map,prog} show` can show us the memlock of each map and > > prog, but the memlock is vary from the real memory size. The memlock > > of a bpf object is approximately > > `round_up(key_size + value_size, 8) * max_entries`, > > so 1) it can't apply to the non-preallocated bpf map which may > > increase or decrease the real memory size dynamically. 2) the element > > size of some bpf map is not `key_size + value_size`, for example the > > element size of htab is > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > > That said the differece between these two values may be very great if > > the key_size and value_size is small. For example in my verifaction, > > the size of memlock and real memory of a preallocated hash map are, > > > > $ grep BPF /proc/meminfo > > BPF: 1026048 B <<< the size of preallocated memalloc pool > > > > (create hash map) > > > > $ bpftool map show > > 3: hash name count_map flags 0x0 > > key 4B value 4B max_entries 1048576 memlock 8388608B > > > > $ grep BPF /proc/meminfo > > BPF: 84919344 B > > > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > > bytes while the memlock is only 8388608 bytes. > > > > - memcg > > With memcg we only know that the BPF memory usage is less than > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > > know that the BPF memory usage is less than $MemTotal if the BPF > > object is charged into root memcg :) > > > > So we need a way to get the BPF memory usage especially there will be > > more and more bpf programs running on the production environment. The > > memory usage of BPF memory is not trivial, which deserves a new item in > > /proc/meminfo. > > > > This patchset introduce a solution to calculate the BPF memory usage. > > This solution is similar to how memory is charged into memcg, so it is > > easy to understand. It counts three types of memory usage - > > - page > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > > their families. > > When a page is allocated, we will count its size and mark the head > > page, and then check the head page at page freeing. > > - slab > > via kmalloc, kmem_cache_alloc and their families. > > When a slab object is allocated, we will mark this object in this > > slab and check it at slab object freeing. That said we need extra memory > > to store the information of each object in a slab. > > - percpu > > via alloc_percpu and its family. > > When a percpu area is allocated, we will mark this area in this > > percpu chunk and check it at percpu area freeing. That said we need > > extra memory to store the information of each area in a percpu chunk. > > > > So we only need to annotate the allcation to add the BPF memory size, > > and the sub of the BPF memory size will be handled automatically at > > freeing. We can annotate it in irq, softirq or process context. To avoid > > counting the nested allcations, for example the percpu backing allocator, > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > > the count consistent with memcg accounting. > > So you can't easily annotate the freeing places as well, to avoid the whole > tracking infrastructure? The trouble is kfree_rcu(). for example, old_item = active_vm_item_set(ACTIVE_VM_BPF); kfree_rcu(); active_vm_item_set(old_item); If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we will change lots of code in the RCU subsystem. I'm not sure if it is worth it. > I thought there was a patchset for a whole > bfp-specific memory allocator, where accounting would be implemented > naturally, I would imagine. > I posted a patchset[1] which annotates both allocating and freeing several months ago. But unfortunately after more investigation and verification I found the deferred freeing context is a problem, which can't be resolved easily. That's why I finally decided to annotate allocating only. [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/ > > To store the information of a slab or a page, we need to create a new > > member in struct page, but we can do it in page extension which can > > avoid changing the size of struct page. So a new page extension > > active_vm is introduced. Each page and each slab which is allocated as > > BPF memory will have a struct active_vm. The reason it is named as > > active_vm is that we can extend it to other areas easily, for example in > > the future we may use it to count other memory usage. > > > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at > > compile time or kernel parameter `active_vm=` at runtime. > > The issue with page_ext is the extra memory usage, so it was rather intended > for debugging features that can be always compiled in, but only enabled at > runtime when debugging is needed. The overhead is only paid when enabled. > That's at least the case of page_owner and page_table_check. The 32bit > page_idle is rather an oddity that could have instead stayed 64-bit only. > Right, it seems currently page_ext is for debugging purposes only. > But this is proposing a page_ext functionality supposed to be enabled at all > times in production, with the goal of improved accounting. Not an on-demand > debugging. I'm afraid the costs will outweight the benefits. > The memory overhead of this new page extension is (8/4096), which is 0.2% of total memory. Not too big to be acceptable. If the user really thinks this overhead is not accepted, he can set "active_vm=off" to disable it. To reduce the memory overhead further, I have a bold idea. Actually we don't need to allocate such a page extension for every page, while we only need to allocate it if the user needs to access it. That said it seems that we can allocate some kind of page extensions dynamically rather than preallocate at booting, but I haven't investigated it deeply to check if it can work. What do you think? > Just a quick thought, in case the bpf accounting really can't be handled > without marking pages and slab objects - since memcg already has hooks there > without need of page_ext, couldn't it be done by extending the memcg infra > instead? > We need to make sure the accounting of BPF memory usage is still workable even without memcg, see also the previous discussion[2]. [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/
On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote: > On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 12/12/22 01:37, Yafang Shao wrote: > > > Currently there's no way to get BPF memory usage, while we can only > > > estimate the usage by bpftool or memcg, both of which are not reliable. > > > > > > - bpftool > > > `bpftool {map,prog} show` can show us the memlock of each map and > > > prog, but the memlock is vary from the real memory size. The memlock > > > of a bpf object is approximately > > > `round_up(key_size + value_size, 8) * max_entries`, > > > so 1) it can't apply to the non-preallocated bpf map which may > > > increase or decrease the real memory size dynamically. 2) the element > > > size of some bpf map is not `key_size + value_size`, for example the > > > element size of htab is > > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > > > That said the differece between these two values may be very great if > > > the key_size and value_size is small. For example in my verifaction, > > > the size of memlock and real memory of a preallocated hash map are, > > > > > > $ grep BPF /proc/meminfo > > > BPF: 1026048 B <<< the size of preallocated memalloc pool > > > > > > (create hash map) > > > > > > $ bpftool map show > > > 3: hash name count_map flags 0x0 > > > key 4B value 4B max_entries 1048576 memlock 8388608B > > > > > > $ grep BPF /proc/meminfo > > > BPF: 84919344 B > > > > > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > > > bytes while the memlock is only 8388608 bytes. > > > > > > - memcg > > > With memcg we only know that the BPF memory usage is less than > > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > > > know that the BPF memory usage is less than $MemTotal if the BPF > > > object is charged into root memcg :) > > > > > > So we need a way to get the BPF memory usage especially there will be > > > more and more bpf programs running on the production environment. The > > > memory usage of BPF memory is not trivial, which deserves a new item in > > > /proc/meminfo. > > > > > > This patchset introduce a solution to calculate the BPF memory usage. > > > This solution is similar to how memory is charged into memcg, so it is > > > easy to understand. It counts three types of memory usage - > > > - page > > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > > > their families. > > > When a page is allocated, we will count its size and mark the head > > > page, and then check the head page at page freeing. > > > - slab > > > via kmalloc, kmem_cache_alloc and their families. > > > When a slab object is allocated, we will mark this object in this > > > slab and check it at slab object freeing. That said we need extra memory > > > to store the information of each object in a slab. > > > - percpu > > > via alloc_percpu and its family. > > > When a percpu area is allocated, we will mark this area in this > > > percpu chunk and check it at percpu area freeing. That said we need > > > extra memory to store the information of each area in a percpu chunk. > > > > > > So we only need to annotate the allcation to add the BPF memory size, > > > and the sub of the BPF memory size will be handled automatically at > > > freeing. We can annotate it in irq, softirq or process context. To avoid > > > counting the nested allcations, for example the percpu backing allocator, > > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > > > the count consistent with memcg accounting. > > > > So you can't easily annotate the freeing places as well, to avoid the whole > > tracking infrastructure? > > The trouble is kfree_rcu(). for example, > old_item = active_vm_item_set(ACTIVE_VM_BPF); > kfree_rcu(); > active_vm_item_set(old_item); > If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we > will change lots of code in the RCU subsystem. I'm not sure if it is > worth it. (+Cc rcu folks) IMO adding new kfree_rcu() varient for BPF that accounts BPF memory usage would be much less churn :) > > > I thought there was a patchset for a whole > > bfp-specific memory allocator, where accounting would be implemented > > naturally, I would imagine. > > > > I posted a patchset[1] which annotates both allocating and freeing > several months ago. > But unfortunately after more investigation and verification I found > the deferred freeing context is a problem, which can't be resolved > easily. > That's why I finally decided to annotate allocating only. > > [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/ > > > > To store the information of a slab or a page, we need to create a new > > > member in struct page, but we can do it in page extension which can > > > avoid changing the size of struct page. So a new page extension > > > active_vm is introduced. Each page and each slab which is allocated as > > > BPF memory will have a struct active_vm. The reason it is named as > > > active_vm is that we can extend it to other areas easily, for example in > > > the future we may use it to count other memory usage. > > > > > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at > > > compile time or kernel parameter `active_vm=` at runtime. > > > > The issue with page_ext is the extra memory usage, so it was rather intended > > for debugging features that can be always compiled in, but only enabled at > > runtime when debugging is needed. The overhead is only paid when enabled. > > That's at least the case of page_owner and page_table_check. The 32bit > > page_idle is rather an oddity that could have instead stayed 64-bit only. > > > > Right, it seems currently page_ext is for debugging purposes only. > > > But this is proposing a page_ext functionality supposed to be enabled at all > > times in production, with the goal of improved accounting. Not an on-demand > > debugging. I'm afraid the costs will outweight the benefits. > > > > The memory overhead of this new page extension is (8/4096), which is > 0.2% of total memory. Not too big to be acceptable. It's generally unacceptable to increase sizeof(struct page) (nor enabling page_ext by default, and that's the why page_ext is for debugging purposes only) > If the user really > thinks this overhead is not accepted, he can set "active_vm=off" to > disable it. I'd say many people won't welcome adding 0.2% of total memory by default to get BPF memory usage. > To reduce the memory overhead further, I have a bold idea. > Actually we don't need to allocate such a page extension for every > page, while we only need to allocate it if the user needs to access > it. That said it seems that we can allocate some kind of page > extensions dynamically rather than preallocate at booting, but I > haven't investigated it deeply to check if it can work. What do you > think? > > > Just a quick thought, in case the bpf accounting really can't be handled > > without marking pages and slab objects - since memcg already has hooks there > > without need of page_ext, couldn't it be done by extending the memcg infra > > instead? > > > > We need to make sure the accounting of BPF memory usage is still > workable even without memcg, see also the previous discussion[2]. > > [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/
On 12/13/22 15:56, Hyeonggon Yoo wrote: > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote: >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> > >> > On 12/12/22 01:37, Yafang Shao wrote: >> > > Currently there's no way to get BPF memory usage, while we can only >> > > estimate the usage by bpftool or memcg, both of which are not reliable. >> > > >> > > - bpftool >> > > `bpftool {map,prog} show` can show us the memlock of each map and >> > > prog, but the memlock is vary from the real memory size. The memlock >> > > of a bpf object is approximately >> > > `round_up(key_size + value_size, 8) * max_entries`, >> > > so 1) it can't apply to the non-preallocated bpf map which may >> > > increase or decrease the real memory size dynamically. 2) the element >> > > size of some bpf map is not `key_size + value_size`, for example the >> > > element size of htab is >> > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` >> > > That said the differece between these two values may be very great if >> > > the key_size and value_size is small. For example in my verifaction, >> > > the size of memlock and real memory of a preallocated hash map are, >> > > >> > > $ grep BPF /proc/meminfo >> > > BPF: 1026048 B <<< the size of preallocated memalloc pool >> > > >> > > (create hash map) >> > > >> > > $ bpftool map show >> > > 3: hash name count_map flags 0x0 >> > > key 4B value 4B max_entries 1048576 memlock 8388608B >> > > >> > > $ grep BPF /proc/meminfo >> > > BPF: 84919344 B >> > > >> > > So the real memory size is $((84919344 - 1026048)) which is 83893296 >> > > bytes while the memlock is only 8388608 bytes. >> > > >> > > - memcg >> > > With memcg we only know that the BPF memory usage is less than >> > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only >> > > know that the BPF memory usage is less than $MemTotal if the BPF >> > > object is charged into root memcg :) >> > > >> > > So we need a way to get the BPF memory usage especially there will be >> > > more and more bpf programs running on the production environment. The >> > > memory usage of BPF memory is not trivial, which deserves a new item in >> > > /proc/meminfo. >> > > >> > > This patchset introduce a solution to calculate the BPF memory usage. >> > > This solution is similar to how memory is charged into memcg, so it is >> > > easy to understand. It counts three types of memory usage - >> > > - page >> > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and >> > > their families. >> > > When a page is allocated, we will count its size and mark the head >> > > page, and then check the head page at page freeing. >> > > - slab >> > > via kmalloc, kmem_cache_alloc and their families. >> > > When a slab object is allocated, we will mark this object in this >> > > slab and check it at slab object freeing. That said we need extra memory >> > > to store the information of each object in a slab. >> > > - percpu >> > > via alloc_percpu and its family. >> > > When a percpu area is allocated, we will mark this area in this >> > > percpu chunk and check it at percpu area freeing. That said we need >> > > extra memory to store the information of each area in a percpu chunk. >> > > >> > > So we only need to annotate the allcation to add the BPF memory size, >> > > and the sub of the BPF memory size will be handled automatically at >> > > freeing. We can annotate it in irq, softirq or process context. To avoid >> > > counting the nested allcations, for example the percpu backing allocator, >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make >> > > the count consistent with memcg accounting. >> > >> > So you can't easily annotate the freeing places as well, to avoid the whole >> > tracking infrastructure? >> >> The trouble is kfree_rcu(). for example, >> old_item = active_vm_item_set(ACTIVE_VM_BPF); >> kfree_rcu(); >> active_vm_item_set(old_item); >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we >> will change lots of code in the RCU subsystem. I'm not sure if it is >> worth it. > > (+Cc rcu folks) > > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory > usage would be much less churn :) Alternatively, just account the bpf memory as freed already when calling kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is a separate issue (if it's actually an issue) and not something each kfree_rcu() user should think about separately? >> >> > I thought there was a patchset for a whole >> > bfp-specific memory allocator, where accounting would be implemented >> > naturally, I would imagine. >> > >> >> I posted a patchset[1] which annotates both allocating and freeing >> several months ago. >> But unfortunately after more investigation and verification I found >> the deferred freeing context is a problem, which can't be resolved >> easily. >> That's why I finally decided to annotate allocating only. >> >> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/ >> >> > > To store the information of a slab or a page, we need to create a new >> > > member in struct page, but we can do it in page extension which can >> > > avoid changing the size of struct page. So a new page extension >> > > active_vm is introduced. Each page and each slab which is allocated as >> > > BPF memory will have a struct active_vm. The reason it is named as >> > > active_vm is that we can extend it to other areas easily, for example in >> > > the future we may use it to count other memory usage. >> > > >> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at >> > > compile time or kernel parameter `active_vm=` at runtime. >> > >> > The issue with page_ext is the extra memory usage, so it was rather intended >> > for debugging features that can be always compiled in, but only enabled at >> > runtime when debugging is needed. The overhead is only paid when enabled. >> > That's at least the case of page_owner and page_table_check. The 32bit >> > page_idle is rather an oddity that could have instead stayed 64-bit only. >> > >> >> Right, it seems currently page_ext is for debugging purposes only. >> >> > But this is proposing a page_ext functionality supposed to be enabled at all >> > times in production, with the goal of improved accounting. Not an on-demand >> > debugging. I'm afraid the costs will outweight the benefits. >> > >> >> The memory overhead of this new page extension is (8/4096), which is >> 0.2% of total memory. Not too big to be acceptable. > > It's generally unacceptable to increase sizeof(struct page) > (nor enabling page_ext by default, and that's the why page_ext is for > debugging purposes only) > >> If the user really >> thinks this overhead is not accepted, he can set "active_vm=off" to >> disable it. > > I'd say many people won't welcome adding 0.2% of total memory by default > to get BPF memory usage. Agreed. >> To reduce the memory overhead further, I have a bold idea. >> Actually we don't need to allocate such a page extension for every >> page, while we only need to allocate it if the user needs to access >> it. That said it seems that we can allocate some kind of page >> extensions dynamically rather than preallocate at booting, but I >> haven't investigated it deeply to check if it can work. What do you >> think? There's lots of benefits (simplicity) of page_ext being allocated as it is today. What you're suggesting will be better solved (in few years :) by Matthew's bold ideas about shrinking the current struct page and allocating usecase-specific descriptors. >> > Just a quick thought, in case the bpf accounting really can't be handled >> > without marking pages and slab objects - since memcg already has hooks there >> > without need of page_ext, couldn't it be done by extending the memcg infra >> > instead? >> > >> >> We need to make sure the accounting of BPF memory usage is still >> workable even without memcg, see also the previous discussion[2]. >> >> [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/ >
On Tue, Dec 13, 2022 at 04:52:09PM +0100, Vlastimil Babka wrote: > On 12/13/22 15:56, Hyeonggon Yoo wrote: > > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote: > >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > > >> > On 12/12/22 01:37, Yafang Shao wrote: > >> > > Currently there's no way to get BPF memory usage, while we can only > >> > > estimate the usage by bpftool or memcg, both of which are not reliable. > >> > > > >> > > - bpftool > >> > > `bpftool {map,prog} show` can show us the memlock of each map and > >> > > prog, but the memlock is vary from the real memory size. The memlock > >> > > of a bpf object is approximately > >> > > `round_up(key_size + value_size, 8) * max_entries`, > >> > > so 1) it can't apply to the non-preallocated bpf map which may > >> > > increase or decrease the real memory size dynamically. 2) the element > >> > > size of some bpf map is not `key_size + value_size`, for example the > >> > > element size of htab is > >> > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > >> > > That said the differece between these two values may be very great if > >> > > the key_size and value_size is small. For example in my verifaction, > >> > > the size of memlock and real memory of a preallocated hash map are, > >> > > > >> > > $ grep BPF /proc/meminfo > >> > > BPF: 1026048 B <<< the size of preallocated memalloc pool > >> > > > >> > > (create hash map) > >> > > > >> > > $ bpftool map show > >> > > 3: hash name count_map flags 0x0 > >> > > key 4B value 4B max_entries 1048576 memlock 8388608B > >> > > > >> > > $ grep BPF /proc/meminfo > >> > > BPF: 84919344 B > >> > > > >> > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > >> > > bytes while the memlock is only 8388608 bytes. > >> > > > >> > > - memcg > >> > > With memcg we only know that the BPF memory usage is less than > >> > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > >> > > know that the BPF memory usage is less than $MemTotal if the BPF > >> > > object is charged into root memcg :) > >> > > > >> > > So we need a way to get the BPF memory usage especially there will be > >> > > more and more bpf programs running on the production environment. The > >> > > memory usage of BPF memory is not trivial, which deserves a new item in > >> > > /proc/meminfo. > >> > > > >> > > This patchset introduce a solution to calculate the BPF memory usage. > >> > > This solution is similar to how memory is charged into memcg, so it is > >> > > easy to understand. It counts three types of memory usage - > >> > > - page > >> > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > >> > > their families. > >> > > When a page is allocated, we will count its size and mark the head > >> > > page, and then check the head page at page freeing. > >> > > - slab > >> > > via kmalloc, kmem_cache_alloc and their families. > >> > > When a slab object is allocated, we will mark this object in this > >> > > slab and check it at slab object freeing. That said we need extra memory > >> > > to store the information of each object in a slab. > >> > > - percpu > >> > > via alloc_percpu and its family. > >> > > When a percpu area is allocated, we will mark this area in this > >> > > percpu chunk and check it at percpu area freeing. That said we need > >> > > extra memory to store the information of each area in a percpu chunk. > >> > > > >> > > So we only need to annotate the allcation to add the BPF memory size, > >> > > and the sub of the BPF memory size will be handled automatically at > >> > > freeing. We can annotate it in irq, softirq or process context. To avoid > >> > > counting the nested allcations, for example the percpu backing allocator, > >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > >> > > the count consistent with memcg accounting. > >> > > >> > So you can't easily annotate the freeing places as well, to avoid the whole > >> > tracking infrastructure? > >> > >> The trouble is kfree_rcu(). for example, > >> old_item = active_vm_item_set(ACTIVE_VM_BPF); > >> kfree_rcu(); > >> active_vm_item_set(old_item); > >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we > >> will change lots of code in the RCU subsystem. I'm not sure if it is > >> worth it. > > > > (+Cc rcu folks) > > > > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory > > usage would be much less churn :) > > Alternatively, just account the bpf memory as freed already when calling > kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is > a separate issue (if it's actually an issue) and not something each > kfree_rcu() user should think about separately? If the in-flight memory really does need to be accounted for, then one straightforward approach is to use call_rcu() and do the first part of the needed accounting at the call_rcu() callsite and the rest of the accounting when the callback is invoked. Or, if memory must be freed quickly even on ChromeOS and Android, use call_rcu_hurry() instead of call_rcu(). Or is there some accounting requirement that I am missing? Thanx, Paul > >> > I thought there was a patchset for a whole > >> > bfp-specific memory allocator, where accounting would be implemented > >> > naturally, I would imagine. > >> > > >> > >> I posted a patchset[1] which annotates both allocating and freeing > >> several months ago. > >> But unfortunately after more investigation and verification I found > >> the deferred freeing context is a problem, which can't be resolved > >> easily. > >> That's why I finally decided to annotate allocating only. > >> > >> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/ > >> > >> > > To store the information of a slab or a page, we need to create a new > >> > > member in struct page, but we can do it in page extension which can > >> > > avoid changing the size of struct page. So a new page extension > >> > > active_vm is introduced. Each page and each slab which is allocated as > >> > > BPF memory will have a struct active_vm. The reason it is named as > >> > > active_vm is that we can extend it to other areas easily, for example in > >> > > the future we may use it to count other memory usage. > >> > > > >> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at > >> > > compile time or kernel parameter `active_vm=` at runtime. > >> > > >> > The issue with page_ext is the extra memory usage, so it was rather intended > >> > for debugging features that can be always compiled in, but only enabled at > >> > runtime when debugging is needed. The overhead is only paid when enabled. > >> > That's at least the case of page_owner and page_table_check. The 32bit > >> > page_idle is rather an oddity that could have instead stayed 64-bit only. > >> > > >> > >> Right, it seems currently page_ext is for debugging purposes only. > >> > >> > But this is proposing a page_ext functionality supposed to be enabled at all > >> > times in production, with the goal of improved accounting. Not an on-demand > >> > debugging. I'm afraid the costs will outweight the benefits. > >> > > >> > >> The memory overhead of this new page extension is (8/4096), which is > >> 0.2% of total memory. Not too big to be acceptable. > > > > It's generally unacceptable to increase sizeof(struct page) > > (nor enabling page_ext by default, and that's the why page_ext is for > > debugging purposes only) > > > >> If the user really > >> thinks this overhead is not accepted, he can set "active_vm=off" to > >> disable it. > > > > I'd say many people won't welcome adding 0.2% of total memory by default > > to get BPF memory usage. > > Agreed. > > >> To reduce the memory overhead further, I have a bold idea. > >> Actually we don't need to allocate such a page extension for every > >> page, while we only need to allocate it if the user needs to access > >> it. That said it seems that we can allocate some kind of page > >> extensions dynamically rather than preallocate at booting, but I > >> haven't investigated it deeply to check if it can work. What do you > >> think? > > There's lots of benefits (simplicity) of page_ext being allocated as it is > today. What you're suggesting will be better solved (in few years :) by > Matthew's bold ideas about shrinking the current struct page and allocating > usecase-specific descriptors. > > >> > Just a quick thought, in case the bpf accounting really can't be handled > >> > without marking pages and slab objects - since memcg already has hooks there > >> > without need of page_ext, couldn't it be done by extending the memcg infra > >> > instead? > >> > > >> > >> We need to make sure the accounting of BPF memory usage is still > >> workable even without memcg, see also the previous discussion[2]. > >> > >> [2]. https://lore.kernel.org/linux-mm/Yy53cgcwx+hTll4R@slm.duckdns.org/ > > >
On Tue, Dec 13, 2022 at 10:56 PM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote: > > On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > > > On 12/12/22 01:37, Yafang Shao wrote: > > > > Currently there's no way to get BPF memory usage, while we can only > > > > estimate the usage by bpftool or memcg, both of which are not reliable. > > > > > > > > - bpftool > > > > `bpftool {map,prog} show` can show us the memlock of each map and > > > > prog, but the memlock is vary from the real memory size. The memlock > > > > of a bpf object is approximately > > > > `round_up(key_size + value_size, 8) * max_entries`, > > > > so 1) it can't apply to the non-preallocated bpf map which may > > > > increase or decrease the real memory size dynamically. 2) the element > > > > size of some bpf map is not `key_size + value_size`, for example the > > > > element size of htab is > > > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > > > > That said the differece between these two values may be very great if > > > > the key_size and value_size is small. For example in my verifaction, > > > > the size of memlock and real memory of a preallocated hash map are, > > > > > > > > $ grep BPF /proc/meminfo > > > > BPF: 1026048 B <<< the size of preallocated memalloc pool > > > > > > > > (create hash map) > > > > > > > > $ bpftool map show > > > > 3: hash name count_map flags 0x0 > > > > key 4B value 4B max_entries 1048576 memlock 8388608B > > > > > > > > $ grep BPF /proc/meminfo > > > > BPF: 84919344 B > > > > > > > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > > > > bytes while the memlock is only 8388608 bytes. > > > > > > > > - memcg > > > > With memcg we only know that the BPF memory usage is less than > > > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > > > > know that the BPF memory usage is less than $MemTotal if the BPF > > > > object is charged into root memcg :) > > > > > > > > So we need a way to get the BPF memory usage especially there will be > > > > more and more bpf programs running on the production environment. The > > > > memory usage of BPF memory is not trivial, which deserves a new item in > > > > /proc/meminfo. > > > > > > > > This patchset introduce a solution to calculate the BPF memory usage. > > > > This solution is similar to how memory is charged into memcg, so it is > > > > easy to understand. It counts three types of memory usage - > > > > - page > > > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > > > > their families. > > > > When a page is allocated, we will count its size and mark the head > > > > page, and then check the head page at page freeing. > > > > - slab > > > > via kmalloc, kmem_cache_alloc and their families. > > > > When a slab object is allocated, we will mark this object in this > > > > slab and check it at slab object freeing. That said we need extra memory > > > > to store the information of each object in a slab. > > > > - percpu > > > > via alloc_percpu and its family. > > > > When a percpu area is allocated, we will mark this area in this > > > > percpu chunk and check it at percpu area freeing. That said we need > > > > extra memory to store the information of each area in a percpu chunk. > > > > > > > > So we only need to annotate the allcation to add the BPF memory size, > > > > and the sub of the BPF memory size will be handled automatically at > > > > freeing. We can annotate it in irq, softirq or process context. To avoid > > > > counting the nested allcations, for example the percpu backing allocator, > > > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > > > > the count consistent with memcg accounting. > > > > > > So you can't easily annotate the freeing places as well, to avoid the whole > > > tracking infrastructure? > > > > The trouble is kfree_rcu(). for example, > > old_item = active_vm_item_set(ACTIVE_VM_BPF); > > kfree_rcu(); > > active_vm_item_set(old_item); > > If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we > > will change lots of code in the RCU subsystem. I'm not sure if it is > > worth it. > > (+Cc rcu folks) > > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory > usage would be much less churn :) > Besides the kfree_rcu(), there is another deferred freeing, for example, the vfree_deferred(), but vfree_deferred() can be handled easily. Once we do it this way, the BPF memory usage accounting will be a burden of all newly introduced deferred freeing. For example, if the MM guys optimize the memory freeing code in the future without noticing this BPF accounting, it will be broken easily. So in the long run, the restrictions should be as less as possible. > > > > > I thought there was a patchset for a whole > > > bfp-specific memory allocator, where accounting would be implemented > > > naturally, I would imagine. > > > > > > > I posted a patchset[1] which annotates both allocating and freeing > > several months ago. > > But unfortunately after more investigation and verification I found > > the deferred freeing context is a problem, which can't be resolved > > easily. > > That's why I finally decided to annotate allocating only. > > > > [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/ > > > > > > To store the information of a slab or a page, we need to create a new > > > > member in struct page, but we can do it in page extension which can > > > > avoid changing the size of struct page. So a new page extension > > > > active_vm is introduced. Each page and each slab which is allocated as > > > > BPF memory will have a struct active_vm. The reason it is named as > > > > active_vm is that we can extend it to other areas easily, for example in > > > > the future we may use it to count other memory usage. > > > > > > > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at > > > > compile time or kernel parameter `active_vm=` at runtime. > > > > > > The issue with page_ext is the extra memory usage, so it was rather intended > > > for debugging features that can be always compiled in, but only enabled at > > > runtime when debugging is needed. The overhead is only paid when enabled. > > > That's at least the case of page_owner and page_table_check. The 32bit > > > page_idle is rather an oddity that could have instead stayed 64-bit only. > > > > > > > Right, it seems currently page_ext is for debugging purposes only. > > > > > But this is proposing a page_ext functionality supposed to be enabled at all > > > times in production, with the goal of improved accounting. Not an on-demand > > > debugging. I'm afraid the costs will outweight the benefits. > > > > > > > The memory overhead of this new page extension is (8/4096), which is > > 0.2% of total memory. Not too big to be acceptable. > > It's generally unacceptable to increase sizeof(struct page) > (nor enabling page_ext by default, and that's the why page_ext is for > debugging purposes only) > Then we can disable page_ext by default. The user who wants to use it can pass "active_vm=on". > > If the user really > > thinks this overhead is not accepted, he can set "active_vm=off" to > > disable it. > > I'd say many people won't welcome adding 0.2% of total memory by default > to get BPF memory usage. > This 0.2% of total memory can be reduced significantly by introducing dynamical page ext. so it is not a problem which will last forever. We can optimize the memory overhead step by step to make it be accepted by more and more people.
On Tue, Dec 13, 2022 at 11:52 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 12/13/22 15:56, Hyeonggon Yoo wrote: > > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote: > >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > > >> > On 12/12/22 01:37, Yafang Shao wrote: > >> > > Currently there's no way to get BPF memory usage, while we can only > >> > > estimate the usage by bpftool or memcg, both of which are not reliable. > >> > > > >> > > - bpftool > >> > > `bpftool {map,prog} show` can show us the memlock of each map and > >> > > prog, but the memlock is vary from the real memory size. The memlock > >> > > of a bpf object is approximately > >> > > `round_up(key_size + value_size, 8) * max_entries`, > >> > > so 1) it can't apply to the non-preallocated bpf map which may > >> > > increase or decrease the real memory size dynamically. 2) the element > >> > > size of some bpf map is not `key_size + value_size`, for example the > >> > > element size of htab is > >> > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > >> > > That said the differece between these two values may be very great if > >> > > the key_size and value_size is small. For example in my verifaction, > >> > > the size of memlock and real memory of a preallocated hash map are, > >> > > > >> > > $ grep BPF /proc/meminfo > >> > > BPF: 1026048 B <<< the size of preallocated memalloc pool > >> > > > >> > > (create hash map) > >> > > > >> > > $ bpftool map show > >> > > 3: hash name count_map flags 0x0 > >> > > key 4B value 4B max_entries 1048576 memlock 8388608B > >> > > > >> > > $ grep BPF /proc/meminfo > >> > > BPF: 84919344 B > >> > > > >> > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > >> > > bytes while the memlock is only 8388608 bytes. > >> > > > >> > > - memcg > >> > > With memcg we only know that the BPF memory usage is less than > >> > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > >> > > know that the BPF memory usage is less than $MemTotal if the BPF > >> > > object is charged into root memcg :) > >> > > > >> > > So we need a way to get the BPF memory usage especially there will be > >> > > more and more bpf programs running on the production environment. The > >> > > memory usage of BPF memory is not trivial, which deserves a new item in > >> > > /proc/meminfo. > >> > > > >> > > This patchset introduce a solution to calculate the BPF memory usage. > >> > > This solution is similar to how memory is charged into memcg, so it is > >> > > easy to understand. It counts three types of memory usage - > >> > > - page > >> > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > >> > > their families. > >> > > When a page is allocated, we will count its size and mark the head > >> > > page, and then check the head page at page freeing. > >> > > - slab > >> > > via kmalloc, kmem_cache_alloc and their families. > >> > > When a slab object is allocated, we will mark this object in this > >> > > slab and check it at slab object freeing. That said we need extra memory > >> > > to store the information of each object in a slab. > >> > > - percpu > >> > > via alloc_percpu and its family. > >> > > When a percpu area is allocated, we will mark this area in this > >> > > percpu chunk and check it at percpu area freeing. That said we need > >> > > extra memory to store the information of each area in a percpu chunk. > >> > > > >> > > So we only need to annotate the allcation to add the BPF memory size, > >> > > and the sub of the BPF memory size will be handled automatically at > >> > > freeing. We can annotate it in irq, softirq or process context. To avoid > >> > > counting the nested allcations, for example the percpu backing allocator, > >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > >> > > the count consistent with memcg accounting. > >> > > >> > So you can't easily annotate the freeing places as well, to avoid the whole > >> > tracking infrastructure? > >> > >> The trouble is kfree_rcu(). for example, > >> old_item = active_vm_item_set(ACTIVE_VM_BPF); > >> kfree_rcu(); > >> active_vm_item_set(old_item); > >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we > >> will change lots of code in the RCU subsystem. I'm not sure if it is > >> worth it. > > > > (+Cc rcu folks) > > > > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory > > usage would be much less churn :) > > Alternatively, just account the bpf memory as freed already when calling > kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is > a separate issue (if it's actually an issue) and not something each > kfree_rcu() user should think about separately? > Not sure if it is a problem for other users as well. But I can explain why it is an issue for BPF accounting. In BPF accounting, we need to store something into the task_struct and use it later. So if the 'storer' and the 'user' are different tasks, the information will be lost. > >> > >> > I thought there was a patchset for a whole > >> > bfp-specific memory allocator, where accounting would be implemented > >> > naturally, I would imagine. > >> > > >> > >> I posted a patchset[1] which annotates both allocating and freeing > >> several months ago. > >> But unfortunately after more investigation and verification I found > >> the deferred freeing context is a problem, which can't be resolved > >> easily. > >> That's why I finally decided to annotate allocating only. > >> > >> [1]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/ > >> > >> > > To store the information of a slab or a page, we need to create a new > >> > > member in struct page, but we can do it in page extension which can > >> > > avoid changing the size of struct page. So a new page extension > >> > > active_vm is introduced. Each page and each slab which is allocated as > >> > > BPF memory will have a struct active_vm. The reason it is named as > >> > > active_vm is that we can extend it to other areas easily, for example in > >> > > the future we may use it to count other memory usage. > >> > > > >> > > The new page extension active_vm can be disabled via CONFIG_ACTIVE_VM at > >> > > compile time or kernel parameter `active_vm=` at runtime. > >> > > >> > The issue with page_ext is the extra memory usage, so it was rather intended > >> > for debugging features that can be always compiled in, but only enabled at > >> > runtime when debugging is needed. The overhead is only paid when enabled. > >> > That's at least the case of page_owner and page_table_check. The 32bit > >> > page_idle is rather an oddity that could have instead stayed 64-bit only. > >> > > >> > >> Right, it seems currently page_ext is for debugging purposes only. > >> > >> > But this is proposing a page_ext functionality supposed to be enabled at all > >> > times in production, with the goal of improved accounting. Not an on-demand > >> > debugging. I'm afraid the costs will outweight the benefits. > >> > > >> > >> The memory overhead of this new page extension is (8/4096), which is > >> 0.2% of total memory. Not too big to be acceptable. > > > > It's generally unacceptable to increase sizeof(struct page) > > (nor enabling page_ext by default, and that's the why page_ext is for > > debugging purposes only) > > > >> If the user really > >> thinks this overhead is not accepted, he can set "active_vm=off" to > >> disable it. > > > > I'd say many people won't welcome adding 0.2% of total memory by default > > to get BPF memory usage. > > Agreed. > > >> To reduce the memory overhead further, I have a bold idea. > >> Actually we don't need to allocate such a page extension for every > >> page, while we only need to allocate it if the user needs to access > >> it. That said it seems that we can allocate some kind of page > >> extensions dynamically rather than preallocate at booting, but I > >> haven't investigated it deeply to check if it can work. What do you > >> think? > > There's lots of benefits (simplicity) of page_ext being allocated as it is > today. These benefits also lead it to debugging purposes only :) If we can make it run on production env, it will be more useful. > What you're suggesting will be better solved (in few years :) by > Matthew's bold ideas about shrinking the current struct page and allocating > usecase-specific descriptors. > So the memory overhead won't be a problem in the future, right ?
On Wed, Dec 14, 2022 at 3:22 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Dec 13, 2022 at 04:52:09PM +0100, Vlastimil Babka wrote: > > On 12/13/22 15:56, Hyeonggon Yoo wrote: > > > On Tue, Dec 13, 2022 at 07:52:42PM +0800, Yafang Shao wrote: > > >> On Tue, Dec 13, 2022 at 1:54 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > >> > > > >> > On 12/12/22 01:37, Yafang Shao wrote: > > >> > > Currently there's no way to get BPF memory usage, while we can only > > >> > > estimate the usage by bpftool or memcg, both of which are not reliable. > > >> > > > > >> > > - bpftool > > >> > > `bpftool {map,prog} show` can show us the memlock of each map and > > >> > > prog, but the memlock is vary from the real memory size. The memlock > > >> > > of a bpf object is approximately > > >> > > `round_up(key_size + value_size, 8) * max_entries`, > > >> > > so 1) it can't apply to the non-preallocated bpf map which may > > >> > > increase or decrease the real memory size dynamically. 2) the element > > >> > > size of some bpf map is not `key_size + value_size`, for example the > > >> > > element size of htab is > > >> > > `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)` > > >> > > That said the differece between these two values may be very great if > > >> > > the key_size and value_size is small. For example in my verifaction, > > >> > > the size of memlock and real memory of a preallocated hash map are, > > >> > > > > >> > > $ grep BPF /proc/meminfo > > >> > > BPF: 1026048 B <<< the size of preallocated memalloc pool > > >> > > > > >> > > (create hash map) > > >> > > > > >> > > $ bpftool map show > > >> > > 3: hash name count_map flags 0x0 > > >> > > key 4B value 4B max_entries 1048576 memlock 8388608B > > >> > > > > >> > > $ grep BPF /proc/meminfo > > >> > > BPF: 84919344 B > > >> > > > > >> > > So the real memory size is $((84919344 - 1026048)) which is 83893296 > > >> > > bytes while the memlock is only 8388608 bytes. > > >> > > > > >> > > - memcg > > >> > > With memcg we only know that the BPF memory usage is less than > > >> > > memory.usage_in_bytes (or memory.current in v2). Furthermore, we only > > >> > > know that the BPF memory usage is less than $MemTotal if the BPF > > >> > > object is charged into root memcg :) > > >> > > > > >> > > So we need a way to get the BPF memory usage especially there will be > > >> > > more and more bpf programs running on the production environment. The > > >> > > memory usage of BPF memory is not trivial, which deserves a new item in > > >> > > /proc/meminfo. > > >> > > > > >> > > This patchset introduce a solution to calculate the BPF memory usage. > > >> > > This solution is similar to how memory is charged into memcg, so it is > > >> > > easy to understand. It counts three types of memory usage - > > >> > > - page > > >> > > via kmalloc, vmalloc, kmem_cache_alloc or alloc pages directly and > > >> > > their families. > > >> > > When a page is allocated, we will count its size and mark the head > > >> > > page, and then check the head page at page freeing. > > >> > > - slab > > >> > > via kmalloc, kmem_cache_alloc and their families. > > >> > > When a slab object is allocated, we will mark this object in this > > >> > > slab and check it at slab object freeing. That said we need extra memory > > >> > > to store the information of each object in a slab. > > >> > > - percpu > > >> > > via alloc_percpu and its family. > > >> > > When a percpu area is allocated, we will mark this area in this > > >> > > percpu chunk and check it at percpu area freeing. That said we need > > >> > > extra memory to store the information of each area in a percpu chunk. > > >> > > > > >> > > So we only need to annotate the allcation to add the BPF memory size, > > >> > > and the sub of the BPF memory size will be handled automatically at > > >> > > freeing. We can annotate it in irq, softirq or process context. To avoid > > >> > > counting the nested allcations, for example the percpu backing allocator, > > >> > > we reuse the __GFP_ACCOUNT to filter them out. __GFP_ACCOUNT also make > > >> > > the count consistent with memcg accounting. > > >> > > > >> > So you can't easily annotate the freeing places as well, to avoid the whole > > >> > tracking infrastructure? > > >> > > >> The trouble is kfree_rcu(). for example, > > >> old_item = active_vm_item_set(ACTIVE_VM_BPF); > > >> kfree_rcu(); > > >> active_vm_item_set(old_item); > > >> If we want to pass the ACTIVE_VM_BPF into the deferred rcu context, we > > >> will change lots of code in the RCU subsystem. I'm not sure if it is > > >> worth it. > > > > > > (+Cc rcu folks) > > > > > > IMO adding new kfree_rcu() varient for BPF that accounts BPF memory > > > usage would be much less churn :) > > > > Alternatively, just account the bpf memory as freed already when calling > > kfree_rcu()? I think the amount of memory "in flight" to be freed by rcu is > > a separate issue (if it's actually an issue) and not something each > > kfree_rcu() user should think about separately? > > If the in-flight memory really does need to be accounted for, then one > straightforward approach is to use call_rcu() and do the first part of > the needed accounting at the call_rcu() callsite and the rest of the > accounting when the callback is invoked. Or, if memory must be freed > quickly even on ChromeOS and Android, use call_rcu_hurry() instead > of call_rcu(). > Right, call_rcu() can make it work. But I'm not sure if all kfree_rcu() in kernel/bpf can be replaced by call_rcu(). Alexei, any comment on it ? $ grep -r "kfree_rcu" kernel/bpf/ kernel/bpf/local_storage.c: kfree_rcu(new, rcu); kernel/bpf/lpm_trie.c: kfree_rcu(node, rcu); kernel/bpf/lpm_trie.c: kfree_rcu(parent, rcu); kernel/bpf/lpm_trie.c: kfree_rcu(node, rcu); kernel/bpf/lpm_trie.c: kfree_rcu(node, rcu); kernel/bpf/bpf_inode_storage.c: kfree_rcu(local_storage, rcu); kernel/bpf/bpf_task_storage.c: kfree_rcu(local_storage, rcu); kernel/bpf/trampoline.c: kfree_rcu(im, rcu); kernel/bpf/core.c: kfree_rcu(progs, rcu); kernel/bpf/core.c: * no need to call kfree_rcu(), just call kfree() directly. kernel/bpf/core.c: kfree_rcu(progs, rcu); kernel/bpf/bpf_local_storage.c: * kfree(), else do kfree_rcu(). kernel/bpf/bpf_local_storage.c: kfree_rcu(local_storage, rcu); kernel/bpf/bpf_local_storage.c: kfree_rcu(selem, rcu); kernel/bpf/bpf_local_storage.c: kfree_rcu(selem, rcu); kernel/bpf/bpf_local_storage.c: kfree_rcu(local_storage, rcu);