Message ID | 20240702084423.1717904-1-link@vivo.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce PMC(PER-MEMCG-CACHE) | expand |
On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: > This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). > > Background > === > > Modern computer systems always have performance gaps between hardware, > such as the performance differences between CPU, memory, and disk. > Due to the principle of locality of reference in data access: > > Programs often access data that has been accessed before > Programs access the next set of data after accessing a particular data > As a result: > 1. CPU cache is used to speed up the access of already accessed data > in memory > 2. Disk prefetching techniques are used to prepare the next set of data > to be accessed in advance (to avoid direct disk access) > The basic utilization of locality greatly enhances computer performance. > > PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance > program performance. > > In modern computers, especially in smartphones, services are provided to > users on a per-application basis (such as Camera, Chat, etc.), > where an application is composed of multiple processes working together to > provide services. > > The basic unit for managing resources in a computer is the process, > which in turn uses threads to share memory and accomplish tasks. > Memory is shared among threads within a process. > > However, modern computers have the following issues, with a locality deficiency: > > 1. Different forms of memory exist and are not interconnected (anonymous > pages, file pages, special memory such as DMA-BUF, various memory alloc in > kernel mode, etc.) > 2. Memory isolation exists between processes, and apart from specific > shared memory, they do not communicate with each other. > 3. During the transition of functionality within an application, a process > usually releases memory, while another process requests memory, and in > this process, memory has to be obtained from the lowest level through > competition. > > For example abount camera application: > > Camera applications typically provide photo capture services as well as photo > preview services. > The photo capture process usually utilizes DMA-BUF to facilitate the sharing > of image data between the CPU and DMA devices. > When it comes to image preview, multiple algorithm processes are typically > involved in processing the image data, which may also involve heap memory > and other resources. > > During the switch between photo capture and preview, the application typically > needs to release DMA-BUF memory and then the algorithms need to allocate > heap memory. The flow of system memory during this process is managed by > the PCP-BUDDY system. > > However, the PCP and BUDDY systems are shared, and subsequently requested > memory may not be available due to previously allocated memory being used > (such as for file reading), requiring a competitive (memory reclamation) > process to obtain it. > > So, if it is possible to allow the released memory to be allocated with > high priority within the application, then this can meet the locality > requirement, improve performance, and avoid unnecessary memory reclaim. > > PMC solutions are similar to PCP, as they both establish cache pools according > to certain rules. > > Why base on MEMCG? > === > > The MEMCG container can allocate selected processes to a MEMCG based on certain > grouping strategies (typical examples include grouping by app or UID). > Processes within the same MEMCG can then be used for statistics, upper limit > restrictions, and reclamation control. > > All processes within a MEMCG are considered as a single memory unit, > sharing memory among themselves. As a result, when one process releases > memory, another process within the same group can obtain it with the > highest priority, fully utilizing the locality of memory allocation > characteristics within the MEMCG (such as APP grouping). > > In addition, MEMCG provides feature interfaces that can be dynamically toggled > and are fully controllable by the policy.This provides greater flexibility > and does not impact performance when not enabled (controlled through static key). > > > Abount PMC implement > === > Here, a cache switch is provided for each MEMCG(not on root). > When the user enables the cache, processes within the MEMCG will share memory > through this cache. > > The cache pool is positioned before the PCP. All order0 page released by > processes in MEMCG will be released to the cache pool first, and when memory > is requested, it will also be prioritized to be obtained from the cache pool. > > `memory.cache` is the sole entry point for controlling PMC, here are some > nested keys to control PMC: > 1. "enable=[y|n]" to enable or disable targeted MEMCG's cache > 2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already > enabled PMC's behavior. > a) `nid` to targeted a node to change it's key. or else all node. > b) The `watermark` is used to control cache behavior, caching only when > zone free pages above the zone's high water mark + this watermark is > exceeded during memory release. (unit byte, default 50MB, > min 10MB per-node-all-zone) > c) `reaper_time` to control reaper gap, if meet, reaper all cache in this > MEMCG(unit us, default 5s, 0 is disable.) > d) `limit` is to limit the maximum memory used by the cache pool(unit bytes, > default 100MB, max 500MB per-node-all-zone) > > Performance > === > PMC is based on MEMCG and requires performance measurement through the > sharing of complex workloads between application processes. > Therefore, at the moment, we unable to provide a better testing solution > for this patchset. > > Here is the internal testing situation we provide, using the camera > application as an example. (1-NODE-1-ZONE-8GRAM) > > Test Case: Capture in rear portrait HDR mode > 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram > which memory types including dmabuf(470M), PSS(150M) and APU(200M) > 2. Test steps: take a photo, then click thumbnail to view the full image > > The overall performance benefit from click shutter button to showing whole > image improves 500ms, and the total slowpath cost of all camera threads reduced > from 958ms to 495ms. > Especially for the shot2shot in this mode, the preview dealy of each frame have > a significant improve. Hello Huan, thank you for sharing your work. Some high-level thoughts: 1) Naming is hard, but it took me quite a while to realize that you're talking about free memory. Cache is obviously an overloaded term, but per-memcg-cache can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not the best choice. 2) Overall an idea to have a per-memcg free memory pool makes sense to me, especially if we talk 2MB or 1GB pages (or order > 0 in general). 3) You absolutely have to integrate the reclaim mechanism with a generic memory reclaim mechanism, which is driven by the memory pressure. 4) You claim a ~50% performance win in your workload, which is a lot. It's not clear to me where it's coming from. It's hard to believe the page allocation/release paths are taking 50% of the cpu time. Please, clarify. There are a lot of other questions, and you highlighted some of them below (and these are indeed right questions to ask), but let's start with something. Thanks
在 2024/7/3 3:27, Roman Gushchin 写道: > On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: >> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). >> >> Background >> === >> >> Modern computer systems always have performance gaps between hardware, >> such as the performance differences between CPU, memory, and disk. >> Due to the principle of locality of reference in data access: >> >> Programs often access data that has been accessed before >> Programs access the next set of data after accessing a particular data >> As a result: >> 1. CPU cache is used to speed up the access of already accessed data >> in memory >> 2. Disk prefetching techniques are used to prepare the next set of data >> to be accessed in advance (to avoid direct disk access) >> The basic utilization of locality greatly enhances computer performance. >> >> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance >> program performance. >> >> In modern computers, especially in smartphones, services are provided to >> users on a per-application basis (such as Camera, Chat, etc.), >> where an application is composed of multiple processes working together to >> provide services. >> >> The basic unit for managing resources in a computer is the process, >> which in turn uses threads to share memory and accomplish tasks. >> Memory is shared among threads within a process. >> >> However, modern computers have the following issues, with a locality deficiency: >> >> 1. Different forms of memory exist and are not interconnected (anonymous >> pages, file pages, special memory such as DMA-BUF, various memory alloc in >> kernel mode, etc.) >> 2. Memory isolation exists between processes, and apart from specific >> shared memory, they do not communicate with each other. >> 3. During the transition of functionality within an application, a process >> usually releases memory, while another process requests memory, and in >> this process, memory has to be obtained from the lowest level through >> competition. >> >> For example abount camera application: >> >> Camera applications typically provide photo capture services as well as photo >> preview services. >> The photo capture process usually utilizes DMA-BUF to facilitate the sharing >> of image data between the CPU and DMA devices. >> When it comes to image preview, multiple algorithm processes are typically >> involved in processing the image data, which may also involve heap memory >> and other resources. >> >> During the switch between photo capture and preview, the application typically >> needs to release DMA-BUF memory and then the algorithms need to allocate >> heap memory. The flow of system memory during this process is managed by >> the PCP-BUDDY system. >> >> However, the PCP and BUDDY systems are shared, and subsequently requested >> memory may not be available due to previously allocated memory being used >> (such as for file reading), requiring a competitive (memory reclamation) >> process to obtain it. >> >> So, if it is possible to allow the released memory to be allocated with >> high priority within the application, then this can meet the locality >> requirement, improve performance, and avoid unnecessary memory reclaim. >> >> PMC solutions are similar to PCP, as they both establish cache pools according >> to certain rules. >> >> Why base on MEMCG? >> === >> >> The MEMCG container can allocate selected processes to a MEMCG based on certain >> grouping strategies (typical examples include grouping by app or UID). >> Processes within the same MEMCG can then be used for statistics, upper limit >> restrictions, and reclamation control. >> >> All processes within a MEMCG are considered as a single memory unit, >> sharing memory among themselves. As a result, when one process releases >> memory, another process within the same group can obtain it with the >> highest priority, fully utilizing the locality of memory allocation >> characteristics within the MEMCG (such as APP grouping). >> >> In addition, MEMCG provides feature interfaces that can be dynamically toggled >> and are fully controllable by the policy.This provides greater flexibility >> and does not impact performance when not enabled (controlled through static key). >> >> >> Abount PMC implement >> === >> Here, a cache switch is provided for each MEMCG(not on root). >> When the user enables the cache, processes within the MEMCG will share memory >> through this cache. >> >> The cache pool is positioned before the PCP. All order0 page released by >> processes in MEMCG will be released to the cache pool first, and when memory >> is requested, it will also be prioritized to be obtained from the cache pool. >> >> `memory.cache` is the sole entry point for controlling PMC, here are some >> nested keys to control PMC: >> 1. "enable=[y|n]" to enable or disable targeted MEMCG's cache >> 2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already >> enabled PMC's behavior. >> a) `nid` to targeted a node to change it's key. or else all node. >> b) The `watermark` is used to control cache behavior, caching only when >> zone free pages above the zone's high water mark + this watermark is >> exceeded during memory release. (unit byte, default 50MB, >> min 10MB per-node-all-zone) >> c) `reaper_time` to control reaper gap, if meet, reaper all cache in this >> MEMCG(unit us, default 5s, 0 is disable.) >> d) `limit` is to limit the maximum memory used by the cache pool(unit bytes, >> default 100MB, max 500MB per-node-all-zone) >> >> Performance >> === >> PMC is based on MEMCG and requires performance measurement through the >> sharing of complex workloads between application processes. >> Therefore, at the moment, we unable to provide a better testing solution >> for this patchset. >> >> Here is the internal testing situation we provide, using the camera >> application as an example. (1-NODE-1-ZONE-8GRAM) >> >> Test Case: Capture in rear portrait HDR mode >> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram >> which memory types including dmabuf(470M), PSS(150M) and APU(200M) >> 2. Test steps: take a photo, then click thumbnail to view the full image >> >> The overall performance benefit from click shutter button to showing whole >> image improves 500ms, and the total slowpath cost of all camera threads reduced >> from 958ms to 495ms. >> Especially for the shot2shot in this mode, the preview dealy of each frame have >> a significant improve. > Hello Huan, > > thank you for sharing your work. thanks > > Some high-level thoughts: > 1) Naming is hard, but it took me quite a while to realize that you're talking Haha, sorry for my pool english > about free memory. Cache is obviously an overloaded term, but per-memcg-cache > can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not Currently, my idea is that all memory released by processes under memcg will go into the `cache`, and the original attributes will be ignored, and can be freely requested by processes under memcg. (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more friendly? :) > the best choice. > 2) Overall an idea to have a per-memcg free memory pool makes sense to me, > especially if we talk 2MB or 1GB pages (or order > 0 in general). I like it too :) > 3) You absolutely have to integrate the reclaim mechanism with a generic > memory reclaim mechanism, which is driven by the memory pressure. Yes, I all think about it. > 4) You claim a ~50% performance win in your workload, which is a lot. It's not > clear to me where it's coming from. It's hard to believe the page allocation/release > paths are taking 50% of the cpu time. Please, clarify. Let me describe it more specifically. In our test scenario, we have 8GB of RAM, and our camera application has a complex set of algorithms, with a peak memory requirement of up to 3GB. Therefore, in a multi-application background scenario, starting the camera and taking photos will create a very high memory pressure. In this scenario, any released memory will be quickly used by other processes (such as file pages). So, during the process of switching from camera capture to preview, DMA-BUF memory will be released, while the memory used for the preview algorithm will be simultaneously requested. We need to take a lot of slow path routes to obtain enough memory for the preview algorithm, and it seems that the just released DMA-BUF memory does not provide much help. But using PMC (let's call it that for now), we are able to quickly meet the memory needs of the subsequent preview process with the just released DMA-BUF memory, without having to go through the slow path, resulting in a significant performance improvement. (of course, break migrate type may not good.) > > There are a lot of other questions, and you highlighted some of them below > (and these are indeed right questions to ask), but let's start with something. > > Thanks Thanks
On Wed, Jul 03, 2024 at 10:23:35AM GMT, Huan Yang wrote: > > 在 2024/7/3 3:27, Roman Gushchin 写道: [...] > > Hello Huan, > > > > thank you for sharing your work. > thanks > > > > Some high-level thoughts: > > 1) Naming is hard, but it took me quite a while to realize that you're talking > Haha, sorry for my pool english > > about free memory. Cache is obviously an overloaded term, but per-memcg-cache > > can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not > > Currently, my idea is that all memory released by processes under memcg will > go into the `cache`, > > and the original attributes will be ignored, and can be freely requested by > processes under memcg. > > (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more friendly? > :) > > > the best choice. > > 2) Overall an idea to have a per-memcg free memory pool makes sense to me, > > especially if we talk 2MB or 1GB pages (or order > 0 in general). > I like it too :) > > 3) You absolutely have to integrate the reclaim mechanism with a generic > > memory reclaim mechanism, which is driven by the memory pressure. > Yes, I all think about it. > > 4) You claim a ~50% performance win in your workload, which is a lot. It's not > > clear to me where it's coming from. It's hard to believe the page allocation/release > > paths are taking 50% of the cpu time. Please, clarify. > > Let me describe it more specifically. In our test scenario, we have 8GB of > RAM, and our camera application > > has a complex set of algorithms, with a peak memory requirement of up to > 3GB. > > Therefore, in a multi-application background scenario, starting the camera > and taking photos will create a > > very high memory pressure. In this scenario, any released memory will be > quickly used by other processes (such as file pages). > > So, during the process of switching from camera capture to preview, DMA-BUF > memory will be released, > > while the memory used for the preview algorithm will be simultaneously > requested. > > We need to take a lot of slow path routes to obtain enough memory for the > preview algorithm, and it seems that the > > just released DMA-BUF memory does not provide much help. > > But using PMC (let's call it that for now), we are able to quickly meet the > memory needs of the subsequent preview process > > with the just released DMA-BUF memory, without having to go through the slow > path, resulting in a significant performance improvement. > > (of course, break migrate type may not good.) > Please correct me if I am wrong, IIUC you have applcations with different latency or performance requirements, running on the same system but the system is memory constraint. You want applications with stringent performance requirement to go less in the allocation slowpath and want the lower priority (or no perf requirement) applications to do more slowpath work (reclaim/compaction) for themselves as well as for the high priority applications. What about the allocations from the softirqs or non-memcg-aware kernel allocations? An alternative approach would be something similar to the watermark based approach. Low priority applications (or kswapds) doing reclaim/compaction at a higher newly defined watermark and the higher priority applications are protected through the usual memcg protection. I can see another use-case for whatever the solution we comeup with and that is userspace reliable oom-killer. Shakeel
On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@vivo.com> wrote: > > > 在 2024/7/3 3:27, Roman Gushchin 写道: > > On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: > >> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). > >> > >> Background > >> === > >> > >> Modern computer systems always have performance gaps between hardware, > >> such as the performance differences between CPU, memory, and disk. > >> Due to the principle of locality of reference in data access: > >> > >> Programs often access data that has been accessed before > >> Programs access the next set of data after accessing a particular data > >> As a result: > >> 1. CPU cache is used to speed up the access of already accessed data > >> in memory > >> 2. Disk prefetching techniques are used to prepare the next set of data > >> to be accessed in advance (to avoid direct disk access) > >> The basic utilization of locality greatly enhances computer performance. > >> > >> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance > >> program performance. > >> > >> In modern computers, especially in smartphones, services are provided to > >> users on a per-application basis (such as Camera, Chat, etc.), > >> where an application is composed of multiple processes working together to > >> provide services. > >> > >> The basic unit for managing resources in a computer is the process, > >> which in turn uses threads to share memory and accomplish tasks. > >> Memory is shared among threads within a process. > >> > >> However, modern computers have the following issues, with a locality deficiency: > >> > >> 1. Different forms of memory exist and are not interconnected (anonymous > >> pages, file pages, special memory such as DMA-BUF, various memory alloc in > >> kernel mode, etc.) > >> 2. Memory isolation exists between processes, and apart from specific > >> shared memory, they do not communicate with each other. > >> 3. During the transition of functionality within an application, a process > >> usually releases memory, while another process requests memory, and in > >> this process, memory has to be obtained from the lowest level through > >> competition. > >> > >> For example abount camera application: > >> > >> Camera applications typically provide photo capture services as well as photo > >> preview services. > >> The photo capture process usually utilizes DMA-BUF to facilitate the sharing > >> of image data between the CPU and DMA devices. > >> When it comes to image preview, multiple algorithm processes are typically > >> involved in processing the image data, which may also involve heap memory > >> and other resources. > >> > >> During the switch between photo capture and preview, the application typically > >> needs to release DMA-BUF memory and then the algorithms need to allocate > >> heap memory. The flow of system memory during this process is managed by > >> the PCP-BUDDY system. > >> > >> However, the PCP and BUDDY systems are shared, and subsequently requested > >> memory may not be available due to previously allocated memory being used > >> (such as for file reading), requiring a competitive (memory reclamation) > >> process to obtain it. > >> > >> So, if it is possible to allow the released memory to be allocated with > >> high priority within the application, then this can meet the locality > >> requirement, improve performance, and avoid unnecessary memory reclaim. > >> > >> PMC solutions are similar to PCP, as they both establish cache pools according > >> to certain rules. > >> > >> Why base on MEMCG? > >> === > >> > >> The MEMCG container can allocate selected processes to a MEMCG based on certain > >> grouping strategies (typical examples include grouping by app or UID). > >> Processes within the same MEMCG can then be used for statistics, upper limit > >> restrictions, and reclamation control. > >> > >> All processes within a MEMCG are considered as a single memory unit, > >> sharing memory among themselves. As a result, when one process releases > >> memory, another process within the same group can obtain it with the > >> highest priority, fully utilizing the locality of memory allocation > >> characteristics within the MEMCG (such as APP grouping). > >> > >> In addition, MEMCG provides feature interfaces that can be dynamically toggled > >> and are fully controllable by the policy.This provides greater flexibility > >> and does not impact performance when not enabled (controlled through static key). > >> > >> > >> Abount PMC implement > >> === > >> Here, a cache switch is provided for each MEMCG(not on root). > >> When the user enables the cache, processes within the MEMCG will share memory > >> through this cache. > >> > >> The cache pool is positioned before the PCP. All order0 page released by > >> processes in MEMCG will be released to the cache pool first, and when memory > >> is requested, it will also be prioritized to be obtained from the cache pool. > >> > >> `memory.cache` is the sole entry point for controlling PMC, here are some > >> nested keys to control PMC: > >> 1. "enable=[y|n]" to enable or disable targeted MEMCG's cache > >> 2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already > >> enabled PMC's behavior. > >> a) `nid` to targeted a node to change it's key. or else all node. > >> b) The `watermark` is used to control cache behavior, caching only when > >> zone free pages above the zone's high water mark + this watermark is > >> exceeded during memory release. (unit byte, default 50MB, > >> min 10MB per-node-all-zone) > >> c) `reaper_time` to control reaper gap, if meet, reaper all cache in this > >> MEMCG(unit us, default 5s, 0 is disable.) > >> d) `limit` is to limit the maximum memory used by the cache pool(unit bytes, > >> default 100MB, max 500MB per-node-all-zone) > >> > >> Performance > >> === > >> PMC is based on MEMCG and requires performance measurement through the > >> sharing of complex workloads between application processes. > >> Therefore, at the moment, we unable to provide a better testing solution > >> for this patchset. > >> > >> Here is the internal testing situation we provide, using the camera > >> application as an example. (1-NODE-1-ZONE-8GRAM) > >> > >> Test Case: Capture in rear portrait HDR mode > >> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram > >> which memory types including dmabuf(470M), PSS(150M) and APU(200M) > >> 2. Test steps: take a photo, then click thumbnail to view the full image > >> > >> The overall performance benefit from click shutter button to showing whole > >> image improves 500ms, and the total slowpath cost of all camera threads reduced > >> from 958ms to 495ms. > >> Especially for the shot2shot in this mode, the preview dealy of each frame have > >> a significant improve. > > Hello Huan, > > > > thank you for sharing your work. > thanks > > > > Some high-level thoughts: > > 1) Naming is hard, but it took me quite a while to realize that you're talking > Haha, sorry for my pool english > > about free memory. Cache is obviously an overloaded term, but per-memcg-cache > > can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not > > Currently, my idea is that all memory released by processes under memcg > will go into the `cache`, > > and the original attributes will be ignored, and can be freely requested > by processes under memcg. > > (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more > friendly? :) > > > the best choice. > > 2) Overall an idea to have a per-memcg free memory pool makes sense to me, > > especially if we talk 2MB or 1GB pages (or order > 0 in general). > I like it too :) > > 3) You absolutely have to integrate the reclaim mechanism with a generic > > memory reclaim mechanism, which is driven by the memory pressure. > Yes, I all think about it. > > 4) You claim a ~50% performance win in your workload, which is a lot. It's not > > clear to me where it's coming from. It's hard to believe the page allocation/release > > paths are taking 50% of the cpu time. Please, clarify. > > Let me describe it more specifically. In our test scenario, we have 8GB > of RAM, and our camera application > > has a complex set of algorithms, with a peak memory requirement of up to > 3GB. > > Therefore, in a multi-application background scenario, starting the > camera and taking photos will create a > > very high memory pressure. In this scenario, any released memory will be > quickly used by other processes (such as file pages). > > So, during the process of switching from camera capture to preview, > DMA-BUF memory will be released, > > while the memory used for the preview algorithm will be simultaneously > requested. > > We need to take a lot of slow path routes to obtain enough memory for > the preview algorithm, and it seems that the > > just released DMA-BUF memory does not provide much help. > Hi Huan, I find this part surprising. Assuming the dmabuf memory doesn't first go into a page pool (used for some buffers, not all) and actually does get freed synchronously with fput, this would mean it gets sucked up by other supposedly background processes before it can be allocated by the preview process. I thought the preview process was the one most desperate for memory? You mention file pages, but where is this newly-freed memory actually going if not to the preview process? My initial reaction was the same as Roman's that the PMC should be hooked up to reclaim instead of depending on the reaper. But I think this might suggest that wouldn't work because the system is under such high memory pressure that it'd be likely reclaim would have emptied the PMCs before the preview process could use it. One more thing I find odd is that for this to work a significant portion of your dmabuf pages would have to be order 0, but we're talking about a ~500M buffer. Does whatever exports this buffer not try to use higher order pages like here? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54 Thanks! -T.J. > But using PMC (let's call it that for now), we are able to quickly meet > the memory needs of the subsequent preview process > > with the just released DMA-BUF memory, without having to go through the > slow path, resulting in a significant performance improvement. > > (of course, break migrate type may not good.) > > > > > There are a lot of other questions, and you highlighted some of them below > > (and these are indeed right questions to ask), but let's start with something. > > > > Thanks > Thanks >
在 2024/7/4 6:59, T.J. Mercier 写道: > On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@vivo.com> wrote: >> >> 在 2024/7/3 3:27, Roman Gushchin 写道: >>> On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: >>>> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). >>>> >>>> Background >>>> === >>>> >>>> Modern computer systems always have performance gaps between hardware, >>>> such as the performance differences between CPU, memory, and disk. >>>> Due to the principle of locality of reference in data access: >>>> >>>> Programs often access data that has been accessed before >>>> Programs access the next set of data after accessing a particular data >>>> As a result: >>>> 1. CPU cache is used to speed up the access of already accessed data >>>> in memory >>>> 2. Disk prefetching techniques are used to prepare the next set of data >>>> to be accessed in advance (to avoid direct disk access) >>>> The basic utilization of locality greatly enhances computer performance. >>>> >>>> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance >>>> program performance. >>>> >>>> In modern computers, especially in smartphones, services are provided to >>>> users on a per-application basis (such as Camera, Chat, etc.), >>>> where an application is composed of multiple processes working together to >>>> provide services. >>>> >>>> The basic unit for managing resources in a computer is the process, >>>> which in turn uses threads to share memory and accomplish tasks. >>>> Memory is shared among threads within a process. >>>> >>>> However, modern computers have the following issues, with a locality deficiency: >>>> >>>> 1. Different forms of memory exist and are not interconnected (anonymous >>>> pages, file pages, special memory such as DMA-BUF, various memory alloc in >>>> kernel mode, etc.) >>>> 2. Memory isolation exists between processes, and apart from specific >>>> shared memory, they do not communicate with each other. >>>> 3. During the transition of functionality within an application, a process >>>> usually releases memory, while another process requests memory, and in >>>> this process, memory has to be obtained from the lowest level through >>>> competition. >>>> >>>> For example abount camera application: >>>> >>>> Camera applications typically provide photo capture services as well as photo >>>> preview services. >>>> The photo capture process usually utilizes DMA-BUF to facilitate the sharing >>>> of image data between the CPU and DMA devices. >>>> When it comes to image preview, multiple algorithm processes are typically >>>> involved in processing the image data, which may also involve heap memory >>>> and other resources. >>>> >>>> During the switch between photo capture and preview, the application typically >>>> needs to release DMA-BUF memory and then the algorithms need to allocate >>>> heap memory. The flow of system memory during this process is managed by >>>> the PCP-BUDDY system. >>>> >>>> However, the PCP and BUDDY systems are shared, and subsequently requested >>>> memory may not be available due to previously allocated memory being used >>>> (such as for file reading), requiring a competitive (memory reclamation) >>>> process to obtain it. >>>> >>>> So, if it is possible to allow the released memory to be allocated with >>>> high priority within the application, then this can meet the locality >>>> requirement, improve performance, and avoid unnecessary memory reclaim. >>>> >>>> PMC solutions are similar to PCP, as they both establish cache pools according >>>> to certain rules. >>>> >>>> Why base on MEMCG? >>>> === >>>> >>>> The MEMCG container can allocate selected processes to a MEMCG based on certain >>>> grouping strategies (typical examples include grouping by app or UID). >>>> Processes within the same MEMCG can then be used for statistics, upper limit >>>> restrictions, and reclamation control. >>>> >>>> All processes within a MEMCG are considered as a single memory unit, >>>> sharing memory among themselves. As a result, when one process releases >>>> memory, another process within the same group can obtain it with the >>>> highest priority, fully utilizing the locality of memory allocation >>>> characteristics within the MEMCG (such as APP grouping). >>>> >>>> In addition, MEMCG provides feature interfaces that can be dynamically toggled >>>> and are fully controllable by the policy.This provides greater flexibility >>>> and does not impact performance when not enabled (controlled through static key). >>>> >>>> >>>> Abount PMC implement >>>> === >>>> Here, a cache switch is provided for each MEMCG(not on root). >>>> When the user enables the cache, processes within the MEMCG will share memory >>>> through this cache. >>>> >>>> The cache pool is positioned before the PCP. All order0 page released by >>>> processes in MEMCG will be released to the cache pool first, and when memory >>>> is requested, it will also be prioritized to be obtained from the cache pool. >>>> >>>> `memory.cache` is the sole entry point for controlling PMC, here are some >>>> nested keys to control PMC: >>>> 1. "enable=[y|n]" to enable or disable targeted MEMCG's cache >>>> 2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already >>>> enabled PMC's behavior. >>>> a) `nid` to targeted a node to change it's key. or else all node. >>>> b) The `watermark` is used to control cache behavior, caching only when >>>> zone free pages above the zone's high water mark + this watermark is >>>> exceeded during memory release. (unit byte, default 50MB, >>>> min 10MB per-node-all-zone) >>>> c) `reaper_time` to control reaper gap, if meet, reaper all cache in this >>>> MEMCG(unit us, default 5s, 0 is disable.) >>>> d) `limit` is to limit the maximum memory used by the cache pool(unit bytes, >>>> default 100MB, max 500MB per-node-all-zone) >>>> >>>> Performance >>>> === >>>> PMC is based on MEMCG and requires performance measurement through the >>>> sharing of complex workloads between application processes. >>>> Therefore, at the moment, we unable to provide a better testing solution >>>> for this patchset. >>>> >>>> Here is the internal testing situation we provide, using the camera >>>> application as an example. (1-NODE-1-ZONE-8GRAM) >>>> >>>> Test Case: Capture in rear portrait HDR mode >>>> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram >>>> which memory types including dmabuf(470M), PSS(150M) and APU(200M) >>>> 2. Test steps: take a photo, then click thumbnail to view the full image >>>> >>>> The overall performance benefit from click shutter button to showing whole >>>> image improves 500ms, and the total slowpath cost of all camera threads reduced >>>> from 958ms to 495ms. >>>> Especially for the shot2shot in this mode, the preview dealy of each frame have >>>> a significant improve. >>> Hello Huan, >>> >>> thank you for sharing your work. >> thanks >>> Some high-level thoughts: >>> 1) Naming is hard, but it took me quite a while to realize that you're talking >> Haha, sorry for my pool english >>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache >>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not >> Currently, my idea is that all memory released by processes under memcg >> will go into the `cache`, >> >> and the original attributes will be ignored, and can be freely requested >> by processes under memcg. >> >> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more >> friendly? :) >> >>> the best choice. >>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me, >>> especially if we talk 2MB or 1GB pages (or order > 0 in general). >> I like it too :) >>> 3) You absolutely have to integrate the reclaim mechanism with a generic >>> memory reclaim mechanism, which is driven by the memory pressure. >> Yes, I all think about it. >>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not >>> clear to me where it's coming from. It's hard to believe the page allocation/release >>> paths are taking 50% of the cpu time. Please, clarify. >> Let me describe it more specifically. In our test scenario, we have 8GB >> of RAM, and our camera application >> >> has a complex set of algorithms, with a peak memory requirement of up to >> 3GB. >> >> Therefore, in a multi-application background scenario, starting the >> camera and taking photos will create a >> >> very high memory pressure. In this scenario, any released memory will be >> quickly used by other processes (such as file pages). >> >> So, during the process of switching from camera capture to preview, >> DMA-BUF memory will be released, >> >> while the memory used for the preview algorithm will be simultaneously >> requested. >> >> We need to take a lot of slow path routes to obtain enough memory for >> the preview algorithm, and it seems that the >> >> just released DMA-BUF memory does not provide much help. >> > Hi Huan, HI T.J. > > I find this part surprising. Assuming the dmabuf memory doesn't first > go into a page pool (used for some buffers, not all) and actually does Actually, when PMC enabled, we let page free avoid free into page pool. > get freed synchronously with fput, this would mean it gets sucked up > by other supposedly background processes before it can be allocated by > the preview process. I thought the preview process was the one most > desperate for memory? You mention file pages, but where is this > newly-freed memory actually going if not to the preview process? My This was discovered through the meminfo observation program. When the dma-buf is released, there is a noticeable increase in cache. This may be triggered by pagecache when loading the algorithm model. Additionally, the algorithm heap memory cannot benefit from the release of the dma-buf. I believe this is related to the migratetype. The stack/heap cannot obtain priority access to the dma-buf memory released by the kernel.(HIGHUSER_MOVABLE) So, PMC break it, share each memory. Even if it's incorrect :)(If my understanding of the fragmentation issue is incorrect, please correct me.) > initial reaction was the same as Roman's that the PMC should be hooked > up to reclaim instead of depending on the reaper. But I think this > might suggest that wouldn't work because the system is under such high > memory pressure that it'd be likely reclaim would have emptied the > PMCs before the preview process could use it. The point you raised is indeed very likely to happen, as there is immense memory pressure. Currently, we only open the PMC when the application is in the foreground, and close it when it goes to the background. It is indeed unnecessary to drain the PMC when the application is in the foreground, and a longer reaper timeout would be more useful.(Thanks for the flexibility provided by memcg.) > > One more thing I find odd is that for this to work a significant > portion of your dmabuf pages would have to be order 0, but we're > talking about a ~500M buffer. Does whatever exports this buffer not > try to use higher order pages like here? Yes, actually our heap configured order 8 4 0, but In our practical application and observation processes, it is often difficult to meet the high-order memory allocation, so falling back to order 0 is the most common. Therefore, for our MID_ORDER allocation, we use LOW_ORDER_GFP. Just like the testing scenario I mentioned earlier, with 8GB of RAM and the camera peaking at around 3GB, the fragmentation at this point will cause most of the DMA-BUF allocations to fall back to order 0. The use of PMC is for real-world, high-load applications. I don't think it's very practical for regular applications. Thanks HY > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54 > > Thanks! > -T.J. > >> But using PMC (let's call it that for now), we are able to quickly meet >> the memory needs of the subsequent preview process >> >> with the just released DMA-BUF memory, without having to go through the >> slow path, resulting in a significant performance improvement. >> >> (of course, break migrate type may not good.) >> >>> There are a lot of other questions, and you highlighted some of them below >>> (and these are indeed right questions to ask), but let's start with something. >>> >>> Thanks >> Thanks >>
在 2024/7/4 1:27, Shakeel Butt 写道: > On Wed, Jul 03, 2024 at 10:23:35AM GMT, Huan Yang wrote: >> 在 2024/7/3 3:27, Roman Gushchin 写道: > [...] >>> Hello Huan, >>> >>> thank you for sharing your work. >> thanks >>> Some high-level thoughts: >>> 1) Naming is hard, but it took me quite a while to realize that you're talking >> Haha, sorry for my pool english >>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache >>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not >> Currently, my idea is that all memory released by processes under memcg will >> go into the `cache`, >> >> and the original attributes will be ignored, and can be freely requested by >> processes under memcg. >> >> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more friendly? >> :) >> >>> the best choice. >>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me, >>> especially if we talk 2MB or 1GB pages (or order > 0 in general). >> I like it too :) >>> 3) You absolutely have to integrate the reclaim mechanism with a generic >>> memory reclaim mechanism, which is driven by the memory pressure. >> Yes, I all think about it. >>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not >>> clear to me where it's coming from. It's hard to believe the page allocation/release >>> paths are taking 50% of the cpu time. Please, clarify. >> Let me describe it more specifically. In our test scenario, we have 8GB of >> RAM, and our camera application >> >> has a complex set of algorithms, with a peak memory requirement of up to >> 3GB. >> >> Therefore, in a multi-application background scenario, starting the camera >> and taking photos will create a >> >> very high memory pressure. In this scenario, any released memory will be >> quickly used by other processes (such as file pages). >> >> So, during the process of switching from camera capture to preview, DMA-BUF >> memory will be released, >> >> while the memory used for the preview algorithm will be simultaneously >> requested. >> >> We need to take a lot of slow path routes to obtain enough memory for the >> preview algorithm, and it seems that the >> >> just released DMA-BUF memory does not provide much help. >> >> But using PMC (let's call it that for now), we are able to quickly meet the >> memory needs of the subsequent preview process >> >> with the just released DMA-BUF memory, without having to go through the slow >> path, resulting in a significant performance improvement. >> >> (of course, break migrate type may not good.) >> > Please correct me if I am wrong, IIUC you have applcations with > different latency or performance requirements, running on the same > system but the system is memory constraint. You want applications with > stringent performance requirement to go less in the allocation slowpath > and want the lower priority (or no perf requirement) applications to do > more slowpath work (reclaim/compaction) for themselves as well as for > the high priority applications. Yes, The PMC does have the idea of priority control. In the smartphone, the most strongly perceived aspect by users is the foreground app. In the scenario I described, the camera application should have absolute priority for memory, and its internal memory usage should be given priority to meet its needs.(Especially when we set the PMC's allocation after the buddy free.) > > What about the allocations from the softirqs or non-memcg-aware kernel > allocations? Sorry softirqs I can't explain. But, many kernel thread also set into root memcg. In our scenario, we set all processes related to the camera application to the same memcg.(both user and kernel thread) > > An alternative approach would be something similar to the watermark > based approach. Low priority applications (or kswapds) doing > reclaim/compaction at a higher newly defined watermark and the higher > priority applications are protected through the usual memcg protection. Also, Please correct me if I am wrong. I understand that even with boost, water level control cannot finely control which applications or processes should be recycled with a high water level. Application grouping and selection need to be re-implemented. Through PMC, we can proactively group the processes required by the application, only opening them when they enter the foreground and closing them when in the background. > > I can see another use-case for whatever the solution we comeup with and > that is userspace reliable oom-killer. Yes, LMKD is helpfull. That's unfortunate, but our product also has other dimensions of assessment, including application persistence. This means that when the camera is launched, we can only kill unnecessary applications to free up a small amount of memory to meet its startup requirements. However, when it requests memory for taking a photo, the memory allocation is relatively lazy during the kill-check phase. And one more thing, the memory released by killing applications may not necessarily meet the instantaneous memory requirements.(Many zram compress page, not too fast) Thanks, HY > > Shakeel >
On Wed, Jul 3, 2024 at 7:29 PM Huan Yang <link@vivo.com> wrote: > > > 在 2024/7/4 6:59, T.J. Mercier 写道: > > On Tue, Jul 2, 2024 at 7:23 PM Huan Yang <link@vivo.com> wrote: > >> > >> 在 2024/7/3 3:27, Roman Gushchin 写道: > >>> On Tue, Jul 02, 2024 at 04:44:03PM +0800, Huan Yang wrote: > >>>> This patchset like to talk abount a idea about PMC(PER-MEMCG-CACHE). > >>>> > >>>> Background > >>>> === > >>>> > >>>> Modern computer systems always have performance gaps between hardware, > >>>> such as the performance differences between CPU, memory, and disk. > >>>> Due to the principle of locality of reference in data access: > >>>> > >>>> Programs often access data that has been accessed before > >>>> Programs access the next set of data after accessing a particular data > >>>> As a result: > >>>> 1. CPU cache is used to speed up the access of already accessed data > >>>> in memory > >>>> 2. Disk prefetching techniques are used to prepare the next set of data > >>>> to be accessed in advance (to avoid direct disk access) > >>>> The basic utilization of locality greatly enhances computer performance. > >>>> > >>>> PMC (per-MEMCG-cache) is similar, utilizing a principle of locality to enhance > >>>> program performance. > >>>> > >>>> In modern computers, especially in smartphones, services are provided to > >>>> users on a per-application basis (such as Camera, Chat, etc.), > >>>> where an application is composed of multiple processes working together to > >>>> provide services. > >>>> > >>>> The basic unit for managing resources in a computer is the process, > >>>> which in turn uses threads to share memory and accomplish tasks. > >>>> Memory is shared among threads within a process. > >>>> > >>>> However, modern computers have the following issues, with a locality deficiency: > >>>> > >>>> 1. Different forms of memory exist and are not interconnected (anonymous > >>>> pages, file pages, special memory such as DMA-BUF, various memory alloc in > >>>> kernel mode, etc.) > >>>> 2. Memory isolation exists between processes, and apart from specific > >>>> shared memory, they do not communicate with each other. > >>>> 3. During the transition of functionality within an application, a process > >>>> usually releases memory, while another process requests memory, and in > >>>> this process, memory has to be obtained from the lowest level through > >>>> competition. > >>>> > >>>> For example abount camera application: > >>>> > >>>> Camera applications typically provide photo capture services as well as photo > >>>> preview services. > >>>> The photo capture process usually utilizes DMA-BUF to facilitate the sharing > >>>> of image data between the CPU and DMA devices. > >>>> When it comes to image preview, multiple algorithm processes are typically > >>>> involved in processing the image data, which may also involve heap memory > >>>> and other resources. > >>>> > >>>> During the switch between photo capture and preview, the application typically > >>>> needs to release DMA-BUF memory and then the algorithms need to allocate > >>>> heap memory. The flow of system memory during this process is managed by > >>>> the PCP-BUDDY system. > >>>> > >>>> However, the PCP and BUDDY systems are shared, and subsequently requested > >>>> memory may not be available due to previously allocated memory being used > >>>> (such as for file reading), requiring a competitive (memory reclamation) > >>>> process to obtain it. > >>>> > >>>> So, if it is possible to allow the released memory to be allocated with > >>>> high priority within the application, then this can meet the locality > >>>> requirement, improve performance, and avoid unnecessary memory reclaim. > >>>> > >>>> PMC solutions are similar to PCP, as they both establish cache pools according > >>>> to certain rules. > >>>> > >>>> Why base on MEMCG? > >>>> === > >>>> > >>>> The MEMCG container can allocate selected processes to a MEMCG based on certain > >>>> grouping strategies (typical examples include grouping by app or UID). > >>>> Processes within the same MEMCG can then be used for statistics, upper limit > >>>> restrictions, and reclamation control. > >>>> > >>>> All processes within a MEMCG are considered as a single memory unit, > >>>> sharing memory among themselves. As a result, when one process releases > >>>> memory, another process within the same group can obtain it with the > >>>> highest priority, fully utilizing the locality of memory allocation > >>>> characteristics within the MEMCG (such as APP grouping). > >>>> > >>>> In addition, MEMCG provides feature interfaces that can be dynamically toggled > >>>> and are fully controllable by the policy.This provides greater flexibility > >>>> and does not impact performance when not enabled (controlled through static key). > >>>> > >>>> > >>>> Abount PMC implement > >>>> === > >>>> Here, a cache switch is provided for each MEMCG(not on root). > >>>> When the user enables the cache, processes within the MEMCG will share memory > >>>> through this cache. > >>>> > >>>> The cache pool is positioned before the PCP. All order0 page released by > >>>> processes in MEMCG will be released to the cache pool first, and when memory > >>>> is requested, it will also be prioritized to be obtained from the cache pool. > >>>> > >>>> `memory.cache` is the sole entry point for controlling PMC, here are some > >>>> nested keys to control PMC: > >>>> 1. "enable=[y|n]" to enable or disable targeted MEMCG's cache > >>>> 2. "keys=nid=%d,watermark=%u,reaper_time=%u,limit=%u" to control already > >>>> enabled PMC's behavior. > >>>> a) `nid` to targeted a node to change it's key. or else all node. > >>>> b) The `watermark` is used to control cache behavior, caching only when > >>>> zone free pages above the zone's high water mark + this watermark is > >>>> exceeded during memory release. (unit byte, default 50MB, > >>>> min 10MB per-node-all-zone) > >>>> c) `reaper_time` to control reaper gap, if meet, reaper all cache in this > >>>> MEMCG(unit us, default 5s, 0 is disable.) > >>>> d) `limit` is to limit the maximum memory used by the cache pool(unit bytes, > >>>> default 100MB, max 500MB per-node-all-zone) > >>>> > >>>> Performance > >>>> === > >>>> PMC is based on MEMCG and requires performance measurement through the > >>>> sharing of complex workloads between application processes. > >>>> Therefore, at the moment, we unable to provide a better testing solution > >>>> for this patchset. > >>>> > >>>> Here is the internal testing situation we provide, using the camera > >>>> application as an example. (1-NODE-1-ZONE-8GRAM) > >>>> > >>>> Test Case: Capture in rear portrait HDR mode > >>>> 1. Test mode: rear portrait HDR mode. This scene needs more than 800M ram > >>>> which memory types including dmabuf(470M), PSS(150M) and APU(200M) > >>>> 2. Test steps: take a photo, then click thumbnail to view the full image > >>>> > >>>> The overall performance benefit from click shutter button to showing whole > >>>> image improves 500ms, and the total slowpath cost of all camera threads reduced > >>>> from 958ms to 495ms. > >>>> Especially for the shot2shot in this mode, the preview dealy of each frame have > >>>> a significant improve. > >>> Hello Huan, > >>> > >>> thank you for sharing your work. > >> thanks > >>> Some high-level thoughts: > >>> 1) Naming is hard, but it took me quite a while to realize that you're talking > >> Haha, sorry for my pool english > >>> about free memory. Cache is obviously an overloaded term, but per-memcg-cache > >>> can mean absolutely anything (pagecache? cpu cache? ...), so maybe it's not > >> Currently, my idea is that all memory released by processes under memcg > >> will go into the `cache`, > >> > >> and the original attributes will be ignored, and can be freely requested > >> by processes under memcg. > >> > >> (so, dma-buf\page cache\heap\driver, so on). Maybe named PMP more > >> friendly? :) > >> > >>> the best choice. > >>> 2) Overall an idea to have a per-memcg free memory pool makes sense to me, > >>> especially if we talk 2MB or 1GB pages (or order > 0 in general). > >> I like it too :) > >>> 3) You absolutely have to integrate the reclaim mechanism with a generic > >>> memory reclaim mechanism, which is driven by the memory pressure. > >> Yes, I all think about it. > >>> 4) You claim a ~50% performance win in your workload, which is a lot. It's not > >>> clear to me where it's coming from. It's hard to believe the page allocation/release > >>> paths are taking 50% of the cpu time. Please, clarify. > >> Let me describe it more specifically. In our test scenario, we have 8GB > >> of RAM, and our camera application > >> > >> has a complex set of algorithms, with a peak memory requirement of up to > >> 3GB. > >> > >> Therefore, in a multi-application background scenario, starting the > >> camera and taking photos will create a > >> > >> very high memory pressure. In this scenario, any released memory will be > >> quickly used by other processes (such as file pages). > >> > >> So, during the process of switching from camera capture to preview, > >> DMA-BUF memory will be released, > >> > >> while the memory used for the preview algorithm will be simultaneously > >> requested. > >> > >> We need to take a lot of slow path routes to obtain enough memory for > >> the preview algorithm, and it seems that the > >> > >> just released DMA-BUF memory does not provide much help. > >> > > Hi Huan, > HI T.J. > > > > I find this part surprising. Assuming the dmabuf memory doesn't first > > go into a page pool (used for some buffers, not all) and actually does > Actually, when PMC enabled, we let page free avoid free into page pool. > > get freed synchronously with fput, this would mean it gets sucked up > > by other supposedly background processes before it can be allocated by > > the preview process. I thought the preview process was the one most > > desperate for memory? You mention file pages, but where is this > > newly-freed memory actually going if not to the preview process? My > This was discovered through the meminfo observation program. > When the dma-buf is released, there is a noticeable increase in cache. > > This may be triggered by pagecache when loading the algorithm model. > > Additionally, the algorithm heap memory cannot benefit from the release > of the dma-buf. > I believe this is related to the migratetype. The stack/heap cannot > obtain priority access to > the dma-buf memory released by the kernel.(HIGHUSER_MOVABLE) > > So, PMC break it, share each memory. Even if it's incorrect :)(If my > understanding of the > fragmentation issue is incorrect, please correct me.) > Oh that would make sense, but then the memory *is* going to your preview process just not in the form you were hoping for. So model loading and your heap allocations were fighting for memory, probably thrashing the file pages? I guess it's more important to get the heap allocations done first for performance for your app, and I think I can understand how PMC would give a sort of priority to those over the file pages during the preview transition. Ok. Sorry I don't have an opinion on this part yet if that's what's happening. > > initial reaction was the same as Roman's that the PMC should be hooked > > up to reclaim instead of depending on the reaper. But I think this > > might suggest that wouldn't work because the system is under such high > > memory pressure that it'd be likely reclaim would have emptied the > > PMCs before the preview process could use it. > The point you raised is indeed very likely to happen, as there is immense > memory pressure. > Currently, we only open the PMC when the application is in the foreground, > and close it when it goes to the background. > It is indeed unnecessary to drain the PMC when the application is in the > foreground, > and a longer reaper timeout would be more useful.(Thanks for the > flexibility provided by memcg.) > > > > One more thing I find odd is that for this to work a significant > > portion of your dmabuf pages would have to be order 0, but we're > > talking about a ~500M buffer. Does whatever exports this buffer not > > try to use higher order pages like here? > Yes, actually our heap configured order 8 4 0, but In our practical > application and observation processes, > it is often difficult to meet the high-order memory allocation, so > falling back to order 0 is the most common. > Therefore, for our MID_ORDER allocation, we use LOW_ORDER_GFP. > Just like the testing scenario I mentioned earlier, with 8GB of RAM and > the camera peaking at around 3GB, > > the fragmentation at this point will cause most of the DMA-BUF > allocations to fall back to order 0. > The use of PMC is for real-world, high-load applications. I don't think > it's very practical for regular applications. Got it, thanks. > > Thanks > HY > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dma-buf/heaps/system_heap.c?h=v6.9#n54 > > > > Thanks! > > -T.J. > > > >> But using PMC (let's call it that for now), we are able to quickly meet > >> the memory needs of the subsequent preview process > >> > >> with the just released DMA-BUF memory, without having to go through the > >> slow path, resulting in a significant performance improvement. > >> > >> (of course, break migrate type may not good.) > >> > >>> There are a lot of other questions, and you highlighted some of them below > >>> (and these are indeed right questions to ask), but let's start with something. > >>> > >>> Thanks > >> Thanks > >>