Message ID | 1585221127-11458-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | psi: enhance psi with the help of ebpf | expand |
On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote: > PSI gives us a powerful way to anaylze memory pressure issue, but we can > make it more powerful with the help of tracepoint, kprobe, ebpf and etc. > Especially with ebpf we can flexiblely get more details of the memory > pressure. > > In orderc to achieve this goal, a new parameter is added into > psi_memstall_{enter, leave}, which indicates the specific type of a > memstall. There're totally ten memstalls by now, > MEMSTALL_KSWAPD > MEMSTALL_RECLAIM_DIRECT > MEMSTALL_RECLAIM_MEMCG > MEMSTALL_RECLAIM_HIGH > MEMSTALL_KCOMPACTD > MEMSTALL_COMPACT > MEMSTALL_WORKINGSET_REFAULT > MEMSTALL_WORKINGSET_THRASHING > MEMSTALL_MEMDELAY > MEMSTALL_SWAPIO What does this provide over the events tracked in /proc/vmstats? Can you elaborate a bit how you are using this information? It's not quite clear to me from the example in patch #2.
On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote: > > PSI gives us a powerful way to anaylze memory pressure issue, but we can > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc. > > Especially with ebpf we can flexiblely get more details of the memory > > pressure. > > > > In orderc to achieve this goal, a new parameter is added into > > psi_memstall_{enter, leave}, which indicates the specific type of a > > memstall. There're totally ten memstalls by now, > > MEMSTALL_KSWAPD > > MEMSTALL_RECLAIM_DIRECT > > MEMSTALL_RECLAIM_MEMCG > > MEMSTALL_RECLAIM_HIGH > > MEMSTALL_KCOMPACTD > > MEMSTALL_COMPACT > > MEMSTALL_WORKINGSET_REFAULT > > MEMSTALL_WORKINGSET_THRASHING > > MEMSTALL_MEMDELAY > > MEMSTALL_SWAPIO > > What does this provide over the events tracked in /proc/vmstats? > /proc/vmstat only tells us which events occured, but it can't tell us how long these events take. Sometimes we really want to know how long the event takes and PSI can provide us the data For example, in the past days when I did performance tuning for a database service, I monitored that the latency spike is related with the workingset_refault counter in /proc/vmstat, and at that time I really want to know the spread of latencies caused by workingset_refault, but there's no easy way to get it. Now with newly added MEMSTALL_WORKINGSET_REFAULT, I can get the latencies caused by workingset refault. > Can you elaborate a bit how you are using this information? It's not > quite clear to me from the example in patch #2. > From the traced data in patch #2, we can find that the high latencies of user tasks are always type 7 of memstall , which is MEMSTALL_WORKINGSET_THRASHING, and then we should look into the details of wokingset of the user tasks and think about how to improve it - for example, by reducing the workingset. BTW, there's some error in the definition of show_psi_memstall_type() in patch #2 ( that's an old version), I will correct it. To summarize, with the pressure data in /proc/pressure/memroy we know that the system is under memory pressure, and then with the newly added tracing facility in this patchset we can get the reason of this memory pressure, and then thinks about how to make the change. The workflow can be illustrated as bellow. REASON ACTION | compaction | look into the details of compaction | Memory pressure - | vmscan | look into the details of vmscan | | workingset | look into the details of workingset | | etc | ... | Thanks Yafang
On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote: > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote: > > > PSI gives us a powerful way to anaylze memory pressure issue, but we can > > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc. > > > Especially with ebpf we can flexiblely get more details of the memory > > > pressure. > > > > > > In orderc to achieve this goal, a new parameter is added into > > > psi_memstall_{enter, leave}, which indicates the specific type of a > > > memstall. There're totally ten memstalls by now, > > > MEMSTALL_KSWAPD > > > MEMSTALL_RECLAIM_DIRECT > > > MEMSTALL_RECLAIM_MEMCG > > > MEMSTALL_RECLAIM_HIGH > > > MEMSTALL_KCOMPACTD > > > MEMSTALL_COMPACT > > > MEMSTALL_WORKINGSET_REFAULT > > > MEMSTALL_WORKINGSET_THRASHING > > > MEMSTALL_MEMDELAY > > > MEMSTALL_SWAPIO > > > > What does this provide over the events tracked in /proc/vmstats? > > > > /proc/vmstat only tells us which events occured, but it can't tell us > how long these events take. > Sometimes we really want to know how long the event takes and PSI can > provide us the data > For example, in the past days when I did performance tuning for a > database service, I monitored that the latency spike is related with > the workingset_refault counter in /proc/vmstat, and at that time I > really want to know the spread of latencies caused by > workingset_refault, but there's no easy way to get it. Now with newly > added MEMSTALL_WORKINGSET_REFAULT, I can get the latencies caused by > workingset refault. Okay, but how do you use that information in practice? > > Can you elaborate a bit how you are using this information? It's not > > quite clear to me from the example in patch #2. > > > > From the traced data in patch #2, we can find that the high latencies > of user tasks are always type 7 of memstall , which is > MEMSTALL_WORKINGSET_THRASHING, and then we should look into the > details of wokingset of the user tasks and think about how to improve > it - for example, by reducing the workingset. That's an analyses we run frequently as well: we see high pressure, and then correlate it with the events. High rate of refaults? The workingset is too big. High rate of compaction work? Somebody is asking for higher order pages under load; check THP events next. etc. This works fairly reliably. I'm curious what the extra per-event latency breakdown would add and where it would be helpful. I'm not really opposed to your patches it if it is, I just don't see the usecase right now.
On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote: > > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Thu, Mar 26, 2020 at 07:12:05AM -0400, Yafang Shao wrote: > > > > PSI gives us a powerful way to anaylze memory pressure issue, but we can > > > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc. > > > > Especially with ebpf we can flexiblely get more details of the memory > > > > pressure. > > > > > > > > In orderc to achieve this goal, a new parameter is added into > > > > psi_memstall_{enter, leave}, which indicates the specific type of a > > > > memstall. There're totally ten memstalls by now, > > > > MEMSTALL_KSWAPD > > > > MEMSTALL_RECLAIM_DIRECT > > > > MEMSTALL_RECLAIM_MEMCG > > > > MEMSTALL_RECLAIM_HIGH > > > > MEMSTALL_KCOMPACTD > > > > MEMSTALL_COMPACT > > > > MEMSTALL_WORKINGSET_REFAULT > > > > MEMSTALL_WORKINGSET_THRASHING > > > > MEMSTALL_MEMDELAY > > > > MEMSTALL_SWAPIO > > > > > > What does this provide over the events tracked in /proc/vmstats? > > > > > > > /proc/vmstat only tells us which events occured, but it can't tell us > > how long these events take. > > Sometimes we really want to know how long the event takes and PSI can > > provide us the data > > For example, in the past days when I did performance tuning for a > > database service, I monitored that the latency spike is related with > > the workingset_refault counter in /proc/vmstat, and at that time I > > really want to know the spread of latencies caused by > > workingset_refault, but there's no easy way to get it. Now with newly > > added MEMSTALL_WORKINGSET_REFAULT, I can get the latencies caused by > > workingset refault. > > Okay, but how do you use that information in practice? > With the newly added facility, we can know when these events occur and how long each event takes. Then we can use these datas to generate a Latency Heat Map[1] and to compare whether these latencies match with the application latencies recoreded in its log - for example the slow query log in mysql. If the refault latencies match with the slow query log, then these slow queries are caused by these workingset refault. If the refault latencies don't match with slow query log, IOW much smaller than the slow query log, then the slow query log isn't caused by the workingset refault. High rate of refaults may not cause high pressure, if the backing device is fast enough. While the latencies of refaults give us a direct relationship with memory pressure. [1]. http://www.brendangregg.com/HeatMaps/latency.html > > > Can you elaborate a bit how you are using this information? It's not > > > quite clear to me from the example in patch #2. > > > > > > > From the traced data in patch #2, we can find that the high latencies > > of user tasks are always type 7 of memstall , which is > > MEMSTALL_WORKINGSET_THRASHING, and then we should look into the > > details of wokingset of the user tasks and think about how to improve > > it - for example, by reducing the workingset. > > That's an analyses we run frequently as well: we see high pressure, > and then correlate it with the events. > > High rate of refaults? The workingset is too big. > > High rate of compaction work? Somebody is asking for higher order > pages under load; check THP events next. > > etc. > > This works fairly reliably. I'm curious what the extra per-event > latency breakdown would add and where it would be helpful. > > I'm not really opposed to your patches it if it is, I just don't see > the usecase right now. > As I explained above, the rate of these events can't give us a full view of the memory pressure. High memory pressure may not caused by high rate of vmstat events, while it can be caused by low rate of events but with high latencies. Latencies are the application really concerns, that's why PSI is very useful for us. Furthermore, there're some events are not recored in vmstat. e.g. typf of memstall vmstat event MEMSTALL_KSWAPD pageoutrun, {pgscan, pgsteal}_kswapd MEMSTALL_RECLAIM_DIRECT {pgscan,steal}_direct MEMSTALL_RECLAIM_MEMCG /* no event */ MEMSTALL_RECLAIM_HIGH /* no event */ MEMSTALL_KCOMPACTD compact_daemon_wake MEMSTALL_COMPACT compact_{stall, fail, success} MEMSTALL_WORKINGSET_REFAULT workingset_refault MEMSTALL_WORKINGSET_THRASH /* no event */ MEMSTALL_MEMDELAY /* no event */ MEMSTALL_SWAPIO pswpin After we add these types of memstall, we don't need to add these missed events one by one. Thanks Yafang
On Wed, Apr 01, 2020 at 09:22:24AM +0800, Yafang Shao wrote: > On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote: > > > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > With the newly added facility, we can know when these events occur > and how long each event takes. > Then we can use these datas to generate a Latency Heat Map[1] and to > compare whether these latencies match with the application latencies > recoreded in its log - for example the slow query log in mysql. If the > refault latencies match with the slow query log, then these slow > queries are caused by these workingset refault. If the refault > latencies don't match with slow query log, IOW much smaller than the > slow query log, then the slow query log isn't caused by the > workingset refault. Okay, you want to use it much finer-grained to understand individual end-to-end latencies for your services, rather than "the system is melting down and I want to know why." That sounds valid to me. > > > > Can you elaborate a bit how you are using this information? It's not > > > > quite clear to me from the example in patch #2. > > > > > > > > > > From the traced data in patch #2, we can find that the high latencies > > > of user tasks are always type 7 of memstall , which is > > > MEMSTALL_WORKINGSET_THRASHING, and then we should look into the > > > details of wokingset of the user tasks and think about how to improve > > > it - for example, by reducing the workingset. > > > > That's an analyses we run frequently as well: we see high pressure, > > and then correlate it with the events. > > > > High rate of refaults? The workingset is too big. > > > > High rate of compaction work? Somebody is asking for higher order > > pages under load; check THP events next. > > > > etc. > > > > This works fairly reliably. I'm curious what the extra per-event > > latency breakdown would add and where it would be helpful. > > > > I'm not really opposed to your patches it if it is, I just don't see > > the usecase right now. > > > > As I explained above, the rate of these events can't give us a full > view of the memory pressure. High memory pressure may not caused by > high rate of vmstat events, while it can be caused by low rate of > events but with high latencies. Latencies are the application really > concerns, that's why PSI is very useful for us. > > Furthermore, there're some events are not recored in vmstat. e.g. > > typf of memstall vmstat event > > MEMSTALL_KSWAPD pageoutrun, {pgscan, > pgsteal}_kswapd > MEMSTALL_RECLAIM_DIRECT {pgscan,steal}_direct > MEMSTALL_RECLAIM_MEMCG /* no event */ > MEMSTALL_RECLAIM_HIGH /* no event */ > MEMSTALL_KCOMPACTD compact_daemon_wake > MEMSTALL_COMPACT compact_{stall, fail, success} > MEMSTALL_WORKINGSET_REFAULT workingset_refault > MEMSTALL_WORKINGSET_THRASH /* no event */ > MEMSTALL_MEMDELAY /* no event */ > MEMSTALL_SWAPIO pswpin > > After we add these types of memstall, we don't need to add these > missed events one by one. I'm a bit concerned about the maintainability of these things. It makes moving code around harder, and it forces somebody who has no interest in this debugging facility to thing about the categories. And naming them is hard even for somebody who does care. I'm not a fan of MEMSTALL_MEMDELAY, for example because it's way too non-descript. The distinction between MEMSTALL_WORKINGSET_REFAULT and MEMSTALL_WORKINGSET_THRASH is dubious as well. These are recipes for bit rot and making the code harder to hack on. I see two options to do this better: One is to use stack traces as identifiers instead of a made-up type. The other is to use the calling function as the id (see how kmalloc_track_caller() utilizes _RET_IP_). bpftrace can use the stack as a map key. So this should already work without any kernel modifications, using @start[tid, kstack]?
On Fri, Apr 3, 2020 at 11:48 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Apr 01, 2020 at 09:22:24AM +0800, Yafang Shao wrote: > > On Tue, Mar 31, 2020 at 11:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Fri, Mar 27, 2020 at 09:17:59AM +0800, Yafang Shao wrote: > > > > On Thu, Mar 26, 2020 at 10:31 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > With the newly added facility, we can know when these events occur > > and how long each event takes. > > Then we can use these datas to generate a Latency Heat Map[1] and to > > compare whether these latencies match with the application latencies > > recoreded in its log - for example the slow query log in mysql. If the > > refault latencies match with the slow query log, then these slow > > queries are caused by these workingset refault. If the refault > > latencies don't match with slow query log, IOW much smaller than the > > slow query log, then the slow query log isn't caused by the > > workingset refault. > > Okay, you want to use it much finer-grained to understand individual > end-to-end latencies for your services, rather than "the system is > melting down and I want to know why." That sounds valid to me. > Right, individual end-to-end latencies are very important for the latency sensitive services. > > > > > Can you elaborate a bit how you are using this information? It's not > > > > > quite clear to me from the example in patch #2. > > > > > > > > > > > > > From the traced data in patch #2, we can find that the high latencies > > > > of user tasks are always type 7 of memstall , which is > > > > MEMSTALL_WORKINGSET_THRASHING, and then we should look into the > > > > details of wokingset of the user tasks and think about how to improve > > > > it - for example, by reducing the workingset. > > > > > > That's an analyses we run frequently as well: we see high pressure, > > > and then correlate it with the events. > > > > > > High rate of refaults? The workingset is too big. > > > > > > High rate of compaction work? Somebody is asking for higher order > > > pages under load; check THP events next. > > > > > > etc. > > > > > > This works fairly reliably. I'm curious what the extra per-event > > > latency breakdown would add and where it would be helpful. > > > > > > I'm not really opposed to your patches it if it is, I just don't see > > > the usecase right now. > > > > > > > As I explained above, the rate of these events can't give us a full > > view of the memory pressure. High memory pressure may not caused by > > high rate of vmstat events, while it can be caused by low rate of > > events but with high latencies. Latencies are the application really > > concerns, that's why PSI is very useful for us. > > > > Furthermore, there're some events are not recored in vmstat. e.g. > > > > typf of memstall vmstat event > > > > MEMSTALL_KSWAPD pageoutrun, {pgscan, > > pgsteal}_kswapd > > MEMSTALL_RECLAIM_DIRECT {pgscan,steal}_direct > > MEMSTALL_RECLAIM_MEMCG /* no event */ > > MEMSTALL_RECLAIM_HIGH /* no event */ > > MEMSTALL_KCOMPACTD compact_daemon_wake > > MEMSTALL_COMPACT compact_{stall, fail, success} > > MEMSTALL_WORKINGSET_REFAULT workingset_refault > > MEMSTALL_WORKINGSET_THRASH /* no event */ > > MEMSTALL_MEMDELAY /* no event */ > > MEMSTALL_SWAPIO pswpin > > > > After we add these types of memstall, we don't need to add these > > missed events one by one. > > I'm a bit concerned about the maintainability of these things. It > makes moving code around harder, and it forces somebody who has no > interest in this debugging facility to thing about the categories. > > And naming them is hard even for somebody who does care. I'm not a fan > of MEMSTALL_MEMDELAY, for example because it's way too > non-descript. The distinction between MEMSTALL_WORKINGSET_REFAULT and > MEMSTALL_WORKINGSET_THRASH is dubious as well. > Agree with you that the naming is not good. > These are recipes for bit rot and making the code harder to hack on. > > I see two options to do this better: One is to use stack traces as > identifiers instead of a made-up type. The other is to use the calling > function as the id (see how kmalloc_track_caller() utilizes _RET_IP_). > > bpftrace can use the stack as a map key. So this should already work > without any kernel modifications, using @start[tid, kstack]? > If we don't make any kernel modifications, it is not easy to get whehter the psi_memstall_{enter, leave} is nested or not. The nested psi_memstall_{enter, leave} may make some noises. Seems the first option is better. With _RET_IP_ we can also get the caller. So how about adding tracepoints for psi_memstall_{enter, leave} as bellow ? @@ -904,7 +904,7 @@ void psi_memstall_enter(unsigned long *flags, enum memstall_types type) if (*flags) return; + trace_psi_memstall_enter(_RET_IP_); @@ -944,7 +943,7 @@ void psi_memstall_leave(unsigned long *flags, enum memstall_types type) if (*flags) return; + trace_psi_memstall_leave(_RET_IP_); Thanks Yafang