Message ID | 20230724132224.916-2-thunder.leizhen@huaweicloud.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | softirq: redefine the type of kernel_stat.softirqs[] as unsigned long | expand |
On Mon, Jul 24, 2023 at 09:22:23PM +0800, thunder.leizhen@huaweicloud.com wrote: > From: Zhen Lei <thunder.leizhen@huawei.com> > > The statistics function of softirq is supported by commit aa0ce5bbc2db > ("softirq: introduce statistics for softirq") in 2009. At that time, > 64-bit processors should not have many cores and would not face > significant count overflow problems. Now it's common for a processor to > have hundreds of cores. Assume that there are 100 cores and 10 > TIMER_SOFTIRQ are generated per second, then the 32-bit sum will be > overflowed after 50 days. 50 days is long enough to take a snapshot. You should always be using difference between, not absolute values, and understand that they can wrap. We only tend to change the size of a counter when it can wrap sufficiently quickly that we might miss a wrap (eg tens of seconds).
On 2023/7/24 21:50, Matthew Wilcox wrote: > On Mon, Jul 24, 2023 at 09:22:23PM +0800, thunder.leizhen@huaweicloud.com wrote: >> From: Zhen Lei <thunder.leizhen@huawei.com> >> >> The statistics function of softirq is supported by commit aa0ce5bbc2db >> ("softirq: introduce statistics for softirq") in 2009. At that time, >> 64-bit processors should not have many cores and would not face >> significant count overflow problems. Now it's common for a processor to >> have hundreds of cores. Assume that there are 100 cores and 10 >> TIMER_SOFTIRQ are generated per second, then the 32-bit sum will be >> overflowed after 50 days. > > 50 days is long enough to take a snapshot. You should always be using > difference between, not absolute values, and understand that they can > wrap. We only tend to change the size of a counter when it can wrap > sufficiently quickly that we might miss a wrap (eg tens of seconds). Yes, I think patch 2/2 can be dropped. I reduced the number of soft interrupts generated in one second, and actually 100+ or 1000 is normal. But I think patch 1/2 is necessary. The sum of the output scattered values does not match the output sum. To solve this problem, we only need to adjust the type of a local variable. > > . >
On 2023/7/25 10:00, Leizhen (ThunderTown) wrote: > > > On 2023/7/24 21:50, Matthew Wilcox wrote: >> On Mon, Jul 24, 2023 at 09:22:23PM +0800, thunder.leizhen@huaweicloud.com wrote: >>> From: Zhen Lei <thunder.leizhen@huawei.com> >>> >>> The statistics function of softirq is supported by commit aa0ce5bbc2db >>> ("softirq: introduce statistics for softirq") in 2009. At that time, >>> 64-bit processors should not have many cores and would not face >>> significant count overflow problems. Now it's common for a processor to >>> have hundreds of cores. Assume that there are 100 cores and 10 >>> TIMER_SOFTIRQ are generated per second, then the 32-bit sum will be >>> overflowed after 50 days. >> >> 50 days is long enough to take a snapshot. You should always be using >> difference between, not absolute values, and understand that they can >> wrap. We only tend to change the size of a counter when it can wrap >> sufficiently quickly that we might miss a wrap (eg tens of seconds). Sometimes it can take a long time to view it again. For example, it is possible to run a complete business test for hours or even days, and then calculate the average. > > Yes, I think patch 2/2 can be dropped. I reduced the number of soft > interrupts generated in one second, and actually 100+ or 1000 is normal. > But I think patch 1/2 is necessary. The sum of the output scattered values > does not match the output sum. To solve this problem, we only need to > adjust the type of a local variable. However, it is important to consider that when the local variable is changed to u64, the output string becomes longer. It is not clear if the user-mode program parses it only by u32. > > >> >> . >> >
On Tue, Jul 25, 2023 at 05:09:05PM +0800, Leizhen (ThunderTown) wrote: > On 2023/7/25 10:00, Leizhen (ThunderTown) wrote: > > On 2023/7/24 21:50, Matthew Wilcox wrote: > >> On Mon, Jul 24, 2023 at 09:22:23PM +0800, thunder.leizhen@huaweicloud.com wrote: > >>> From: Zhen Lei <thunder.leizhen@huawei.com> > >>> > >>> The statistics function of softirq is supported by commit aa0ce5bbc2db > >>> ("softirq: introduce statistics for softirq") in 2009. At that time, > >>> 64-bit processors should not have many cores and would not face > >>> significant count overflow problems. Now it's common for a processor to > >>> have hundreds of cores. Assume that there are 100 cores and 10 > >>> TIMER_SOFTIRQ are generated per second, then the 32-bit sum will be > >>> overflowed after 50 days. > >> > >> 50 days is long enough to take a snapshot. You should always be using > >> difference between, not absolute values, and understand that they can > >> wrap. We only tend to change the size of a counter when it can wrap > >> sufficiently quickly that we might miss a wrap (eg tens of seconds). > > Sometimes it can take a long time to view it again. For example, it is > possible to run a complete business test for hours or even days, and > then calculate the average. I've been part of teams which have done such multi-hour tests. That isn't how monitoring was performed. Instead snapshots were taken every minute or even more frequently, because we wanted to know how these counters were fluctuating during the test -- were there time periods when the number of sortirqs spiked, or was it constant during the test? > > Yes, I think patch 2/2 can be dropped. I reduced the number of soft > > interrupts generated in one second, and actually 100+ or 1000 is normal. > > But I think patch 1/2 is necessary. The sum of the output scattered values > > does not match the output sum. To solve this problem, we only need to > > adjust the type of a local variable. > > However, it is important to consider that when the local variable is changed > to u64, the output string becomes longer. It is not clear if the user-mode > program parses it only by u32. There's no need for the numbers to add up. They won't anyway, because summing them is racy , so they'll always be a little off.
On 2023/7/25 23:26, Matthew Wilcox wrote: > On Tue, Jul 25, 2023 at 05:09:05PM +0800, Leizhen (ThunderTown) wrote: >> On 2023/7/25 10:00, Leizhen (ThunderTown) wrote: >>> On 2023/7/24 21:50, Matthew Wilcox wrote: >>>> On Mon, Jul 24, 2023 at 09:22:23PM +0800, thunder.leizhen@huaweicloud.com wrote: >>>>> From: Zhen Lei <thunder.leizhen@huawei.com> >>>>> >>>>> The statistics function of softirq is supported by commit aa0ce5bbc2db >>>>> ("softirq: introduce statistics for softirq") in 2009. At that time, >>>>> 64-bit processors should not have many cores and would not face >>>>> significant count overflow problems. Now it's common for a processor to >>>>> have hundreds of cores. Assume that there are 100 cores and 10 >>>>> TIMER_SOFTIRQ are generated per second, then the 32-bit sum will be >>>>> overflowed after 50 days. >>>> >>>> 50 days is long enough to take a snapshot. You should always be using >>>> difference between, not absolute values, and understand that they can >>>> wrap. We only tend to change the size of a counter when it can wrap >>>> sufficiently quickly that we might miss a wrap (eg tens of seconds). >> >> Sometimes it can take a long time to view it again. For example, it is >> possible to run a complete business test for hours or even days, and >> then calculate the average. > > I've been part of teams which have done such multi-hour tests. That > isn't how monitoring was performed. Instead snapshots were taken every > minute or even more frequently, because we wanted to know how these > counters were fluctuating during the test -- were there time periods > when the number of sortirqs spiked, or was it constant during the test? > >>> Yes, I think patch 2/2 can be dropped. I reduced the number of soft >>> interrupts generated in one second, and actually 100+ or 1000 is normal. >>> But I think patch 1/2 is necessary. The sum of the output scattered values >>> does not match the output sum. To solve this problem, we only need to >>> adjust the type of a local variable. >> >> However, it is important to consider that when the local variable is changed >> to u64, the output string becomes longer. It is not clear if the user-mode >> program parses it only by u32. > > There's no need for the numbers to add up. They won't anyway, because > summing them is racy , so they'll always be a little off. Okay, thanks for the reply. I got it. I just summed it up temporarily to prove that integer overflow is possible, and there's no actual requirement. > > . >
diff --git a/fs/proc/stat.c b/fs/proc/stat.c index da60956b2915645..84aac577a50cabb 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -86,7 +86,7 @@ static int show_stat(struct seq_file *p, void *v) u64 guest, guest_nice; u64 sum = 0; u64 sum_softirq = 0; - unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; + u64 per_softirq_sums[NR_SOFTIRQS] = {0}; struct timespec64 boottime; user = nice = system = idle = iowait =