fs/proc: introduce /proc/stat2 file

Message ID	20181029192521.23059-1-dave@stgolabs.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Davidlohr Bueso <dave@stgolabs.net> To: akpm@linux-foundation.org Cc: longman@redhat.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, dave@stgolabs.net, Davidlohr Bueso <dbueso@suse.de> Subject: [PATCH] fs/proc: introduce /proc/stat2 file Date: Mon, 29 Oct 2018 12:25:21 -0700 Message-Id: <20181029192521.23059-1-dave@stgolabs.net> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	fs/proc: introduce /proc/stat2 file \| expand fs/proc: introduce /proc/stat2 file

Davidlohr Bueso Oct. 29, 2018, 7:25 p.m. UTC

A recent report from a large database vendor which I shall not name
shows concerns about poor performance when consuming /proc/stat info.
Particularly  kstat_irq() pops up in the profiles and most time is
being spent there. The overall system is under a lot of irqs and
almost 1k cores, thus this comes to little surprise.

Granted that procfs in general is not known for its performance,
nor designed for it, for that matter. Some users, however may be able
to overcome this performance limitation, some not. Therefore it isn't
bad having a kernel option for users that don't want any hard irq info
-- and care enough about this.

This patch introduces a new /proc/stat2 file that is identical to the
regular 'stat' except that it zeroes all hard irq statistics. The new
file is a drop in replacement to stat for users that need performance.

The stat file is not touched, of course -- this was also previously
suggested by Waiman:
https://lore.kernel.org/lkml/1524166562-5644-1-git-send-email-longman@redhat.com/

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 Documentation/filesystems/proc.txt | 12 +++++++---
 fs/proc/stat.c                     | 45 ++++++++++++++++++++++++++++++++------
 2 files changed, 47 insertions(+), 10 deletions(-)

Waiman Long Oct. 29, 2018, 7:35 p.m. UTC | #1

On 10/29/2018 03:25 PM, Davidlohr Bueso wrote:
> A recent report from a large database vendor which I shall not name
> shows concerns about poor performance when consuming /proc/stat info.
> Particularly  kstat_irq() pops up in the profiles and most time is
> being spent there. The overall system is under a lot of irqs and
> almost 1k cores, thus this comes to little surprise.
>
> Granted that procfs in general is not known for its performance,
> nor designed for it, for that matter. Some users, however may be able
> to overcome this performance limitation, some not. Therefore it isn't
> bad having a kernel option for users that don't want any hard irq info
> -- and care enough about this.
>
> This patch introduces a new /proc/stat2 file that is identical to the
> regular 'stat' except that it zeroes all hard irq statistics. The new
> file is a drop in replacement to stat for users that need performance.
>
> The stat file is not touched, of course -- this was also previously
> suggested by Waiman:
> https://lore.kernel.org/lkml/1524166562-5644-1-git-send-email-longman@redhat.com/
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

I am wondering if /proc/stat_noirqs will be a more descriptive name of
the intent of this new procfs file or we should just go with the more
generic stat2 name.

Cheers,
Longman

> ---
>  Documentation/filesystems/proc.txt | 12 +++++++---
>  fs/proc/stat.c                     | 45 ++++++++++++++++++++++++++++++++------
>  2 files changed, 47 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..563b01decb1e 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -27,7 +27,7 @@ Table of Contents
>    1.5	SCSI info
>    1.6	Parallel port info in /proc/parport
>    1.7	TTY info in /proc/tty
> -  1.8	Miscellaneous kernel statistics in /proc/stat
> +  1.8	Miscellaneous kernel statistics in /proc/stat and /proc/stat2
>    1.9	Ext4 file system parameters
>  
>    2	Modifying System Parameters
> @@ -140,6 +140,7 @@ Table 1-1: Process specific entries in /proc
>   mem		Memory held by this process
>   root		Link to the root directory of this process
>   stat		Process status
> + stat2		Process status without irq information
>   statm		Process memory status information
>   status		Process status in human readable form
>   wchan		Present with CONFIG_KALLSYMS=y: it shows the kernel function
> @@ -1301,8 +1302,8 @@ To see  which  tty's  are  currently in use, you can simply look into the file
>    unknown              /dev/tty        4    1-63 console 
>  
>  
> -1.8 Miscellaneous kernel statistics in /proc/stat
> --------------------------------------------------
> +1.8 Miscellaneous kernel statistics in /proc/stat and /proc/stat2
> +-----------------------------------------------------------------
>  
>  Various pieces   of  information about  kernel activity  are  available in the
>  /proc/stat file.  All  of  the numbers reported  in  this file are  aggregates
> @@ -1371,6 +1372,11 @@ of the possible system softirqs. The first column is the total of all
>  softirqs serviced; each subsequent column is the total for that particular
>  softirq.
>  
> +The stat2 file acts as a performance alternative to /proc/stat for workloads
> +and systems that care and are under heavy irq load. In order to to be completely
> +compatible, /proc/stat and /proc/stat2 are identical with the exception that the
> +later will show 0 for any (hard)irq-related fields. This refers particularly
> +to the "intr" line and 'irq' column for that aggregate in the cpu line.
>  
>  1.9 Ext4 file system parameters
>  -------------------------------
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 535eda7857cf..349040270003 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -79,7 +79,7 @@ static u64 get_iowait_time(int cpu)
>  
>  #endif
>  
> -static int show_stat(struct seq_file *p, void *v)
> +static int __show_stat(struct seq_file *p, void *v, bool irq_stats)
>  {
>  	int i, j;
>  	u64 user, nice, system, idle, iowait, irq, softirq, steal;
> @@ -100,13 +100,17 @@ static int show_stat(struct seq_file *p, void *v)
>  		system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>  		idle += get_idle_time(i);
>  		iowait += get_iowait_time(i);
> -		irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
>  		softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
>  		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
>  		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
>  		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
> -		sum += kstat_cpu_irqs_sum(i);
> -		sum += arch_irq_stat_cpu(i);
> +
> +		if (irq_stats) {
> +			irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
> +
> +			sum += kstat_cpu_irqs_sum(i);
> +			sum += arch_irq_stat_cpu(i);
> +		}
>  
>  		for (j = 0; j < NR_SOFTIRQS; j++) {
>  			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -115,7 +119,9 @@ static int show_stat(struct seq_file *p, void *v)
>  			sum_softirq += softirq_stat;
>  		}
>  	}
> -	sum += arch_irq_stat();
> +
> +	if (irq_stats)
> +		sum += arch_irq_stat();
>  
>  	seq_put_decimal_ull(p, "cpu  ", nsec_to_clock_t(user));
>  	seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
> @@ -136,7 +142,8 @@ static int show_stat(struct seq_file *p, void *v)
>  		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>  		idle = get_idle_time(i);
>  		iowait = get_iowait_time(i);
> -		irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
> +		if (irq_stats)
> +			irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
>  		softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
>  		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
>  		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
> @@ -158,7 +165,7 @@ static int show_stat(struct seq_file *p, void *v)
>  
>  	/* sum again ? it could be updated? */
>  	for_each_irq_nr(j)
> -		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> +		seq_put_decimal_ull(p, " ", irq_stats ? kstat_irqs_usr(j) : 0);
>  
>  	seq_printf(p,
>  		"\nctxt %llu\n"
> @@ -181,6 +188,16 @@ static int show_stat(struct seq_file *p, void *v)
>  	return 0;
>  }
>  
> +static int show_stat(struct seq_file *p, void *v)
> +{
> +	return __show_stat(p, v, true);
> +}
> +
> +static int show_stat2(struct seq_file *p, void *v)
> +{
> +	return __show_stat(p, v, false);
> +}
> +
>  static int stat_open(struct inode *inode, struct file *file)
>  {
>  	unsigned int size = 1024 + 128 * num_online_cpus();
> @@ -190,6 +207,12 @@ static int stat_open(struct inode *inode, struct file *file)
>  	return single_open_size(file, show_stat, NULL, size);
>  }
>  
> +static int stat2_open(struct inode *inode, struct file *file)
> +{
> +	unsigned int size = 1024 + 128 * num_online_cpus();
> +	return single_open_size(file, show_stat2, NULL, size);
> +}
> +
>  static const struct file_operations proc_stat_operations = {
>  	.open		= stat_open,
>  	.read		= seq_read,
> @@ -197,9 +220,17 @@ static const struct file_operations proc_stat_operations = {
>  	.release	= single_release,
>  };
>  
> +static const struct file_operations proc_stat2_operations = {
> +	.open		= stat2_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
>  static int __init proc_stat_init(void)
>  {
>  	proc_create("stat", 0, NULL, &proc_stat_operations);
> +	proc_create("stat2", 0, NULL, &proc_stat2_operations);
>  	return 0;
>  }
>  fs_initcall(proc_stat_init);

Davidlohr Bueso Oct. 29, 2018, 8 p.m. UTC | #2

On Mon, 29 Oct 2018, Waiman Long wrote:

>I am wondering if /proc/stat_noirqs will be a more descriptive name of
>the intent of this new procfs file or we should just go with the more
>generic stat2 name.

The reason why I went with '2' instead of a more rescriptive name
was that I think of the call as a drop-in replacement/extention to
stat. Therefore the same fields are maintained, otherwise with stat_noirqs
I feel like instead of zeroing out, they should just be removed.

But otoh, I have no strong objection in renaming either.

Thanks,
Davidlohr

Waiman Long Oct. 29, 2018, 8:29 p.m. UTC | #3

On 10/29/2018 04:00 PM, Davidlohr Bueso wrote:
> On Mon, 29 Oct 2018, Waiman Long wrote:
>
>> I am wondering if /proc/stat_noirqs will be a more descriptive name of
>> the intent of this new procfs file or we should just go with the more
>> generic stat2 name.
>
> The reason why I went with '2' instead of a more rescriptive name
> was that I think of the call as a drop-in replacement/extention to
> stat. Therefore the same fields are maintained, otherwise with
> stat_noirqs
> I feel like instead of zeroing out, they should just be removed.
>
> But otoh, I have no strong objection in renaming either.
>
> Thanks,
> Davidlohr

I am just questioning the rationale for the stat2 name. I am not
advocating to use stat_noirqs neither.

BTW, since you are making stat2 compatible with stat, will that be
easier from the user API perspective if we use a sysctl parameter to
turn on and off IRQs reporting for /proc/stat, for example?

I know that there are pros and cons for each approach, I just want to
consider all the available options and choose the best one.

Cheers,
Longman

Davidlohr Bueso Oct. 29, 2018, 8:38 p.m. UTC | #4

On Mon, 29 Oct 2018, Waiman Long wrote:

>BTW, since you are making stat2 compatible with stat, will that be
>easier from the user API perspective if we use a sysctl parameter to
>turn on and off IRQs reporting for /proc/stat, for example?

For one /proc/stat is also common for debugging envs (ie: performance)
and I fear that if a tunnable modifies the behavior of the output, we
it might never be usable again (at least not without having users also
now consider the systctl parameter). Making it dynamic I think is not
worth it.

Thanks,
Davidlohr

Waiman Long Oct. 29, 2018, 8:59 p.m. UTC | #5

On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
> On Mon, 29 Oct 2018, Waiman Long wrote:
>
>> BTW, since you are making stat2 compatible with stat, will that be
>> easier from the user API perspective if we use a sysctl parameter to
>> turn on and off IRQs reporting for /proc/stat, for example?
>
> For one /proc/stat is also common for debugging envs (ie: performance)
> and I fear that if a tunnable modifies the behavior of the output, we
> it might never be usable again (at least not without having users also
> now consider the systctl parameter). Making it dynamic I think is not
> worth it.
>
> Thanks,
> Davidlohr

This is just a matter if it is easier for users to modify their code to
use /proc/stat2 or turning on a sysctl parameter. Again, this will
certainly depend on the circumstances.

Cheers,
Longman

Waiman Long Oct. 29, 2018, 9:01 p.m. UTC | #6

On 10/29/2018 03:25 PM, Davidlohr Bueso wrote:
> A recent report from a large database vendor which I shall not name
> shows concerns about poor performance when consuming /proc/stat info.
> Particularly  kstat_irq() pops up in the profiles and most time is
> being spent there. The overall system is under a lot of irqs and
> almost 1k cores, thus this comes to little surprise.
>
> Granted that procfs in general is not known for its performance,
> nor designed for it, for that matter. Some users, however may be able
> to overcome this performance limitation, some not. Therefore it isn't
> bad having a kernel option for users that don't want any hard irq info
> -- and care enough about this.
>
> This patch introduces a new /proc/stat2 file that is identical to the
> regular 'stat' except that it zeroes all hard irq statistics. The new
> file is a drop in replacement to stat for users that need performance.
>
> The stat file is not touched, of course -- this was also previously
> suggested by Waiman:
> https://lore.kernel.org/lkml/1524166562-5644-1-git-send-email-longman@redhat.com/
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  Documentation/filesystems/proc.txt | 12 +++++++---
>  fs/proc/stat.c                     | 45 ++++++++++++++++++++++++++++++++------
>  2 files changed, 47 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..563b01decb1e 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -27,7 +27,7 @@ Table of Contents
>    1.5	SCSI info
>    1.6	Parallel port info in /proc/parport
>    1.7	TTY info in /proc/tty
> -  1.8	Miscellaneous kernel statistics in /proc/stat
> +  1.8	Miscellaneous kernel statistics in /proc/stat and /proc/stat2
>    1.9	Ext4 file system parameters
>  
>    2	Modifying System Parameters
> @@ -140,6 +140,7 @@ Table 1-1: Process specific entries in /proc
>   mem		Memory held by this process
>   root		Link to the root directory of this process
>   stat		Process status
> + stat2		Process status without irq information
>   statm		Process memory status information
>   status		Process status in human readable form
>   wchan		Present with CONFIG_KALLSYMS=y: it shows the kernel function
> @@ -1301,8 +1302,8 @@ To see  which  tty's  are  currently in use, you can simply look into the file
>    unknown              /dev/tty        4    1-63 console 
>  
>  
> -1.8 Miscellaneous kernel statistics in /proc/stat
> --------------------------------------------------
> +1.8 Miscellaneous kernel statistics in /proc/stat and /proc/stat2
> +-----------------------------------------------------------------
>  
>  Various pieces   of  information about  kernel activity  are  available in the
>  /proc/stat file.  All  of  the numbers reported  in  this file are  aggregates
> @@ -1371,6 +1372,11 @@ of the possible system softirqs. The first column is the total of all
>  softirqs serviced; each subsequent column is the total for that particular
>  softirq.
>  
> +The stat2 file acts as a performance alternative to /proc/stat for workloads

A "performant alternative", right?

-Longman

Vito Caputo Oct. 29, 2018, 9:23 p.m. UTC | #7

On Mon, Oct 29, 2018 at 04:59:03PM -0400, Waiman Long wrote:
> On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
> > On Mon, 29 Oct 2018, Waiman Long wrote:
> >
> >> BTW, since you are making stat2 compatible with stat, will that be
> >> easier from the user API perspective if we use a sysctl parameter to
> >> turn on and off IRQs reporting for /proc/stat, for example?
> >
> > For one /proc/stat is also common for debugging envs (ie: performance)
> > and I fear that if a tunnable modifies the behavior of the output, we
> > it might never be usable again (at least not without having users also
> > now consider the systctl parameter). Making it dynamic I think is not
> > worth it.
> >
> > Thanks,
> > Davidlohr
> 
> This is just a matter if it is easier for users to modify their code to
> use /proc/stat2 or turning on a sysctl parameter. Again, this will
> certainly depend on the circumstances.
> 

I wonder if it makes sense to introduce a more general mechanism for
toggling subfields in proc files.  Extended attributes could probably be
abused to key the subfields, write a 1 or 0 to well-known names for
toggling them on a per-fd basis via fsetxattr.

For this particular case the program would just have to add code like:

int	zero = 0;
fsetxattr(proc_stat_fd, "intr", &zero, sizeof(zero), XATTR_REPLACE);

Just putting it out there.  I've certainly wanted an ability to noop
fields before where I was polling proc frequently and skipping the bulk
of what was there but syscpu was still rather high.

I'm definitely not in favor of just adding another stat file that is the
same format as the existing one with the intrs zeroed out.  It's a dirty
hack; fine for your local needs but too gross for upstream IMHO.

Regards,
Vito Caputo

Waiman Long Oct. 29, 2018, 9:35 p.m. UTC | #8

On 10/29/2018 05:23 PM, Vito Caputo wrote:
> On Mon, Oct 29, 2018 at 04:59:03PM -0400, Waiman Long wrote:
>> On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
>>> On Mon, 29 Oct 2018, Waiman Long wrote:
>>>
>>>> BTW, since you are making stat2 compatible with stat, will that be
>>>> easier from the user API perspective if we use a sysctl parameter to
>>>> turn on and off IRQs reporting for /proc/stat, for example?
>>> For one /proc/stat is also common for debugging envs (ie: performance)
>>> and I fear that if a tunnable modifies the behavior of the output, we
>>> it might never be usable again (at least not without having users also
>>> now consider the systctl parameter). Making it dynamic I think is not
>>> worth it.
>>>
>>> Thanks,
>>> Davidlohr
>> This is just a matter if it is easier for users to modify their code to
>> use /proc/stat2 or turning on a sysctl parameter. Again, this will
>> certainly depend on the circumstances.
>>
> I wonder if it makes sense to introduce a more general mechanism for
> toggling subfields in proc files.  Extended attributes could probably be
> abused to key the subfields, write a 1 or 0 to well-known names for
> toggling them on a per-fd basis via fsetxattr.
>
> For this particular case the program would just have to add code like:
>
> int	zero = 0;
> fsetxattr(proc_stat_fd, "intr", &zero, sizeof(zero), XATTR_REPLACE);
>
> Just putting it out there.  I've certainly wanted an ability to noop
> fields before where I was polling proc frequently and skipping the bulk
> of what was there but syscpu was still rather high.
>
> I'm definitely not in favor of just adding another stat file that is the
> same format as the existing one with the intrs zeroed out.  It's a dirty
> hack; fine for your local needs but too gross for upstream IMHO.
>
> Regards,
> Vito Caputo

Does procfs allow extended attributes? I am not sure if using extended
attributes is a usual practice for doing this kind of control on a
procfs file.

Cheers,
Longman

Vito Caputo Oct. 29, 2018, 10:41 p.m. UTC | #9

On Mon, Oct 29, 2018 at 05:35:15PM -0400, Waiman Long wrote:
> On 10/29/2018 05:23 PM, Vito Caputo wrote:
> > On Mon, Oct 29, 2018 at 04:59:03PM -0400, Waiman Long wrote:
> >> On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
> >>> On Mon, 29 Oct 2018, Waiman Long wrote:
> >>>
> >>>> BTW, since you are making stat2 compatible with stat, will that be
> >>>> easier from the user API perspective if we use a sysctl parameter to
> >>>> turn on and off IRQs reporting for /proc/stat, for example?
> >>> For one /proc/stat is also common for debugging envs (ie: performance)
> >>> and I fear that if a tunnable modifies the behavior of the output, we
> >>> it might never be usable again (at least not without having users also
> >>> now consider the systctl parameter). Making it dynamic I think is not
> >>> worth it.
> >>>
> >>> Thanks,
> >>> Davidlohr
> >> This is just a matter if it is easier for users to modify their code to
> >> use /proc/stat2 or turning on a sysctl parameter. Again, this will
> >> certainly depend on the circumstances.
> >>
> > I wonder if it makes sense to introduce a more general mechanism for
> > toggling subfields in proc files.  Extended attributes could probably be
> > abused to key the subfields, write a 1 or 0 to well-known names for
> > toggling them on a per-fd basis via fsetxattr.
> >
> > For this particular case the program would just have to add code like:
> >
> > int	zero = 0;
> > fsetxattr(proc_stat_fd, "intr", &zero, sizeof(zero), XATTR_REPLACE);
> >
> > Just putting it out there.  I've certainly wanted an ability to noop
> > fields before where I was polling proc frequently and skipping the bulk
> > of what was there but syscpu was still rather high.
> >
> > I'm definitely not in favor of just adding another stat file that is the
> > same format as the existing one with the intrs zeroed out.  It's a dirty
> > hack; fine for your local needs but too gross for upstream IMHO.
> >
> > Regards,
> > Vito Caputo
> 
> Does procfs allow extended attributes? I am not sure if using extended
> attributes is a usual practice for doing this kind of control on a
> procfs file.
> 

I'm not aware of any such usage, but I didn't dig into the code to see if
there were already conflicting xattr users.

If this did turn out to be a reasonable approach, it would probably be
wise to prefix the key strings with a namespace in case future uses come
up.  Like "filter:intr" or something along those lines.

WRT procfs support for xattr, if it's not already implemented it's not
difficult to add.  The important factor is this utilizes an existing vfs
api, there's nothing about procfs prohibiting xattr handling.

Regards,
Vito Caputo

Daniel Colascione Oct. 29, 2018, 11:04 p.m. UTC | #10

On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> This patch introduces a new /proc/stat2 file that is identical to the
> regular 'stat' except that it zeroes all hard irq statistics. The new
> file is a drop in replacement to stat for users that need performance.

For a while now, I've been thinking over ways to improve the
performance of collecting various bits of kernel information. I don't
think that a proliferation of special-purpose named bag-of-fields file
variants is the right answer, because even if you add a few info-file
variants, you're still left with a situation where a given file
provides a particular caller with too little or too much information.
I'd much rather move to a model in which userspace *explicitly* tells
the kernel which fields it wants, with the kernel replying with just
those particular fields, maybe in their raw binary representations.
The ASCII-text bag-of-everything files would remain available for
ad-hoc and non-performance critical use, but programs that cared about
performance would have an efficient bypass. One concrete approach is
to let users open up today's proc files and, instead of read(2)ing a
text blob, use an ioctl to retrieve specified and targeted information
of the sort that would normally be encoded in the text blob. Because
callers would open the same file when using either the text or binary
interfaces, little would have to change, and it'd be easy to implement
fallbacks when a particular system doesn't support a particular
fast-path ioctl.

Vito Caputo Oct. 30, 2018, 12:58 a.m. UTC | #11

On Mon, Oct 29, 2018 at 11:04:45PM +0000, Daniel Colascione wrote:
> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> > This patch introduces a new /proc/stat2 file that is identical to the
> > regular 'stat' except that it zeroes all hard irq statistics. The new
> > file is a drop in replacement to stat for users that need performance.
> 
> For a while now, I've been thinking over ways to improve the
> performance of collecting various bits of kernel information. I don't
> think that a proliferation of special-purpose named bag-of-fields file
> variants is the right answer, because even if you add a few info-file
> variants, you're still left with a situation where a given file
> provides a particular caller with too little or too much information.
> I'd much rather move to a model in which userspace *explicitly* tells
> the kernel which fields it wants, with the kernel replying with just
> those particular fields, maybe in their raw binary representations.
> The ASCII-text bag-of-everything files would remain available for
> ad-hoc and non-performance critical use, but programs that cared about
> performance would have an efficient bypass. One concrete approach is
> to let users open up today's proc files and, instead of read(2)ing a
> text blob, use an ioctl to retrieve specified and targeted information
> of the sort that would normally be encoded in the text blob. Because
> callers would open the same file when using either the text or binary
> interfaces, little would have to change, and it'd be easy to implement
> fallbacks when a particular system doesn't support a particular
> fast-path ioctl.

We have two extremes of granularity in the /proc and /sys virtual
filesystems today:

On procfs there's these legacy files which aggregate loosely-related
system information, and in cases where you actually want most of what's
provided, it's a nice optimization because you can sample it all in a
single pread() call.

On sysfs the granularity is much finer with it being fairly common to
find a file-per-datum.  This has other advantages, like not needing to
parse snowflake formats which sometimes varied across kernel versions
like in procfs, or needing to burden the kernel to produce more
information than necessary.

But anyone who has written tools trying to sample large subsets of the
granular information in sysfs at a high rate will know how quickly it
becomes rather costly in terms of system calls.

The last time I went down this path, I wished there were a system call
like readv() which accepted a vector a new iovec type specifying an fd.

Then the sysfs model could be made a more efficient by coalescing all
the required read syscalls into a single megaread bundling all the
relevant fds that are simply kept open and reused.

If we had such a readv() variant, the sysfs granular model could be used
to granularly expose all the information we currently expose in /proc,
while still being relatively efficient in terms of system calls per
sample.  Sure you still have to lookup and open all the files of
interest, but that only needs to occur once at initialization.

Regards,
Vito Caputo

Davidlohr Bueso Oct. 30, 2018, 6:57 p.m. UTC | #12

On Mon, 29 Oct 2018, Vito Caputo wrote:

>I'm definitely not in favor of just adding another stat file that is the
>same format as the existing one with the intrs zeroed out.  It's a dirty
>hack; fine for your local needs but too gross for upstream IMHO.

I suspect very few users of /proc/stat actually use the irq fields in the
first place. So the common case ends up doing unnecessary operations. The
stat2 approach is not perfect, but I think it's the best approach so far.
This sort of renaming is not uncommon when we cannot break userspace, and
its not like procfs is not already far contaminated already.

There are not enough users that care about this stuff, afaik. What you
suggest sounds like a lot of over-engineering.

Thanks,
Davidlohr

Vito Caputo Oct. 30, 2018, 10:40 p.m. UTC | #13

On Tue, Oct 30, 2018 at 11:57:56AM -0700, Davidlohr Bueso wrote:
> On Mon, 29 Oct 2018, Vito Caputo wrote:
> 
> > I'm definitely not in favor of just adding another stat file that is the
> > same format as the existing one with the intrs zeroed out.  It's a dirty
> > hack; fine for your local needs but too gross for upstream IMHO.
> 
> I suspect very few users of /proc/stat actually use the irq fields in the
> first place. So the common case ends up doing unnecessary operations. The
> stat2 approach is not perfect, but I think it's the best approach so far.
> This sort of renaming is not uncommon when we cannot break userspace, and
> its not like procfs is not already far contaminated already.
> 
> There are not enough users that care about this stuff, afaik. What you
> suggest sounds like a lot of over-engineering.
> 

What you suggest sounds like a kludge with zero engineering at all.

My suggestion might be stupid, insofar as the same thing can be achieved
using ioctls without assigning surprising semantics to an existing vfs
api.  But I think the spirit of the suggestion is a reasonable
compromise, if /proc/stat is not a deprecated interface and it's
too difficult to make everything it collects sufficiently efficient on
any size machine.

If you create /proc/stat2 to omit interrupts, do we then create
/proc/stat3 to omit CPUs when just interrupts are of interest to the
application running on a 256-cpu machine?

Furthermore, your rationale of procfs already being contaminated is an
active embrace of the broken windows effect.  If you recognize something
is flawed, it's cause to work harder to improve the situation when
working in the flawed area, not worsen it.

Regards,
Vito Caputo

Davidlohr Bueso Oct. 30, 2018, 11:15 p.m. UTC | #14

On Tue, 30 Oct 2018, Vito Caputo wrote:

>If you create /proc/stat2 to omit interrupts, do we then create
>/proc/stat3 to omit CPUs when just interrupts are of interest to the
>application running on a 256-cpu machine?

Be real, this is a bogus argument. As mentioned, stat2 is named as such
because it's a replacement for a better common case, not some random increment.

Thanks,
Davidlohr

Andrew Morton Nov. 6, 2018, 11:48 p.m. UTC | #15

On Mon, 29 Oct 2018 23:04:45 +0000 Daniel Colascione <dancol@google.com> wrote:

> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> > This patch introduces a new /proc/stat2 file that is identical to the
> > regular 'stat' except that it zeroes all hard irq statistics. The new
> > file is a drop in replacement to stat for users that need performance.
> 
> For a while now, I've been thinking over ways to improve the
> performance of collecting various bits of kernel information. I don't
> think that a proliferation of special-purpose named bag-of-fields file
> variants is the right answer, because even if you add a few info-file
> variants, you're still left with a situation where a given file
> provides a particular caller with too little or too much information.
> I'd much rather move to a model in which userspace *explicitly* tells
> the kernel which fields it wants, with the kernel replying with just
> those particular fields, maybe in their raw binary representations.
> The ASCII-text bag-of-everything files would remain available for
> ad-hoc and non-performance critical use, but programs that cared about
> performance would have an efficient bypass. One concrete approach is
> to let users open up today's proc files and, instead of read(2)ing a
> text blob, use an ioctl to retrieve specified and targeted information
> of the sort that would normally be encoded in the text blob. Because
> callers would open the same file when using either the text or binary
> interfaces, little would have to change, and it'd be easy to implement
> fallbacks when a particular system doesn't support a particular
> fast-path ioctl.

Yup.  There are better ways of getting information out of the kernel,
to say the least.

It would be interesting to know precisely which stat fields the
database-which-shall-not-be-named is looking for.  Then we could cook
up a very whizzy way of getting at the info.

A downside of the stat2 approach is that applications will need to be
rebuilt.  And hopefully when people do this, they'll open
"/etc/my-app-name/symlink-to-proc-stat" (or use per-application config)
so they won't need a rebuild when we add /proc/stat3!

A /proc/change-how-stat-works tunable would avoid the need to rebuild
applications.  But if a system still has some applications which want
the irq info then that doesn't work.

It's all very sad, really.

btw,

> +The stat2 file acts as a performance alternative to /proc/stat for workloads
> +and systems that care and are under heavy irq load. In order to to be completely
> +compatible, /proc/stat and /proc/stat2 are identical with the exception that the
> +later will show 0 for any (hard)irq-related fields. This refers particularly

"latter"

> +to the "intr" line and 'irq' column for that aggregate in the cpu line.

btw2, please quantify "poor performance".  That helps us determine how
much we care about all of this!

Davidlohr Bueso Nov. 7, 2018, 3:32 a.m. UTC | #16

On Tue, 06 Nov 2018, Andrew Morton wrote:

>It would be interesting to know precisely which stat fields the
>database-which-shall-not-be-named is looking for.  Then we could cook
>up a very whizzy way of getting at the info.

The ctxt field, afaik. In any case they have been able to work around
the bottleneck. I'm not sure if that is the case for Waiman, however.

>
>A downside of the stat2 approach is that applications will need to be
>rebuilt.  And hopefully when people do this, they'll open
>"/etc/my-app-name/symlink-to-proc-stat" (or use per-application config)
>so they won't need a rebuild when we add /proc/stat3!
>
>A /proc/change-how-stat-works tunable would avoid the need to rebuild
>applications.  But if a system still has some applications which want
>the irq info then that doesn't work.
>
>It's all very sad, really.
>
>btw,
>
>> +The stat2 file acts as a performance alternative to /proc/stat for workloads
>> +and systems that care and are under heavy irq load. In order to to be completely
>> +compatible, /proc/stat and /proc/stat2 are identical with the exception that the
>> +later will show 0 for any (hard)irq-related fields. This refers particularly
>
>"latter"
>
>> +to the "intr" line and 'irq' column for that aggregate in the cpu line.
>
>btw2, please quantify "poor performance".  That helps us determine how
>much we care about all of this!

Up to a quarter of a second is what was reported as being spent every time
/proc/stat is used. This is with 1k cores and 4k interrupts.

Thanks,
Davidlohr

Miklos Szeredi Nov. 7, 2018, 10:03 a.m. UTC | #17

On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 29 Oct 2018 23:04:45 +0000 Daniel Colascione <dancol@google.com> wrote:
>
>> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>> > This patch introduces a new /proc/stat2 file that is identical to the
>> > regular 'stat' except that it zeroes all hard irq statistics. The new
>> > file is a drop in replacement to stat for users that need performance.
>>
>> For a while now, I've been thinking over ways to improve the
>> performance of collecting various bits of kernel information. I don't
>> think that a proliferation of special-purpose named bag-of-fields file
>> variants is the right answer, because even if you add a few info-file
>> variants, you're still left with a situation where a given file
>> provides a particular caller with too little or too much information.
>> I'd much rather move to a model in which userspace *explicitly* tells
>> the kernel which fields it wants, with the kernel replying with just
>> those particular fields, maybe in their raw binary representations.
>> The ASCII-text bag-of-everything files would remain available for
>> ad-hoc and non-performance critical use, but programs that cared about
>> performance would have an efficient bypass. One concrete approach is
>> to let users open up today's proc files and, instead of read(2)ing a
>> text blob, use an ioctl to retrieve specified and targeted information
>> of the sort that would normally be encoded in the text blob. Because
>> callers would open the same file when using either the text or binary
>> interfaces, little would have to change, and it'd be easy to implement
>> fallbacks when a particular system doesn't support a particular
>> fast-path ioctl.

Please.   Sysfs, with the one value per file rule, was created exactly
for the purpose of eliminating these sort of problems with procfs.  So
instead of inventing special purpose interfaces for proc, just make
the info available in sysfs, if not already available.

Thanks,
Miklos

Daniel Colascione Nov. 7, 2018, 3:42 p.m. UTC | #18

On Wed, Nov 7, 2018 at 10:03 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Mon, 29 Oct 2018 23:04:45 +0000 Daniel Colascione <dancol@google.com> wrote:
>>
>>> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
>>> > This patch introduces a new /proc/stat2 file that is identical to the
>>> > regular 'stat' except that it zeroes all hard irq statistics. The new
>>> > file is a drop in replacement to stat for users that need performance.
>>>
>>> For a while now, I've been thinking over ways to improve the
>>> performance of collecting various bits of kernel information. I don't
>>> think that a proliferation of special-purpose named bag-of-fields file
>>> variants is the right answer, because even if you add a few info-file
>>> variants, you're still left with a situation where a given file
>>> provides a particular caller with too little or too much information.
>>> I'd much rather move to a model in which userspace *explicitly* tells
>>> the kernel which fields it wants, with the kernel replying with just
>>> those particular fields, maybe in their raw binary representations.
>>> The ASCII-text bag-of-everything files would remain available for
>>> ad-hoc and non-performance critical use, but programs that cared about
>>> performance would have an efficient bypass. One concrete approach is
>>> to let users open up today's proc files and, instead of read(2)ing a
>>> text blob, use an ioctl to retrieve specified and targeted information
>>> of the sort that would normally be encoded in the text blob. Because
>>> callers would open the same file when using either the text or binary
>>> interfaces, little would have to change, and it'd be easy to implement
>>> fallbacks when a particular system doesn't support a particular
>>> fast-path ioctl.
>
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

First of all, is sysfs even right? Some people, for whatever reason,
are extremely particular about the purposes of various virtual
filesystems. "No, sysfs is for exposing kernel objects, not
configuration!" is something I've heard more than once. Who's to say
that sysfs is for exposing /proc/pid/stat, which isn't a "kernel
object" itself? (A process is not its struct task.) More generally,
objections about APIs rooted in arcane kernel-internal considerations
about the purposes of various virtual filesystems --- procfs, sysfs,
debugfs, configfs --- makes the userspace API worse, because it
enshrines implementation details (is this thing a kobject or not?) in
public API. If I had my way, we'd have continued putting *everything*
in procfs and just make procfs the "I want stuff from the kernel" API.
Nobody in userspace cares about these filesystem divisions.

Second, slurping from a sysfs-style setup in which there's one file
per piece of information creates massive overhead, because there's
currently no way to open multiple paths with one system call and no
way to read from multiple FDs with one system call. If you want this
kind of setup to work, you need some kind of batched openat-and-read
system call mechanism. I think a simple "get information from this
procfs FD" system call --- something like statx --- is both cleaner
and more efficient. Plus, without a batch operation, there's no way to
achieve atomicity. It's perfectly reasonable for userspace to request
some bits of information about a process want these bits to be
consistent with each other. Now, such an API would be good to add, but
it's not enough, since a generic batched openat-and-read would still
have to go through VFS, create struct files, (probably) encode to
ASCII, and so on. Why should any system pay to do that much work when
the fields anyone might want could be obtained with a simple
copy_to_user?

Third, and finally, a sysfs-style tree for processes doesn't currently
exist. Would you propose having *two* *different* representations of
the process list as virtual filesystems? That's another pointless
exposure of internal kernel divisions in the user API. We already have
procfs. Let's just make it better.

Miklos Szeredi Nov. 7, 2018, 3:54 p.m. UTC | #19

On Wed, Nov 7, 2018 at 4:42 PM, Daniel Colascione <dancol@google.com> wrote:

> configuration!" is something I've heard more than once. Who's to say
> that sysfs is for exposing /proc/pid/stat,

Patch is about /proc/stat not /proc/PID/stat.  Please revise your
arguments based on that.

Thanks,
Miklos

Daniel Colascione Nov. 7, 2018, 4:01 p.m. UTC | #20

On Wed, Nov 7, 2018 at 3:54 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Nov 7, 2018 at 4:42 PM, Daniel Colascione <dancol@google.com> wrote:
>
>> configuration!" is something I've heard more than once. Who's to say
>> that sysfs is for exposing /proc/pid/stat,
>
> Patch is about /proc/stat not /proc/PID/stat.  Please revise your
> arguments based on that.

My argument stands. The considerations I'm discussing apply to both
process-specific and system-global information files.

Waiman Long Nov. 7, 2018, 4:31 p.m. UTC | #21

On 11/06/2018 10:32 PM, Davidlohr Bueso wrote:
> On Tue, 06 Nov 2018, Andrew Morton wrote:
>
>> It would be interesting to know precisely which stat fields the
>> database-which-shall-not-be-named is looking for.  Then we could cook
>> up a very whizzy way of getting at the info.
>
> The ctxt field, afaik. In any case they have been able to work around
> the bottleneck. I'm not sure if that is the case for Waiman, however.
>

In my case, the customers just complain about the slowdown in reading
/proc/stat on some platforms vs. the others because some had many more
interrupt lines than the others. They didn't specifically call out what
they were looking at.

>>
>> A downside of the stat2 approach is that applications will need to be
>> rebuilt.  And hopefully when people do this, they'll open
>> "/etc/my-app-name/symlink-to-proc-stat" (or use per-application config)
>> so they won't need a rebuild when we add /proc/stat3!
>>
>> A /proc/change-how-stat-works tunable would avoid the need to rebuild
>> applications.  But if a system still has some applications which want
>> the irq info then that doesn't work.
>>
>> It's all very sad, really.
>>
>> btw,
>>
>>> +The stat2 file acts as a performance alternative to /proc/stat for
>>> workloads
>>> +and systems that care and are under heavy irq load. In order to to
>>> be completely
>>> +compatible, /proc/stat and /proc/stat2 are identical with the
>>> exception that the
>>> +later will show 0 for any (hard)irq-related fields. This refers
>>> particularly
>>
>> "latter"
>>
>>> +to the "intr" line and 'irq' column for that aggregate in the cpu
>>> line.
>>
>> btw2, please quantify "poor performance".  That helps us determine how
>> much we care about all of this!
>
> Up to a quarter of a second is what was reported as being spent every
> time
> /proc/stat is used. This is with 1k cores and 4k interrupts.
>
> Thanks,
> Davidlohr

Yes, the time spent will scale more or less linearly with the # of cores
and # of interrupts.

Thanks,
Longman

Vito Caputo Nov. 7, 2018, 8:32 p.m. UTC | #22

On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Mon, 29 Oct 2018 23:04:45 +0000 Daniel Colascione <dancol@google.com> wrote:
> >
> >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >> > This patch introduces a new /proc/stat2 file that is identical to the
> >> > regular 'stat' except that it zeroes all hard irq statistics. The new
> >> > file is a drop in replacement to stat for users that need performance.
> >>
> >> For a while now, I've been thinking over ways to improve the
> >> performance of collecting various bits of kernel information. I don't
> >> think that a proliferation of special-purpose named bag-of-fields file
> >> variants is the right answer, because even if you add a few info-file
> >> variants, you're still left with a situation where a given file
> >> provides a particular caller with too little or too much information.
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> 
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.
> 

I like the sysfs approach to organizing the data, and have wanted
fd-batching IO syscalls in other circumstances anyways, so I think
there's a good possibility of something along those lines getting added
eventually.

At a past employer I had written some backup software which had to
reassemble versioned files from chains of reverse differentials (think
rdiff-backup).  I had all the information needed to quickly construct a
multi-fd iovec to supply to a single batched readv syscall when
servicing versioned reads from a FUSE mount that involved a potentially
long chain of diffs, but no such syscall exists.  The more
differentials, the more fragmented the operation tended to be, requiring
increasing numbers of smaller reads across more files to reconstruct the
buffer.

The same thing would be useful for making reads from large numbers of
sysfs files less costly.  I presume proposing such a generally
applicable VFS API addition would meet less resistance than specialized
proc interfaces, perhaps naively :).

Regards,
Vito Caputo

Dave Chinner Nov. 8, 2018, 2:07 a.m. UTC | #23

On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Mon, 29 Oct 2018 23:04:45 +0000 Daniel Colascione <dancol@google.com> wrote:
> >
> >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> >> > This patch introduces a new /proc/stat2 file that is identical to the
> >> > regular 'stat' except that it zeroes all hard irq statistics. The new
> >> > file is a drop in replacement to stat for users that need performance.
> >>
> >> For a while now, I've been thinking over ways to improve the
> >> performance of collecting various bits of kernel information. I don't
> >> think that a proliferation of special-purpose named bag-of-fields file
> >> variants is the right answer, because even if you add a few info-file
> >> variants, you're still left with a situation where a given file
> >> provides a particular caller with too little or too much information.
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> 
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

This doesn't solve the problem.

The problem is that this specific implementation of per-cpu
counters need to be summed on every read. Hence when you have a huge
number of CPUs each per-cpu iteration that takes a substantial
amount of time.

If only we had percpu counters that had a fixed, extremely low read
overhead that doesn't care about the number of CPUs in the
machine....

Oh, wait, we do: percpu_counters.[ch].

This all seems like a counter implementation deficiency to me, not
an interface problem...

Cheers,

Dave.

Davidlohr Bueso Nov. 8, 2018, 7:24 a.m. UTC | #24

On Thu, 08 Nov 2018, Dave Chinner wrote:

>If only we had percpu counters that had a fixed, extremely low read
>overhead that doesn't care about the number of CPUs in the
>machine....
>
>Oh, wait, we do: percpu_counters.[ch].
>
>This all seems like a counter implementation deficiency to me, not
>an interface problem...

Yeah fair point, as long as we can sacrifice accuracy by replacing
kernel_stat -- or maybe just replace the hard irq stats, which I
still think only accounts for 1% of all stat users. I have not looked
at how filesystems tune the batch size, but it would certainly be worth
looking into methinks.

Thanks,
Davidlohr

Davidlohr Bueso Nov. 8, 2018, 7:44 a.m. UTC | #25

On Wed, 07 Nov 2018, Davidlohr Bueso wrote:
>I have not looked at how filesystems tune the batch size, but it would certainly be worth
>looking into methinks.

nm this part, percpu_counter_batch is not tunable. It would
still probably be acceptable (famous last words) to at least
move the bottleneck in question to percpu_counter api.

Thanks,
Davidlohr

fs/proc: introduce /proc/stat2 file

Commit Message

Comments

Patch