[v3] fs/proc: add VmTaskSize field to /proc/$$/status

Message ID	1557158023-23021-1-git-send-email-jsavitz@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Joel Savitz <jsavitz@redhat.com> To: linux-kernel@vger.kernel.org Cc: Joel Savitz <jsavitz@redhat.com>, Alexey Dobriyan <adobriyan@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Michael Ellerman <mpe@ellerman.id.au>, Ram Pai <linuxram@us.ibm.com>, Andrea Arcangeli <aarcange@redhat.com>, Huang Ying <ying.huang@intel.com>, Sandeep Patil <sspatil@android.com>, Rafael Aquini <aquini@redhat.com>, Yury Norov <yury.norov@gmail.com>, linux-fsdevel@vger.kernel.org Subject: [PATCH v3] fs/proc: add VmTaskSize field to /proc/$$/status Date: Mon, 6 May 2019 11:53:43 -0400 Message-Id: <1557158023-23021-1-git-send-email-jsavitz@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	[v3] fs/proc: add VmTaskSize field to /proc/$$/status \| expand [v3] fs/proc: add VmTaskSize field to /proc/$$/status

Joel Savitz May 6, 2019, 3:53 p.m. UTC

There is currently no easy and architecture-independent way to find the
lowest unusable virtual address available to a process without
brute-force calculation. This patch allows a user to easily retrieve
this value via /proc/<pid>/status.

Using this patch, any program that previously needed to waste cpu cycles
recalculating a non-sensitive process-dependent value already known to
the kernel can now be optimized to use this mechanism.

Signed-off-by: Joel Savitz <jsavitz@redhat.com>
---
 Documentation/filesystems/proc.txt | 2 ++
 fs/proc/task_mmu.c                 | 2 ++
 2 files changed, 4 insertions(+)

Rafael Aquini May 7, 2019, 12:54 p.m. UTC | #1

On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> There is currently no easy and architecture-independent way to find the
> lowest unusable virtual address available to a process without
> brute-force calculation. This patch allows a user to easily retrieve
> this value via /proc/<pid>/status.
> 
> Using this patch, any program that previously needed to waste cpu cycles
> recalculating a non-sensitive process-dependent value already known to
> the kernel can now be optimized to use this mechanism.
> 
> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> ---
>  Documentation/filesystems/proc.txt | 2 ++
>  fs/proc/task_mmu.c                 | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 66cad5c86171..1c6a912e3975 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -187,6 +187,7 @@ read the file /proc/PID/status:
>    VmLib:      1412 kB
>    VmPTE:        20 kb
>    VmSwap:        0 kB
> +  VmTaskSize:	137438953468 kB
>    HugetlbPages:          0 kB
>    CoreDumping:    0
>    THP_enabled:	  1
> @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
>   VmPTE                       size of page table entries
>   VmSwap                      amount of swap used by anonymous private data
>                               (shmem swap usage is not included)
> + VmTaskSize                  lowest unusable address in process virtual memory

Can we change this help text to "size of process' virtual address space memory" ?

>   HugetlbPages                size of hugetlb memory portions
>   CoreDumping                 process's memory is currently being dumped
>                               (killing the process may lead to a corrupted core)
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 95ca1fe7283c..0af7081f7b19 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	seq_put_decimal_ull_width(m,
>  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
>  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> +	seq_put_decimal_ull_width(m,
> +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
>  	seq_puts(m, " kB\n");
>  	hugetlb_report_usage(m, mm);
>  }
> -- 
> 2.18.1
> 
Acked-by: Rafael Aquini <aquini@redhat.com>

Yury Norov May 8, 2019, 6:37 a.m. UTC | #2

On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
> On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> > There is currently no easy and architecture-independent way to find the
> > lowest unusable virtual address available to a process without
> > brute-force calculation. This patch allows a user to easily retrieve
> > this value via /proc/<pid>/status.
> > 
> > Using this patch, any program that previously needed to waste cpu cycles
> > recalculating a non-sensitive process-dependent value already known to
> > the kernel can now be optimized to use this mechanism.
> > 
> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> > ---
> >  Documentation/filesystems/proc.txt | 2 ++
> >  fs/proc/task_mmu.c                 | 2 ++
> >  2 files changed, 4 insertions(+)
> > 
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 66cad5c86171..1c6a912e3975 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
> >    VmLib:      1412 kB
> >    VmPTE:        20 kb
> >    VmSwap:        0 kB
> > +  VmTaskSize:	137438953468 kB
> >    HugetlbPages:          0 kB
> >    CoreDumping:    0
> >    THP_enabled:	  1
> > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
> >   VmPTE                       size of page table entries
> >   VmSwap                      amount of swap used by anonymous private data
> >                               (shmem swap usage is not included)
> > + VmTaskSize                  lowest unusable address in process virtual memory
> 
> Can we change this help text to "size of process' virtual address space memory" ?

Agree. Or go in other direction and make it VmEnd

> >   HugetlbPages                size of hugetlb memory portions
> >   CoreDumping                 process's memory is currently being dumped
> >                               (killing the process may lead to a corrupted core)
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 95ca1fe7283c..0af7081f7b19 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> >  	seq_put_decimal_ull_width(m,
> >  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
> >  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> > +	seq_put_decimal_ull_width(m,
> > +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
> >  	seq_puts(m, " kB\n");
> >  	hugetlb_report_usage(m, mm);
> >  }

I'm OK with technical part, but I still have questions not answered
(or wrongly answered) in v1 and v2. Below is the very detailed
description of the concerns I have.

1. What is the exact reason for it? Original version tells about some
test that takes so much time that you were able to drink a cup of
coffee before it was done. The test as you said implements linear
search to find the last page and so is of O(n). If it's only for some
random test, I think the kernel can survive without it. Do you have a
real example of useful programs that suffer without this information?

2. I have nothing against taking breaks and see nothing weird if
ineffective algorithms take time. On my system (x86, Ubuntu) the last
mapped region according to /proc/<pid>/maps is:
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]
So to find the required address, we have to inspect 2559 pages. With a
binary search it would take 12 iterations at max. If my calculation is
wrong or your environment is completely different - please elaborate.

3. As far as I can see, Linux currently does not support dynamic
TASK_SIZE. It means that for any platform+ABI combination VmTaskSize
will be always the same. So VmTaskSize would be effectively useless waste
of lines. In fact, TASK SIZE is compiler time information and should
be exposed to user in headers. In discussion to v2 Rafael Aquini answered
for this concern that TASK_SIZE is a runtime resolved macro. It's
true, but my main point is: GCC knows what type of binary it compiles
and is able to select proper value. We are already doing similar things
where appropriate. Refer for example to my arm64/ilp32 series:

arch/arm64/include/uapi/asm/bitsperlong.h:
-#define __BITS_PER_LONG 64
+#if defined(__LP64__)
+/* Assuming __LP64__ will be defined for native ELF64's and not for ILP32. */
+#  define __BITS_PER_LONG 64
+#elif defined(__ILP32__)
+#  define __BITS_PER_LONG 32
+#else
+#  error "Neither LP64 nor ILP32: unsupported ABI in asm/bitsperlong.h"
+#endif

__LP64__ and __ILP32__ are symbols provided by GCC to distinguish
between ABIs. So userspace is able to take proper __BITS_PER_LONG value
at compile time, not at runtime. I think, you can do the same thing for
TASK_SIZE. If there are real usecases of the proposed feature, I think
programs will benefit even more if they will be able to get this info
at compile time.

Thaks,
Yury

Rafael Aquini May 8, 2019, 1:47 p.m. UTC | #3

On Tue, May 07, 2019 at 11:37:16PM -0700, Yury Norov wrote:
> On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
> > On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> > > There is currently no easy and architecture-independent way to find the
> > > lowest unusable virtual address available to a process without
> > > brute-force calculation. This patch allows a user to easily retrieve
> > > this value via /proc/<pid>/status.
> > > 
> > > Using this patch, any program that previously needed to waste cpu cycles
> > > recalculating a non-sensitive process-dependent value already known to
> > > the kernel can now be optimized to use this mechanism.
> > > 
> > > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> > > ---
> > >  Documentation/filesystems/proc.txt | 2 ++
> > >  fs/proc/task_mmu.c                 | 2 ++
> > >  2 files changed, 4 insertions(+)
> > > 
> > > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > > index 66cad5c86171..1c6a912e3975 100644
> > > --- a/Documentation/filesystems/proc.txt
> > > +++ b/Documentation/filesystems/proc.txt
> > > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
> > >    VmLib:      1412 kB
> > >    VmPTE:        20 kb
> > >    VmSwap:        0 kB
> > > +  VmTaskSize:	137438953468 kB
> > >    HugetlbPages:          0 kB
> > >    CoreDumping:    0
> > >    THP_enabled:	  1
> > > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
> > >   VmPTE                       size of page table entries
> > >   VmSwap                      amount of swap used by anonymous private data
> > >                               (shmem swap usage is not included)
> > > + VmTaskSize                  lowest unusable address in process virtual memory
> > 
> > Can we change this help text to "size of process' virtual address space memory" ?
> 
> Agree. Or go in other direction and make it VmEnd
> 
> > >   HugetlbPages                size of hugetlb memory portions
> > >   CoreDumping                 process's memory is currently being dumped
> > >                               (killing the process may lead to a corrupted core)
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 95ca1fe7283c..0af7081f7b19 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> > >  	seq_put_decimal_ull_width(m,
> > >  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
> > >  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> > > +	seq_put_decimal_ull_width(m,
> > > +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
> > >  	seq_puts(m, " kB\n");
> > >  	hugetlb_report_usage(m, mm);
> > >  }
> 
> I'm OK with technical part, but I still have questions not answered
> (or wrongly answered) in v1 and v2. Below is the very detailed
> description of the concerns I have.
> 
> 1. What is the exact reason for it? Original version tells about some
> test that takes so much time that you were able to drink a cup of
> coffee before it was done. The test as you said implements linear
> search to find the last page and so is of O(n). If it's only for some
> random test, I think the kernel can survive without it. Do you have a
> real example of useful programs that suffer without this information?
> 
> 
> 2. I have nothing against taking breaks and see nothing weird if
> ineffective algorithms take time. On my system (x86, Ubuntu) the last
> mapped region according to /proc/<pid>/maps is:
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]
> So to find the required address, we have to inspect 2559 pages. With a
> binary search it would take 12 iterations at max. If my calculation is
> wrong or your environment is completely different - please elaborate.
> 
> 3. As far as I can see, Linux currently does not support dynamic
> TASK_SIZE. It means that for any platform+ABI combination VmTaskSize
> will be always the same. So VmTaskSize would be effectively useless waste

Assuming you can have it fixed and decide upon one another at compile
time also is not necessarely true, unfortunately. One could adjust PAGE_OFFSET, 
at kernel config, to provide different splits for the virtual address space,
which will affect TASK_SIZE, eventually. (see arch/x86/Kconfig)

 
> of lines. In fact, TASK SIZE is compiler time information and should
> be exposed to user in headers. In discussion to v2 Rafael Aquini answered
> for this concern that TASK_SIZE is a runtime resolved macro. It's
> true, but my main point is: GCC knows what type of binary it compiles
> and is able to select proper value. We are already doing similar things
> where appropriate. Refer for example to my arm64/ilp32 series:
> arch/arm64/include/uapi/asm/bitsperlong.h:
> -#define __BITS_PER_LONG 64
> +#if defined(__LP64__)
> +/* Assuming __LP64__ will be defined for native ELF64's and not for ILP32. */
> +#  define __BITS_PER_LONG 64
> +#elif defined(__ILP32__)
> +#  define __BITS_PER_LONG 32
> +#else
> +#  error "Neither LP64 nor ILP32: unsupported ABI in asm/bitsperlong.h"
> +#endif
> 


You are correct, but you miss the point Joel is trying to provide that
value in an architectural agnostic way to avoid the hassle of keep adding
more preprocessor complexity and being concerned about arch particularities.

You were spot on pointing the issues with the prctl(2) approach before,
but I don't see the need to overengineer the suggested approach here. 
The cost of getting mm->task_size exported via /proc/<pid>/status is
neglectable, and prevents further complexity going in for such simple
task.


Regards,
-- Rafael

Michael Ellerman May 10, 2019, 3:32 a.m. UTC | #4

Yury Norov <yury.norov@gmail.com> writes:
> On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
>> On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
>> > There is currently no easy and architecture-independent way to find the
>> > lowest unusable virtual address available to a process without
>> > brute-force calculation. This patch allows a user to easily retrieve
>> > this value via /proc/<pid>/status.
>> > 
>> > Using this patch, any program that previously needed to waste cpu cycles
>> > recalculating a non-sensitive process-dependent value already known to
>> > the kernel can now be optimized to use this mechanism.
>> > 
>> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>> > ---
>> >  Documentation/filesystems/proc.txt | 2 ++
>> >  fs/proc/task_mmu.c                 | 2 ++
>> >  2 files changed, 4 insertions(+)
>> > 
>> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> > index 66cad5c86171..1c6a912e3975 100644
>> > --- a/Documentation/filesystems/proc.txt
>> > +++ b/Documentation/filesystems/proc.txt
>> > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
>> >    VmLib:      1412 kB
>> >    VmPTE:        20 kb
>> >    VmSwap:        0 kB
>> > +  VmTaskSize:	137438953468 kB
>> >    HugetlbPages:          0 kB
>> >    CoreDumping:    0
>> >    THP_enabled:	  1
>> > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
>> >   VmPTE                       size of page table entries
>> >   VmSwap                      amount of swap used by anonymous private data
>> >                               (shmem swap usage is not included)
>> > + VmTaskSize                  lowest unusable address in process virtual memory
>> 
>> Can we change this help text to "size of process' virtual address space memory" ?
>
> Agree. Or go in other direction and make it VmEnd

Yeah I think VmEnd would be clearer to folks who aren't familiar with
the kernel's usage of the TASK_SIZE terminology.

>> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> > index 95ca1fe7283c..0af7081f7b19 100644
>> > --- a/fs/proc/task_mmu.c
>> > +++ b/fs/proc/task_mmu.c
>> > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>> >  	seq_put_decimal_ull_width(m,
>> >  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
>> >  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
>> > +	seq_put_decimal_ull_width(m,
>> > +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
>> >  	seq_puts(m, " kB\n");
>> >  	hugetlb_report_usage(m, mm);
>> >  }
>
> I'm OK with technical part, but I still have questions not answered
> (or wrongly answered) in v1 and v2. Below is the very detailed
> description of the concerns I have.
>
> 1. What is the exact reason for it? Original version tells about some
> test that takes so much time that you were able to drink a cup of
> coffee before it was done. The test as you said implements linear
> search to find the last page and so is of O(n). If it's only for some
> random test, I think the kernel can survive without it. Do you have a
> real example of useful programs that suffer without this information?
>
>
> 2. I have nothing against taking breaks and see nothing weird if
> ineffective algorithms take time. On my system (x86, Ubuntu) the last
> mapped region according to /proc/<pid>/maps is:
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]
> So to find the required address, we have to inspect 2559 pages. With a
> binary search it would take 12 iterations at max. If my calculation is
> wrong or your environment is completely different - please elaborate.

I agree it should not be hard to calculate, but at the same time it's
trivial for the kernel to export the information so I don't see why the
kernel shouldn't.

> 3. As far as I can see, Linux currently does not support dynamic
> TASK_SIZE. It means that for any platform+ABI combination VmTaskSize
> will be always the same. So VmTaskSize would be effectively useless waste
> of lines. In fact, TASK SIZE is compiler time information and should
> be exposed to user in headers. In discussion to v2 Rafael Aquini answered
> for this concern that TASK_SIZE is a runtime resolved macro. It's
> true, but my main point is: GCC knows what type of binary it compiles
> and is able to select proper value. We are already doing similar things
> where appropriate. Refer for example to my arm64/ilp32 series:
>
> arch/arm64/include/uapi/asm/bitsperlong.h:
> -#define __BITS_PER_LONG 64
> +#if defined(__LP64__)
> +/* Assuming __LP64__ will be defined for native ELF64's and not for ILP32. */
> +#  define __BITS_PER_LONG 64
> +#elif defined(__ILP32__)
> +#  define __BITS_PER_LONG 32
> +#else
> +#  error "Neither LP64 nor ILP32: unsupported ABI in asm/bitsperlong.h"
> +#endif
>
> __LP64__ and __ILP32__ are symbols provided by GCC to distinguish
> between ABIs. So userspace is able to take proper __BITS_PER_LONG value
> at compile time, not at runtime. I think, you can do the same thing for
> TASK_SIZE.

No you can't do it at compile time for TASK_SIZE.

On powerpc a 64-bit program might run on a kernel built with 4K pages
where TASK_SIZE is 64TB, and that same program can run on a kernel built
with 64K pages where TASK_SIZE is 4PB.

And it's not just determined by PAGE_SIZE, that same program might also
run on an older kernel where TASK_SIZE with 64K pages was 512TB.

cheers

Yury Norov May 10, 2019, 7:25 a.m. UTC | #5

On Fri, May 10, 2019 at 01:32:22PM +1000, Michael Ellerman wrote:
> Yury Norov <yury.norov@gmail.com> writes:
> > On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
> >> On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> >> > There is currently no easy and architecture-independent way to find the
> >> > lowest unusable virtual address available to a process without
> >> > brute-force calculation. This patch allows a user to easily retrieve
> >> > this value via /proc/<pid>/status.
> >> > 
> >> > Using this patch, any program that previously needed to waste cpu cycles
> >> > recalculating a non-sensitive process-dependent value already known to
> >> > the kernel can now be optimized to use this mechanism.
> >> > 
> >> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> >> > ---
> >> >  Documentation/filesystems/proc.txt | 2 ++
> >> >  fs/proc/task_mmu.c                 | 2 ++
> >> >  2 files changed, 4 insertions(+)
> >> > 
> >> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> >> > index 66cad5c86171..1c6a912e3975 100644
> >> > --- a/Documentation/filesystems/proc.txt
> >> > +++ b/Documentation/filesystems/proc.txt
> >> > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
> >> >    VmLib:      1412 kB
> >> >    VmPTE:        20 kb
> >> >    VmSwap:        0 kB
> >> > +  VmTaskSize:	137438953468 kB
> >> >    HugetlbPages:          0 kB
> >> >    CoreDumping:    0
> >> >    THP_enabled:	  1
> >> > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
> >> >   VmPTE                       size of page table entries
> >> >   VmSwap                      amount of swap used by anonymous private data
> >> >                               (shmem swap usage is not included)
> >> > + VmTaskSize                  lowest unusable address in process virtual memory
> >> 
> >> Can we change this help text to "size of process' virtual address space memory" ?
> >
> > Agree. Or go in other direction and make it VmEnd
> 
> Yeah I think VmEnd would be clearer to folks who aren't familiar with
> the kernel's usage of the TASK_SIZE terminology.
> 
> >> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >> > index 95ca1fe7283c..0af7081f7b19 100644
> >> > --- a/fs/proc/task_mmu.c
> >> > +++ b/fs/proc/task_mmu.c
> >> > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> >> >  	seq_put_decimal_ull_width(m,
> >> >  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
> >> >  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> >> > +	seq_put_decimal_ull_width(m,
> >> > +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
> >> >  	seq_puts(m, " kB\n");
> >> >  	hugetlb_report_usage(m, mm);
> >> >  }
> >
> > I'm OK with technical part, but I still have questions not answered
> > (or wrongly answered) in v1 and v2. Below is the very detailed
> > description of the concerns I have.
> >
> > 1. What is the exact reason for it? Original version tells about some
> > test that takes so much time that you were able to drink a cup of
> > coffee before it was done. The test as you said implements linear
> > search to find the last page and so is of O(n). If it's only for some
> > random test, I think the kernel can survive without it. Do you have a
> > real example of useful programs that suffer without this information?
> >
> >
> > 2. I have nothing against taking breaks and see nothing weird if
> > ineffective algorithms take time. On my system (x86, Ubuntu) the last
> > mapped region according to /proc/<pid>/maps is:
> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]
> > So to find the required address, we have to inspect 2559 pages. With a
> > binary search it would take 12 iterations at max. If my calculation is
> > wrong or your environment is completely different - please elaborate.
> 
> I agree it should not be hard to calculate, but at the same time it's
> trivial for the kernel to export the information so I don't see why the
> kernel shouldn't.

Kernel shouldn't do it unless there will be real users of the feature.
Otherwise it's pure bloating.

One possible user of it that I can imagine is mmap(MAP_FIXED). The
documentation is very clear about it:

   Furthermore,  this  option  is  extremely  hazardous (when used on its own),
   because it forcibly removes preexisting mappings, making it easy for a 
   multithreaded  process  to corrupt its own address space.

VmEnd provided by kernel may encourage people to solve their problems
by using MAP_FIXED which is potentially dangerous.

Another scenario of VmEnd is to understand how many top bits of address will
be always zero to allocate them for user's purpose, like smart pointers. It
worth to discuss this usecase with compiler people. If they have interest,
I think it's more straightforward to give them something like:
   int preserve_top_bits(int nbits);

Anything else?
 
> > 3. As far as I can see, Linux currently does not support dynamic
> > TASK_SIZE. It means that for any platform+ABI combination VmTaskSize
> > will be always the same. So VmTaskSize would be effectively useless waste
> > of lines. In fact, TASK SIZE is compiler time information and should
> > be exposed to user in headers. In discussion to v2 Rafael Aquini answered
> > for this concern that TASK_SIZE is a runtime resolved macro. It's
> > true, but my main point is: GCC knows what type of binary it compiles
> > and is able to select proper value. We are already doing similar things
> > where appropriate. Refer for example to my arm64/ilp32 series:
> >
> > arch/arm64/include/uapi/asm/bitsperlong.h:
> > -#define __BITS_PER_LONG 64
> > +#if defined(__LP64__)
> > +/* Assuming __LP64__ will be defined for native ELF64's and not for ILP32. */
> > +#  define __BITS_PER_LONG 64
> > +#elif defined(__ILP32__)
> > +#  define __BITS_PER_LONG 32
> > +#else
> > +#  error "Neither LP64 nor ILP32: unsupported ABI in asm/bitsperlong.h"
> > +#endif
> >
> > __LP64__ and __ILP32__ are symbols provided by GCC to distinguish
> > between ABIs. So userspace is able to take proper __BITS_PER_LONG value
> > at compile time, not at runtime. I think, you can do the same thing for
> > TASK_SIZE.
> 
> No you can't do it at compile time for TASK_SIZE.
> 
> On powerpc a 64-bit program might run on a kernel built with 4K pages
> where TASK_SIZE is 64TB, and that same program can run on a kernel built
> with 64K pages where TASK_SIZE is 4PB.
> 
> And it's not just determined by PAGE_SIZE, that same program might also
> run on an older kernel where TASK_SIZE with 64K pages was 512TB.

OK, understood.

Michael Ellerman May 14, 2019, 6:17 a.m. UTC | #6

Yury Norov <yury.norov@gmail.com> writes:
> On Fri, May 10, 2019 at 01:32:22PM +1000, Michael Ellerman wrote:
>> Yury Norov <yury.norov@gmail.com> writes:
>> > On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
>> >> On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
>> >> > There is currently no easy and architecture-independent way to find the
>> >> > lowest unusable virtual address available to a process without
>> >> > brute-force calculation. This patch allows a user to easily retrieve
>> >> > this value via /proc/<pid>/status.
>> >> > 
>> >> > Using this patch, any program that previously needed to waste cpu cycles
>> >> > recalculating a non-sensitive process-dependent value already known to
>> >> > the kernel can now be optimized to use this mechanism.
>> >> > 
>> >> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>> >> > ---
>> >> >  Documentation/filesystems/proc.txt | 2 ++
>> >> >  fs/proc/task_mmu.c                 | 2 ++
>> >> >  2 files changed, 4 insertions(+)
>> >> > 
>> >> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
>> >> > index 66cad5c86171..1c6a912e3975 100644
>> >> > --- a/Documentation/filesystems/proc.txt
>> >> > +++ b/Documentation/filesystems/proc.txt
>> >> > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
>> >> >    VmLib:      1412 kB
>> >> >    VmPTE:        20 kb
>> >> >    VmSwap:        0 kB
>> >> > +  VmTaskSize:	137438953468 kB
>> >> >    HugetlbPages:          0 kB
>> >> >    CoreDumping:    0
>> >> >    THP_enabled:	  1
>> >> > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
>> >> >   VmPTE                       size of page table entries
>> >> >   VmSwap                      amount of swap used by anonymous private data
>> >> >                               (shmem swap usage is not included)
>> >> > + VmTaskSize                  lowest unusable address in process virtual memory
>> >> 
>> >> Can we change this help text to "size of process' virtual address space memory" ?
>> >
>> > Agree. Or go in other direction and make it VmEnd
>> 
>> Yeah I think VmEnd would be clearer to folks who aren't familiar with
>> the kernel's usage of the TASK_SIZE terminology.
>> 
>> >> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> >> > index 95ca1fe7283c..0af7081f7b19 100644
>> >> > --- a/fs/proc/task_mmu.c
>> >> > +++ b/fs/proc/task_mmu.c
>> >> > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>> >> >  	seq_put_decimal_ull_width(m,
>> >> >  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
>> >> >  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
>> >> > +	seq_put_decimal_ull_width(m,
>> >> > +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
>> >> >  	seq_puts(m, " kB\n");
>> >> >  	hugetlb_report_usage(m, mm);
>> >> >  }
>> >
>> > I'm OK with technical part, but I still have questions not answered
>> > (or wrongly answered) in v1 and v2. Below is the very detailed
>> > description of the concerns I have.
>> >
>> > 1. What is the exact reason for it? Original version tells about some
>> > test that takes so much time that you were able to drink a cup of
>> > coffee before it was done. The test as you said implements linear
>> > search to find the last page and so is of O(n). If it's only for some
>> > random test, I think the kernel can survive without it. Do you have a
>> > real example of useful programs that suffer without this information?
>> >
>> >
>> > 2. I have nothing against taking breaks and see nothing weird if
>> > ineffective algorithms take time. On my system (x86, Ubuntu) the last
>> > mapped region according to /proc/<pid>/maps is:
>> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]
>> > So to find the required address, we have to inspect 2559 pages. With a
>> > binary search it would take 12 iterations at max. If my calculation is
>> > wrong or your environment is completely different - please elaborate.
>> 
>> I agree it should not be hard to calculate, but at the same time it's
>> trivial for the kernel to export the information so I don't see why the
>> kernel shouldn't.
>
> Kernel shouldn't do it unless there will be real users of the feature.
> Otherwise it's pure bloating.

A single line or two of code to print a value that's useful information
for userspace is hardly "bloat".

I agree it's good to have users for things, but this seems like it's so
trivial that we should just add it and someone will find a use for it.

> One possible user of it that I can imagine is mmap(MAP_FIXED). The
> documentation is very clear about it:
>
>    Furthermore,  this  option  is  extremely  hazardous (when used on its own),
>    because it forcibly removes preexisting mappings, making it easy for a 
>    multithreaded  process  to corrupt its own address space.
>
> VmEnd provided by kernel may encourage people to solve their problems
> by using MAP_FIXED which is potentially dangerous.

There's MAP_FIXED_NOREPLACE now which is not dangerous.

Using MAX_FIXED_NOREPLACE and VmEnd would make it relatively easy to do
a userspace ASLR implementation, so that actually is an argument in
favour IMHO.

> Another scenario of VmEnd is to understand how many top bits of address will
> be always zero to allocate them for user's purpose, like smart pointers. It
> worth to discuss this usecase with compiler people. If they have interest,
> I think it's more straightforward to give them something like:
>    int preserve_top_bits(int nbits);

You mean a syscall?

With things like hardware pointer tagging / colouring coming along I
think you're right that using VmEnd and assuming the top bits are never
used is a bad idea, an explicit interface would be better.

cheers

Yury Norov May 19, 2019, 9:31 p.m. UTC | #7

On Tue, May 14, 2019 at 04:17:46PM +1000, Michael Ellerman wrote:
> Yury Norov <yury.norov@gmail.com> writes:
> > On Fri, May 10, 2019 at 01:32:22PM +1000, Michael Ellerman wrote:
> >> Yury Norov <yury.norov@gmail.com> writes:
> >> > On Tue, May 07, 2019 at 08:54:31AM -0400, Rafael Aquini wrote:
> >> >> On Mon, May 06, 2019 at 11:53:43AM -0400, Joel Savitz wrote:
> >> >> > There is currently no easy and architecture-independent way to find the
> >> >> > lowest unusable virtual address available to a process without
> >> >> > brute-force calculation. This patch allows a user to easily retrieve
> >> >> > this value via /proc/<pid>/status.
> >> >> > 
> >> >> > Using this patch, any program that previously needed to waste cpu cycles
> >> >> > recalculating a non-sensitive process-dependent value already known to
> >> >> > the kernel can now be optimized to use this mechanism.
> >> >> > 
> >> >> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> >> >> > ---
> >> >> >  Documentation/filesystems/proc.txt | 2 ++
> >> >> >  fs/proc/task_mmu.c                 | 2 ++
> >> >> >  2 files changed, 4 insertions(+)
> >> >> > 
> >> >> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> >> >> > index 66cad5c86171..1c6a912e3975 100644
> >> >> > --- a/Documentation/filesystems/proc.txt
> >> >> > +++ b/Documentation/filesystems/proc.txt
> >> >> > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
> >> >> >    VmLib:      1412 kB
> >> >> >    VmPTE:        20 kb
> >> >> >    VmSwap:        0 kB
> >> >> > +  VmTaskSize:	137438953468 kB
> >> >> >    HugetlbPages:          0 kB
> >> >> >    CoreDumping:    0
> >> >> >    THP_enabled:	  1
> >> >> > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
> >> >> >   VmPTE                       size of page table entries
> >> >> >   VmSwap                      amount of swap used by anonymous private data
> >> >> >                               (shmem swap usage is not included)
> >> >> > + VmTaskSize                  lowest unusable address in process virtual memory
> >> >> 
> >> >> Can we change this help text to "size of process' virtual address space memory" ?
> >> >
> >> > Agree. Or go in other direction and make it VmEnd
> >> 
> >> Yeah I think VmEnd would be clearer to folks who aren't familiar with
> >> the kernel's usage of the TASK_SIZE terminology.
> >> 
> >> >> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >> >> > index 95ca1fe7283c..0af7081f7b19 100644
> >> >> > --- a/fs/proc/task_mmu.c
> >> >> > +++ b/fs/proc/task_mmu.c
> >> >> > @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> >> >> >  	seq_put_decimal_ull_width(m,
> >> >> >  		    " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
> >> >> >  	SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> >> >> > +	seq_put_decimal_ull_width(m,
> >> >> > +		    " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
> >> >> >  	seq_puts(m, " kB\n");
> >> >> >  	hugetlb_report_usage(m, mm);
> >> >> >  }
> >> >
> >> > I'm OK with technical part, but I still have questions not answered
> >> > (or wrongly answered) in v1 and v2. Below is the very detailed
> >> > description of the concerns I have.
> >> >
> >> > 1. What is the exact reason for it? Original version tells about some
> >> > test that takes so much time that you were able to drink a cup of
> >> > coffee before it was done. The test as you said implements linear
> >> > search to find the last page and so is of O(n). If it's only for some
> >> > random test, I think the kernel can survive without it. Do you have a
> >> > real example of useful programs that suffer without this information?
> >> >
> >> >
> >> > 2. I have nothing against taking breaks and see nothing weird if
> >> > ineffective algorithms take time. On my system (x86, Ubuntu) the last
> >> > mapped region according to /proc/<pid>/maps is:
> >> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0     [vsyscall]
> >> > So to find the required address, we have to inspect 2559 pages. With a
> >> > binary search it would take 12 iterations at max. If my calculation is
> >> > wrong or your environment is completely different - please elaborate.
> >> 
> >> I agree it should not be hard to calculate, but at the same time it's
> >> trivial for the kernel to export the information so I don't see why the
> >> kernel shouldn't.
> >
> > Kernel shouldn't do it unless there will be real users of the feature.
> > Otherwise it's pure bloating.
> 
> A single line or two of code to print a value that's useful information
> for userspace is hardly "bloat".
> 
> I agree it's good to have users for things, but this seems like it's so
> trivial that we should just add it and someone will find a use for it.

Little bloat is still bloat. Trivial useless code is still useless.

If someone finds a use of VmEnd, it should be thoroughly reviewed for
better alternatives.
 
> > One possible user of it that I can imagine is mmap(MAP_FIXED). The
> > documentation is very clear about it:
> >
> >    Furthermore,  this  option  is  extremely  hazardous (when used on its own),
> >    because it forcibly removes preexisting mappings, making it easy for a 
> >    multithreaded  process  to corrupt its own address space.
> >
> > VmEnd provided by kernel may encourage people to solve their problems
> > by using MAP_FIXED which is potentially dangerous.
> 
> There's MAP_FIXED_NOREPLACE now which is not dangerous.

MAP_FIXED_NOREPLACE is still not supported by glibc and not
documented. (Glibc doesn't use mman-common.h that comes from kernel,
and defines all mmap-related stuff in its own bits/mman.h). Therefore
from the point of view of 99% users MAP_FIXED_NOREPLACE doesn't exist.
Bionic defines MAP_FIXED_NOREPLACE but does not document it and doesn't
use.

> Using MAX_FIXED_NOREPLACE and VmEnd would make it relatively easy to do
> a userspace ASLR implementation, so that actually is an argument in
> favour IMHO.

Kernel-supported ASLR works well since 2.6.12. Do you see any
downside of using it?

MAP_RANDOM would be even more handy for userspace ASLR.

VmEnd in current form would break certain userspace programs that
has DEFAULT_MAP_WINDOW != TASK_SIZE. This is the case for 48-bit VA
programs running on 52-bits VA ARM kernel. See 363524d2b1227
(arm64: mm: Introduce DEFAULT_MAP_WINDOW).

> > Another scenario of VmEnd is to understand how many top bits of address will
> > be always zero to allocate them for user's purpose, like smart pointers. It
> > worth to discuss this usecase with compiler people. If they have interest,
> > I think it's more straightforward to give them something like:
> >    int preserve_top_bits(int nbits);
> 
> You mean a syscall?
> 
> With things like hardware pointer tagging / colouring coming along I
> think you're right that using VmEnd and assuming the top bits are never
> used is a bad idea, an explicit interface would be better.
> 
> cheers

[v3] fs/proc: add VmTaskSize field to /proc/$$/status

Commit Message

Comments

Patch