vmap(): don't allow invalid pages

Message ID	20220118235244.540103-1-yury.norov@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Yury Norov <yury.norov@gmail.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Nicholas Piggin <npiggin@gmail.com>, Ding Tianhong <dingtianhong@huawei.com>, Anshuman Khandual <anshuman.khandual@arm.com>, Matthew Wilcox <willy@infradead.org>, Alexey Klimov <aklimov@redhat.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Yury Norov <yury.norov@gmail.com> Subject: [PATCH] vmap(): don't allow invalid pages Date: Tue, 18 Jan 2022 15:52:44 -0800 Message-Id: <20220118235244.540103-1-yury.norov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	vmap(): don't allow invalid pages \| expand vmap(): don't allow invalid pages

Yury Norov Jan. 18, 2022, 11:52 p.m. UTC

vmap() takes struct page *pages as one of arguments, and user may provide
an invalid pointer which would lead to DABT at address translation later.

Currently, kernel checks the pages against NULL. In my case, however, the
address was not NULL, and was big enough so that the hardware generated
Address Size Abort on arm64.

Interestingly, this abort happens even if copy_from_kernel_nofault() is
used, which is quite inconvenient for debugging purposes. 

This patch adds a pfn_valid() check into vmap() path, so that invalid
mapping will not be created.

RFC: https://lkml.org/lkml/2022/1/18/815
v1: use pfn_valid() instead of adding an arch-specific
    arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 mm/vmalloc.c | 2 ++
 1 file changed, 2 insertions(+)

Matthew Wilcox (Oracle) Jan. 19, 2022, 12:51 a.m. UTC | #1

On Tue, Jan 18, 2022 at 03:52:44PM -0800, Yury Norov wrote:
> vmap() takes struct page *pages as one of arguments, and user may provide
> an invalid pointer which would lead to DABT at address translation later.

Could we spell out 'DABT'?  Presumably that's an ARM-specific thing.
Just like we don't say #PF for Intel page faults, I think this is
probably a 'data abort'?

> Currently, kernel checks the pages against NULL. In my case, however, the
> address was not NULL, and was big enough so that the hardware generated
> Address Size Abort on arm64.
> 
> Interestingly, this abort happens even if copy_from_kernel_nofault() is
> used, which is quite inconvenient for debugging purposes. 
> 
> This patch adds a pfn_valid() check into vmap() path, so that invalid
> mapping will not be created.
> 
> RFC: https://lkml.org/lkml/2022/1/18/815
> v1: use pfn_valid() instead of adding an arch-specific
>     arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.
> 
> Signed-off-by: Yury Norov <yury.norov@gmail.com>

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>

> ---
>  mm/vmalloc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..a4134ee56b10 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  			return -EBUSY;
>  		if (WARN_ON(!page))
>  			return -ENOMEM;
> +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> +			return -EINVAL;
>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>  		(*nr)++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
> -- 
> 2.30.2
>

Anshuman Khandual Jan. 19, 2022, 6:17 a.m. UTC | #2

On 1/19/22 6:21 AM, Matthew Wilcox wrote:
> On Tue, Jan 18, 2022 at 03:52:44PM -0800, Yury Norov wrote:
>> vmap() takes struct page *pages as one of arguments, and user may provide
>> an invalid pointer which would lead to DABT at address translation later.
> 
> Could we spell out 'DABT'?  Presumably that's an ARM-specific thing.
> Just like we don't say #PF for Intel page faults, I think this is
> probably a 'data abort'?

Right, it is data abort.

> 
>> Currently, kernel checks the pages against NULL. In my case, however, the
>> address was not NULL, and was big enough so that the hardware generated
>> Address Size Abort on arm64.

Could you please provide the actual abort stack here with details.

>>
>> Interestingly, this abort happens even if copy_from_kernel_nofault() is
>> used, which is quite inconvenient for debugging purposes. 
>>
>> This patch adds a pfn_valid() check into vmap() path, so that invalid
>> mapping will not be created.
>>
>> RFC: https://lkml.org/lkml/2022/1/18/815
>> v1: use pfn_valid() instead of adding an arch-specific
>>     arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.

This should be after the '---' below.

>>
>> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> 
> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
>> ---
>>  mm/vmalloc.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index d2a00ad4e1dd..a4134ee56b10 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>>  			return -EBUSY;
>>  		if (WARN_ON(!page))
>>  			return -ENOMEM;
>> +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
>> +			return -EINVAL;
>>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>>  		(*nr)++;
>>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>> -- 
>> 2.30.2
>>

Why should not this just scan over the entire user provided struct page
array and make sure that all pages there in are valid via above method,
but in vmap() itself before calling vmap_pages_range(). Because seems
like a single invalid page detected in vmap_pages_pte_range() will
anyways abort the entire vmap(). This will also enable us to drop the
existing NULL check above.

Mark Rutland Jan. 19, 2022, 11:16 a.m. UTC | #3

Hi,

I replied ot the original RFC before spotting this; duplicating those comments
here because I think they apply regardless of the mechanism used to work around
this.

On Tue, Jan 18, 2022 at 03:52:44PM -0800, Yury Norov wrote:
> vmap() takes struct page *pages as one of arguments, and user may provide
> an invalid pointer which would lead to DABT at address translation later. 
>
> Currently, kernel checks the pages against NULL. In my case, however, the
> address was not NULL, and was big enough so that the hardware generated
> Address Size Abort on arm64.

Can you give an example of when this might happen? It sounds like you're
actually hitting this, so a backtrace would be nice.

I'm a bit confused as to when why we'd try to vmap() pages that we
didn't have a legitimate struct page for -- where did these addresses
come from?

It sounds like this is going wrong at a higher level, and we're passing
entirely bogus struct page pointers around. This seems like the sort of
thing DEBUG_VIRTUAL or similar should check when we initially generate
the struct page pointer.

> Interestingly, this abort happens even if copy_from_kernel_nofault() is
> used, which is quite inconvenient for debugging purposes. 

I can go take a look at this, but TBH we never expect to take an address size
fault to begin with, so this is arguably correct -- it's an internal
consistency problem.

> This patch adds a pfn_valid() check into vmap() path, so that invalid
> mapping will not be created.
> 
> RFC: https://lkml.org/lkml/2022/1/18/815
> v1: use pfn_valid() instead of adding an arch-specific
>     arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.
> 
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
>  mm/vmalloc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..a4134ee56b10 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  			return -EBUSY;
>  		if (WARN_ON(!page))
>  			return -ENOMEM;
> +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> +			return -EINVAL;

My fear here is that for this to fire, we've already passed a bogus struct page
pointer around the intermediate infrastructure, and any of that might try to
use it in unsafe ways (in future even if we don't use it today).

I think the fundamental issue here is that we generate a bogus struct page
pointer at all, and knowing where that came from would help to fix that.

Thanks,
Mark.

>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>  		(*nr)++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
> -- 
> 2.30.2
>

Robin Murphy Jan. 19, 2022, 1:28 p.m. UTC | #4

On 2022-01-18 23:52, Yury Norov wrote:
> vmap() takes struct page *pages as one of arguments, and user may provide
> an invalid pointer which would lead to DABT at address translation later.
> 
> Currently, kernel checks the pages against NULL. In my case, however, the
> address was not NULL, and was big enough so that the hardware generated
> Address Size Abort on arm64.
> 
> Interestingly, this abort happens even if copy_from_kernel_nofault() is
> used, which is quite inconvenient for debugging purposes.
> 
> This patch adds a pfn_valid() check into vmap() path, so that invalid
> mapping will not be created.
> 
> RFC: https://lkml.org/lkml/2022/1/18/815
> v1: use pfn_valid() instead of adding an arch-specific
>      arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.
> 
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
>   mm/vmalloc.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..a4134ee56b10 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>   			return -EBUSY;
>   		if (WARN_ON(!page))
>   			return -ENOMEM;
> +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))

Is it page_to_pfn() guaranteed to work without blowing up if page is 
invalid in the first place? Looking at the CONFIG_SPARSEMEM case I'm not 
sure that's true...

Robin.

> +			return -EINVAL;
>   		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>   		(*nr)++;
>   	} while (pte++, addr += PAGE_SIZE, addr != end);

Matthew Wilcox (Oracle) Jan. 19, 2022, 4:27 p.m. UTC | #5

On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> > +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> 
> Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> true...

Even if it does blow up, at least it's blowing up here where someone
can start to debug it, rather than blowing up on first access, where
we no longer have the invlid struct page pointer.

I don't think we have a 'page_valid' function which will tell us whether
a random pointer is actually a struct page or not.

Yury Norov Jan. 19, 2022, 5 p.m. UTC | #6

On Wed, Jan 19, 2022 at 3:17 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi,
>
> I replied ot the original RFC before spotting this; duplicating those comments
> here because I think they apply regardless of the mechanism used to work around
> this.
>
> On Tue, Jan 18, 2022 at 03:52:44PM -0800, Yury Norov wrote:
> > vmap() takes struct page *pages as one of arguments, and user may provide
> > an invalid pointer which would lead to DABT at address translation later.
> >
> > Currently, kernel checks the pages against NULL. In my case, however, the
> > address was not NULL, and was big enough so that the hardware generated
> > Address Size Abort on arm64.
>
> Can you give an example of when this might happen? It sounds like you're
> actually hitting this, so a backtrace would be nice.
>
> I'm a bit confused as to when why we'd try to vmap() pages that we
> didn't have a legitimate struct page for -- where did these addresses
> come from?
>
> It sounds like this is going wrong at a higher level, and we're passing
> entirely bogus struct page pointers around. This seems like the sort of
> thing DEBUG_VIRTUAL or similar should check when we initially generate
> the struct page pointer.

Hi Mark,

This is an out-of-tree code that does:

    vaddr1 = dma_alloc_coherent()
    page = virt_to_page() // Wrong here
    ...
    vaddr2 = vmap(page)
    memset(vaddr2) // Fault here

virt_to_page() returns a wrong pointer if vaddr1 is not a linear kernel address.
The problem is that vmap() populates pte with bad pfn successfully, and it's
much harder to debug at memory access time.

> > Interestingly, this abort happens even if copy_from_kernel_nofault() is
> > used, which is quite inconvenient for debugging purposes.
>
> I can go take a look at this, but TBH we never expect to take an address size
> fault to begin with, so this is arguably correct -- it's an internal
> consistency problem.
>
> > This patch adds a pfn_valid() check into vmap() path, so that invalid
> > mapping will not be created.
> >
> > RFC: https://lkml.org/lkml/2022/1/18/815
> > v1: use pfn_valid() instead of adding an arch-specific
> >     arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.
> >
> > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > ---
> >  mm/vmalloc.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d2a00ad4e1dd..a4134ee56b10 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >                       return -EBUSY;
> >               if (WARN_ON(!page))
> >                       return -ENOMEM;
> > +             if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > +                     return -EINVAL;
>
> My fear here is that for this to fire, we've already passed a bogus struct page
> pointer around the intermediate infrastructure, and any of that might try to
> use it in unsafe ways (in future even if we don't use it today).
>
> I think the fundamental issue here is that we generate a bogus struct page
> pointer at all, and knowing where that came from would help to fix that.

You're right. That's why WARN_ON() is used for the page == null in the code
above, I believe, - to let client code know that something goes wrong, and it's
not a regular ENOMEM situation.

Thanks,
Yury

> Thanks,
> Mark.
>
> >               set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> >               (*nr)++;
> >       } while (pte++, addr += PAGE_SIZE, addr != end);
> > --
> > 2.30.2
> >

Yury Norov Jan. 19, 2022, 5:22 p.m. UTC | #7

On Tue, Jan 18, 2022 at 10:17 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 1/19/22 6:21 AM, Matthew Wilcox wrote:
> > On Tue, Jan 18, 2022 at 03:52:44PM -0800, Yury Norov wrote:
> >> vmap() takes struct page *pages as one of arguments, and user may provide
> >> an invalid pointer which would lead to DABT at address translation later.
> >
> > Could we spell out 'DABT'?  Presumably that's an ARM-specific thing.
> > Just like we don't say #PF for Intel page faults, I think this is
> > probably a 'data abort'?
>
> Right, it is data abort.
>
> >
> >> Currently, kernel checks the pages against NULL. In my case, however, the
> >> address was not NULL, and was big enough so that the hardware generated
> >> Address Size Abort on arm64.
>
> Could you please provide the actual abort stack here with details.

[  665.484101] Unhandled fault at 0xffff8000252cd000
[  665.488807] Mem abort info:
[  665.491617]   ESR = 0x96000043
[  665.494675]   EC = 0x25: DABT (current EL), IL = 32 bits
[  665.499985]   SET = 0, FnV = 0
[  665.503039]   EA = 0, S1PTW = 0
[  665.506167] Data abort info:
[  665.509047]   ISV = 0, ISS = 0x00000043
[  665.512882]   CM = 0, WnR = 1
[  665.515851] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000818cb000
[  665.522550] [ffff8000252cd000] pgd=000000affcfff003,
pud=000000affcffe003, pmd=0000008fad8c3003, pte=00688000a5217713
[  665.533160] Internal error: level 3 address size fault: 96000043 [#1] SMP
[  665.539936] Modules linked in: [...]
[  665.616212] CPU: 178 PID: 13199 Comm: test Tainted: P           OE
   5.4.0-84-generic #94~18.04.1-Ubuntu
[  665.626806] Hardware name: HPE Apollo 70             /C01_APACHE_MB
        , BIOS L50_5.13_1.0.6 07/10/2018
[  665.636618] pstate: 80400009 (Nzcv daif +PAN -UAO)
[  665.641407] pc : __memset+0x38/0x188
[  665.645146] lr : test+0xcc/0x3f8
[  665.650184] sp : ffff8000359bb840
[  665.653486] x29: ffff8000359bb840 x28: 0000000000000000
[  665.658785] x27: 0000000000000000 x26: 0000000000231000
[  665.664083] x25: ffff00ae660f6110 x24: ffff00ae668cb800
[  665.669382] x23: 0000000000000001 x22: ffff00af533e5000
[  665.674680] x21: 0000000000001000 x20: 0000000000000000
[  665.679978] x19: ffff00ae66950000 x18: ffffffffffffffff
[  665.685276] x17: 00000000588636a5 x16: 0000000000000013
[  665.690574] x15: ffffffffffffffff x14: 000000000007ffff
[  665.695872] x13: 0000000080000000 x12: 0140000000000000
[  665.701170] x11: 0000000000000041 x10: ffff8000652cd000
[  665.706468] x9 : ffff8000252cf000 x8 : ffff8000252cd000
[  665.711767] x7 : 0303030303030303 x6 : 0000000000001000
[  665.717065] x5 : ffff8000252cd000 x4 : 0000000000000000
[  665.722363] x3 : ffff8000252cdfff x2 : 0000000000000001
[  665.727661] x1 : 0000000000000003 x0 : ffff8000252cd000
[  665.732960] Call trace:
[  665.735395]  __memset+0x38/0x188
[...]

TCR_EL1 is 34b5503510, and so TCR_EL1.IPS is 0b100. The pfn that caused
address size abort has bit#47 set, which is far above 16TB that is allowed by
ips == 100.

> >>
> >> Interestingly, this abort happens even if copy_from_kernel_nofault() is
> >> used, which is quite inconvenient for debugging purposes.
> >>
> >> This patch adds a pfn_valid() check into vmap() path, so that invalid
> >> mapping will not be created.
> >>
> >> RFC: https://lkml.org/lkml/2022/1/18/815
> >> v1: use pfn_valid() instead of adding an arch-specific
> >>     arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.
>
> This should be after the '---' below.
>
> >>
> >> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> >
> > Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> >
> >> ---
> >>  mm/vmalloc.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >> index d2a00ad4e1dd..a4134ee56b10 100644
> >> --- a/mm/vmalloc.c
> >> +++ b/mm/vmalloc.c
> >> @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >>                      return -EBUSY;
> >>              if (WARN_ON(!page))
> >>                      return -ENOMEM;
> >> +            if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> >> +                    return -EINVAL;
> >>              set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
> >>              (*nr)++;
> >>      } while (pte++, addr += PAGE_SIZE, addr != end);
> >> --
> >> 2.30.2
> >>
>
> Why should not this just scan over the entire user provided struct page
> array and make sure that all pages there in are valid via above method,
> but in vmap() itself before calling vmap_pages_range(). Because seems
> like a single invalid page detected in vmap_pages_pte_range() will
> anyways abort the entire vmap(). This will also enable us to drop the
> existing NULL check above.

I can do this, but why is it any better than the current approach?

Thanks,
Yury

Russell King (Oracle) Jan. 19, 2022, 5:54 p.m. UTC | #8

On Wed, Jan 19, 2022 at 04:27:32PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> > > +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > 
> > Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> > in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> > true...
> 
> Even if it does blow up, at least it's blowing up here where someone
> can start to debug it, rather than blowing up on first access, where
> we no longer have the invlid struct page pointer.
> 
> I don't think we have a 'page_valid' function which will tell us whether
> a random pointer is actually a struct page or not.

Isn't it supposed to be:

	if (!pfn_valid(pfn)) {
		handle invalid pfn;
	}

	page = pfn_to_page(pfn);

Anything else - even trying to convert an invalid page back to a pfn,
could well be unreliable (sparsemem or discontigmem).

Matthew Wilcox (Oracle) Jan. 19, 2022, 6:01 p.m. UTC | #9

On Wed, Jan 19, 2022 at 05:54:15PM +0000, Russell King (Oracle) wrote:
> On Wed, Jan 19, 2022 at 04:27:32PM +0000, Matthew Wilcox wrote:
> > On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> > > > +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > > 
> > > Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> > > in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> > > true...
> > 
> > Even if it does blow up, at least it's blowing up here where someone
> > can start to debug it, rather than blowing up on first access, where
> > we no longer have the invlid struct page pointer.
> > 
> > I don't think we have a 'page_valid' function which will tell us whether
> > a random pointer is actually a struct page or not.
> 
> Isn't it supposed to be:
> 
> 	if (!pfn_valid(pfn)) {
> 		handle invalid pfn;
> 	}
> 
> 	page = pfn_to_page(pfn);
> 
> Anything else - even trying to convert an invalid page back to a pfn,
> could well be unreliable (sparsemem or discontigmem). 

This function is passed an array of pages.  We have no way of doing
what you propose.

Mark Rutland Jan. 19, 2022, 6:06 p.m. UTC | #10

On Wed, Jan 19, 2022 at 09:00:23AM -0800, Yury Norov wrote:
> On Wed, Jan 19, 2022 at 3:17 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Hi,
> >
> > I replied ot the original RFC before spotting this; duplicating those comments
> > here because I think they apply regardless of the mechanism used to work around
> > this.
> >
> > On Tue, Jan 18, 2022 at 03:52:44PM -0800, Yury Norov wrote:
> > > vmap() takes struct page *pages as one of arguments, and user may provide
> > > an invalid pointer which would lead to DABT at address translation later.
> > >
> > > Currently, kernel checks the pages against NULL. In my case, however, the
> > > address was not NULL, and was big enough so that the hardware generated
> > > Address Size Abort on arm64.
> >
> > Can you give an example of when this might happen? It sounds like you're
> > actually hitting this, so a backtrace would be nice.
> >
> > I'm a bit confused as to when why we'd try to vmap() pages that we
> > didn't have a legitimate struct page for -- where did these addresses
> > come from?
> >
> > It sounds like this is going wrong at a higher level, and we're passing
> > entirely bogus struct page pointers around. This seems like the sort of
> > thing DEBUG_VIRTUAL or similar should check when we initially generate
> > the struct page pointer.
> 
> Hi Mark,
> 
> This is an out-of-tree code that does:
> 
>     vaddr1 = dma_alloc_coherent()
>     page = virt_to_page() // Wrong here
>     ...
>     vaddr2 = vmap(page)
>     memset(vaddr2) // Fault here
> 
> virt_to_page() returns a wrong pointer if vaddr1 is not a linear kernel address.
> The problem is that vmap() populates pte with bad pfn successfully, and it's
> much harder to debug at memory access time.
 
Ah, ok. FWIW, that case should be caught by DEBUG_VIRTUAL when that's
enabled.

It would be nice to put that within the commit message, since that
clearly indicates this is about catching a mistake in buggy code, and
indicates where to go looking next.

> > > Interestingly, this abort happens even if copy_from_kernel_nofault() is
> > > used, which is quite inconvenient for debugging purposes.
> >
> > I can go take a look at this, but TBH we never expect to take an address size
> > fault to begin with, so this is arguably correct -- it's an internal
> > consistency problem.
> >
> > > This patch adds a pfn_valid() check into vmap() path, so that invalid
> > > mapping will not be created.
> > >
> > > RFC: https://lkml.org/lkml/2022/1/18/815
> > > v1: use pfn_valid() instead of adding an arch-specific
> > >     arch_vmap_page_valid(). Thanks to Matthew Wilcox for the hint.
> > >
> > > Signed-off-by: Yury Norov <yury.norov@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index d2a00ad4e1dd..a4134ee56b10 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -477,6 +477,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> > >                       return -EBUSY;
> > >               if (WARN_ON(!page))
> > >                       return -ENOMEM;
> > > +             if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > > +                     return -EINVAL;
> >
> > My fear here is that for this to fire, we've already passed a bogus struct page
> > pointer around the intermediate infrastructure, and any of that might try to
> > use it in unsafe ways (in future even if we don't use it today).
> >
> > I think the fundamental issue here is that we generate a bogus struct page
> > pointer at all, and knowing where that came from would help to fix that.
> 
> You're right. That's why WARN_ON() is used for the page == null in the code
> above, I believe, - to let client code know that something goes wrong, and it's
> not a regular ENOMEM situation.

Sure; with all the above in mind I think it's fine for this to be
best-effort (e.g. as Robin noted page_to_pfn() might consume parts of
the struct page before this), since either way that will indicate
roughly where the problem is.

As above, it would just be nice for the commit message to be a little
more explicit as to that.

Thanks,
Mark.

Robin Murphy Jan. 19, 2022, 6:43 p.m. UTC | #11

On 2022-01-19 16:27, Matthew Wilcox wrote:
> On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
>>> +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
>>
>> Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
>> in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
>> true...
> 
> Even if it does blow up, at least it's blowing up here where someone
> can start to debug it, rather than blowing up on first access, where
> we no longer have the invlid struct page pointer.

But if that were the case then we'd blow up on the following line when 
mk_pte(page, prot) ends up calling it same anyway. Indeed it's arguably 
the best-case scenario since it would also blow up in page_address() if 
we hit the vmap_range loop rather than going down the 
vmap_small_pages_range_noflush() path.

Furthermore, if you *are* lucky enough to take a fault upon accessing a 
bogus mapping, then surely a phys_to_page() calculation on whatever 
ended up in the PTE should get you back the original "pointer" anyway, 
shouldn't it? Sure it's a bit more work to extract the caller out of the 
VMA if necessary, but hey, that's debugging! Maybe vmap() failing means 
you then pass the offending nonsense to __free_pages() and that ruins 
your day anyway...

The implications are infinitely worse if you've mapped something that 
did happen to be a valid page, but wasn't the *right* page, such that 
you don't fault but corrupt things or trigger a fatal system error 
instead. I'd echo Mark's point that if we can't trust a page pointer to 
be correct then we've already lost. In general the proportion of "wrong" 
pointers one can viably attempt to detect is so small that it's rarely 
ever worth trying, and the cases that are so obviously wrong tend to 
show up well enough in normal operation (although NULL-safety is of 
course a bit of a special case when it can simplify API usage).

> I don't think we have a 'page_valid' function which will tell us whether
> a random pointer is actually a struct page or not.

Indeed, my impression is that the only legitimate way to get hold of a 
page pointer without assumed provenance is via pfn_to_page(), which is 
where pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really 
*should* be a tautology.

I guess in this instance it would be technically feasible to implement a 
function which checks "is this a correctly-aligned pointer within memmap 
bounds", but IMO that would be a massive step in the wrong direction. 
We're developers; sometimes we introduce bugs when developing code, and 
it takes effort to debug them, but that still doesn't make it a good 
idea to optimise normal code paths for the expectation of writing new 
catastrophically-bad bugs. Plus logically if such a "page_valid()" check 
could be justified at all then that should rightfully lead to a 
churn-fest of adding it to pretty much every interface which accepts 
page pointers - one half of vmap() is hardly special.

FWIW, If anything I reckon a DEBUG_VM option that made checks within 
page_to_x/x_to_page as appropriate would help Yury's issue just as much, 
while having the potential to be considerably more useful in general.

Cheers,
Robin.

Russell King (Oracle) Jan. 19, 2022, 6:57 p.m. UTC | #12

On Wed, Jan 19, 2022 at 06:01:24PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 19, 2022 at 05:54:15PM +0000, Russell King (Oracle) wrote:
> > On Wed, Jan 19, 2022 at 04:27:32PM +0000, Matthew Wilcox wrote:
> > > On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> > > > > +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > > > 
> > > > Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> > > > in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> > > > true...
> > > 
> > > Even if it does blow up, at least it's blowing up here where someone
> > > can start to debug it, rather than blowing up on first access, where
> > > we no longer have the invlid struct page pointer.
> > > 
> > > I don't think we have a 'page_valid' function which will tell us whether
> > > a random pointer is actually a struct page or not.
> > 
> > Isn't it supposed to be:
> > 
> > 	if (!pfn_valid(pfn)) {
> > 		handle invalid pfn;
> > 	}
> > 
> > 	page = pfn_to_page(pfn);
> > 
> > Anything else - even trying to convert an invalid page back to a pfn,
> > could well be unreliable (sparsemem or discontigmem). 
> 
> This function is passed an array of pages.  We have no way of doing
> what you propose.

You can't go from a struct page to "this is valid", it's too late by the
time you call vmap() - that's my fundamental point.

If the translation from a PFN to a struct page can return pointers to
something that isn't a valid struct page, then it can also (with
sparsemem) return a pointer to _another_ struct page that could well
be valid depending on how the struct page arrays are laid out in
memory.

To repeat: once you have a struct page, it's too late to determine if
the struct page is valid.

Russell King (Oracle) Jan. 19, 2022, 7:12 p.m. UTC | #13

On Wed, Jan 19, 2022 at 06:43:10PM +0000, Robin Murphy wrote:
> Indeed, my impression is that the only legitimate way to get hold of a page
> pointer without assumed provenance is via pfn_to_page(), which is where
> pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really *should* be a
> tautology.

That can only be true if pfn == page_to_pfn(pfn_to_page(pfn)) for all
values of pfn.

Given how pfn_to_page() is defined in the sparsemem case:

#define __pfn_to_page(pfn)                              \
({	unsigned long __pfn = (pfn);                    \
	struct mem_section *__sec = __pfn_to_section(__pfn);    \
	__section_mem_map_addr(__sec) + __pfn;          \
})
#define page_to_pfn __page_to_pfn

that isn't the case, especially when looking at page_to_pfn():

#define __page_to_pfn(pg)                                       \
({      const struct page *__pg = (pg);                         \
        int __sec = page_to_section(__pg);                      \
	(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
})

Where:

static inline unsigned long page_to_section(const struct page *page)
{
	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
}

So if page_to_section() returns something that is, e.g. zero for an
invalid page in a non-zero section, you're not going to end up with
the right pfn from page_to_pfn().

As I've said now a couple of times, trying to determine of a struct
page pointer is valid is the wrong question to be asking.

Matthew Wilcox (Oracle) Jan. 19, 2022, 7:35 p.m. UTC | #14

On Wed, Jan 19, 2022 at 06:57:34PM +0000, Russell King (Oracle) wrote:
> On Wed, Jan 19, 2022 at 06:01:24PM +0000, Matthew Wilcox wrote:
> > On Wed, Jan 19, 2022 at 05:54:15PM +0000, Russell King (Oracle) wrote:
> > > On Wed, Jan 19, 2022 at 04:27:32PM +0000, Matthew Wilcox wrote:
> > > > On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> > > > > > +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > > > > 
> > > > > Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> > > > > in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> > > > > true...
> > > > 
> > > > Even if it does blow up, at least it's blowing up here where someone
> > > > can start to debug it, rather than blowing up on first access, where
> > > > we no longer have the invlid struct page pointer.
> > > > 
> > > > I don't think we have a 'page_valid' function which will tell us whether
> > > > a random pointer is actually a struct page or not.
> > > 
> > > Isn't it supposed to be:
> > > 
> > > 	if (!pfn_valid(pfn)) {
> > > 		handle invalid pfn;
> > > 	}
> > > 
> > > 	page = pfn_to_page(pfn);
> > > 
> > > Anything else - even trying to convert an invalid page back to a pfn,
> > > could well be unreliable (sparsemem or discontigmem). 
> > 
> > This function is passed an array of pages.  We have no way of doing
> > what you propose.
> 
> You can't go from a struct page to "this is valid", it's too late by the
> time you call vmap() - that's my fundamental point.

Yes, and we have debugging code in __virt_to_phys() that would have
caught this, had Yury enabled CONFIG_DEBUG_VIRTUAL.  My point is that
in this instance, page_to_pfn() doesn't crash, which lets vmap() set
up a mapping to a completely bogus physical address.  We're better
off checking pfn_valid() here than not.

> If the translation from a PFN to a struct page can return pointers to
> something that isn't a valid struct page, then it can also (with
> sparsemem) return a pointer to _another_ struct page that could well
> be valid depending on how the struct page arrays are laid out in
> memory.

Sure, it's not going to catch everything.  But I don't think that
letting perfect be the enemy of the good here is the right approach.

Russell King (Oracle) Jan. 19, 2022, 10:38 p.m. UTC | #15

On Wed, Jan 19, 2022 at 07:35:33PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 19, 2022 at 06:57:34PM +0000, Russell King (Oracle) wrote:
> > On Wed, Jan 19, 2022 at 06:01:24PM +0000, Matthew Wilcox wrote:
> > > On Wed, Jan 19, 2022 at 05:54:15PM +0000, Russell King (Oracle) wrote:
> > > > On Wed, Jan 19, 2022 at 04:27:32PM +0000, Matthew Wilcox wrote:
> > > > > On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> > > > > > > +		if (WARN_ON(!pfn_valid(page_to_pfn(page))))
> > > > > > 
> > > > > > Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> > > > > > in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> > > > > > true...
> > > > > 
> > > > > Even if it does blow up, at least it's blowing up here where someone
> > > > > can start to debug it, rather than blowing up on first access, where
> > > > > we no longer have the invlid struct page pointer.
> > > > > 
> > > > > I don't think we have a 'page_valid' function which will tell us whether
> > > > > a random pointer is actually a struct page or not.
> > > > 
> > > > Isn't it supposed to be:
> > > > 
> > > > 	if (!pfn_valid(pfn)) {
> > > > 		handle invalid pfn;
> > > > 	}
> > > > 
> > > > 	page = pfn_to_page(pfn);
> > > > 
> > > > Anything else - even trying to convert an invalid page back to a pfn,
> > > > could well be unreliable (sparsemem or discontigmem). 
> > > 
> > > This function is passed an array of pages.  We have no way of doing
> > > what you propose.
> > 
> > You can't go from a struct page to "this is valid", it's too late by the
> > time you call vmap() - that's my fundamental point.
> 
> Yes, and we have debugging code in __virt_to_phys() that would have
> caught this, had Yury enabled CONFIG_DEBUG_VIRTUAL.  My point is that
> in this instance, page_to_pfn() doesn't crash, which lets vmap() set
> up a mapping to a completely bogus physical address.  We're better
> off checking pfn_valid() here than not.

I don't disagree that pfn_valid() will catch _some_ but it should, no,
must not be a subsitute for ensuring that the proper checks are done
when creating e.g. an array of struct pages.

Anshuman Khandual Jan. 20, 2022, 3:37 a.m. UTC | #16

On 1/19/22 10:52 PM, Yury Norov wrote:
>> Why should not this just scan over the entire user provided struct page
>> array and make sure that all pages there in are valid via above method,
>> but in vmap() itself before calling vmap_pages_range(). Because seems
>> like a single invalid page detected in vmap_pages_pte_range() will
>> anyways abort the entire vmap(). This will also enable us to drop the
>> existing NULL check above.
>
> I can do this, but why is it any better than the current approach?

Because it will just return on the first instance where the valid page
check fails, saving us some CPU cycles and an incomplete mapping ?

Matthew Wilcox (Oracle) Jan. 20, 2022, 4:27 a.m. UTC | #17

On Thu, Jan 20, 2022 at 09:07:11AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/19/22 10:52 PM, Yury Norov wrote:
> >> Why should not this just scan over the entire user provided struct page
> >> array and make sure that all pages there in are valid via above method,
> >> but in vmap() itself before calling vmap_pages_range(). Because seems
> >> like a single invalid page detected in vmap_pages_pte_range() will
> >> anyways abort the entire vmap(). This will also enable us to drop the
> >> existing NULL check above.
> >
> > I can do this, but why is it any better than the current approach?
> 
> Because it will just return on the first instance where the valid page
> check fails, saving us some CPU cycles and an incomplete mapping ?

The valid page check is never intended to fail!  If you're worried about
the efficiency of doing this, you have seriously confused your priorities.

Robin Murphy Jan. 20, 2022, 12:22 p.m. UTC | #18

On 2022-01-19 19:12, Russell King (Oracle) wrote:
> On Wed, Jan 19, 2022 at 06:43:10PM +0000, Robin Murphy wrote:
>> Indeed, my impression is that the only legitimate way to get hold of a page
>> pointer without assumed provenance is via pfn_to_page(), which is where
>> pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really *should* be a
>> tautology.
> 
> That can only be true if pfn == page_to_pfn(pfn_to_page(pfn)) for all
> values of pfn.
> 
> Given how pfn_to_page() is defined in the sparsemem case:
> 
> #define __pfn_to_page(pfn)                              \
> ({	unsigned long __pfn = (pfn);                    \
> 	struct mem_section *__sec = __pfn_to_section(__pfn);    \
> 	__section_mem_map_addr(__sec) + __pfn;          \
> })
> #define page_to_pfn __page_to_pfn
> 
> that isn't the case, especially when looking at page_to_pfn():
> 
> #define __page_to_pfn(pg)                                       \
> ({      const struct page *__pg = (pg);                         \
>          int __sec = page_to_section(__pg);                      \
> 	(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
> })
> 
> Where:
> 
> static inline unsigned long page_to_section(const struct page *page)
> {
> 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> }
> 
> So if page_to_section() returns something that is, e.g. zero for an
> invalid page in a non-zero section, you're not going to end up with
> the right pfn from page_to_pfn().

Right, I emphasised "should" in an attempt to imply "in the absence of 
serious bugs that have further-reaching consequences anyway".

> As I've said now a couple of times, trying to determine of a struct
> page pointer is valid is the wrong question to be asking.

And doing so in one single place, on the justification of avoiding an 
incredibly niche symptom, is even more so. Not to mention that an 
address size fault is one of the best possible outcomes anyway, vs. the 
untold damage that may stem from accesses actually going through to 
random parts of the physical memory map.

Robin.

Russell King (Oracle) Jan. 20, 2022, 1:03 p.m. UTC | #19

On Thu, Jan 20, 2022 at 12:22:35PM +0000, Robin Murphy wrote:
> On 2022-01-19 19:12, Russell King (Oracle) wrote:
> > On Wed, Jan 19, 2022 at 06:43:10PM +0000, Robin Murphy wrote:
> > > Indeed, my impression is that the only legitimate way to get hold of a page
> > > pointer without assumed provenance is via pfn_to_page(), which is where
> > > pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really *should* be a
> > > tautology.
> > 
> > That can only be true if pfn == page_to_pfn(pfn_to_page(pfn)) for all
> > values of pfn.
> > 
> > Given how pfn_to_page() is defined in the sparsemem case:
> > 
> > #define __pfn_to_page(pfn)                              \
> > ({	unsigned long __pfn = (pfn);                    \
> > 	struct mem_section *__sec = __pfn_to_section(__pfn);    \
> > 	__section_mem_map_addr(__sec) + __pfn;          \
> > })
> > #define page_to_pfn __page_to_pfn
> > 
> > that isn't the case, especially when looking at page_to_pfn():
> > 
> > #define __page_to_pfn(pg)                                       \
> > ({      const struct page *__pg = (pg);                         \
> >          int __sec = page_to_section(__pg);                      \
> > 	(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
> > })
> > 
> > Where:
> > 
> > static inline unsigned long page_to_section(const struct page *page)
> > {
> > 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> > }
> > 
> > So if page_to_section() returns something that is, e.g. zero for an
> > invalid page in a non-zero section, you're not going to end up with
> > the right pfn from page_to_pfn().
> 
> Right, I emphasised "should" in an attempt to imply "in the absence of
> serious bugs that have further-reaching consequences anyway".
> 
> > As I've said now a couple of times, trying to determine of a struct
> > page pointer is valid is the wrong question to be asking.
> 
> And doing so in one single place, on the justification of avoiding an
> incredibly niche symptom, is even more so. Not to mention that an address
> size fault is one of the best possible outcomes anyway, vs. the untold
> damage that may stem from accesses actually going through to random parts of
> the physical memory map.

I don't see it as a "niche" symptom.

If we start off with the struct page being invalid, then the result of
page_to_pfn() can not be relied upon to produce something that is
meaningful - which is exactly why the vmap() issue arises.

With a pfn_valid() check, we at least know that the PFN points at
memory. However, that memory could be _anything_ in the system - it
could be the kernel image, and it could give userspace access to
change kernel code.

So, while it is useful to do a pfn_valid() check in vmap(), as I said
to willy, this must _not_ be the primary check. It should IMHO use
WARN_ON() to make it blatently obvious that it should be something we
expect _not_ to trigger under normal circumstances, but is there to
catch programming errors elsewhere.

Robin Murphy Jan. 20, 2022, 4:37 p.m. UTC | #20

On 2022-01-20 13:03, Russell King (Oracle) wrote:
> On Thu, Jan 20, 2022 at 12:22:35PM +0000, Robin Murphy wrote:
>> On 2022-01-19 19:12, Russell King (Oracle) wrote:
>>> On Wed, Jan 19, 2022 at 06:43:10PM +0000, Robin Murphy wrote:
>>>> Indeed, my impression is that the only legitimate way to get hold of a page
>>>> pointer without assumed provenance is via pfn_to_page(), which is where
>>>> pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really *should* be a
>>>> tautology.
>>>
>>> That can only be true if pfn == page_to_pfn(pfn_to_page(pfn)) for all
>>> values of pfn.
>>>
>>> Given how pfn_to_page() is defined in the sparsemem case:
>>>
>>> #define __pfn_to_page(pfn)                              \
>>> ({	unsigned long __pfn = (pfn);                    \
>>> 	struct mem_section *__sec = __pfn_to_section(__pfn);    \
>>> 	__section_mem_map_addr(__sec) + __pfn;          \
>>> })
>>> #define page_to_pfn __page_to_pfn
>>>
>>> that isn't the case, especially when looking at page_to_pfn():
>>>
>>> #define __page_to_pfn(pg)                                       \
>>> ({      const struct page *__pg = (pg);                         \
>>>           int __sec = page_to_section(__pg);                      \
>>> 	(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
>>> })
>>>
>>> Where:
>>>
>>> static inline unsigned long page_to_section(const struct page *page)
>>> {
>>> 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
>>> }
>>>
>>> So if page_to_section() returns something that is, e.g. zero for an
>>> invalid page in a non-zero section, you're not going to end up with
>>> the right pfn from page_to_pfn().
>>
>> Right, I emphasised "should" in an attempt to imply "in the absence of
>> serious bugs that have further-reaching consequences anyway".
>>
>>> As I've said now a couple of times, trying to determine of a struct
>>> page pointer is valid is the wrong question to be asking.
>>
>> And doing so in one single place, on the justification of avoiding an
>> incredibly niche symptom, is even more so. Not to mention that an address
>> size fault is one of the best possible outcomes anyway, vs. the untold
>> damage that may stem from accesses actually going through to random parts of
>> the physical memory map.
> 
> I don't see it as a "niche" symptom.

The commit message specifically cites a Data Abort "at address 
translation later". Broadly speaking, a Data Abort due to an address 
size fault only occurs if you've been lucky enough that the bogus PA 
which got mapped is so spectacularly wrong that it's beyond the range 
configured in TCR.IPS. How many other architectures even have a 
mechanism like that?

> If we start off with the struct page being invalid, then the result of
> page_to_pfn() can not be relied upon to produce something that is
> meaningful - which is exactly why the vmap() issue arises.
> 
> With a pfn_valid() check, we at least know that the PFN points at
> memory.

No, we know it points to some PA space which has a struct page to 
represent it. pfn_valid() only says that pfn_to_page() will yield a 
valid result. That also includes things like reserved pages covering 
non-RAM areas, where a kernel VA mapping existing at all could 
potentially be fatal to the system even if it's never explicitly 
accessed - for all we know it might be a carveout belonging to 
overly-aggressive Secure software such that even a speculative prefetch 
might trigger an instant system reset.

> However, that memory could be _anything_ in the system - it
> could be the kernel image, and it could give userspace access to
> change kernel code.
> 
> So, while it is useful to do a pfn_valid() check in vmap(), as I said
> to willy, this must _not_ be the primary check. It should IMHO use
> WARN_ON() to make it blatently obvious that it should be something we
> expect _not_ to trigger under normal circumstances, but is there to
> catch programming errors elsewhere.

Rather, "to partially catch unrelated programming errors elsewhere, 
provided the buggy code happens to call vmap() rather than any of the 
many other functions with a struct page * argument." That's where it 
stretches my definition of "useful" just a bit too far. It's not about 
perfect being the enemy of good, it's about why vmap() should be 
special, and death by a thousand "useful" cuts - if we don't trust the 
pointer, why not check its alignment for basic plausibility first? If it 
seems valid, why not check if the page flags look sensible to make sure? 
How many useful little checks is too many? Every bit of code footprint 
and execution overhead imposed unconditionally on all end users to 
theoretically save developers' debugging time still adds up. Although on 
that note, it looks like arch/arm's pfn_valid() is still a linear scan 
of the memblock array, so the overhead of adding that for every page in 
every vmap() might not even be so small...

Robin.

Russell King (Oracle) Jan. 20, 2022, 4:54 p.m. UTC | #21

On Thu, Jan 20, 2022 at 04:37:01PM +0000, Robin Murphy wrote:
> On 2022-01-20 13:03, Russell King (Oracle) wrote:
> > On Thu, Jan 20, 2022 at 12:22:35PM +0000, Robin Murphy wrote:
> > > On 2022-01-19 19:12, Russell King (Oracle) wrote:
> > > > On Wed, Jan 19, 2022 at 06:43:10PM +0000, Robin Murphy wrote:
> > > > > Indeed, my impression is that the only legitimate way to get hold of a page
> > > > > pointer without assumed provenance is via pfn_to_page(), which is where
> > > > > pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really *should* be a
> > > > > tautology.
> > > > 
> > > > That can only be true if pfn == page_to_pfn(pfn_to_page(pfn)) for all
> > > > values of pfn.
> > > > 
> > > > Given how pfn_to_page() is defined in the sparsemem case:
> > > > 
> > > > #define __pfn_to_page(pfn)                              \
> > > > ({	unsigned long __pfn = (pfn);                    \
> > > > 	struct mem_section *__sec = __pfn_to_section(__pfn);    \
> > > > 	__section_mem_map_addr(__sec) + __pfn;          \
> > > > })
> > > > #define page_to_pfn __page_to_pfn
> > > > 
> > > > that isn't the case, especially when looking at page_to_pfn():
> > > > 
> > > > #define __page_to_pfn(pg)                                       \
> > > > ({      const struct page *__pg = (pg);                         \
> > > >           int __sec = page_to_section(__pg);                      \
> > > > 	(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
> > > > })
> > > > 
> > > > Where:
> > > > 
> > > > static inline unsigned long page_to_section(const struct page *page)
> > > > {
> > > > 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> > > > }
> > > > 
> > > > So if page_to_section() returns something that is, e.g. zero for an
> > > > invalid page in a non-zero section, you're not going to end up with
> > > > the right pfn from page_to_pfn().
> > > 
> > > Right, I emphasised "should" in an attempt to imply "in the absence of
> > > serious bugs that have further-reaching consequences anyway".
> > > 
> > > > As I've said now a couple of times, trying to determine of a struct
> > > > page pointer is valid is the wrong question to be asking.
> > > 
> > > And doing so in one single place, on the justification of avoiding an
> > > incredibly niche symptom, is even more so. Not to mention that an address
> > > size fault is one of the best possible outcomes anyway, vs. the untold
> > > damage that may stem from accesses actually going through to random parts of
> > > the physical memory map.
> > 
> > I don't see it as a "niche" symptom.
> 
> The commit message specifically cites a Data Abort "at address translation
> later". Broadly speaking, a Data Abort due to an address size fault only
> occurs if you've been lucky enough that the bogus PA which got mapped is so
> spectacularly wrong that it's beyond the range configured in TCR.IPS. How
> many other architectures even have a mechanism like that?

I think we're misinterpreting each other.

> > If we start off with the struct page being invalid, then the result of
> > page_to_pfn() can not be relied upon to produce something that is
> > meaningful - which is exactly why the vmap() issue arises.
> > 
> > With a pfn_valid() check, we at least know that the PFN points at
> > memory.
> 
> No, we know it points to some PA space which has a struct page to represent
> it. pfn_valid() only says that pfn_to_page() will yield a valid result. That
> also includes things like reserved pages covering non-RAM areas, where a
> kernel VA mapping existing at all could potentially be fatal to the system
> even if it's never explicitly accessed - for all we know it might be a
> carveout belonging to overly-aggressive Secure software such that even a
> speculative prefetch might trigger an instant system reset.

So are you saying that the "address size fault" can happen because we've
mapped something for which pfn_valid() returns true?

> > However, that memory could be _anything_ in the system - it
> > could be the kernel image, and it could give userspace access to
> > change kernel code.
> > 
> > So, while it is useful to do a pfn_valid() check in vmap(), as I said
> > to willy, this must _not_ be the primary check. It should IMHO use
> > WARN_ON() to make it blatently obvious that it should be something we
> > expect _not_ to trigger under normal circumstances, but is there to
> > catch programming errors elsewhere.
> 
> Rather, "to partially catch unrelated programming errors elsewhere, provided
> the buggy code happens to call vmap() rather than any of the many other
> functions with a struct page * argument." That's where it stretches my
> definition of "useful" just a bit too far. It's not about perfect being the
> enemy of good, it's about why vmap() should be special, and death by a
> thousand "useful" cuts - if we don't trust the pointer, why not check its
> alignment for basic plausibility first? If it seems valid, why not check if
> the page flags look sensible to make sure? How many useful little checks is
> too many? Every bit of code footprint and execution overhead imposed
> unconditionally on all end users to theoretically save developers' debugging
> time still adds up. Although on that note, it looks like arch/arm's
> pfn_valid() is still a linear scan of the memblock array, so the overhead of
> adding that for every page in every vmap() might not even be so small...

Well, I think I've adequately explained why I believe:

	pfn_is_valid(page_to_pfn(page))

being used as the primary check is substandard, and will likely lead to
a future CVE. When generating an array of struct page's, I believe that
it is the responsibility for the generator to ensure that the array
only contains valid pages.

Matthew Wilcox (Oracle) Jan. 20, 2022, 7:04 p.m. UTC | #22

On Thu, Jan 20, 2022 at 04:54:03PM +0000, Russell King (Oracle) wrote:
> Well, I think I've adequately explained why I believe:
> 
> 	pfn_is_valid(page_to_pfn(page))
> 
> being used as the primary check is substandard, and will likely lead to
> a future CVE. When generating an array of struct page's, I believe that
> it is the responsibility for the generator to ensure that the array
> only contains valid pages.

So you're saying that virt_to_page() should be made more expensive?

Yury Norov Jan. 21, 2022, 2:56 a.m. UTC | #23

On Wed, Jan 19, 2022 at 7:37 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 1/19/22 10:52 PM, Yury Norov wrote:
> >> Why should not this just scan over the entire user provided struct page
> >> array and make sure that all pages there in are valid via above method,
> >> but in vmap() itself before calling vmap_pages_range(). Because seems
> >> like a single invalid page detected in vmap_pages_pte_range() will
> >> anyways abort the entire vmap(). This will also enable us to drop the
> >> existing NULL check above.
> >
> > I can do this, but why is it any better than the current approach?
>
> Because it will just return on the first instance where the valid page
> check fails, saving us some CPU cycles and an incomplete mapping ?

This should normally never happen, that's why warn_on() is there. If it
happens, there is a serious problem, and the code must be fixed. So,
no CPU cycles saving in real life.

Yury Norov Jan. 21, 2022, 5:26 a.m. UTC | #24

On Thu, Jan 20, 2022 at 8:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2022-01-20 13:03, Russell King (Oracle) wrote:
> > On Thu, Jan 20, 2022 at 12:22:35PM +0000, Robin Murphy wrote:
> >> On 2022-01-19 19:12, Russell King (Oracle) wrote:
> >>> On Wed, Jan 19, 2022 at 06:43:10PM +0000, Robin Murphy wrote:
> >>>> Indeed, my impression is that the only legitimate way to get hold of a page
> >>>> pointer without assumed provenance is via pfn_to_page(), which is where
> >>>> pfn_valid() comes in. Thus pfn_valid(page_to_pfn()) really *should* be a
> >>>> tautology.
> >>>
> >>> That can only be true if pfn == page_to_pfn(pfn_to_page(pfn)) for all
> >>> values of pfn.
> >>>
> >>> Given how pfn_to_page() is defined in the sparsemem case:
> >>>
> >>> #define __pfn_to_page(pfn)                              \
> >>> ({  unsigned long __pfn = (pfn);                    \
> >>>     struct mem_section *__sec = __pfn_to_section(__pfn);    \
> >>>     __section_mem_map_addr(__sec) + __pfn;          \
> >>> })
> >>> #define page_to_pfn __page_to_pfn
> >>>
> >>> that isn't the case, especially when looking at page_to_pfn():
> >>>
> >>> #define __page_to_pfn(pg)                                       \
> >>> ({      const struct page *__pg = (pg);                         \
> >>>           int __sec = page_to_section(__pg);                      \
> >>>     (unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); \
> >>> })
> >>>
> >>> Where:
> >>>
> >>> static inline unsigned long page_to_section(const struct page *page)
> >>> {
> >>>     return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> >>> }
> >>>
> >>> So if page_to_section() returns something that is, e.g. zero for an
> >>> invalid page in a non-zero section, you're not going to end up with
> >>> the right pfn from page_to_pfn().
> >>
> >> Right, I emphasised "should" in an attempt to imply "in the absence of
> >> serious bugs that have further-reaching consequences anyway".
> >>
> >>> As I've said now a couple of times, trying to determine of a struct
> >>> page pointer is valid is the wrong question to be asking.
> >>
> >> And doing so in one single place, on the justification of avoiding an
> >> incredibly niche symptom, is even more so. Not to mention that an address
> >> size fault is one of the best possible outcomes anyway, vs. the untold
> >> damage that may stem from accesses actually going through to random parts of
> >> the physical memory map.

It's not a single place. Many exported functions check arguments this
way or another.
__vunmap() in vfree() path, for example, checks for address alignment, which is
quite similar to me. And later even makes BUG_ON(!page).

> > I don't see it as a "niche" symptom.
>
> The commit message specifically cites a Data Abort "at address
> translation later". Broadly speaking, a Data Abort due to an address
> size fault only occurs if you've been lucky enough that the bogus PA
> which got mapped is so spectacularly wrong that it's beyond the range
> configured in TCR.IPS. How many other architectures even have a
> mechanism like that?
>
> > If we start off with the struct page being invalid, then the result of
> > page_to_pfn() can not be relied upon to produce something that is
> > meaningful - which is exactly why the vmap() issue arises.
> >
> > With a pfn_valid() check, we at least know that the PFN points at
> > memory.
>
> No, we know it points to some PA space which has a struct page to
> represent it. pfn_valid() only says that pfn_to_page() will yield a
> valid result. That also includes things like reserved pages covering
> non-RAM areas, where a kernel VA mapping existing at all could
> potentially be fatal to the system even if it's never explicitly
> accessed - for all we know it might be a carveout belonging to
> overly-aggressive Secure software such that even a speculative prefetch
> might trigger an instant system reset.
>
> > However, that memory could be _anything_ in the system - it
> > could be the kernel image, and it could give userspace access to
> > change kernel code.
> >
> > So, while it is useful to do a pfn_valid() check in vmap(), as I said
> > to willy, this must _not_ be the primary check. It should IMHO use
> > WARN_ON() to make it blatently obvious that it should be something we
> > expect _not_ to trigger under normal circumstances, but is there to
> > catch programming errors elsewhere.

It actually uses WARN_ON().

> Rather, "to partially catch unrelated programming errors elsewhere,
> provided the buggy code happens to call vmap() rather than any of the
> many other functions with a struct page * argument." That's where it
> stretches my definition of "useful" just a bit too far. It's not about
> perfect being the enemy of good, it's about why vmap() should be
> special, and death by a thousand "useful" cuts - if we don't trust the
> pointer, why not check its alignment for basic plausibility first?

Because in that particular case pfn_valid() is enough. If someone else
will have a real case where IS_ALIGNED() would help - I will be all for
adding that check in vmap().

>  If it
> seems valid, why not check if the page flags look sensible to make sure?
> How many useful little checks is too many?

I'd put in 'too many' group those that test for something that never happened
to people.

> Every bit of code footprint
> and execution overhead imposed unconditionally on all end users to
> theoretically save developers' debugging time still adds up.

Not theoretically - practically!

End users will value kernel stability even when buggy drivers are installed.
They will also value developers who fix bugs quickly. It has been noticed that
DEBUG_VIRTUAL could catch this bug. But sometimes stopping the production
hardware, building a custom kernel with many debug options in the hope that
one of them will help, and running a suspicious driver for hours would take
itself for more than a day.

Thanks,
Yury

> Although on
> that note, it looks like arch/arm's pfn_valid() is still a linear scan
> of the memblock array, so the overhead of adding that for every page in
> every vmap() might not even be so small...
>
> Robin.

Matthew Wilcox (Oracle) Jan. 26, 2022, 2:50 a.m. UTC | #25

On Wed, Jan 19, 2022 at 01:28:14PM +0000, Robin Murphy wrote:
> Is it page_to_pfn() guaranteed to work without blowing up if page is invalid
> in the first place? Looking at the CONFIG_SPARSEMEM case I'm not sure that's
> true...

Something that all the ARM people weighing in on this don't understand
is that basically nobody uses SPARSEMEM without SPARSEMEM_VMEMMAP.
So all this complicated code to do page_to_pfn() is never tested.
Real users all do a simple subtraction and so the simple pfn_valid()
works fine.

vmap(): don't allow invalid pages

Commit Message

Comments

Patch