diff mbox

[RFC] Randomization of address chosen by mmap.

Message ID 20180304034704.GB20725@bombadil.infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

Matthew Wilcox (Oracle) March 4, 2018, 3:47 a.m. UTC
On Sat, Mar 03, 2018 at 04:00:45PM -0500, Daniel Micay wrote:
> The main thing I'd like to see is just the option to get a guarantee
> of enforced gaps around mappings, without necessarily even having
> randomization of the gap size. It's possible to add guard pages in
> userspace but it adds overhead by doubling the number of system calls
> to map memory (mmap PROT_NONE region, mprotect the inner portion to
> PROT_READ|PROT_WRITE) and *everything* using mmap would need to
> cooperate which is unrealistic.

So something like this?

To use it, OR in PROT_GUARD(n) to the PROT flags of mmap, and it should
pad the map by n pages.  I haven't tested it, so I'm sure it's buggy,
but it seems like a fairly cheap way to give us padding after every
mapping.

Running it on an old kernel will result in no padding, so to see if it
worked or not, try mapping something immediately after it.

Comments

Matthew Wilcox (Oracle) March 4, 2018, 8:56 p.m. UTC | #1
On Sat, Mar 03, 2018 at 07:47:04PM -0800, Matthew Wilcox wrote:
> On Sat, Mar 03, 2018 at 04:00:45PM -0500, Daniel Micay wrote:
> > The main thing I'd like to see is just the option to get a guarantee
> > of enforced gaps around mappings, without necessarily even having
> > randomization of the gap size. It's possible to add guard pages in
> > userspace but it adds overhead by doubling the number of system calls
> > to map memory (mmap PROT_NONE region, mprotect the inner portion to
> > PROT_READ|PROT_WRITE) and *everything* using mmap would need to
> > cooperate which is unrealistic.
> 
> So something like this?
> 
> To use it, OR in PROT_GUARD(n) to the PROT flags of mmap, and it should
> pad the map by n pages.  I haven't tested it, so I'm sure it's buggy,
> but it seems like a fairly cheap way to give us padding after every
> mapping.
> 
> Running it on an old kernel will result in no padding, so to see if it
> worked or not, try mapping something immediately after it.

Thinking about this more ...

 - When you call munmap, if you pass in the same (addr, length) that were
   used for mmap, then it should unmap the guard pages as well (that
   wasn't part of the patch, so it would have to be added)
 - If 'addr' is higher than the mapped address, and length at least
   reaches the end of the mapping, then I would expect the guard pages to
   "move down" and be after the end of the newly-shortened mapping.
 - If 'addr' is higher than the mapped address, and the length doesn't
   reach the end of the old mapping, we split the old mapping into two.
   I would expect the guard pages to apply to both mappings, insofar as
   they'll fit.  For an example, suppose we have a five-page mapping with
   two guard pages (MMMMMGG), and then we unmap the fourth page.  Now we
   have a three-page mapping with one guard page followed immediately
   by a one-page mapping with two guard pages (MMMGMGG).

I would say that mremap cannot change the number of guard pages.
Although I'm a little tempted to add an mremap flag to permit the mapping
to expand into the guard pages.  That would give us a nice way to reserve
address space for a mapping we think is going to expand.
blackzert@gmail.com March 5, 2018, 1:09 p.m. UTC | #2
> On 4 Mar 2018, at 23:56, Matthew Wilcox <willy@infradead.org> wrote:
> Thinking about this more ...
> 
> - When you call munmap, if you pass in the same (addr, length) that were
>   used for mmap, then it should unmap the guard pages as well (that
>   wasn't part of the patch, so it would have to be added)
> - If 'addr' is higher than the mapped address, and length at least
>   reaches the end of the mapping, then I would expect the guard pages to
>   "move down" and be after the end of the newly-shortened mapping.
> - If 'addr' is higher than the mapped address, and the length doesn't
>   reach the end of the old mapping, we split the old mapping into two.
>   I would expect the guard pages to apply to both mappings, insofar as
>   they'll fit.  For an example, suppose we have a five-page mapping with
>   two guard pages (MMMMMGG), and then we unmap the fourth page.  Now we
>   have a three-page mapping with one guard page followed immediately
>   by a one-page mapping with two guard pages (MMMGMGG).

I’m analysing that approach and see much more problems:
- each time you call mmap like this, you still  increase count of vmas as my 
patch did
- now feature vma_merge shouldn’t work at all, until MAP_FIXED is set or
PROT_GUARD(0)
- the entropy you provide is like 16 bit, that is really not so hard to brute
- in your patch you don’t use vm_guard at address searching, I see many roots 
of bugs here
- if you unmap/remap one page inside region, field vma_guard will show head 
or tail pages for vma, not both; kernel don’t know how to handle it
- user mode now choose entropy with PROT_GUARD macro, where did he gets it? 
User mode shouldn’t be responsible for entropy at all

I can’t understand what direction this conversation is going to. I was talking 
about weak implementation in Linux kernel but got many comments about ASLR 
should be implemented in user mode what is really weird to me.

I think it is possible  to add GUARD pages into my implementations, but initially 
problem was about entropy of address choosing. I would like to resolve it step by
step.

Thanks,
Ilya
Daniel Micay March 5, 2018, 2:23 p.m. UTC | #3
On 5 March 2018 at 08:09, Ilya Smith <blackzert@gmail.com> wrote:
>
>> On 4 Mar 2018, at 23:56, Matthew Wilcox <willy@infradead.org> wrote:
>> Thinking about this more ...
>>
>> - When you call munmap, if you pass in the same (addr, length) that were
>>   used for mmap, then it should unmap the guard pages as well (that
>>   wasn't part of the patch, so it would have to be added)
>> - If 'addr' is higher than the mapped address, and length at least
>>   reaches the end of the mapping, then I would expect the guard pages to
>>   "move down" and be after the end of the newly-shortened mapping.
>> - If 'addr' is higher than the mapped address, and the length doesn't
>>   reach the end of the old mapping, we split the old mapping into two.
>>   I would expect the guard pages to apply to both mappings, insofar as
>>   they'll fit.  For an example, suppose we have a five-page mapping with
>>   two guard pages (MMMMMGG), and then we unmap the fourth page.  Now we
>>   have a three-page mapping with one guard page followed immediately
>>   by a one-page mapping with two guard pages (MMMGMGG).
>
> I’m analysing that approach and see much more problems:
> - each time you call mmap like this, you still  increase count of vmas as my
> patch did
> - now feature vma_merge shouldn’t work at all, until MAP_FIXED is set or
> PROT_GUARD(0)
> - the entropy you provide is like 16 bit, that is really not so hard to brute
> - in your patch you don’t use vm_guard at address searching, I see many roots
> of bugs here
> - if you unmap/remap one page inside region, field vma_guard will show head
> or tail pages for vma, not both; kernel don’t know how to handle it
> - user mode now choose entropy with PROT_GUARD macro, where did he gets it?
> User mode shouldn’t be responsible for entropy at all

I didn't suggest this as the way of implementing fine-grained
randomization but rather a small starting point for hardening address
space layout further. I don't think it should be tied to a mmap flag
but rather something like a personality flag or a global sysctl. It
doesn't need to be random at all to be valuable, and it's just a first
step. It doesn't mean there can't be switches between random pivots
like OpenBSD mmap, etc. I'm not so sure that randomly switching around
is going to result in isolating things very well though.

The VMA count issue is at least something very predictable with a
performance cost only for kernel operations.

> I can’t understand what direction this conversation is going to. I was talking
> about weak implementation in Linux kernel but got many comments about ASLR
> should be implemented in user mode what is really weird to me.

That's not what I said. I was saying that splitting things into
regions based on the type of allocation works really well and allows
for high entropy bases, but that the kernel can't really do that right
now. It could split up code that starts as PROT_EXEC into a region but
that's generally not how libraries are mapped in so it won't know
until mprotect which is obviously too late. Unless it had some kind of
type key passed from userspace, it can't really do that.

> I think it is possible  to add GUARD pages into my implementations, but initially
> problem was about entropy of address choosing. I would like to resolve it step by
> step.

Starting with fairly aggressive fragmentation of the address space is
going to be a really hard sell. The costs of a very spread out address
space in terms of TLB misses, etc. are unclear. Starting with enforced
gaps (1 page) and randomization for those wouldn't rule out having
finer-grained randomization, like randomly switching between different
regions. This needs to be cheap enough that people want to enable it,
and the goals need to be clearly spelled out. The goal needs to be
clearer than "more randomization == good" and then accepting a high
performance cost for that.

I'm not dictating how things should be done, I don't have any say
about that. I'm just trying to discuss it.
blackzert@gmail.com March 5, 2018, 4:05 p.m. UTC | #4
> On 5 Mar 2018, at 17:23, Daniel Micay <danielmicay@gmail.com> wrote:
> I didn't suggest this as the way of implementing fine-grained
> randomization but rather a small starting point for hardening address
> space layout further. I don't think it should be tied to a mmap flag
> but rather something like a personality flag or a global sysctl. It
> doesn't need to be random at all to be valuable, and it's just a first
> step. It doesn't mean there can't be switches between random pivots
> like OpenBSD mmap, etc. I'm not so sure that randomly switching around
> is going to result in isolating things very well though.
> 

Here I like the idea of Kees Cook:
> I think this will need a larger knob -- doing this by default is
> likely to break stuff, I'd imagine? Bikeshedding: I'm not sure if this
> should be setting "3" for /proc/sys/kernel/randomize_va_space, or a
> separate one like /proc/sys/mm/randomize_mmap_allocation.
I mean it should be a way to turn randomization off since some applications are 
really need huge memory.
If you have suggestion here, would be really helpful to discuss.
I think one switch might be done globally for system administrate like 
/proc/sys/mm/randomize_mmap_allocation and another one would be good to have 
some ioctl to switch it of in case if application knows what to do.

I would like to implement it in v2 of the patch.

>> I can’t understand what direction this conversation is going to. I was talking
>> about weak implementation in Linux kernel but got many comments about ASLR
>> should be implemented in user mode what is really weird to me.
> 
> That's not what I said. I was saying that splitting things into
> regions based on the type of allocation works really well and allows
> for high entropy bases, but that the kernel can't really do that right
> now. It could split up code that starts as PROT_EXEC into a region but
> that's generally not how libraries are mapped in so it won't know
> until mprotect which is obviously too late. Unless it had some kind of
> type key passed from userspace, it can't really do that.

Yes, thats really true. I wrote about earlier. This is the issue - kernel can’t 
provide such interface thats why I try to get maximum from current mmap design. 
May be later we could split mmap on different actions by different types of 
memory it handles. But it will be a very long road I think. 

>> I think it is possible  to add GUARD pages into my implementations, but initially
>> problem was about entropy of address choosing. I would like to resolve it step by
>> step.
> 
> Starting with fairly aggressive fragmentation of the address space is
> going to be a really hard sell. The costs of a very spread out address
> space in terms of TLB misses, etc. are unclear. Starting with enforced
> gaps (1 page) and randomization for those wouldn't rule out having
> finer-grained randomization, like randomly switching between different
> regions. This needs to be cheap enough that people want to enable it,
> and the goals need to be clearly spelled out. The goal needs to be
> clearer than "more randomization == good" and then accepting a high
> performance cost for that.
> 

I want to clarify. As I know TLB caches doesn’t care about distance between 
pages, since it works with pages. So in theory TLB miss is not an issue here. I 
agree, I need to show the performance costs here. I will. Just give some time 
please.

The enforced gaps, in my case:
+	addr = get_random_long() % ((high - low) >> PAGE_SHIFT);
+	addr = low + (addr << PAGE_SHIFT);
but what you saying, entropy here should be decreased.

How about something like this:
+	addr = get_random_long() % min(((high - low) >> PAGE_SHIFT), 
MAX_SECURE_GAP );
+	addr = high - (addr << PAGE_SHIFT);
where MAX_SECURE_GAP is configurable. Probably with sysctl.

How do you like it?

> I'm not dictating how things should be done, I don't have any say
> about that. I'm just trying to discuss it.

Sorry, thanks for your involvement. I’m really appreciate it. 

Thanks,
Ilya
Matthew Wilcox (Oracle) March 5, 2018, 4:23 p.m. UTC | #5
On Mon, Mar 05, 2018 at 04:09:31PM +0300, Ilya Smith wrote:
> > On 4 Mar 2018, at 23:56, Matthew Wilcox <willy@infradead.org> wrote:
> > Thinking about this more ...
> > 
> > - When you call munmap, if you pass in the same (addr, length) that were
> >   used for mmap, then it should unmap the guard pages as well (that
> >   wasn't part of the patch, so it would have to be added)
> > - If 'addr' is higher than the mapped address, and length at least
> >   reaches the end of the mapping, then I would expect the guard pages to
> >   "move down" and be after the end of the newly-shortened mapping.
> > - If 'addr' is higher than the mapped address, and the length doesn't
> >   reach the end of the old mapping, we split the old mapping into two.
> >   I would expect the guard pages to apply to both mappings, insofar as
> >   they'll fit.  For an example, suppose we have a five-page mapping with
> >   two guard pages (MMMMMGG), and then we unmap the fourth page.  Now we
> >   have a three-page mapping with one guard page followed immediately
> >   by a one-page mapping with two guard pages (MMMGMGG).
> 
> I’m analysing that approach and see much more problems:
> - each time you call mmap like this, you still  increase count of vmas as my 
> patch did

Umm ... yes, each time you call mmap, you get a VMA.  I'm not sure why
that's a problem with my patch.  I was trying to solve the problem Daniel
pointed out, that mapping a guard region after each mmap cost twice as
many VMAs, and it solves that problem.

> - now feature vma_merge shouldn’t work at all, until MAP_FIXED is set or
> PROT_GUARD(0)

That's true.

> - the entropy you provide is like 16 bit, that is really not so hard to brute

It's 16 bits per mapping.  I think that'll make enough attacks harder
to be worthwhile.

> - in your patch you don’t use vm_guard at address searching, I see many roots 
> of bugs here

Don't need to.  vm_end includes the guard pages.

> - if you unmap/remap one page inside region, field vma_guard will show head 
> or tail pages for vma, not both; kernel don’t know how to handle it

There are no head pages.  The guard pages are only placed after the real end.

> - user mode now choose entropy with PROT_GUARD macro, where did he gets it? 
> User mode shouldn’t be responsible for entropy at all

I can't agree with that.  The user has plenty of opportunities to get
randomness; from /dev/random is the easiest, but you could also do timing
attacks on your own cachelines, for example.
blackzert@gmail.com March 5, 2018, 7:27 p.m. UTC | #6
> On 5 Mar 2018, at 19:23, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Mon, Mar 05, 2018 at 04:09:31PM +0300, Ilya Smith wrote:
>> 
>> I’m analysing that approach and see much more problems:
>> - each time you call mmap like this, you still  increase count of vmas as my 
>> patch did
> 
> Umm ... yes, each time you call mmap, you get a VMA.  I'm not sure why
> that's a problem with my patch.  I was trying to solve the problem Daniel
> pointed out, that mapping a guard region after each mmap cost twice as
> many VMAs, and it solves that problem.
> 
The issue was in VMAs count as Daniel mentioned. 
The more count, the harder walk tree. I think this is fine

>> - the entropy you provide is like 16 bit, that is really not so hard to brute
> 
> It's 16 bits per mapping.  I think that'll make enough attacks harder
> to be worthwhile.

Well yes, its ok, sorry. I just would like to have 32 bit entropy maximum some day :)

>> - in your patch you don’t use vm_guard at address searching, I see many roots 
>> of bugs here
> 
> Don't need to.  vm_end includes the guard pages.
> 
>> - if you unmap/remap one page inside region, field vma_guard will show head 
>> or tail pages for vma, not both; kernel don’t know how to handle it
> 
> There are no head pages.  The guard pages are only placed after the real end.
> 

Ok, we have MG where G = vm_guard, right? so when you do vm_split, 
you may come to situation - m1g1m2G, how to handle it? I mean when M is 
split with only one page inside this region. How to handle it?

>> - user mode now choose entropy with PROT_GUARD macro, where did he gets it? 
>> User mode shouldn’t be responsible for entropy at all
> 
> I can't agree with that.  The user has plenty of opportunities to get
> randomness; from /dev/random is the easiest, but you could also do timing
> attacks on your own cachelines, for example.

I think the usual case to use randomization for any mmap or not use it at all 
for whole process. So here I think would be nice to have some variable 
changeable with sysctl (root only) and ioctl (for greedy processes).

Well, let me summary:
My approach chose random gap inside gap range with following strings:

+	addr = get_random_long() % ((high - low) >> PAGE_SHIFT);
+	addr = low + (addr << PAGE_SHIFT);

Could be improved limiting maximum possible entropy in this shift.
To prevent situation when attacker may massage allocations and 
predict chosen address, I randomly choose memory region. I’m still
like my idea, but not going to push it anymore, since you have yours now.

Your idea just provide random non-mappable and non-accessable offset
from best-fit region. This consumes memory (1GB gap if random value 
is 0xffff). But it works and should work faster and should resolve the issue.  

My point was that current implementation need to be changed and you
have your own approach for that. :)
Lets keep mine in the mind till better times (or worse?) ;)
Will you finish your approach and upstream it?

Best regards,
Ilya
Matthew Wilcox (Oracle) March 5, 2018, 7:47 p.m. UTC | #7
On Mon, Mar 05, 2018 at 10:27:32PM +0300, Ilya Smith wrote:
> > On 5 Mar 2018, at 19:23, Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, Mar 05, 2018 at 04:09:31PM +0300, Ilya Smith wrote:
> >> I’m analysing that approach and see much more problems:
> >> - each time you call mmap like this, you still  increase count of vmas as my 
> >> patch did
> > 
> > Umm ... yes, each time you call mmap, you get a VMA.  I'm not sure why
> > that's a problem with my patch.  I was trying to solve the problem Daniel
> > pointed out, that mapping a guard region after each mmap cost twice as
> > many VMAs, and it solves that problem.
> > 
> The issue was in VMAs count as Daniel mentioned. 
> The more count, the harder walk tree. I think this is fine

The performance problem Daniel was mentioning with your patch was not
with the number of VMAs but with the scattering of addresses across the
page table tree.

> >> - the entropy you provide is like 16 bit, that is really not so hard to brute
> > 
> > It's 16 bits per mapping.  I think that'll make enough attacks harder
> > to be worthwhile.
> 
> Well yes, its ok, sorry. I just would like to have 32 bit entropy maximum some day :)

We could put 32 bits of padding into the prot argument on 64-bit systems
(and obviously you need a 64-bit address space to use that many bits).  The
thing is that you can't then put anything else into those pages (without
using MAP_FIXED).

> >> - if you unmap/remap one page inside region, field vma_guard will show head 
> >> or tail pages for vma, not both; kernel don’t know how to handle it
> > 
> > There are no head pages.  The guard pages are only placed after the real end.
> 
> Ok, we have MG where G = vm_guard, right? so when you do vm_split, 
> you may come to situation - m1g1m2G, how to handle it? I mean when M is 
> split with only one page inside this region. How to handle it?

I thought I covered that in my earlier email.  Using one letter per page,
and a five-page mapping with two guard pages: MMMMMGG.  Now unmap the
fourth page, and the VMA gets split into two.  You get: MMMGMGG.

> > I can't agree with that.  The user has plenty of opportunities to get
> > randomness; from /dev/random is the easiest, but you could also do timing
> > attacks on your own cachelines, for example.
> 
> I think the usual case to use randomization for any mmap or not use it at all 
> for whole process. So here I think would be nice to have some variable 
> changeable with sysctl (root only) and ioctl (for greedy processes).

I think this functionality can just as well live inside libc as in
the kernel.

> Well, let me summary:
> My approach chose random gap inside gap range with following strings:
> 
> +	addr = get_random_long() % ((high - low) >> PAGE_SHIFT);
> +	addr = low + (addr << PAGE_SHIFT);
> 
> Could be improved limiting maximum possible entropy in this shift.
> To prevent situation when attacker may massage allocations and 
> predict chosen address, I randomly choose memory region. I’m still
> like my idea, but not going to push it anymore, since you have yours now.
> 
> Your idea just provide random non-mappable and non-accessable offset
> from best-fit region. This consumes memory (1GB gap if random value 
> is 0xffff). But it works and should work faster and should resolve the issue.

umm ... 64k * 4k is a 256MB gap, not 1GB.  And it consumes address space,
not memory.

> My point was that current implementation need to be changed and you
> have your own approach for that. :)
> Lets keep mine in the mind till better times (or worse?) ;)
> Will you finish your approach and upstream it?

I'm just putting it out there for discussion.  If people think this is
the right approach, then I'm happy to finish it off.  If the consensus
is that we should randomly pick addresses instead, I'm happy if your
approach gets merged.
blackzert@gmail.com March 5, 2018, 8:20 p.m. UTC | #8
> On 5 Mar 2018, at 22:47, Matthew Wilcox <willy@infradead.org> wrote:
>>>> - the entropy you provide is like 16 bit, that is really not so hard to brute
>>> 
>>> It's 16 bits per mapping.  I think that'll make enough attacks harder
>>> to be worthwhile.
>> 
>> Well yes, its ok, sorry. I just would like to have 32 bit entropy maximum some day :)
> 
> We could put 32 bits of padding into the prot argument on 64-bit systems
> (and obviously you need a 64-bit address space to use that many bits).  The
> thing is that you can't then put anything else into those pages (without
> using MAP_FIXED).
> 

This one sounds good to me. In my approach it is possible to map there, but ok.

>>>> - if you unmap/remap one page inside region, field vma_guard will show head 
>>>> or tail pages for vma, not both; kernel don’t know how to handle it
>>> 
>>> There are no head pages.  The guard pages are only placed after the real end.
>> 
>> Ok, we have MG where G = vm_guard, right? so when you do vm_split, 
>> you may come to situation - m1g1m2G, how to handle it? I mean when M is 
>> split with only one page inside this region. How to handle it?
> 
> I thought I covered that in my earlier email.  Using one letter per page,
> and a five-page mapping with two guard pages: MMMMMGG.  Now unmap the
> fourth page, and the VMA gets split into two.  You get: MMMGMGG.
> 
I was just interesting, it’s not the issue to me. Now its clear, thanks.

>>> I can't agree with that.  The user has plenty of opportunities to get
>>> randomness; from /dev/random is the easiest, but you could also do timing
>>> attacks on your own cachelines, for example.
>> 
>> I think the usual case to use randomization for any mmap or not use it at all 
>> for whole process. So here I think would be nice to have some variable 
>> changeable with sysctl (root only) and ioctl (for greedy processes).
> 
> I think this functionality can just as well live inside libc as in
> the kernel.
> 

Good news for them :)

>> Well, let me summary:
>> My approach chose random gap inside gap range with following strings:
>> 
>> +	addr = get_random_long() % ((high - low) >> PAGE_SHIFT);
>> +	addr = low + (addr << PAGE_SHIFT);
>> 
>> Could be improved limiting maximum possible entropy in this shift.
>> To prevent situation when attacker may massage allocations and 
>> predict chosen address, I randomly choose memory region. I’m still
>> like my idea, but not going to push it anymore, since you have yours now.
>> 
>> Your idea just provide random non-mappable and non-accessable offset
>> from best-fit region. This consumes memory (1GB gap if random value 
>> is 0xffff). But it works and should work faster and should resolve the issue.
> 
> umm ... 64k * 4k is a 256MB gap, not 1GB.  And it consumes address space,
> not memory.
> 

hmm, yes… I found 8 bits somewhere.. 256MB should be enough for everyone.

>> My point was that current implementation need to be changed and you
>> have your own approach for that. :)
>> Lets keep mine in the mind till better times (or worse?) ;)
>> Will you finish your approach and upstream it?
> 
> I'm just putting it out there for discussion.  If people think this is
> the right approach, then I'm happy to finish it off.  If the consensus
> is that we should randomly pick addresses instead, I'm happy if your
> approach gets merged.

So now, its time to call for people? Sorry, I’m new here.

Thanks,
Ilya
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ef7fb1726ab..9da6df7f62fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2183,8 +2183,8 @@  extern int install_special_mapping(struct mm_struct *mm,
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
-	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	unsigned long len, unsigned long pad_len, vm_flags_t vm_flags,
+	unsigned long pgoff, struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1c5dea402501..9c2b66fa0561 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -299,6 +299,7 @@  struct vm_area_struct {
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
 	pgprot_t vm_page_prot;		/* Access permissions of this VMA. */
 	unsigned long vm_flags;		/* Flags, see mm.h. */
+	unsigned int vm_guard;		/* Number of trailing guard pages */
 
 	/*
 	 * For areas with an address space and backing store,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f8b134f5608f..d88babdf97f9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -12,6 +12,7 @@ 
 #define PROT_EXEC	0x4		/* page can be executed */
 #define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
+#define PROT_GUARD(x)	((x) & 0xffff) << 4	/* guard pages */
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
 
diff --git a/mm/memory.c b/mm/memory.c
index 1cfc4699db42..5b0f87afa0af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4125,6 +4125,9 @@  int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 					    flags & FAULT_FLAG_REMOTE))
 		return VM_FAULT_SIGSEGV;
 
+	if (DIV_ROUND_UP(vma->vm_end - address, PAGE_SIZE) < vma->vm_guard)
+		return VM_FAULT_SIGSEGV;
+
 	/*
 	 * Enable the memcg OOM handling for faults triggered in user
 	 * space.  Kernel faults are handled more gracefully.
diff --git a/mm/mmap.c b/mm/mmap.c
index 575766ec02f8..b9844b810ee7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1433,6 +1433,7 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long pgoff, unsigned long *populate,
 			struct list_head *uf)
 {
+	unsigned int guard_len = ((prot >> 4) & 0xffff) << PAGE_SHIFT;
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
@@ -1458,6 +1459,8 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 	len = PAGE_ALIGN(len);
 	if (!len)
 		return -ENOMEM;
+	if (len + guard_len < len)
+		return -ENOMEM;
 
 	/* offset overflow? */
 	if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
@@ -1472,7 +1475,7 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
-	addr = get_unmapped_area(file, addr, len, pgoff, flags);
+	addr = get_unmapped_area(file, addr, len + guard_len, pgoff, flags);
 	if (offset_in_page(addr))
 		return addr;
 
@@ -1591,7 +1594,7 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, len + guard_len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1727,8 +1730,8 @@  static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		unsigned long len, unsigned long pad_len, vm_flags_t vm_flags,
+		unsigned long pgoff, struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1737,24 +1740,24 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long charged = 0;
 
 	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
+	if (!may_expand_vm(mm, vm_flags, pad_len >> PAGE_SHIFT)) {
 		unsigned long nr_pages;
 
 		/*
 		 * MAP_FIXED may remove pages of mappings that intersects with
 		 * requested mapping. Account for the pages it would unmap.
 		 */
-		nr_pages = count_vma_pages_range(mm, addr, addr + len);
+		nr_pages = count_vma_pages_range(mm, addr, addr + pad_len);
 
 		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
+					(pad_len >> PAGE_SHIFT) - nr_pages))
 			return -ENOMEM;
 	}
 
 	/* Clear old maps */
-	while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
+	while (find_vma_links(mm, addr, addr + pad_len, &prev, &rb_link,
 			      &rb_parent)) {
-		if (do_munmap(mm, addr, len, uf))
+		if (do_munmap(mm, addr, pad_len, uf))
 			return -ENOMEM;
 	}
 
@@ -1771,7 +1774,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	/*
 	 * Can we just expand an old mapping?
 	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+	vma = vma_merge(mm, prev, addr, addr + pad_len, vm_flags,
 			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
 	if (vma)
 		goto out;
@@ -1789,9 +1792,10 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma->vm_mm = mm;
 	vma->vm_start = addr;
-	vma->vm_end = addr + len;
+	vma->vm_end = addr + pad_len;
 	vma->vm_flags = vm_flags;
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
+	vma->vm_guard = (pad_len - len) >> PAGE_SHIFT;
 	vma->vm_pgoff = pgoff;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);