mbox series

[0/2] RFC: READ/WRITE_ONCE vma/mm cleanups

Message ID 20190301035550.1124-1-aarcange@redhat.com (mailing list archive)
Headers show
Series RFC: READ/WRITE_ONCE vma/mm cleanups | expand

Message

Andrea Arcangeli March 1, 2019, 3:55 a.m. UTC
Hello,

This was a well known issue for more than a decade, but until a few
months ago we relied on the compiler to stick to atomic accesses and
updates while walking and updating pagetables.

However now the 64bit native_set_pte finally uses WRITE_ONCE and
gup_pmd_range uses READ_ONCE as well.

This convert more racy VM places to avoid depending on the expected
compiler behavior to achieve kernel runtime correctness.

It mostly guarantees gcc to do atomic updates at 64bit granularity
(practically not needed) and it also prevents gcc to emit code that
risks getting confused if the memory unexpectedly changes under it
(unlikely to ever be needed).

The list of vm_start/end/pgoff to update isn't complete, I covered the
most obvious places, but before wasting too much time at doing a full
audit I thought it was safer to post it and get some comment. More
updates can be posted incrementally anyway.

Andrea Arcangeli (2):
  coredump: use READ_ONCE to read mm->flags
  mm: use READ/WRITE_ONCE to access anonymous vmas
    vm_start/vm_end/vm_pgoff

 fs/coredump.c |  2 +-
 mm/gup.c      | 23 +++++++++++++----------
 mm/internal.h |  3 ++-
 mm/memory.c   |  2 +-
 mm/mmap.c     | 16 ++++++++--------
 mm/rmap.c     |  3 ++-
 mm/vmacache.c |  3 ++-
 7 files changed, 29 insertions(+), 23 deletions(-)

Comments

Kirill A. Shutemov March 1, 2019, 9:37 a.m. UTC | #1
On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> Hello,
> 
> This was a well known issue for more than a decade, but until a few
> months ago we relied on the compiler to stick to atomic accesses and
> updates while walking and updating pagetables.
> 
> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> gup_pmd_range uses READ_ONCE as well.
> 
> This convert more racy VM places to avoid depending on the expected
> compiler behavior to achieve kernel runtime correctness.
> 
> It mostly guarantees gcc to do atomic updates at 64bit granularity
> (practically not needed) and it also prevents gcc to emit code that
> risks getting confused if the memory unexpectedly changes under it
> (unlikely to ever be needed).
> 
> The list of vm_start/end/pgoff to update isn't complete, I covered the
> most obvious places, but before wasting too much time at doing a full
> audit I thought it was safer to post it and get some comment. More
> updates can be posted incrementally anyway.

The intention is described well to my eyes.

Do I understand correctly, that it's attempt to get away with modifying
vma's fields under down_read(mmap_sem)?

I'm not fan of this.

It can help with producing stable value for the one field, but it doesn't
help if more than one thing changed under you. Like if both vm_start and
vm_end modifed under you, it can lead to inconsistency. Like vm_end <
vm_start.
Vlastimil Babka March 1, 2019, 1:04 p.m. UTC | #2
On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
>> Hello,
>>
>> This was a well known issue for more than a decade, but until a few
>> months ago we relied on the compiler to stick to atomic accesses and
>> updates while walking and updating pagetables.
>>
>> However now the 64bit native_set_pte finally uses WRITE_ONCE and
>> gup_pmd_range uses READ_ONCE as well.
>>
>> This convert more racy VM places to avoid depending on the expected
>> compiler behavior to achieve kernel runtime correctness.
>>
>> It mostly guarantees gcc to do atomic updates at 64bit granularity
>> (practically not needed) and it also prevents gcc to emit code that
>> risks getting confused if the memory unexpectedly changes under it
>> (unlikely to ever be needed).
>>
>> The list of vm_start/end/pgoff to update isn't complete, I covered the
>> most obvious places, but before wasting too much time at doing a full
>> audit I thought it was safer to post it and get some comment. More
>> updates can be posted incrementally anyway.
> 
> The intention is described well to my eyes.
> 
> Do I understand correctly, that it's attempt to get away with modifying
> vma's fields under down_read(mmap_sem)?

If that's the intention, then IMHO it's not that well described. It
talks about "racy VM places" but e.g. the __mm_populate() changes are
for code protected by down_read(). So what's going on here?

> I'm not fan of this.
> 
> It can help with producing stable value for the one field, but it doesn't
> help if more than one thing changed under you. Like if both vm_start and
> vm_end modifed under you, it can lead to inconsistency. Like vm_end <
> vm_start.
>
Andrea Arcangeli March 1, 2019, 4:54 p.m. UTC | #3
Hello Kirill and Vlastimil,

On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
> On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> >> Hello,
> >>
> >> This was a well known issue for more than a decade, but until a few
> >> months ago we relied on the compiler to stick to atomic accesses and
> >> updates while walking and updating pagetables.
> >>
> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> >> gup_pmd_range uses READ_ONCE as well.
> >>
> >> This convert more racy VM places to avoid depending on the expected
> >> compiler behavior to achieve kernel runtime correctness.
> >>
> >> It mostly guarantees gcc to do atomic updates at 64bit granularity
> >> (practically not needed) and it also prevents gcc to emit code that
> >> risks getting confused if the memory unexpectedly changes under it
> >> (unlikely to ever be needed).
> >>
> >> The list of vm_start/end/pgoff to update isn't complete, I covered the
> >> most obvious places, but before wasting too much time at doing a full
> >> audit I thought it was safer to post it and get some comment. More
> >> updates can be posted incrementally anyway.
> > 
> > The intention is described well to my eyes.
> > 
> > Do I understand correctly, that it's attempt to get away with modifying
> > vma's fields under down_read(mmap_sem)?

The issue is that we already get away with it, but we do it without
READ/WRITE_ONCE. The patch should changes nothing, it should only
reduce the dependency on the compiler to do what we expect.

> If that's the intention, then IMHO it's not that well described. It
> talks about "racy VM places" but e.g. the __mm_populate() changes are
> for code protected by down_read(). So what's going on here?

expand_stack can move anonymous vma vm_end up or vm_start/pgoff down,
while we hold the mmap_sem for writing. See the location of the three
WRITE_ONCE in the patch.

So whenever we deal with a vma that we don't know if it's filebacked
(filebacked vmas cannot growsup/down) and that we don't know if it has
VM_GROWSDOWN/UP set, we shall use READ_ONCE to access
vm_start/end/pgoff. This is the only thing the patch is about, and it
should make no runtime difference at all, but then the WRITE_ONCE in
native_set_pte also should make no runtime difference just like the
READ_ONCE in gup_pmd_range should make no runtime difference. I mean
we don't trust the compiler with gup_fast but then we trust it with
expand_stack vs find_vma.
Davidlohr Bueso March 1, 2019, 6:49 p.m. UTC | #4
On Fri, 01 Mar 2019, Andrea Arcangeli wrote:

>
>On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
>> On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
>> > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
>> >> Hello,
>> >>
>> >> This was a well known issue for more than a decade, but until a few
>> >> months ago we relied on the compiler to stick to atomic accesses and
>> >> updates while walking and updating pagetables.
>> >>
>> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
>> >> gup_pmd_range uses READ_ONCE as well.
>> >>
>> >> This convert more racy VM places to avoid depending on the expected
>> >> compiler behavior to achieve kernel runtime correctness.
>> >>
>> >> It mostly guarantees gcc to do atomic updates at 64bit granularity
>> >> (practically not needed) and it also prevents gcc to emit code that
>> >> risks getting confused if the memory unexpectedly changes under it
>> >> (unlikely to ever be needed).
>> >>
>> >> The list of vm_start/end/pgoff to update isn't complete, I covered the
>> >> most obvious places, but before wasting too much time at doing a full
>> >> audit I thought it was safer to post it and get some comment. More
>> >> updates can be posted incrementally anyway.
>> >
>> > The intention is described well to my eyes.
>> >
>> > Do I understand correctly, that it's attempt to get away with modifying
>> > vma's fields under down_read(mmap_sem)?
>
>The issue is that we already get away with it, but we do it without
>READ/WRITE_ONCE. The patch should changes nothing, it should only
>reduce the dependency on the compiler to do what we expect.
>
>> If that's the intention, then IMHO it's not that well described. It
>> talks about "racy VM places" but e.g. the __mm_populate() changes are
>> for code protected by down_read(). So what's going on here?
>
>expand_stack can move anonymous vma vm_end up or vm_start/pgoff down,
>while we hold the mmap_sem for writing. See the location of the three
>WRITE_ONCE in the patch.

You mean for reading, right? Yes, with expand_stack being held for read
such members were never really serialized by the mmap_sem and thus we
should not be computing stale values.

Acked-by: Davidlohr Bueso <dbueso@suse.de>

Thanks.
Kirill A. Shutemov March 4, 2019, 10:12 a.m. UTC | #5
On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote:
> Hello Kirill and Vlastimil,
> 
> On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
> > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> > >> Hello,
> > >>
> > >> This was a well known issue for more than a decade, but until a few
> > >> months ago we relied on the compiler to stick to atomic accesses and
> > >> updates while walking and updating pagetables.
> > >>
> > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> > >> gup_pmd_range uses READ_ONCE as well.
> > >>
> > >> This convert more racy VM places to avoid depending on the expected
> > >> compiler behavior to achieve kernel runtime correctness.
> > >>
> > >> It mostly guarantees gcc to do atomic updates at 64bit granularity
> > >> (practically not needed) and it also prevents gcc to emit code that
> > >> risks getting confused if the memory unexpectedly changes under it
> > >> (unlikely to ever be needed).
> > >>
> > >> The list of vm_start/end/pgoff to update isn't complete, I covered the
> > >> most obvious places, but before wasting too much time at doing a full
> > >> audit I thought it was safer to post it and get some comment. More
> > >> updates can be posted incrementally anyway.
> > > 
> > > The intention is described well to my eyes.
> > > 
> > > Do I understand correctly, that it's attempt to get away with modifying
> > > vma's fields under down_read(mmap_sem)?
> 
> The issue is that we already get away with it, but we do it without
> READ/WRITE_ONCE. The patch should changes nothing, it should only
> reduce the dependency on the compiler to do what we expect.

Yes, it is pre-existing problem. And yes, complier may screw this up.
The patch may reduce dependency on the compiler, but it doesn't mean it
reduces chance of race.

Consider your changes into __mm_populate() and populate_vma_page_range().
You put READ_ONCE() in both functions. But populate_vma_page_range() gets
called from __mm_populate(). Before your change compiler may optimize the
code and load from the memory once for a field. With your changes complier
will issue two loads.

It *increases* chances of the race, not reduces them.

The current locking scheme doesn't allow modifying VMA field without
down_write(mmap_sem).

We do have hacks[1] that try to bypass the limitation, but AFAIK we never
had a solid explanation why this should work. Sparkling READ_ONCE()
doesn't help with this, but makes it appears legitimate.

[1] I believe we also touch vm_flags without proper locking to set/clear
VM_LOCKED.
Michal Hocko March 5, 2019, 1 p.m. UTC | #6
On Mon 04-03-19 13:12:10, Kirill A. Shutemov wrote:
> On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote:
> > Hello Kirill and Vlastimil,
> > 
> > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote:
> > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote:
> > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote:
> > > >> Hello,
> > > >>
> > > >> This was a well known issue for more than a decade, but until a few
> > > >> months ago we relied on the compiler to stick to atomic accesses and
> > > >> updates while walking and updating pagetables.
> > > >>
> > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and
> > > >> gup_pmd_range uses READ_ONCE as well.
> > > >>
> > > >> This convert more racy VM places to avoid depending on the expected
> > > >> compiler behavior to achieve kernel runtime correctness.
> > > >>
> > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity
> > > >> (practically not needed) and it also prevents gcc to emit code that
> > > >> risks getting confused if the memory unexpectedly changes under it
> > > >> (unlikely to ever be needed).
> > > >>
> > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the
> > > >> most obvious places, but before wasting too much time at doing a full
> > > >> audit I thought it was safer to post it and get some comment. More
> > > >> updates can be posted incrementally anyway.
> > > > 
> > > > The intention is described well to my eyes.
> > > > 
> > > > Do I understand correctly, that it's attempt to get away with modifying
> > > > vma's fields under down_read(mmap_sem)?
> > 
> > The issue is that we already get away with it, but we do it without
> > READ/WRITE_ONCE. The patch should changes nothing, it should only
> > reduce the dependency on the compiler to do what we expect.
> 
> Yes, it is pre-existing problem. And yes, complier may screw this up.
> The patch may reduce dependency on the compiler, but it doesn't mean it
> reduces chance of race.
> 
> Consider your changes into __mm_populate() and populate_vma_page_range().
> You put READ_ONCE() in both functions. But populate_vma_page_range() gets
> called from __mm_populate(). Before your change compiler may optimize the
> code and load from the memory once for a field. With your changes complier
> will issue two loads.
> 
> It *increases* chances of the race, not reduces them.
> 
> The current locking scheme doesn't allow modifying VMA field without
> down_write(mmap_sem).
> 
> We do have hacks[1] that try to bypass the limitation, but AFAIK we never
> had a solid explanation why this should work. Sparkling READ_ONCE()
> doesn't help with this, but makes it appears legitimate.

I do agree with Kirill here. Sprinkling {READ,WRITE}_ONCE around just
doesn't solve anything. I am pretty sure that people will not think
about it and we will end up in a similar half covered situation in few
years again. I would rather remove all those hacks and use a saner
locking scheme instead.

> [1] I believe we also touch vm_flags without proper locking to set/clear
> VM_LOCKED.