Message ID | 20190301035550.1124-1-aarcange@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | RFC: READ/WRITE_ONCE vma/mm cleanups | expand |
On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > Hello, > > This was a well known issue for more than a decade, but until a few > months ago we relied on the compiler to stick to atomic accesses and > updates while walking and updating pagetables. > > However now the 64bit native_set_pte finally uses WRITE_ONCE and > gup_pmd_range uses READ_ONCE as well. > > This convert more racy VM places to avoid depending on the expected > compiler behavior to achieve kernel runtime correctness. > > It mostly guarantees gcc to do atomic updates at 64bit granularity > (practically not needed) and it also prevents gcc to emit code that > risks getting confused if the memory unexpectedly changes under it > (unlikely to ever be needed). > > The list of vm_start/end/pgoff to update isn't complete, I covered the > most obvious places, but before wasting too much time at doing a full > audit I thought it was safer to post it and get some comment. More > updates can be posted incrementally anyway. The intention is described well to my eyes. Do I understand correctly, that it's attempt to get away with modifying vma's fields under down_read(mmap_sem)? I'm not fan of this. It can help with producing stable value for the one field, but it doesn't help if more than one thing changed under you. Like if both vm_start and vm_end modifed under you, it can lead to inconsistency. Like vm_end < vm_start.
On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: >> Hello, >> >> This was a well known issue for more than a decade, but until a few >> months ago we relied on the compiler to stick to atomic accesses and >> updates while walking and updating pagetables. >> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and >> gup_pmd_range uses READ_ONCE as well. >> >> This convert more racy VM places to avoid depending on the expected >> compiler behavior to achieve kernel runtime correctness. >> >> It mostly guarantees gcc to do atomic updates at 64bit granularity >> (practically not needed) and it also prevents gcc to emit code that >> risks getting confused if the memory unexpectedly changes under it >> (unlikely to ever be needed). >> >> The list of vm_start/end/pgoff to update isn't complete, I covered the >> most obvious places, but before wasting too much time at doing a full >> audit I thought it was safer to post it and get some comment. More >> updates can be posted incrementally anyway. > > The intention is described well to my eyes. > > Do I understand correctly, that it's attempt to get away with modifying > vma's fields under down_read(mmap_sem)? If that's the intention, then IMHO it's not that well described. It talks about "racy VM places" but e.g. the __mm_populate() changes are for code protected by down_read(). So what's going on here? > I'm not fan of this. > > It can help with producing stable value for the one field, but it doesn't > help if more than one thing changed under you. Like if both vm_start and > vm_end modifed under you, it can lead to inconsistency. Like vm_end < > vm_start. >
Hello Kirill and Vlastimil, On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > >> Hello, > >> > >> This was a well known issue for more than a decade, but until a few > >> months ago we relied on the compiler to stick to atomic accesses and > >> updates while walking and updating pagetables. > >> > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > >> gup_pmd_range uses READ_ONCE as well. > >> > >> This convert more racy VM places to avoid depending on the expected > >> compiler behavior to achieve kernel runtime correctness. > >> > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > >> (practically not needed) and it also prevents gcc to emit code that > >> risks getting confused if the memory unexpectedly changes under it > >> (unlikely to ever be needed). > >> > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > >> most obvious places, but before wasting too much time at doing a full > >> audit I thought it was safer to post it and get some comment. More > >> updates can be posted incrementally anyway. > > > > The intention is described well to my eyes. > > > > Do I understand correctly, that it's attempt to get away with modifying > > vma's fields under down_read(mmap_sem)? The issue is that we already get away with it, but we do it without READ/WRITE_ONCE. The patch should changes nothing, it should only reduce the dependency on the compiler to do what we expect. > If that's the intention, then IMHO it's not that well described. It > talks about "racy VM places" but e.g. the __mm_populate() changes are > for code protected by down_read(). So what's going on here? expand_stack can move anonymous vma vm_end up or vm_start/pgoff down, while we hold the mmap_sem for writing. See the location of the three WRITE_ONCE in the patch. So whenever we deal with a vma that we don't know if it's filebacked (filebacked vmas cannot growsup/down) and that we don't know if it has VM_GROWSDOWN/UP set, we shall use READ_ONCE to access vm_start/end/pgoff. This is the only thing the patch is about, and it should make no runtime difference at all, but then the WRITE_ONCE in native_set_pte also should make no runtime difference just like the READ_ONCE in gup_pmd_range should make no runtime difference. I mean we don't trust the compiler with gup_fast but then we trust it with expand_stack vs find_vma.
On Fri, 01 Mar 2019, Andrea Arcangeli wrote: > >On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: >> On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: >> > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: >> >> Hello, >> >> >> >> This was a well known issue for more than a decade, but until a few >> >> months ago we relied on the compiler to stick to atomic accesses and >> >> updates while walking and updating pagetables. >> >> >> >> However now the 64bit native_set_pte finally uses WRITE_ONCE and >> >> gup_pmd_range uses READ_ONCE as well. >> >> >> >> This convert more racy VM places to avoid depending on the expected >> >> compiler behavior to achieve kernel runtime correctness. >> >> >> >> It mostly guarantees gcc to do atomic updates at 64bit granularity >> >> (practically not needed) and it also prevents gcc to emit code that >> >> risks getting confused if the memory unexpectedly changes under it >> >> (unlikely to ever be needed). >> >> >> >> The list of vm_start/end/pgoff to update isn't complete, I covered the >> >> most obvious places, but before wasting too much time at doing a full >> >> audit I thought it was safer to post it and get some comment. More >> >> updates can be posted incrementally anyway. >> > >> > The intention is described well to my eyes. >> > >> > Do I understand correctly, that it's attempt to get away with modifying >> > vma's fields under down_read(mmap_sem)? > >The issue is that we already get away with it, but we do it without >READ/WRITE_ONCE. The patch should changes nothing, it should only >reduce the dependency on the compiler to do what we expect. > >> If that's the intention, then IMHO it's not that well described. It >> talks about "racy VM places" but e.g. the __mm_populate() changes are >> for code protected by down_read(). So what's going on here? > >expand_stack can move anonymous vma vm_end up or vm_start/pgoff down, >while we hold the mmap_sem for writing. See the location of the three >WRITE_ONCE in the patch. You mean for reading, right? Yes, with expand_stack being held for read such members were never really serialized by the mmap_sem and thus we should not be computing stale values. Acked-by: Davidlohr Bueso <dbueso@suse.de> Thanks.
On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote: > Hello Kirill and Vlastimil, > > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > > >> Hello, > > >> > > >> This was a well known issue for more than a decade, but until a few > > >> months ago we relied on the compiler to stick to atomic accesses and > > >> updates while walking and updating pagetables. > > >> > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > > >> gup_pmd_range uses READ_ONCE as well. > > >> > > >> This convert more racy VM places to avoid depending on the expected > > >> compiler behavior to achieve kernel runtime correctness. > > >> > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > > >> (practically not needed) and it also prevents gcc to emit code that > > >> risks getting confused if the memory unexpectedly changes under it > > >> (unlikely to ever be needed). > > >> > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > > >> most obvious places, but before wasting too much time at doing a full > > >> audit I thought it was safer to post it and get some comment. More > > >> updates can be posted incrementally anyway. > > > > > > The intention is described well to my eyes. > > > > > > Do I understand correctly, that it's attempt to get away with modifying > > > vma's fields under down_read(mmap_sem)? > > The issue is that we already get away with it, but we do it without > READ/WRITE_ONCE. The patch should changes nothing, it should only > reduce the dependency on the compiler to do what we expect. Yes, it is pre-existing problem. And yes, complier may screw this up. The patch may reduce dependency on the compiler, but it doesn't mean it reduces chance of race. Consider your changes into __mm_populate() and populate_vma_page_range(). You put READ_ONCE() in both functions. But populate_vma_page_range() gets called from __mm_populate(). Before your change compiler may optimize the code and load from the memory once for a field. With your changes complier will issue two loads. It *increases* chances of the race, not reduces them. The current locking scheme doesn't allow modifying VMA field without down_write(mmap_sem). We do have hacks[1] that try to bypass the limitation, but AFAIK we never had a solid explanation why this should work. Sparkling READ_ONCE() doesn't help with this, but makes it appears legitimate. [1] I believe we also touch vm_flags without proper locking to set/clear VM_LOCKED.
On Mon 04-03-19 13:12:10, Kirill A. Shutemov wrote: > On Fri, Mar 01, 2019 at 11:54:52AM -0500, Andrea Arcangeli wrote: > > Hello Kirill and Vlastimil, > > > > On Fri, Mar 01, 2019 at 02:04:38PM +0100, Vlastimil Babka wrote: > > > On 3/1/19 10:37 AM, Kirill A. Shutemov wrote: > > > > On Thu, Feb 28, 2019 at 10:55:48PM -0500, Andrea Arcangeli wrote: > > > >> Hello, > > > >> > > > >> This was a well known issue for more than a decade, but until a few > > > >> months ago we relied on the compiler to stick to atomic accesses and > > > >> updates while walking and updating pagetables. > > > >> > > > >> However now the 64bit native_set_pte finally uses WRITE_ONCE and > > > >> gup_pmd_range uses READ_ONCE as well. > > > >> > > > >> This convert more racy VM places to avoid depending on the expected > > > >> compiler behavior to achieve kernel runtime correctness. > > > >> > > > >> It mostly guarantees gcc to do atomic updates at 64bit granularity > > > >> (practically not needed) and it also prevents gcc to emit code that > > > >> risks getting confused if the memory unexpectedly changes under it > > > >> (unlikely to ever be needed). > > > >> > > > >> The list of vm_start/end/pgoff to update isn't complete, I covered the > > > >> most obvious places, but before wasting too much time at doing a full > > > >> audit I thought it was safer to post it and get some comment. More > > > >> updates can be posted incrementally anyway. > > > > > > > > The intention is described well to my eyes. > > > > > > > > Do I understand correctly, that it's attempt to get away with modifying > > > > vma's fields under down_read(mmap_sem)? > > > > The issue is that we already get away with it, but we do it without > > READ/WRITE_ONCE. The patch should changes nothing, it should only > > reduce the dependency on the compiler to do what we expect. > > Yes, it is pre-existing problem. And yes, complier may screw this up. > The patch may reduce dependency on the compiler, but it doesn't mean it > reduces chance of race. > > Consider your changes into __mm_populate() and populate_vma_page_range(). > You put READ_ONCE() in both functions. But populate_vma_page_range() gets > called from __mm_populate(). Before your change compiler may optimize the > code and load from the memory once for a field. With your changes complier > will issue two loads. > > It *increases* chances of the race, not reduces them. > > The current locking scheme doesn't allow modifying VMA field without > down_write(mmap_sem). > > We do have hacks[1] that try to bypass the limitation, but AFAIK we never > had a solid explanation why this should work. Sparkling READ_ONCE() > doesn't help with this, but makes it appears legitimate. I do agree with Kirill here. Sprinkling {READ,WRITE}_ONCE around just doesn't solve anything. I am pretty sure that people will not think about it and we will end up in a similar half covered situation in few years again. I would rather remove all those hacks and use a saner locking scheme instead. > [1] I believe we also touch vm_flags without proper locking to set/clear > VM_LOCKED.