Message ID | 20180703082718.GF16767@dhcp22.suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jul 03, 2018 at 10:27:18AM +0200, Michal Hocko wrote: > On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote: > > On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote: > > > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote: > > > [...] > > > > I probably miss the explanation somewhere, but what's wrong with allowing > > > > other thread to re-populate the VMA? > > > > > > We have discussed that earlier and it boils down to how is racy access > > > to munmap supposed to behave. Right now we have either the original > > > content or SEGV. If we allow to simply madvise_dontneed before real > > > unmap we could get a new page as well. There might be (quite broken I > > > would say) user space code that would simply corrupt data silently that > > > way. > > > > Okay, so we add a lot of complexity to accommodate broken userspace that > > may or may not exist. Is it right? :) > > I would really love to do the most simple and obious thing > > diff --git a/mm/mmap.c b/mm/mmap.c > index 336bee8c4e25..86ffb179c3b5 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap); > SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > { > profile_munmap(addr); > + if (len > LARGE_NUMBER) > + do_madvise(addr, len, MADV_DONTNEED); > return vm_munmap(addr, len); > } > > but the argument that current semantic of good data or SEGV on > racing threads is no longer preserved sounds valid to me. Remember > optimizations shouldn't eat your data. How do we ensure that we won't > corrupt data silently? +linux-api Frankly, I don't see change in semantics here. Code that has race between munmap() and page fault would get intermittent SIGSEGV before and after the approach with simple MADV_DONTNEED. To be safe, I wouldn't go with the optimization if the process has custom SIGSEGV handler. > Besides that if this was so simple then we do not even need any kernel > code. You could do that from glibc resp. any munmap wrapper. So maybe > the proper answer is, if you do care then just help the system and > DONTNEED your data before you munmap as an optimization for large > mappings. Kernel latency problems have to be handled by kernel.
On Tue 03-07-18 12:19:11, Kirill A. Shutemov wrote: > On Tue, Jul 03, 2018 at 10:27:18AM +0200, Michal Hocko wrote: > > On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote: > > > On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote: > > > > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote: > > > > [...] > > > > > I probably miss the explanation somewhere, but what's wrong with allowing > > > > > other thread to re-populate the VMA? > > > > > > > > We have discussed that earlier and it boils down to how is racy access > > > > to munmap supposed to behave. Right now we have either the original > > > > content or SEGV. If we allow to simply madvise_dontneed before real > > > > unmap we could get a new page as well. There might be (quite broken I > > > > would say) user space code that would simply corrupt data silently that > > > > way. > > > > > > Okay, so we add a lot of complexity to accommodate broken userspace that > > > may or may not exist. Is it right? :) > > > > I would really love to do the most simple and obious thing > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 336bee8c4e25..86ffb179c3b5 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap); > > SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > > { > > profile_munmap(addr); > > + if (len > LARGE_NUMBER) > > + do_madvise(addr, len, MADV_DONTNEED); > > return vm_munmap(addr, len); > > } > > > > but the argument that current semantic of good data or SEGV on > > racing threads is no longer preserved sounds valid to me. Remember > > optimizations shouldn't eat your data. How do we ensure that we won't > > corrupt data silently? > > +linux-api > > Frankly, I don't see change in semantics here. > > Code that has race between munmap() and page fault would get intermittent > SIGSEGV before and after the approach with simple MADV_DONTNEED. prior to this patch you would either get an expected content (if you win the race) or SEGV otherwise. With the above change you would get a third state - a fresh new page (zero page) if you lost the race half way. That sounds like a change of a long term semantic. How much that matters is of course a question. Userspace is known to do the most unexpected things you never even dreamed of.
On Tue, Jul 03, 2018 at 01:34:53PM +0200, Michal Hocko wrote: > On Tue 03-07-18 12:19:11, Kirill A. Shutemov wrote: > > On Tue, Jul 03, 2018 at 10:27:18AM +0200, Michal Hocko wrote: > > > On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote: > > > > On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote: > > > > > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote: > > > > > [...] > > > > > > I probably miss the explanation somewhere, but what's wrong with allowing > > > > > > other thread to re-populate the VMA? > > > > > > > > > > We have discussed that earlier and it boils down to how is racy access > > > > > to munmap supposed to behave. Right now we have either the original > > > > > content or SEGV. If we allow to simply madvise_dontneed before real > > > > > unmap we could get a new page as well. There might be (quite broken I > > > > > would say) user space code that would simply corrupt data silently that > > > > > way. > > > > > > > > Okay, so we add a lot of complexity to accommodate broken userspace that > > > > may or may not exist. Is it right? :) > > > > > > I would really love to do the most simple and obious thing > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 336bee8c4e25..86ffb179c3b5 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap); > > > SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > > > { > > > profile_munmap(addr); > > > + if (len > LARGE_NUMBER) > > > + do_madvise(addr, len, MADV_DONTNEED); > > > return vm_munmap(addr, len); > > > } > > > > > > but the argument that current semantic of good data or SEGV on > > > racing threads is no longer preserved sounds valid to me. Remember > > > optimizations shouldn't eat your data. How do we ensure that we won't > > > corrupt data silently? > > > > +linux-api > > > > Frankly, I don't see change in semantics here. > > > > Code that has race between munmap() and page fault would get intermittent > > SIGSEGV before and after the approach with simple MADV_DONTNEED. > > prior to this patch you would either get an expected content (if you > win the race) or SEGV otherwise. With the above change you would get a > third state - a fresh new page (zero page) if you lost the race half > way. That sounds like a change of a long term semantic. > > How much that matters is of course a question. Userspace is known to do > the most unexpected things you never even dreamed of. I bet nobody would notice the difference. Let's go the simple way. The price to protect against *theoretical* broken userspace is too high.
On 7/3/18 5:14 AM, Kirill A. Shutemov wrote: > On Tue, Jul 03, 2018 at 01:34:53PM +0200, Michal Hocko wrote: >> On Tue 03-07-18 12:19:11, Kirill A. Shutemov wrote: >>> On Tue, Jul 03, 2018 at 10:27:18AM +0200, Michal Hocko wrote: >>>> On Tue 03-07-18 11:12:05, Kirill A. Shutemov wrote: >>>>> On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote: >>>>>> On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote: >>>>>> [...] >>>>>>> I probably miss the explanation somewhere, but what's wrong with allowing >>>>>>> other thread to re-populate the VMA? >>>>>> We have discussed that earlier and it boils down to how is racy access >>>>>> to munmap supposed to behave. Right now we have either the original >>>>>> content or SEGV. If we allow to simply madvise_dontneed before real >>>>>> unmap we could get a new page as well. There might be (quite broken I >>>>>> would say) user space code that would simply corrupt data silently that >>>>>> way. >>>>> Okay, so we add a lot of complexity to accommodate broken userspace that >>>>> may or may not exist. Is it right? :) >>>> I would really love to do the most simple and obious thing >>>> >>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>> index 336bee8c4e25..86ffb179c3b5 100644 >>>> --- a/mm/mmap.c >>>> +++ b/mm/mmap.c >>>> @@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap); >>>> SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) >>>> { >>>> profile_munmap(addr); >>>> + if (len > LARGE_NUMBER) >>>> + do_madvise(addr, len, MADV_DONTNEED); >>>> return vm_munmap(addr, len); >>>> } >>>> >>>> but the argument that current semantic of good data or SEGV on >>>> racing threads is no longer preserved sounds valid to me. Remember >>>> optimizations shouldn't eat your data. How do we ensure that we won't >>>> corrupt data silently? >>> +linux-api >>> >>> Frankly, I don't see change in semantics here. >>> >>> Code that has race between munmap() and page fault would get intermittent >>> SIGSEGV before and after the approach with simple MADV_DONTNEED. >> prior to this patch you would either get an expected content (if you >> win the race) or SEGV otherwise. With the above change you would get a >> third state - a fresh new page (zero page) if you lost the race half >> way. That sounds like a change of a long term semantic. >> >> How much that matters is of course a question. Userspace is known to do >> the most unexpected things you never even dreamed of. > I bet nobody would notice the difference. > > Let's go the simple way. The price to protect against *theoretical* broken > userspace is too high. That simple way has two major issues: * The unexpected third state as Michal mentioned. VM_DEAD is a simple way to deal with it. It may not be able to kill all corner cases, but it should be a good simple approach to deal with the most wacky applications. * Can't handle mlocked and hugetlb vmas mentioned by Andrew. MADV_DONTNEED just skips them. Actually, I think your suggestion about just calling regular do_munmap() when getting the exclusive lock sounds reasonable. With this approach, we can solve the above caveats and make code simple enough (Of course not that simple as Michal expects :-) Thanks, Yang >
diff --git a/mm/mmap.c b/mm/mmap.c index 336bee8c4e25..86ffb179c3b5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2811,6 +2811,8 @@ EXPORT_SYMBOL(vm_munmap); SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) { profile_munmap(addr); + if (len > LARGE_NUMBER) + do_madvise(addr, len, MADV_DONTNEED); return vm_munmap(addr, len); }