Message ID | 1452864188-2417-25-git-send-email-ian.campbell@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: > This avoids a potential issue with a fork after allocation but before > madvise. > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > --- > v7: New, replacing "tools/libs/call: linux: avoid forking between mmap > and madvise". > --- > tools/libs/call/linux.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c > index 3641e41..651f380 100644 > --- a/tools/libs/call/linux.c > +++ b/tools/libs/call/linux.c I didn't notice you only handled this for Linux until now. I think FreeBSD and NetBSD need similar treatment, too? But then current BSD* code doesn't even support DONTFORK in madvise. Adding Roger for more input. The changes in this patch look fine to me. Wei. > @@ -88,7 +88,7 @@ void *osdep_alloc_pages(xencall_handle *xcall, size_t npages) > { > size_t size = npages * PAGE_SIZE; > void *p; > - int rc, saved_errno; > + int rc, i, saved_errno; > > /* Address returned by mmap is page aligned. */ > p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0); > @@ -107,6 +107,18 @@ void *osdep_alloc_pages(xencall_handle *xcall, size_t npages) > goto out; > } > > + /* > + * Touch each page in turn to force them to be un-CoWed, in case a > + * fork happened in another thread at an inopportune moment > + * above. The madvise() will prevent any subsequent fork calls from > + * causing the same problem. > + */ > + for ( i = 0; i < npages ; i++ ) > + { > + char *c = (char *)p + (i*PAGE_SIZE); > + *c = 0; > + } > + > return p; > > out: > -- > 2.1.4 >
On Tue, 2016-01-19 at 13:24 +0000, Wei Liu wrote: > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: > > This avoids a potential issue with a fork after allocation but before > > madvise. > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > --- > > v7: New, replacing "tools/libs/call: linux: avoid forking between mmap > > and madvise". > > --- > > tools/libs/call/linux.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c > > index 3641e41..651f380 100644 > > --- a/tools/libs/call/linux.c > > +++ b/tools/libs/call/linux.c > > I didn't notice you only handled this for Linux until now. > > I think FreeBSD and NetBSD need similar treatment, too? But then current > BSD* code doesn't even support DONTFORK in madvise. I think any updates to the *BSD side should come as separate improvements. If they are necessary at all, the Linux stuff comes from Linux making use of particular behaviours on mlock()'s memory (specifically, IIRC, doing things such as THP and NUMA balancing which can cause page faults on the mlock'd pages despite them being locked) which are (possibly) allowed by POSIX, but *BSD may not take the same interpretation. > Adding Roger for more input. > > The changes in this patch look fine to me. May I take that as an Ack?
On Tue, Jan 19, 2016 at 01:40:55PM +0000, Ian Campbell wrote: > On Tue, 2016-01-19 at 13:24 +0000, Wei Liu wrote: > > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: > > > This avoids a potential issue with a fork after allocation but before > > > madvise. > > > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > > --- > > > v7: New, replacing "tools/libs/call: linux: avoid forking between mmap > > > and madvise". > > > --- > > > tools/libs/call/linux.c | 14 +++++++++++++- > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c > > > index 3641e41..651f380 100644 > > > --- a/tools/libs/call/linux.c > > > +++ b/tools/libs/call/linux.c > > > > I didn't notice you only handled this for Linux until now. > > > > I think FreeBSD and NetBSD need similar treatment, too? But then current > > BSD* code doesn't even support DONTFORK in madvise. > > I think any updates to the *BSD side should come as separate improvements. > > If they are necessary at all, the Linux stuff comes from Linux making use > of particular behaviours on mlock()'s memory (specifically, IIRC, doing > things such as THP and NUMA balancing which can cause page faults on the > mlock'd pages despite them being locked) which are (possibly) allowed by > POSIX, but *BSD may not take the same interpretation. > > > Adding Roger for more input. > > > > The changes in this patch look fine to me. > > May I take that as an Ack? > Yes.
El 19/01/16 a les 14.24, Wei Liu ha escrit: > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: >> This avoids a potential issue with a fork after allocation but before >> madvise. >> >> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> >> --- >> v7: New, replacing "tools/libs/call: linux: avoid forking between mmap >> and madvise". >> --- >> tools/libs/call/linux.c | 14 +++++++++++++- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c >> index 3641e41..651f380 100644 >> --- a/tools/libs/call/linux.c >> +++ b/tools/libs/call/linux.c > > I didn't notice you only handled this for Linux until now. > > I think FreeBSD and NetBSD need similar treatment, too? But then current > BSD* code doesn't even support DONTFORK in madvise. > > Adding Roger for more input. Hm, right, thanks for noticing this. I don't think FreeBSD needs a similar treatment (pre-faulting), because mlock will remove any CoW when making the pages wired. Also, AFAICT we don't need to call madvise or minherit(2) because mlock(2) already takes care of preventing the memory region from being copied to the child on fork: "Locked mappings are not inherited by the child process after a fork(2)." [0] So I think we are safe on the FreeBSD side. Roger. [0] https://www.freebsd.org/cgi/man.cgi?query=mlock
On Tue, Jan 19, 2016 at 03:54:54PM +0100, Roger Pau Monné wrote: > El 19/01/16 a les 14.24, Wei Liu ha escrit: > > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: > >> This avoids a potential issue with a fork after allocation but before > >> madvise. > >> > >> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > >> --- > >> v7: New, replacing "tools/libs/call: linux: avoid forking between mmap > >> and madvise". > >> --- > >> tools/libs/call/linux.c | 14 +++++++++++++- > >> 1 file changed, 13 insertions(+), 1 deletion(-) > >> > >> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c > >> index 3641e41..651f380 100644 > >> --- a/tools/libs/call/linux.c > >> +++ b/tools/libs/call/linux.c > > > > I didn't notice you only handled this for Linux until now. > > > > I think FreeBSD and NetBSD need similar treatment, too? But then current > > BSD* code doesn't even support DONTFORK in madvise. > > > > Adding Roger for more input. > > Hm, right, thanks for noticing this. I don't think FreeBSD needs a > similar treatment (pre-faulting), because mlock will remove any CoW when > making the pages wired. > > Also, AFAICT we don't need to call madvise or minherit(2) because > mlock(2) already takes care of preventing the memory region from being > copied to the child on fork: > > "Locked mappings are not inherited by the child process after a > fork(2)." [0] > > So I think we are safe on the FreeBSD side. > But what if the process forks between mmap and mlock? I think that warrants touching the area like we do for Linux here. Wei. > Roger. > > [0] https://www.freebsd.org/cgi/man.cgi?query=mlock >
On Tue, 2016-01-19 at 14:58 +0000, Wei Liu wrote: > On Tue, Jan 19, 2016 at 03:54:54PM +0100, Roger Pau Monné wrote: > > El 19/01/16 a les 14.24, Wei Liu ha escrit: > > > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: > > > > This avoids a potential issue with a fork after allocation but > > > > before > > > > madvise. > > > > > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > > > --- > > > > v7: New, replacing "tools/libs/call: linux: avoid forking between > > > > mmap > > > > and madvise". > > > > --- > > > > tools/libs/call/linux.c | 14 +++++++++++++- > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c > > > > index 3641e41..651f380 100644 > > > > --- a/tools/libs/call/linux.c > > > > +++ b/tools/libs/call/linux.c > > > > > > I didn't notice you only handled this for Linux until now. > > > > > > I think FreeBSD and NetBSD need similar treatment, too? But then > > > current > > > BSD* code doesn't even support DONTFORK in madvise. > > > > > > Adding Roger for more input. > > > > Hm, right, thanks for noticing this. I don't think FreeBSD needs a > > similar treatment (pre-faulting), because mlock will remove any CoW > > when > > making the pages wired. > > > > Also, AFAICT we don't need to call madvise or minherit(2) because > > mlock(2) already takes care of preventing the memory region from being > > copied to the child on fork: > > > > "Locked mappings are not inherited by the child process after a > > fork(2)." [0] > > > > So I think we are safe on the FreeBSD side. > > > > But what if the process forks between mmap and mlock? I think that > warrants touching the area like we do for Linux here. mlock guarantees the memory is populated, I think, which is equivalent to touching it. On Linux we use madvise not mlock, which doesn't make the same claims. > Ian.
On Tue, Jan 19, 2016 at 03:03:31PM +0000, Ian Campbell wrote: > On Tue, 2016-01-19 at 14:58 +0000, Wei Liu wrote: > > On Tue, Jan 19, 2016 at 03:54:54PM +0100, Roger Pau Monné wrote: > > > El 19/01/16 a les 14.24, Wei Liu ha escrit: > > > > On Fri, Jan 15, 2016 at 01:23:03PM +0000, Ian Campbell wrote: > > > > > This avoids a potential issue with a fork after allocation but > > > > > before > > > > > madvise. > > > > > > > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > > > > --- > > > > > v7: New, replacing "tools/libs/call: linux: avoid forking between > > > > > mmap > > > > > and madvise". > > > > > --- > > > > > tools/libs/call/linux.c | 14 +++++++++++++- > > > > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c > > > > > index 3641e41..651f380 100644 > > > > > --- a/tools/libs/call/linux.c > > > > > +++ b/tools/libs/call/linux.c > > > > > > > > I didn't notice you only handled this for Linux until now. > > > > > > > > I think FreeBSD and NetBSD need similar treatment, too? But then > > > > current > > > > BSD* code doesn't even support DONTFORK in madvise. > > > > > > > > Adding Roger for more input. > > > > > > Hm, right, thanks for noticing this. I don't think FreeBSD needs a > > > similar treatment (pre-faulting), because mlock will remove any CoW > > > when > > > making the pages wired. > > > > > > Also, AFAICT we don't need to call madvise or minherit(2) because > > > mlock(2) already takes care of preventing the memory region from being > > > copied to the child on fork: > > > > > > "Locked mappings are not inherited by the child process after a > > > fork(2)." [0] > > > > > > So I think we are safe on the FreeBSD side. > > > > > > > But what if the process forks between mmap and mlock? I think that > > warrants touching the area like we do for Linux here. > > mlock guarantees the memory is populated, I think, which is equivalent to > touching it. > > On Linux we use madvise not mlock, which doesn't make the same claims. > I see. I wonder why we didn't use mlock(2) in Linux too in the first place. Wei. > > > Ian.
On Tue, 2016-01-19 at 15:49 +0000, Wei Liu wrote: > > I see. I wonder why we didn't use mlock(2) in Linux too in the first > place. We did, but mlock(2) on Linux doesn't guarantee that there will be no page faults, it only guarantees that there will be no page faults which require I/O to satisfy (~= it won't swap the page out). Some text calls faults which require I/O "major" vs "minor" page faults which don't need I/O to satisfy. But Linux does things which can result in a supposedly locked PTE being not present or not-writable, which results minor page faults. The main cause of that is CoW due to fork (which affects both the parent and child). Other ones could include being in the middle of moving pages between NUMA nodes (current/source page is R/O) and THP (likewise content moving around). BSD says (https://www.freebsd.org/cgi/man.cgi?query=mlock): After an mlock() system call, the indicated pages will cause neither a non-resident page nor address-translation fault until they are unlocked. which I think rules out both major and minor faults (i.e. faults of any kind, which is what we need to pass the memory through to Xen). Using madvise() instead of mlock() on Linux is really just a hack, in that it gets us closer to the semantics we need, but isn't really the same. Really we ought to provide a /dev/xen/foo from which memory can be allocated with the right properties (much like /dev/xen/gntalloc does for grantable memory). Ian.
diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c index 3641e41..651f380 100644 --- a/tools/libs/call/linux.c +++ b/tools/libs/call/linux.c @@ -88,7 +88,7 @@ void *osdep_alloc_pages(xencall_handle *xcall, size_t npages) { size_t size = npages * PAGE_SIZE; void *p; - int rc, saved_errno; + int rc, i, saved_errno; /* Address returned by mmap is page aligned. */ p = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0); @@ -107,6 +107,18 @@ void *osdep_alloc_pages(xencall_handle *xcall, size_t npages) goto out; } + /* + * Touch each page in turn to force them to be un-CoWed, in case a + * fork happened in another thread at an inopportune moment + * above. The madvise() will prevent any subsequent fork calls from + * causing the same problem. + */ + for ( i = 0; i < npages ; i++ ) + { + char *c = (char *)p + (i*PAGE_SIZE); + *c = 0; + } + return p; out:
This avoids a potential issue with a fork after allocation but before madvise. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> --- v7: New, replacing "tools/libs/call: linux: avoid forking between mmap and madvise". --- tools/libs/call/linux.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)