Message ID | 20160927030621.20862-1-rafael.tinoco@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: > Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback > mechanism for systems not supporting memfd_create syscall (started > being supported since 3.17). This is really dubious code in general and IMHO should just be reverted. We have a golden rule that any time QEMU needs to be able to create a file on disk, then the path should be explicitly provided as a command line argument so that mgmt apps can control the location used. > Backporting memfd_create might not be accepted for distros relying > on older kernels. Nowadays there is no way for security driver > to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX. > > It is more appropriate to include UUID and/or VM names in the > temporary filename, allowing security driver rules to be applied > while maintaining the required unpredictability with mkstemp. We should not have QEMU creating unpredictabile filenames in the first place - any filenames should be determined by libvirt explicitly. > This change will allow libvirt to know exact memfd file to be created > for vhost log AND to create appropriate security rules to allow access > per instance (instead of a opened rule like <tmpdir>/memfd-*). Even with this change it is bad - we don't want driver backends creating arbitrary files in the shared /tmp directory. Regards, Daniel
> On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: >> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback >> mechanism for systems not supporting memfd_create syscall (started >> being supported since 3.17). > > This is really dubious code in general and IMHO should just > be reverted. There are numerous people relying on older kernels in openstack deployments - sometimes with specific drivers (ovswitch, dpdk, infiniband) holding kernel upgrades - but still in need of upgrading userland (e.g. newer releases). Having a fallback mechanism seems appropriate for those cases. > > We have a golden rule that any time QEMU needs to be able to > create a file on disk, then the path should be explicitly > provided as a command line argument so that mgmt apps can > control the location used. > >> Backporting memfd_create might not be accepted for distros relying >> on older kernels. Nowadays there is no way for security driver >> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX. >> >> It is more appropriate to include UUID and/or VM names in the >> temporary filename, allowing security driver rules to be applied >> while maintaining the required unpredictability with mkstemp. > > We should not have QEMU creating unpredictabile filenames in the > first place - any filenames should be determined by libvirt > explicitly. Note that the filename, per se, is not as important as other files, since qemu won't provide it for being accessed by external programs, and, deletes the file, while keeping the descriptor, right after its creation (due to its nature, that is probably why it was created in /tmp). Having libvirt to define a filename that would not be used for recent kernels (> 3.17) and would exist for a fraction of second doesn't seem right to me. > >> This change will allow libvirt to know exact memfd file to be created >> for vhost log AND to create appropriate security rules to allow access >> per instance (instead of a opened rule like <tmpdir>/memfd-*). > > Even with this change it is bad - we don't want driver backends > creating arbitrary files in the shared /tmp directory. On the other hand, if we are creating a tmp file, like I said, I see benefit on having unpredictability (mkstemp), but providing predictable parts to allow security driver to apply rules per instance basis (/tmp/memfd-UUID*, /tmp/memfd-VMname*). Looking forward to a decision so I can backport correct behaviour (with or without memfd file). Thank you! Best Regards, Rafael
Hi ----- Original Message ----- > > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote: > > > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: > > We should not have QEMU creating unpredictabile filenames in the > > first place - any filenames should be determined by libvirt > > explicitly. > > Note that the filename, per se, is not as important as other files, > since qemu won't provide it for being accessed by external programs, and, > deletes the file, while keeping the descriptor, right after its creation > (due to its nature, that is probably why it was created in /tmp). > > Having libvirt to define a filename that would not be used for recent > kernels (> 3.17) and would exist for a fraction of second doesn't seem > right to me. > There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules?
Hello! > On Sep 27, 2016, at 08:13, Marc-André Lureau <mlureau@redhat.com> wrote: > >> Note that the filename, per se, is not as important as other files, >> since qemu won't provide it for being accessed by external programs, and, >> deletes the file, while keeping the descriptor, right after its creation >> (due to its nature, that is probably why it was created in /tmp). >> >> Having libvirt to define a filename that would not be used for recent >> kernels (> 3.17) and would exist for a fraction of second doesn't seem >> right to me. >> > > There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules? You got a point. With a per-vm TMPDIR we don't have to care about filenames in future for the security driver, while still securing them per-instance base. I'll come back to you! Thank you!
On Tue, Sep 27, 2016 at 07:13:55AM -0400, Marc-André Lureau wrote: > Hi > > ----- Original Message ----- > > > > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote: > > > > > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: > > > We should not have QEMU creating unpredictabile filenames in the > > > first place - any filenames should be determined by libvirt > > > explicitly. > > > > Note that the filename, per se, is not as important as other files, > > since qemu won't provide it for being accessed by external programs, and, > > deletes the file, while keeping the descriptor, right after its creation > > (due to its nature, that is probably why it was created in /tmp). > > > > Having libvirt to define a filename that would not be used for recent > > kernels (> 3.17) and would exist for a fraction of second doesn't seem > > right to me. > > > > There are other parts of qemu that rely on creating temporary files, and > this seems to lack a bit of uniformity. Would it make sense to define a > place where qemu could create those? Or setting TMPDIR should help too. > Could libvirt set a per-vm TMPDIR with appropriate security rules? The other places that use mkstemp are block for snapshot=on, which libvirt does not support as we want control over the filename. This needs fixing by allowing a filename to be given. The qemu sockets code uses it for auto-creating a UNIX domain socket path, but again libvirt doesn't support that usage. The exec.c file uses it, but that honours an explicit directory path provided on the command line. So this memfd code really is the first place which is causing a real Just setting TMPDIR per VM doesn't magically solve all these cases as it isn't reasonable to assume that all these files should be in the same location. Certainly block snapshot file will be somewhere different from others, due to its size. Regards, Daniel
On Tue, Sep 27, 2016 at 11:01:10AM -0000, Rafael David Tinoco wrote: > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote: > > > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: > >> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback > >> mechanism for systems not supporting memfd_create syscall (started > >> being supported since 3.17). > > > > This is really dubious code in general and IMHO should just > > be reverted. > > There are numerous people relying on older kernels in openstack > deployments - sometimes with specific drivers (ovswitch, dpdk, > infiniband) holding kernel upgrades - but still in need of upgrading > userland (e.g. newer releases). Having a fallback mechanism seems > appropriate for those cases. I'm not against some kind of fallback - just about the way it silently creates files in /tmp. > > Note that the filename, per se, is not as important as other files, > since qemu won't provide it for being accessed by external programs, and, > deletes the file, while keeping the descriptor, right after its creation > (due to its nature, that is probably why it was created in /tmp). If it doesn't shared with other processes, and is deleted immediately, why does the file need to be on disk at all ? Regards, Daniel
Hi ----- Original Message ----- > On Tue, Sep 27, 2016 at 07:13:55AM -0400, Marc-André Lureau wrote: > > Hi > > > > ----- Original Message ----- > > > > > > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> > > > > wrote: > > > > > > > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: > > > > We should not have QEMU creating unpredictabile filenames in the > > > > first place - any filenames should be determined by libvirt > > > > explicitly. > > > > > > Note that the filename, per se, is not as important as other files, > > > since qemu won't provide it for being accessed by external programs, and, > > > deletes the file, while keeping the descriptor, right after its creation > > > (due to its nature, that is probably why it was created in /tmp). > > > > > > Having libvirt to define a filename that would not be used for recent > > > kernels (> 3.17) and would exist for a fraction of second doesn't seem > > > right to me. > > > > > > > There are other parts of qemu that rely on creating temporary files, and > > this seems to lack a bit of uniformity. Would it make sense to define a > > place where qemu could create those? Or setting TMPDIR should help too. > > Could libvirt set a per-vm TMPDIR with appropriate security rules? > > The other places that use mkstemp are block for snapshot=on, which > libvirt does not support as we want control over the filename. This > needs fixing by allowing a filename to be given. The qemu sockets code > uses it for auto-creating a UNIX domain socket path, but again libvirt > doesn't support that usage. The exec.c file uses it, but that honours > an explicit directory path provided on the command line. So this memfd > code really is the first place which is causing a real Have you reviewed the hundreds of libraries qemu link to? :) > Just setting TMPDIR per VM doesn't magically solve all these cases as > it isn't reasonable to assume that all these files should be in the > same location. Certainly block snapshot file will be somewhere different > from others, due to its size. I am not claiming it solves all problems, but at least it seems it would be quite appropriate for security concerns to have per-vm TMPDIR.
Sorry, I was only able to come back to this today. > On Sep 27, 2016, at 09:18, Daniel Berrange <1626972@bugs.launchpad.net> wrote: > >> There are numerous people relying on older kernels in openstack >> deployments - sometimes with specific drivers (ovswitch, dpdk, >> infiniband) holding kernel upgrades - but still in need of upgrading >> userland (e.g. newer releases). Having a fallback mechanism seems >> appropriate for those cases. > > I'm not against some kind of fallback - just about the way it > silently creates files in /tmp. > That is why memfd_create is used here I suppose: To allow anonymous-backed-pages to have a descriptor and to be sealed. When falling back this mechanism I don't see any other way other than creating a temporary file. Of course one way would be something like: http://paste.ubuntu.com/23270379/ But this is pretty much the same, just solving the "where to place the temporary file" (non configurable for this usage). >> >> Note that the filename, per se, is not as important as other files, >> since qemu won't provide it for being accessed by external programs, and, >> deletes the file, while keeping the descriptor, right after its creation >> (due to its nature, that is probably why it was created in /tmp). > > If it doesn't shared with other processes, and is deleted immediately, > why does the file need to be on disk at all ? Well, it unlinks the file but the references are still there while the descriptor isn't closed by this process, or by the one that receives the descriptor (that is why is the "unlink" so early). If you check vhost_dev_log_resize(), it gets *possible* new vhost log (if a new size is given) and informs the vhost dev driver about the new log base (vhost_ops->vhost_set_log_base). For vhost_user, this means that the file descriptors for vhost logs are likely going to be passed to vhost backend (fds[] in vhost_user_set_log_base). This is just one example, not sure about others. Probably the best approach here, like what Marc-André said, is to create some sort of TMPDIR, set by libvirt perhaps ? > > Regards, > Daniel
Hello Marc, > On Sep 27, 2016, at 08:13, Marc-André Lureau <mlureau@redhat.com> wrote: > >>> On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote: >>> We should not have QEMU creating unpredictabile filenames in the >>> first place - any filenames should be determined by libvirt >>> explicitly. >> >> Note that the filename, per se, is not as important as other files, >> since qemu won't provide it for being accessed by external programs, and, >> deletes the file, while keeping the descriptor, right after its creation >> (due to its nature, that is probably why it was created in /tmp). >> >> Having libvirt to define a filename that would not be used for recent >> kernels (> 3.17) and would exist for a fraction of second doesn't seem >> right to me. >> > > There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules? Best move I can see. Only problem is that if we do that, we would have to create a fallback mechanism for when TMPDIR is not set. It would go back to /tmp ? In my particular case (for 1 vhost log file): -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3 I could have something similar to: -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3,vhostpath=/var/lib/XXXX/YYYY/ and put mkstemp() files (one per vhost device) in there. Even so, what to do when "vhostpath" is not informed ? I'm worried that, right now there are security drivers either blocking the live migration entirely or allowing all instances to be able to read /tmp/memfd-XXXX. Don't you think we could push the first patch until we come up with a better approach for the tmp (and default tmp) files & directories ? The patch is not worse than what was committed already. Tks Rafael
On Mon, Oct 03, 2016 at 03:41:10PM -0000, Rafael David Tinoco wrote: > Sorry, I was only able to come back to this today. > > > On Sep 27, 2016, at 09:18, Daniel Berrange <1626972@bugs.launchpad.net> wrote: > > > >> There are numerous people relying on older kernels in openstack > >> deployments - sometimes with specific drivers (ovswitch, dpdk, > >> infiniband) holding kernel upgrades - but still in need of upgrading > >> userland (e.g. newer releases). Having a fallback mechanism seems > >> appropriate for those cases. > > > > I'm not against some kind of fallback - just about the way it > > silently creates files in /tmp. > > > > That is why memfd_create is used here I suppose: To allow anonymous- > backed-pages to have a descriptor and to be sealed. When falling back > this mechanism I don't see any other way other than creating a temporary > file. Of course one way would be something like: > > http://paste.ubuntu.com/23270379/ > > But this is pretty much the same, just solving the "where to place the > temporary file" (non configurable for this usage). > > >> > >> Note that the filename, per se, is not as important as other files, > >> since qemu won't provide it for being accessed by external programs, and, > >> deletes the file, while keeping the descriptor, right after its creation > >> (due to its nature, that is probably why it was created in /tmp). > > > > If it doesn't shared with other processes, and is deleted immediately, > > why does the file need to be on disk at all ? > > Well, it unlinks the file but the references are still there while the > descriptor isn't closed by this process, or by the one that receives the > descriptor (that is why is the "unlink" so early). > > If you check vhost_dev_log_resize(), it gets *possible* new vhost log > (if a new size is given) and informs the vhost dev driver about the new > log base (vhost_ops->vhost_set_log_base). > > For vhost_user, this means that the file descriptors for vhost logs are > likely going to be passed to vhost backend (fds[] in > vhost_user_set_log_base). This is just one example, not sure about > others. > > Probably the best approach here, like what Marc-André said, is to create > some sort of TMPDIR, set by libvirt perhaps ? So you're saying that the file descriptor here is actually getting passed to a different process for it to use ? If so that means we definitely do not want this in TMPDIR. If we create a generic file in TMPDIR, then its going to have a generic security label. That means that the other process we're giving the FD to is going to have to be granted permission to access this FD and we certainly don't want to grant permission for it to access any of QEMU's other FDs. So for the SELinux integration, we'll need this FD to be in a specific directory, so that we can setup policy such that the file created gets given a specific SELinux label. We can then grant the other process access to only that particular file, and not anything else of QEMU's. This makes me wonder about the memfd_create() code path too - we'll again not want that external process to be granted access to arbitrary FDs of QEMU's and I'm not sure of a way to get the memfd FD to have a specific label. So I think it is possible that when using libvirt we'll want the ability to tell QEMU to *always* use an explicit file in a path libvirt specifies, and never use memfd even if available. Regards, Daniel
Hello Daniel, > On Oct 03, 2016, at 14:55, Daniel P. Berrange <berrange@redhat.com> wrote: > >> Well, it unlinks the file but the references are still there while the >> descriptor isn't closed by this process, or by the one that receives the >> descriptor (that is why is the "unlink" so early). >> >> If you check vhost_dev_log_resize(), it gets *possible* new vhost log >> (if a new size is given) and informs the vhost dev driver about the new >> log base (vhost_ops->vhost_set_log_base). >> >> For vhost_user, this means that the file descriptors for vhost logs are >> likely going to be passed to vhost backend (fds[] in >> vhost_user_set_log_base). This is just one example, not sure about >> others. >> >> Probably the best approach here, like what Marc-André said, is to create >> some sort of TMPDIR, set by libvirt perhaps ? > > So you're saying that the file descriptor here is actually getting > passed to a different process for it to use ? > > If so that means we definitely do not want this in TMPDIR. If we > create a generic file in TMPDIR, then its going to have a generic > security label. That means that the other process we're giving the > FD to is going to have to be granted permission to access this FD > and we certainly don't want to grant permission for it to access > any of QEMU's other FDs. So for the SELinux integration, we'll > need this FD to be in a specific directory, so that we can setup > policy such that the file created gets given a specific SELinux > label. We can then grant the other process access to only that > particular file, and not anything else of QEMU's. > > This makes me wonder about the memfd_create() code path too - we'll > again not want that external process to be granted access to arbitrary > FDs of QEMU's and I'm not sure of a way to get the memfd FD to have > a specific label. So I think it is possible that when using libvirt > we'll want the ability to tell QEMU to *always* use an explicit file > in a path libvirt specifies, and never use memfd even if available. Check this execution path: (vhost_vsock_device_realize) vhost_dev_init vhost_commit |- vhost_get_log_size |... |- vhost_dev_log_resize (vhost_dev_log_resize): vhost_log_get -> here if the size is bigger, a new log is created dev->vhost_ops->vhost_set_log_base() -> kernel or user vhost driver vhost_log_put() ---- So, * In case of the kernel mode, this is just a: vhost in kernel mode = vhost_kernel_set_log_base return vhost_kernel_call(dev, VHOST_SET_LOG_BASE, &base); which makes an ioctl to dev->opaque file descriptor to set a new vhost log base. * But in the case of user mode: vhost in user mode = vhost_user_set_log_base which gets the log file descriptor (log->fd) and gives to vhost_user_write. vhost_user_write will do a qemu_chr_fe_set_msgfds passing the log file descriptors for the backend vhost driver (CharDriverState). If I'm reading this right.. if the backend driver is: static int tcp_set_msgfds(CharDriverState *chr, int *fds, int num) it would check for: !qio_channel_has_feature(s->ioc, QIO_CHANNEL_FEATURE_FD_PASS)) { and configure s->write_msgfds. This would be sent in: static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int len) with "io_channel_send_full" + "qio_channel_writev_full + io_writev from QIOChannelClass. ---- https://www.berrange.com/posts/2016/08/16/ This, from your blog, probably confirms this behaviour: "The migration code supports a number of different protocols besides just “tcp:“. In particular it allows an “fd:” protocol to tell QEMU to use a passed-in file descriptor, and an “exec:” protocol to tell QEMU to launch an external command to tunnel the connection. It is desirable to be able to use TLS with these protocols too, but when using TLS the client QEMU needs to know the hostname of the target QEMU in order to correctly validate the x509 certificate it receives. Thus, a second “tls-hostname” parameter was added to allow QEMU to be informed of the hostname to use for x509 certificate validation when using a non-tcp migration protocol. This can be set on the source QEMU prior to starting the migration using the “migrate_set_str_parameter” monitor command" =)
Yes, definitely. Check this: /** * @qemu_chr_fe_set_msgfds: * * For backends capable of fd passing, set an array of fds to be passed with * the next send operation. * A subsequent call to this function before calling a write function will * result in overwriting the fd array with the new value without being send. * Upon writing the message the fd array is freed. * * Returns: -1 if fd passing isn't supported. */ int qemu_chr_fe_set_msgfds(CharDriverState *s, int *fds, int num); ---- So, at least for vhost_dev_log_resize, this "interface" is being implemented: vhost_user_set_log_base -> VhostUserMsg = VHOST_USER_SET_LOG_BASE vhost_user_write(with the VHOST_USER_GET_LOG_BASE message): - configures the file descriptors(... , fds, fd_num) qemu_chr_fe_set_msgfds - writes them down the char driver qemu_chr_fe_write_all > On Oct 03, 2016, at 15:46, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote: > >> So you're saying that the file descriptor here is actually getting >> passed to a different process for it to use ?
On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote: > Yes, definitely. Check this: [snip] So in that case, I think we must add ability to specify an explicit path that apps can use *regardles* of whether memfd support exists or not. > > On Oct 03, 2016, at 15:46, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote: > > > >> So you're saying that the file descriptor here is actually getting > >> passed to a different process for it to use ? > Regards, Daniel
Let me work on it. I'll get back soon. Tks Daniel. > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote: >> Yes, definitely. Check this: > > [snip] > > So in that case, I think we must add ability to specify an explicit path > that apps can use *regardles* of whether memfd support exists or not. > >>> On Oct 03, 2016, at 15:46, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote: >>> >>>> So you're saying that the file descriptor here is actually getting >>>> passed to a different process for it to use ?
Hi Rafael, Daniel, On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco < rafael.tinoco@canonical.com> wrote: > Let me work on it. I'll get back soon. > > thanks for working on it, before that I have a few questions: Tks Daniel. > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> > wrote: > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote: > >> Yes, definitely. Check this: > > > > [snip] > > > > So in that case, I think we must add ability to specify an explicit path > > that apps can use *regardles* of whether memfd support exists or not. > How will this path be used? Is it going to be global to qemu for various use (kinda like $TMP), or per-device, or for memfd fallback only? Should the path pre-exist? (I suppose, if not, qemu should clean it up when leaving) > > >>> On Oct 03, 2016, at 15:46, Rafael David Tinoco < > rafael.tinoco@canonical.com> wrote: > >>> > >>>> So you're saying that the file descriptor here is actually getting > >>>> passed to a different process for it to use ? > > > -- Marc-André Lureau
On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote: > Hi Rafael, Daniel, > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco < > rafael.tinoco@canonical.com> wrote: > > > Let me work on it. I'll get back soon. > > > > > thanks for working on it, before that I have a few questions: > > Tks Daniel. > > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> > > wrote: > > > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote: > > >> Yes, definitely. Check this: > > > > > > [snip] > > > > > > So in that case, I think we must add ability to specify an explicit path > > > that apps can use *regardles* of whether memfd support exists or not. > > > > How will this path be used? Is it going to be global to qemu for various > use (kinda like $TMP), or per-device, or for memfd fallback only? Should > the path pre-exist? (I suppose, if not, qemu should clean it up when > leaving) I'd expect it to be an option set against the vhost user backend, since that's the thing using this. If other things have similar usage needs wrt memfd in future, they would also need similar path config option. Regards, Daniel
Hi On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <berrange@redhat.com> wrote: > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote: > > Hi Rafael, Daniel, > > > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco < > > rafael.tinoco@canonical.com> wrote: > > > > > Let me work on it. I'll get back soon. > > > > > > > > thanks for working on it, before that I have a few questions: > > > > Tks Daniel. > > > > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> > > > wrote: > > > > > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote: > > > >> Yes, definitely. Check this: > > > > > > > > [snip] > > > > > > > > So in that case, I think we must add ability to specify an explicit > path > > > > that apps can use *regardles* of whether memfd support exists or not. > > > > > > > How will this path be used? Is it going to be global to qemu for various > > use (kinda like $TMP), or per-device, or for memfd fallback only? Should > > the path pre-exist? (I suppose, if not, qemu should clean it up when > > leaving) > > I'd expect it to be an option set against the vhost user backend, since > that's the thing using this. > > If other things have similar usage needs wrt memfd in future, they would > also need similar path config option. > The log may be shared if there are several vhost-user (stored in vhost_log_shm global), so I think it makes more sense to have a global config path for it, or you may end up duplicating that information per vhost backend and having files in either of the specified paths. > > > Regards, > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ > :| > |: http://libvirt.org -o- http://virt-manager.org > :| > |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ > :| >
On Tue, Oct 04, 2016 at 01:10:17PM +0000, Marc-André Lureau wrote: > Hi > > On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <berrange@redhat.com> > wrote: > > > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote: > > > Hi Rafael, Daniel, > > > > > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco < > > > rafael.tinoco@canonical.com> wrote: > > > > > > > Let me work on it. I'll get back soon. > > > > > > > > > > > thanks for working on it, before that I have a few questions: > > > > > > Tks Daniel. > > > > > > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> > > > > wrote: > > > > > > > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote: > > > > >> Yes, definitely. Check this: > > > > > > > > > > [snip] > > > > > > > > > > So in that case, I think we must add ability to specify an explicit > > path > > > > > that apps can use *regardles* of whether memfd support exists or not. > > > > > > > > > > How will this path be used? Is it going to be global to qemu for various > > > use (kinda like $TMP), or per-device, or for memfd fallback only? Should > > > the path pre-exist? (I suppose, if not, qemu should clean it up when > > > leaving) > > > > I'd expect it to be an option set against the vhost user backend, since > > that's the thing using this. > > > > If other things have similar usage needs wrt memfd in future, they would > > also need similar path config option. > > > > The log may be shared if there are several vhost-user (stored in > vhost_log_shm global), so I think it makes more sense to have a global > config path for it, or you may end up duplicating that information per > vhost backend and having files in either of the specified paths. Hmm, is there a reason why it is shared? That seems to make an assumption that all vhost-user backends would be managed by the same external process. While that may be the common case today, it doesn't feel like a reasonable assumption to make long term. IOW it feels wiser to have it set per-NIC unless I'm missing something important that means it must be shared ? Regards, Daniel
> On Oct 04, 2016, at 10:10, Marc-André Lureau <marcandre.lureau@gmail.com> wrote: > > > How will this path be used? Is it going to be global to qemu for various > > use (kinda like $TMP), or per-device, or for memfd fallback only? Should > > the path pre-exist? (I suppose, if not, qemu should clean it up when > > leaving) > > I'd expect it to be an option set against the vhost user backend, since > that's the thing using this. > > If other things have similar usage needs wrt memfd in future, they would > also need similar path config option. I was going for that approach. I could have something similar to: -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3,vhostpath=/var/lib/XXXX/YYYY/ > The log may be shared if there are several vhost-user (stored in vhost_log_shm global), so I think it makes more sense to have a global config path for it, or you may end up duplicating that information per vhost backend and having files in either of the specified paths. But, yes, indeed the vhost_log_shm makes that approach tricky. If sharing the same log file with multiple vhost backend. Besides, tools like openstack would put all the vhost log files in the same place at the end. Having a global config path, forced to be specified, orelse the vhost log isn't created, like when it fails nowadays. This seems to be the right approach.
True. What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only. This way every device would get its own log file and vhost-user backends would be able to get its file descriptors. (and, of course, allow the security drivers to do their jobs). >> On Oct 04, 2016, at 10:25, Daniel P. Berrange <berrange@redhat.com> wrote: >> >> Hmm, is there a reason why it is shared? That seems to make an assumption >> that all vhost-user backends would be managed by the same external process. >> While that may be the common case today, it doesn't feel like a reasonable >> assumption to make long term. IOW it feels wiser to have it set per-NIC >> unless I'm missing something important that means it must be shared ? >
Hi On Tue, Oct 4, 2016 at 5:25 PM Daniel P. Berrange <berrange@redhat.com> wrote: > On Tue, Oct 04, 2016 at 01:10:17PM +0000, Marc-André Lureau wrote: > > Hi > > > > On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <berrange@redhat.com> > > wrote: > > > > > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote: > > > > Hi Rafael, Daniel, > > > > > > > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco < > > > > rafael.tinoco@canonical.com> wrote: > > > > > > > > > Let me work on it. I'll get back soon. > > > > > > > > > > > > > > thanks for working on it, before that I have a few questions: > > > > > > > > Tks Daniel. > > > > > > > > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange < > berrange@redhat.com> > > > > > wrote: > > > > > > > > > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco > wrote: > > > > > >> Yes, definitely. Check this: > > > > > > > > > > > > [snip] > > > > > > > > > > > > So in that case, I think we must add ability to specify an > explicit > > > path > > > > > > that apps can use *regardles* of whether memfd support exists or > not. > > > > > > > > > > > > > How will this path be used? Is it going to be global to qemu for > various > > > > use (kinda like $TMP), or per-device, or for memfd fallback only? > Should > > > > the path pre-exist? (I suppose, if not, qemu should clean it up when > > > > leaving) > > > > > > I'd expect it to be an option set against the vhost user backend, since > > > that's the thing using this. > > > > > > If other things have similar usage needs wrt memfd in future, they > would > > > also need similar path config option. > > > > > > > The log may be shared if there are several vhost-user (stored in > > vhost_log_shm global), so I think it makes more sense to have a global > > config path for it, or you may end up duplicating that information per > > vhost backend and having files in either of the specified paths. > > Hmm, is there a reason why it is shared? That seems to make an assumption > that all vhost-user backends would be managed by the same external process. > While that may be the common case today, it doesn't feel like a reasonable > assumption to make long term. IOW it feels wiser to have it set per-NIC > unless I'm missing something important that means it must be shared ? > > It's a shared log, just like they share the same ram. Duplicating the log would mostly make migration more difficult to handle and increase a bit memory usage. > Regards, > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ > :| > |: http://libvirt.org -o- http://virt-manager.org > :| > |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ > :| >
On Tue, Oct 4, 2016 at 5:34 PM Rafael David Tinoco < rafael.tinoco@canonical.com> wrote: > True. > > What about having a single config parameter as a place to put all vhost > logs for all drives for a single instance ? Remove the memfd implementation > with all the memfd shared_memory option ? Replace it with a > open+unlink+ftruncate+mmap approach only. > > I fail to see your point, memfd is superior to open+unlink and has other advantages with sealing etc. Regarding shared log, see my previous reply to Daniel. This way every device would get its own log file and vhost-user backends > would be able to get its file descriptors. (and, of course, allow the > security drivers to do their jobs). > > >> On Oct 04, 2016, at 10:25, Daniel P. Berrange <berrange@redhat.com> > wrote: > >> > >> Hmm, is there a reason why it is shared? That seems to make an > assumption > >> that all vhost-user backends would be managed by the same external > process. > >> While that may be the common case today, it doesn't feel like a > reasonable > >> assumption to make long term. IOW it feels wiser to have it set per-NIC > >> unless I'm missing something important that means it must be shared ? > > > > -- Marc-André Lureau
> On Oct 04, 2016, at 10:50, Marc-André Lureau <marcandre.lureau@gmail.com> wrote: > > What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only. > > > I fail to see your point, memfd is superior to open+unlink and has other advantages with sealing etc. I was just summarising needs based on previous statement from Daniel: > This makes me wonder about the memfd_create() code path too - we'll > again not want that external process to be granted access to arbitrary > FDs of QEMU's and I'm not sure of a way to get the memfd FD to have > a specific label. So I think it is possible that when using libvirt > we'll want the ability to tell QEMU to *always* use an explicit file > in a path libvirt specifies, and never use memfd even if available. > > Regards, > Daniel
Hello Again, finally I could get back to this, and.. I was finishing a patch creating the open+truncate+mmap+unlink mechanism on files specified by "vhostlog" parameter of tap devices. Patch is done, problem is that... looks like the "memfd" is only used for shared logs AND vhost-net (used for tap devices) doesn't use it. In the following... (scenario 1) Linux kvm01 4.8.0-22-generic #24-Ubuntu SMP Sat Oct 8 09:15:00 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux with: -netdev tap,id=net0,vhost=on -device virtio-net-pci,netdev=net0,id=net0,mac=52:54:00:20:c5:42,bus=pci.0,addr=0x3 ## kvm01 $ ./instance.sh qemu_memfd_check qemu_memfd_alloc: enter qemu_memfd_alloc: memfd_create with no sealing qemu_memfd_alloc: memfd_create worked, truncating... qemu_memfd_alloc: mmaping qemu_memfd_free: enter qemu_memfd_check: ok vhost_dev_start: enter vhost_log_get: enter vhost_log_alloc: enter vhost_log_alloc: local vhost_log_get: not shared vhost_log_put: enter vhost_log_put: enter vhost_log_put: local free (qemu) migrate -d tcp:kvm02:4444 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compres Migration status: completed total time: 14586 milliseconds downtime: 10 milliseconds setup: 20 milliseconds transferred ram: 377224 kbytes throughput: 212.02 mbps remaining ram: 0 kbytes total ram: 4001544 kbytes duplicate: 908879 pages skipped: 0 pages normal: 92129 pages normal bytes: 368516 kbytes dirty sync count: 4 ## kvm02 $ ./instance.sh qemu_memfd_check qemu_memfd_alloc: enter qemu_memfd_alloc: memfd_create with no sealing qemu_memfd_alloc: memfd_create worked, truncating... qemu_memfd_alloc: mmaping qemu_memfd_free: enter qemu_memfd_check: ok vhost_dev_start: enter (scenario 2) Linux kvm01 3.13.0-99-generic #146-Ubuntu SMP Wed Oct 12 20:56:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux with: -netdev tap,id=net0,vhost=on -device virtio-net-pci,netdev=net0,id=net0,mac=52:54:00:20:c5:42,bus=pci.0,addr=0x3 ## kvm01 $ ./instance.sh qemu_memfd_check qemu_memfd_alloc: enter qemu_memfd_alloc: memfd_create with no sealing qemu_memfd_alloc: memfd_create failed #2 qemu_memfd_alloc: fallback qemu_memfd_alloc: fname = /tmp/memfd-XXXXXX qemu_memfd_alloc: fallback truncating qemu_memfd_alloc: mmaping qemu_memfd_free qemu_memfd_check: ok vhost_dev_start: enter vhost_log_get: enter vhost_log_alloc: enter vhost_log_alloc: local vhost_log_get: not shared vhost_log_put: enter vhost_log_put: enter vhost_log_put: local free (qemu) migrate -d tcp:kvm02:4444 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compres Migration status: completed total time: 15400 milliseconds downtime: 9 milliseconds setup: 5 milliseconds transferred ram: 375812 kbytes throughput: 199.99 mbps remaining ram: 0 kbytes total ram: 4001544 kbytes duplicate: 909186 pages skipped: 0 pages normal: 91776 pages normal bytes: 367104 kbytes dirty sync count: 3 ## kvm02 $ ./instance.sh qemu_memfd_check qemu_memfd_alloc: enter qemu_memfd_alloc: memfd_create with no sealing qemu_memfd_alloc: memfd_create failed #2 qemu_memfd_alloc: fallback qemu_memfd_alloc: fname = /tmp/memfd-XXXXXX qemu_memfd_alloc: fallback truncating qemu_memfd_alloc: mmaping qemu_memfd_free qemu_memfd_check: ok vhost_dev_start: enter For kvm01, we have 2 parts: (1) From "-netdev tap,id=net0,vhost=on": - net_init_clients() - net_init_client() - net_client_init() - net_client_init1() - net_client_init_fun() .. net_init_tap() in my case - net_init_tap_one() - vhost_net_init() - vhost_dev_init() - migration checks (host feature, memfd functional test) (2) From "-device virtio-net-pci,netdev=net0...": - virtio_pci_device_plugged() - virtio_pci_modern_regions_init() - virtio_pci_common_write() - virtio_set_status() - virtio_net_set_status() - virtio_net_vhost_status() - vhost_net_start() - vhost_net_start_one() - vhost_dev_start() - does the log allocation logic It looks like "vhost_requires_shm_log" isn't defined by my underlaying VHOST driver (vhost-net in my case). It seems that vhost-user defines it (from VhostOps user_ops). Judging by the outputs above, looks like vhost_dev_log_is_shared is returning false, making (2) - vhost_dev_start - to use a different log allocation (malloc) than the one that was tested for allowing migrations at (1) - vhost_dev_init. Question: Why to check for "memfd" when its not sure - yet - if a shared descriptor and memory pointer is going to be needed for the migration to happen ? Do you want me to change that ? If memfd fails, but, the guest in question is using regular "malloc" for vhost log, we are marking it unable to live migrate by mistake. I could check for vhost_requires_shm_log pointer during vhost_dev_init (coming from tap). Also, if possible, I would like comments about a draft: https://pastebin.canonical.com/168579/ (please disregard printfs and minor problems) OBS: I'm basically removing fallback mechanism from memfd, creating a generic qemu_mmap_XXX implementation, adding a vhostlog parameter in tap cmdline AND changing the decision on what to use: if vhostlog is present in cmdline, qemu_mmap_XXX on vhostlog is used. If it is a directory, a random file is created inside it. If it is a file, the file is used. If no vhostlog is given (default while libvirt isn't changed), it tries first to use memfd (all newer kernels), and, if not possible, it tries to fallback using the qemu_mmap mechanism on "tmp" directory creating random files. PS: Remember that this is because selinux/apparmor labelling on tmp files (and because file descriptors can be passed away, like we discussed before). If that is okay I'll provide a patch asap. Let me know if you prefer something else. Thank you, Rafael > On Oct 04, 2016, at 12:29, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote: > > >> On Oct 04, 2016, at 10:50, Marc-André Lureau <marcandre.lureau@gmail.com> wrote: >> >> What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only. >> >> >> I fail to see your point, memfd is superior to open+unlink and has other advantages with sealing etc. > > I was just summarising needs based on previous statement from Daniel: > >> This makes me wonder about the memfd_create() code path too - we'll >> again not want that external process to be granted access to arbitrary >> FDs of QEMU's and I'm not sure of a way to get the memfd FD to have >> a specific label. So I think it is possible that when using libvirt >> we'll want the ability to tell QEMU to *always* use an explicit file >> in a path libvirt specifies, and never use memfd even if available. >> >> Regards, >> Daniel
The correct (and draft) one: http://pastebin.ubuntu.com/23357210/ Im passing vhostlog parameter as "hdev->log_filename" so it can be accessed from net_init_tap()-> functions AND from vhost_dev_start()-> functions. This way I don't have to change function prototypes anymore. > On Oct 21, 2016, at 01:03, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote: > > Also, if possible, I would like comments about a draft: > > https://pastebin.canonical.com/168579/ > (please disregard printfs and minor problems)
Hi On Fri, Oct 21, 2016 at 6:03 AM Rafael David Tinoco < rafael.tinoco@canonical.com> wrote: > Judging by the outputs above, looks like vhost_dev_log_is_shared is > returning false, making (2) - vhost_dev_start - to use a different log > allocation (malloc) than the one that was tested for allowing migrations at > (1) - vhost_dev_init. > > correct > Question: Why to check for "memfd" when its not sure - yet - if a shared > descriptor and memory pointer is going to be needed for the migration to > happen ? Do you want me to It's done early enough to disable migration. > change that ? If memfd fails, but, the guest in question is using regular > "malloc" for vhost log, we are marking it unable to live migrate by > mistake. I could check for vhost_requires_shm_log pointer during > vhost_dev_init (coming from tap). > > Right, it should be done only if vhost_dev_log_is_shared is true. Patch welcome > Also, if possible, I would like comments about a draft: > > https://pastebin.canonical.com/168579/ > (please disregard printfs and minor problems) > > OBS: I'm basically removing fallback mechanism from memfd, creating a > generic qemu_mmap_XXX implementation, adding a vhostlog parameter in tap > cmdline AND changing the decision on what to use: if vhostlog is present in > cmdline, qemu_mmap_XXX on vhostlog is used. If it is a directory, a random > file is created inside it. If it is a file, the file is used. If no > vhostlog is given (default while libvirt isn't changed), it tries first to > use memfd (all newer kernels), and, if not possible, it tries to fallback > using the qemu_mmap mechanism on "tmp" directory creating random files. > Sounds reasonable, but I am not sure so many fallbacks are necessary. I would just have an optional filename. > > PS: Remember that this is because selinux/apparmor labelling on tmp files > (and because file descriptors can be passed away, like we discussed before). > > If that is okay I'll provide a patch asap. Let me know if you prefer > something else. > Ok, I hope other comments on the idea, and I'll review your patch once on the ML. Thanks
diff --git a/util/memfd.c b/util/memfd.c index 4571d1a..4b715ac 100644 --- a/util/memfd.c +++ b/util/memfd.c @@ -30,6 +30,9 @@ #include <glib/gprintf.h> #include "qemu/memfd.h" +#include "qmp-commands.h" +#include "qemu-common.h" +#include "sysemu/sysemu.h" #ifdef CONFIG_MEMFD #include <sys/memfd.h> @@ -94,11 +97,32 @@ void *qemu_memfd_alloc(const char *name, size_t size, unsigned int seals, return NULL; } } else { + int ret = 0; const char *tmpdir = g_get_tmp_dir(); + UuidInfo *uinfo; + NameInfo *ninfo; gchar *fname; - fname = g_strdup_printf("%s/memfd-XXXXXX", tmpdir); + uinfo = qmp_query_uuid(NULL); + + ret = strcmp(uinfo->UUID, UUID_NONE); + if (ret == 0) { + ninfo = qmp_query_name(NULL); + if (ninfo->has_name) { + fname = g_strdup_printf("%s/memfd-%s-XXXXXX", tmpdir, + ninfo->name); + } else { + fname = g_strdup_printf("%s/memfd-XXXXXX", tmpdir); + } + qapi_free_NameInfo(ninfo); + } else { + fname = g_strdup_printf("%s/memfd-%s-XXXXXX", tmpdir, + uinfo->UUID); + } + mfd = mkstemp(fname); + + qapi_free_UuidInfo(uinfo); unlink(fname); g_free(fname);
Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback mechanism for systems not supporting memfd_create syscall (started being supported since 3.17). Backporting memfd_create might not be accepted for distros relying on older kernels. Nowadays there is no way for security driver to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX. It is more appropriate to include UUID and/or VM names in the temporary filename, allowing security driver rules to be applied while maintaining the required unpredictability with mkstemp. This change will allow libvirt to know exact memfd file to be created for vhost log AND to create appropriate security rules to allow access per instance (instead of a opened rule like <tmpdir>/memfd-*). Example of apparmor deny messages with this change: Per VM UUID (preferred, generated automatically by libvirt): kernel: [26632.154856] type=1400 audit(1474945148.633:78): apparmor= "DENIED" operation="mknod" profile="libvirt-0b96011f-0dc0-44a3-92c3- 196de2efab6d" name="/tmp/memfd-0b96011f-0dc0-44a3-92c3-196de2efab6d- qeHrBV" pid=75161 comm="qemu-system-x86" requested_mask="c" denied_ mask="c" fsuid=107 ouid=107 Per VM name (if no UUID is specified): kernel: [26447.505653] type=1400 audit(1474944963.985:72): apparmor= "DENIED" operation="mknod" profile="libvirt-00000000-0000-0000-0000- 000000000000" name="/tmp/memfd-instance-teste-osYpHh" pid=74648 comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=107 ouid=107 Signed-off-by: Rafael David Tinoco <rafael.tinoco@canonical.com> --- util/memfd.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-)