util: secure memfd_create fallback mechanism

Message ID	20160927030621.20862-1-rafael.tinoco@canonical.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Rafael David Tinoco <rafael.tinoco@canonical.com> To: qemu-devel@nongnu.org Date: Tue, 27 Sep 2016 03:06:21 +0000 Message-Id: <20160927030621.20862-1-rafael.tinoco@canonical.com> Subject: [Qemu-devel] [PATCH] util: secure memfd_create fallback mechanism Precedence: list Cc: 1626972@bugs.launchpad.net, marcandre.lureau@redhat.com, mst@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Rafael Tinoco Sept. 27, 2016, 3:06 a.m. UTC

Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
mechanism for systems not supporting memfd_create syscall (started
being supported since 3.17).

Backporting memfd_create might not be accepted for distros relying
on older kernels. Nowadays there is no way for security driver
to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.

It is more appropriate to include UUID and/or VM names in the
temporary filename, allowing security driver rules to be applied
while maintaining the required unpredictability with mkstemp.

This change will allow libvirt to know exact memfd file to be created
for vhost log AND to create appropriate security rules to allow access
per instance (instead of a opened rule like <tmpdir>/memfd-*).

Example of apparmor deny messages with this change:

Per VM UUID (preferred, generated automatically by libvirt):

kernel: [26632.154856] type=1400 audit(1474945148.633:78): apparmor=
"DENIED" operation="mknod" profile="libvirt-0b96011f-0dc0-44a3-92c3-
196de2efab6d" name="/tmp/memfd-0b96011f-0dc0-44a3-92c3-196de2efab6d-
qeHrBV" pid=75161 comm="qemu-system-x86" requested_mask="c" denied_
mask="c" fsuid=107 ouid=107

Per VM name (if no UUID is specified):

kernel: [26447.505653] type=1400 audit(1474944963.985:72): apparmor=
"DENIED" operation="mknod" profile="libvirt-00000000-0000-0000-0000-
000000000000" name="/tmp/memfd-instance-teste-osYpHh" pid=74648
comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=107
ouid=107

Signed-off-by: Rafael David Tinoco <rafael.tinoco@canonical.com>
---
 util/memfd.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Daniel P. Berrangé Sept. 27, 2016, 8:36 a.m. UTC | #1

On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
> mechanism for systems not supporting memfd_create syscall (started
> being supported since 3.17).

This is really dubious code in general and IMHO should just
be reverted.

We have a golden rule that any time QEMU needs to be able to
create a file on disk, then the path should be explicitly
provided as a command line argument so that mgmt apps can
control the location used.

> Backporting memfd_create might not be accepted for distros relying
> on older kernels. Nowadays there is no way for security driver
> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
> 
> It is more appropriate to include UUID and/or VM names in the
> temporary filename, allowing security driver rules to be applied
> while maintaining the required unpredictability with mkstemp.

We should not have QEMU creating unpredictabile filenames in the
first place - any filenames should be determined by libvirt
explicitly.

> This change will allow libvirt to know exact memfd file to be created
> for vhost log AND to create appropriate security rules to allow access
> per instance (instead of a opened rule like <tmpdir>/memfd-*).

Even with this change it is bad - we don't want driver backends
creating arbitrary files in the shared /tmp directory.

Regards,
Daniel

Rafael Tinoco Sept. 27, 2016, 11:01 a.m. UTC | #2

> On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote:
> 
> On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
>> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
>> mechanism for systems not supporting memfd_create syscall (started
>> being supported since 3.17).
> 
> This is really dubious code in general and IMHO should just
> be reverted.

There are numerous people relying on older kernels in openstack 
deployments - sometimes with specific drivers (ovswitch, dpdk, 
infiniband) holding kernel upgrades - but still in need of upgrading 
userland (e.g. newer releases). Having a fallback mechanism seems 
appropriate for those cases.

> 
> We have a golden rule that any time QEMU needs to be able to
> create a file on disk, then the path should be explicitly
> provided as a command line argument so that mgmt apps can
> control the location used.
> 
>> Backporting memfd_create might not be accepted for distros relying
>> on older kernels. Nowadays there is no way for security driver
>> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
>> 
>> It is more appropriate to include UUID and/or VM names in the
>> temporary filename, allowing security driver rules to be applied
>> while maintaining the required unpredictability with mkstemp.
> 
> We should not have QEMU creating unpredictabile filenames in the
> first place - any filenames should be determined by libvirt
> explicitly.

Note that the filename, per se, is not as important as other files, 
since qemu won't provide it for being accessed by external programs, and,
deletes the file, while keeping the descriptor, right after its creation
(due to its nature, that is probably why it was created in /tmp).

Having libvirt to define a filename that would not be used for recent
kernels (> 3.17) and would exist for a fraction of second doesn't seem
right to me. 

> 
>> This change will allow libvirt to know exact memfd file to be created
>> for vhost log AND to create appropriate security rules to allow access
>> per instance (instead of a opened rule like <tmpdir>/memfd-*).
> 
> Even with this change it is bad - we don't want driver backends
> creating arbitrary files in the shared /tmp directory.

On the other hand, if we are creating a tmp file, like I said, I see 
benefit on having unpredictability (mkstemp), but providing predictable
parts to allow security driver to apply rules per instance basis 
(/tmp/memfd-UUID*, /tmp/memfd-VMname*). 

Looking forward to a decision so I can backport correct behaviour
(with or without memfd file).  

Thank you!

Best Regards,
Rafael

Marc-André Lureau Sept. 27, 2016, 11:13 a.m. UTC | #3

Hi

----- Original Message -----
> 
> > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote:
> > 
> > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> > We should not have QEMU creating unpredictabile filenames in the
> > first place - any filenames should be determined by libvirt
> > explicitly.
> 
> Note that the filename, per se, is not as important as other files,
> since qemu won't provide it for being accessed by external programs, and,
> deletes the file, while keeping the descriptor, right after its creation
> (due to its nature, that is probably why it was created in /tmp).
> 
> Having libvirt to define a filename that would not be used for recent
> kernels (> 3.17) and would exist for a fraction of second doesn't seem
> right to me.
> 

There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules?

Rafael Tinoco Sept. 27, 2016, 11:31 a.m. UTC | #4

Hello!

> On Sep 27, 2016, at 08:13, Marc-André Lureau <mlureau@redhat.com> wrote:
> 
>> Note that the filename, per se, is not as important as other files,
>> since qemu won't provide it for being accessed by external programs, and,
>> deletes the file, while keeping the descriptor, right after its creation
>> (due to its nature, that is probably why it was created in /tmp).
>> 
>> Having libvirt to define a filename that would not be used for recent
>> kernels (> 3.17) and would exist for a fraction of second doesn't seem
>> right to me.
>> 
> 
> There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules?

You got a point. With a per-vm TMPDIR we don't have to care about filenames in future for the security driver, while still securing them per-instance base. I'll come back to you! 

Thank you!

Daniel P. Berrangé Sept. 27, 2016, 12:16 p.m. UTC | #5

On Tue, Sep 27, 2016 at 07:13:55AM -0400, Marc-André Lureau wrote:
> Hi
> 
> ----- Original Message -----
> > 
> > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote:
> > > 
> > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> > > We should not have QEMU creating unpredictabile filenames in the
> > > first place - any filenames should be determined by libvirt
> > > explicitly.
> > 
> > Note that the filename, per se, is not as important as other files,
> > since qemu won't provide it for being accessed by external programs, and,
> > deletes the file, while keeping the descriptor, right after its creation
> > (due to its nature, that is probably why it was created in /tmp).
> > 
> > Having libvirt to define a filename that would not be used for recent
> > kernels (> 3.17) and would exist for a fraction of second doesn't seem
> > right to me.
> > 
> 
> There are other parts of qemu that rely on creating temporary files, and
> this seems to lack a bit of uniformity. Would it make sense to define a
> place where qemu could create those? Or setting TMPDIR should help too.
> Could libvirt set a per-vm TMPDIR with appropriate security rules?

The other places that use mkstemp are block for snapshot=on, which
libvirt does not support as we want control over the filename. This
needs fixing by allowing a filename to be given. The qemu sockets code
uses it for auto-creating a UNIX domain socket path, but again libvirt
doesn't support that usage. The exec.c file uses it, but that honours
an explicit directory path provided on the command line. So this memfd
code really is the first place which is causing a real

Just setting TMPDIR per VM doesn't magically solve all these cases as
it isn't reasonable to assume that all these files should be in the
same location. Certainly block snapshot file will be somewhere different
from others, due to its size.

Regards,
Daniel

Daniel P. Berrangé Sept. 27, 2016, 12:18 p.m. UTC | #6

On Tue, Sep 27, 2016 at 11:01:10AM -0000, Rafael David Tinoco wrote:
> > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote:
> > 
> > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> >> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
> >> mechanism for systems not supporting memfd_create syscall (started
> >> being supported since 3.17).
> > 
> > This is really dubious code in general and IMHO should just
> > be reverted.
> 
> There are numerous people relying on older kernels in openstack 
> deployments - sometimes with specific drivers (ovswitch, dpdk, 
> infiniband) holding kernel upgrades - but still in need of upgrading 
> userland (e.g. newer releases). Having a fallback mechanism seems 
> appropriate for those cases.

I'm not against some kind of fallback - just about the way it
silently creates files in /tmp.

> 
> Note that the filename, per se, is not as important as other files, 
> since qemu won't provide it for being accessed by external programs, and,
> deletes the file, while keeping the descriptor, right after its creation
> (due to its nature, that is probably why it was created in /tmp).

If it doesn't shared with other processes, and is deleted immediately,
why does the file need to be on disk at all ?


Regards,
Daniel

Marc-André Lureau Sept. 27, 2016, 12:25 p.m. UTC | #7

Hi

----- Original Message -----
> On Tue, Sep 27, 2016 at 07:13:55AM -0400, Marc-André Lureau wrote:
> > Hi
> > 
> > ----- Original Message -----
> > > 
> > > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com>
> > > > wrote:
> > > > 
> > > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> > > > We should not have QEMU creating unpredictabile filenames in the
> > > > first place - any filenames should be determined by libvirt
> > > > explicitly.
> > > 
> > > Note that the filename, per se, is not as important as other files,
> > > since qemu won't provide it for being accessed by external programs, and,
> > > deletes the file, while keeping the descriptor, right after its creation
> > > (due to its nature, that is probably why it was created in /tmp).
> > > 
> > > Having libvirt to define a filename that would not be used for recent
> > > kernels (> 3.17) and would exist for a fraction of second doesn't seem
> > > right to me.
> > > 
> > 
> > There are other parts of qemu that rely on creating temporary files, and
> > this seems to lack a bit of uniformity. Would it make sense to define a
> > place where qemu could create those? Or setting TMPDIR should help too.
> > Could libvirt set a per-vm TMPDIR with appropriate security rules?
> 
> The other places that use mkstemp are block for snapshot=on, which
> libvirt does not support as we want control over the filename. This
> needs fixing by allowing a filename to be given. The qemu sockets code
> uses it for auto-creating a UNIX domain socket path, but again libvirt
> doesn't support that usage. The exec.c file uses it, but that honours
> an explicit directory path provided on the command line. So this memfd
> code really is the first place which is causing a real

Have you reviewed the hundreds of libraries qemu link to? :)

> Just setting TMPDIR per VM doesn't magically solve all these cases as
> it isn't reasonable to assume that all these files should be in the
> same location. Certainly block snapshot file will be somewhere different
> from others, due to its size.

I am not claiming it solves all problems, but at least it seems it would be quite appropriate for security concerns to have per-vm TMPDIR.

Rafael Tinoco Oct. 3, 2016, 3:41 p.m. UTC | #8

Sorry, I was only able to come back to this today.

> On Sep 27, 2016, at 09:18, Daniel Berrange <1626972@bugs.launchpad.net> wrote:
> 
>> There are numerous people relying on older kernels in openstack 
>> deployments - sometimes with specific drivers (ovswitch, dpdk, 
>> infiniband) holding kernel upgrades - but still in need of upgrading 
>> userland (e.g. newer releases). Having a fallback mechanism seems 
>> appropriate for those cases.
> 
> I'm not against some kind of fallback - just about the way it
> silently creates files in /tmp.
> 

That is why memfd_create is used here I suppose: To allow anonymous-backed-pages to have a descriptor and to be sealed. When falling back this mechanism I don't see any other way other than creating a temporary file. Of course one way would be something like:

http://paste.ubuntu.com/23270379/

But this is pretty much the same, just solving the "where to place the temporary file" (non configurable for this usage). 

>> 
>> Note that the filename, per se, is not as important as other files, 
>> since qemu won't provide it for being accessed by external programs, and,
>> deletes the file, while keeping the descriptor, right after its creation
>> (due to its nature, that is probably why it was created in /tmp).
> 
> If it doesn't shared with other processes, and is deleted immediately,
> why does the file need to be on disk at all ?

Well, it unlinks the file but the references are still there while the descriptor isn't closed by this process, or by the one that receives the descriptor (that is why is the "unlink" so early). 

If you check vhost_dev_log_resize(), it gets *possible* new vhost log (if a new size is given) and informs the vhost dev driver about the new log base (vhost_ops->vhost_set_log_base). 

For vhost_user, this means that the file descriptors for vhost logs are likely going to be passed to vhost backend (fds[] in vhost_user_set_log_base). This is just one example, not sure about others. 

Probably the best approach here, like what Marc-André said, is to create some sort of TMPDIR, set by libvirt perhaps ?

> 
> Regards,
> Daniel

Rafael Tinoco Oct. 3, 2016, 5:38 p.m. UTC | #9

Hello Marc, 

> On Sep 27, 2016, at 08:13, Marc-André Lureau <mlureau@redhat.com> wrote:
> 
>>> On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
>>> We should not have QEMU creating unpredictabile filenames in the
>>> first place - any filenames should be determined by libvirt
>>> explicitly.
>> 
>> Note that the filename, per se, is not as important as other files,
>> since qemu won't provide it for being accessed by external programs, and,
>> deletes the file, while keeping the descriptor, right after its creation
>> (due to its nature, that is probably why it was created in /tmp).
>> 
>> Having libvirt to define a filename that would not be used for recent
>> kernels (> 3.17) and would exist for a fraction of second doesn't seem
>> right to me.
>> 
> 
> There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules?

Best move I can see. Only problem is that if we do that, we would have to create a fallback mechanism for when TMPDIR is not set. It would go back to /tmp ? 

In my particular case (for 1 vhost log file):

-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3

I could have something similar to:

-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3,vhostpath=/var/lib/XXXX/YYYY/ 

and put mkstemp() files (one per vhost device) in there. 

Even so, what to do when "vhostpath" is not informed ? 

I'm worried that, right now there are security drivers either blocking the live migration entirely or allowing all instances to be able to read /tmp/memfd-XXXX. 

Don't you think we could push the first patch until we come up with a better approach for the tmp (and default tmp) files & directories ? The patch is not worse than what was committed already. 

Tks

Rafael

Daniel P. Berrangé Oct. 3, 2016, 5:55 p.m. UTC | #10

On Mon, Oct 03, 2016 at 03:41:10PM -0000, Rafael David Tinoco wrote:
> Sorry, I was only able to come back to this today.
> 
> > On Sep 27, 2016, at 09:18, Daniel Berrange <1626972@bugs.launchpad.net> wrote:
> > 
> >> There are numerous people relying on older kernels in openstack 
> >> deployments - sometimes with specific drivers (ovswitch, dpdk, 
> >> infiniband) holding kernel upgrades - but still in need of upgrading 
> >> userland (e.g. newer releases). Having a fallback mechanism seems 
> >> appropriate for those cases.
> > 
> > I'm not against some kind of fallback - just about the way it
> > silently creates files in /tmp.
> > 
> 
> That is why memfd_create is used here I suppose: To allow anonymous-
> backed-pages to have a descriptor and to be sealed. When falling back
> this mechanism I don't see any other way other than creating a temporary
> file. Of course one way would be something like:
> 
> http://paste.ubuntu.com/23270379/
> 
> But this is pretty much the same, just solving the "where to place the
> temporary file" (non configurable for this usage).
> 
> >> 
> >> Note that the filename, per se, is not as important as other files, 
> >> since qemu won't provide it for being accessed by external programs, and,
> >> deletes the file, while keeping the descriptor, right after its creation
> >> (due to its nature, that is probably why it was created in /tmp).
> > 
> > If it doesn't shared with other processes, and is deleted immediately,
> > why does the file need to be on disk at all ?
> 
> Well, it unlinks the file but the references are still there while the
> descriptor isn't closed by this process, or by the one that receives the
> descriptor (that is why is the "unlink" so early).
> 
> If you check vhost_dev_log_resize(), it gets *possible* new vhost log
> (if a new size is given) and informs the vhost dev driver about the new
> log base (vhost_ops->vhost_set_log_base).
> 
> For vhost_user, this means that the file descriptors for vhost logs are
> likely going to be passed to vhost backend (fds[] in
> vhost_user_set_log_base). This is just one example, not sure about
> others.
>
> Probably the best approach here, like what Marc-André said, is to create
> some sort of TMPDIR, set by libvirt perhaps ?

So you're saying that the file descriptor here is actually getting
passed to a different process for it to use ?

If so that means we definitely do not want this in TMPDIR. If we
create a generic file in TMPDIR, then its going to have a generic
security label. That means that the other process we're giving the
FD to is going to have to be granted permission to access this FD
and we certainly don't want to grant permission for it to access
any of QEMU's other FDs. So for the SELinux integration, we'll
need this FD to be in a specific directory, so that we can setup
policy such that the file created gets given a specific SELinux
label. We can then grant the other process access to only that
particular file, and not anything else of QEMU's.

This makes me wonder about the memfd_create() code path too - we'll
again not want that external process to be granted access to arbitrary
FDs of QEMU's and I'm not sure of a way to get the memfd  FD to have
a specific label. So I think it is possible that when using libvirt
we'll want the ability to tell QEMU to *always* use an explicit file
in a path libvirt specifies, and never use memfd even if available.

Regards,
Daniel

Rafael Tinoco Oct. 3, 2016, 6:46 p.m. UTC | #11

Hello Daniel,

> On Oct 03, 2016, at 14:55, Daniel P. Berrange <berrange@redhat.com> wrote:
> 
>> Well, it unlinks the file but the references are still there while the
>> descriptor isn't closed by this process, or by the one that receives the
>> descriptor (that is why is the "unlink" so early).
>> 
>> If you check vhost_dev_log_resize(), it gets *possible* new vhost log
>> (if a new size is given) and informs the vhost dev driver about the new
>> log base (vhost_ops->vhost_set_log_base).
>> 
>> For vhost_user, this means that the file descriptors for vhost logs are
>> likely going to be passed to vhost backend (fds[] in
>> vhost_user_set_log_base). This is just one example, not sure about
>> others.
>> 
>> Probably the best approach here, like what Marc-André said, is to create
>> some sort of TMPDIR, set by libvirt perhaps ?
> 
> So you're saying that the file descriptor here is actually getting
> passed to a different process for it to use ?
> 
> If so that means we definitely do not want this in TMPDIR. If we
> create a generic file in TMPDIR, then its going to have a generic
> security label. That means that the other process we're giving the
> FD to is going to have to be granted permission to access this FD
> and we certainly don't want to grant permission for it to access
> any of QEMU's other FDs. So for the SELinux integration, we'll
> need this FD to be in a specific directory, so that we can setup
> policy such that the file created gets given a specific SELinux
> label. We can then grant the other process access to only that
> particular file, and not anything else of QEMU's.
> 
> This makes me wonder about the memfd_create() code path too - we'll
> again not want that external process to be granted access to arbitrary
> FDs of QEMU's and I'm not sure of a way to get the memfd  FD to have
> a specific label. So I think it is possible that when using libvirt
> we'll want the ability to tell QEMU to *always* use an explicit file
> in a path libvirt specifies, and never use memfd even if available.

Check this execution path:

(vhost_vsock_device_realize)
  vhost_dev_init
  vhost_commit
  |- vhost_get_log_size
  |...
  |- vhost_dev_log_resize

(vhost_dev_log_resize):
  vhost_log_get -> here if the size is bigger, a new log is created
  dev->vhost_ops->vhost_set_log_base() -> kernel or user vhost driver
  vhost_log_put()

----

So,

* In case of the kernel mode, this is just a:

vhost in kernel mode = vhost_kernel_set_log_base
return vhost_kernel_call(dev, VHOST_SET_LOG_BASE, &base);

which makes an ioctl to dev->opaque file descriptor to set a new vhost
log base.

* But in the case of user mode:

vhost in user mode = vhost_user_set_log_base

which gets the log file descriptor (log->fd) and gives to
vhost_user_write. vhost_user_write will do a qemu_chr_fe_set_msgfds
passing the log file descriptors for the backend vhost driver
(CharDriverState).

If I'm reading this right.. if the backend driver is:

static int tcp_set_msgfds(CharDriverState *chr, int *fds, int num)

it would check for:

!qio_channel_has_feature(s->ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {

and configure s->write_msgfds. This would be sent in:

static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int
len)

with "io_channel_send_full" + "qio_channel_writev_full + io_writev from
QIOChannelClass.

----

https://www.berrange.com/posts/2016/08/16/

This, from your blog, probably confirms this behaviour:

"The migration code supports a number of different protocols besides
just “tcp:“. In particular it allows an “fd:” protocol to tell QEMU to
use a passed-in file descriptor, and an “exec:” protocol to tell QEMU to
launch an external command to tunnel the connection. It is desirable to
be able to use TLS with these protocols too, but when using TLS the
client QEMU needs to know the hostname of the target QEMU in order to
correctly validate the x509 certificate it receives. Thus, a second
“tls-hostname” parameter was added to allow QEMU to be informed of the
hostname to use for x509 certificate validation when using a non-tcp
migration protocol. This can be set on the source QEMU prior to starting
the migration using the “migrate_set_str_parameter” monitor command"

=)

Rafael Tinoco Oct. 3, 2016, 7:15 p.m. UTC | #12

Yes, definitely. Check this:

/**
 * @qemu_chr_fe_set_msgfds:
 *
 * For backends capable of fd passing, set an array of fds to be passed with
 * the next send operation.
 * A subsequent call to this function before calling a write function will
 * result in overwriting the fd array with the new value without being send.
 * Upon writing the message the fd array is freed.
 *
 * Returns: -1 if fd passing isn't supported.
 */
int qemu_chr_fe_set_msgfds(CharDriverState *s, int *fds, int num);

----

So, at least for vhost_dev_log_resize, this "interface" is being implemented:

vhost_user_set_log_base -> VhostUserMsg = VHOST_USER_SET_LOG_BASE

vhost_user_write(with the VHOST_USER_GET_LOG_BASE message):

- configures the file descriptors(... , fds, fd_num)
  qemu_chr_fe_set_msgfds

- writes them down the char driver
  qemu_chr_fe_write_all

> On Oct 03, 2016, at 15:46, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote:
> 
>> So you're saying that the file descriptor here is actually getting
>> passed to a different process for it to use ?

Daniel P. Berrangé Oct. 4, 2016, 8:36 a.m. UTC | #13

On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> Yes, definitely. Check this:

[snip]

So in that case, I think we must add ability to specify an explicit path
that apps can use *regardles* of whether memfd support exists or not.

> > On Oct 03, 2016, at 15:46, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote:
> > 
> >> So you're saying that the file descriptor here is actually getting
> >> passed to a different process for it to use ?
> 

Regards,
Daniel

Rafael Tinoco Oct. 4, 2016, 12:18 p.m. UTC | #14

Let me work on it. I'll get back soon. 

Tks Daniel.

> On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com> wrote:
> 
> On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
>> Yes, definitely. Check this:
> 
> [snip]
> 
> So in that case, I think we must add ability to specify an explicit path
> that apps can use *regardles* of whether memfd support exists or not.
> 
>>> On Oct 03, 2016, at 15:46, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote:
>>> 
>>>> So you're saying that the file descriptor here is actually getting
>>>> passed to a different process for it to use ?

Marc-André Lureau Oct. 4, 2016, 12:39 p.m. UTC | #15

Hi Rafael, Daniel,

On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
rafael.tinoco@canonical.com> wrote:

> Let me work on it. I'll get back soon.
>
>
thanks for working on it, before that I have a few questions:

Tks Daniel.
>
> > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com>
> wrote:
> >
> > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> >> Yes, definitely. Check this:
> >
> > [snip]
> >
> > So in that case, I think we must add ability to specify an explicit path
> > that apps can use *regardles* of whether memfd support exists or not.
>

How will this path be used? Is it going to be global to qemu for various
use (kinda like $TMP), or per-device, or for memfd fallback only? Should
the path pre-exist? (I suppose, if not, qemu should clean it up when
leaving)

>
> >>> On Oct 03, 2016, at 15:46, Rafael David Tinoco <
> rafael.tinoco@canonical.com> wrote:
> >>>
> >>>> So you're saying that the file descriptor here is actually getting
> >>>> passed to a different process for it to use ?
>
>
> --
Marc-André Lureau

Daniel P. Berrangé Oct. 4, 2016, 12:42 p.m. UTC | #16

On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> Hi Rafael, Daniel,
> 
> On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> rafael.tinoco@canonical.com> wrote:
> 
> > Let me work on it. I'll get back soon.
> >
> >
> thanks for working on it, before that I have a few questions:
> 
> Tks Daniel.
> >
> > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com>
> > wrote:
> > >
> > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> > >> Yes, definitely. Check this:
> > >
> > > [snip]
> > >
> > > So in that case, I think we must add ability to specify an explicit path
> > > that apps can use *regardles* of whether memfd support exists or not.
> >
> 
> How will this path be used? Is it going to be global to qemu for various
> use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> the path pre-exist? (I suppose, if not, qemu should clean it up when
> leaving)

I'd expect it to be an option set against the vhost user backend, since
that's the thing using this.

If other things have similar usage needs wrt memfd in future, they would
also need similar path config option.


Regards,
Daniel

Marc-André Lureau Oct. 4, 2016, 1:10 p.m. UTC | #17

Hi

On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <berrange@redhat.com>
wrote:

> On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> > Hi Rafael, Daniel,
> >
> > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> > rafael.tinoco@canonical.com> wrote:
> >
> > > Let me work on it. I'll get back soon.
> > >
> > >
> > thanks for working on it, before that I have a few questions:
> >
> > Tks Daniel.
> > >
> > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com>
> > > wrote:
> > > >
> > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> > > >> Yes, definitely. Check this:
> > > >
> > > > [snip]
> > > >
> > > > So in that case, I think we must add ability to specify an explicit
> path
> > > > that apps can use *regardles* of whether memfd support exists or not.
> > >
> >
> > How will this path be used? Is it going to be global to qemu for various
> > use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > leaving)
>
> I'd expect it to be an option set against the vhost user backend, since
> that's the thing using this.
>
> If other things have similar usage needs wrt memfd in future, they would
> also need similar path config option.
>

The log may be shared if there are several vhost-user (stored in
vhost_log_shm global), so I think it makes more sense to have a global
config path for it, or you may end up duplicating that information per
vhost backend and having files in either of the specified paths.


>
>
> Regards,
> Daniel
> --
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org              -o-             http://virt-manager.org
> :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/
> :|
>

Daniel P. Berrangé Oct. 4, 2016, 1:25 p.m. UTC | #18

On Tue, Oct 04, 2016 at 01:10:17PM +0000, Marc-André Lureau wrote:
> Hi
> 
> On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <berrange@redhat.com>
> wrote:
> 
> > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> > > Hi Rafael, Daniel,
> > >
> > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> > > rafael.tinoco@canonical.com> wrote:
> > >
> > > > Let me work on it. I'll get back soon.
> > > >
> > > >
> > > thanks for working on it, before that I have a few questions:
> > >
> > > Tks Daniel.
> > > >
> > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <berrange@redhat.com>
> > > > wrote:
> > > > >
> > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> > > > >> Yes, definitely. Check this:
> > > > >
> > > > > [snip]
> > > > >
> > > > > So in that case, I think we must add ability to specify an explicit
> > path
> > > > > that apps can use *regardles* of whether memfd support exists or not.
> > > >
> > >
> > > How will this path be used? Is it going to be global to qemu for various
> > > use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> > > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > > leaving)
> >
> > I'd expect it to be an option set against the vhost user backend, since
> > that's the thing using this.
> >
> > If other things have similar usage needs wrt memfd in future, they would
> > also need similar path config option.
> >
> 
> The log may be shared if there are several vhost-user (stored in
> vhost_log_shm global), so I think it makes more sense to have a global
> config path for it, or you may end up duplicating that information per
> vhost backend and having files in either of the specified paths.

Hmm, is there a reason why it is shared? That seems to make an assumption
that all vhost-user backends would be managed by the same external process.
While that may be the common case today, it doesn't feel like a reasonable
assumption to make long term. IOW it feels wiser to have it set per-NIC
unless I'm missing something important that means it must be shared ?

Regards,
Daniel

Rafael Tinoco Oct. 4, 2016, 1:29 p.m. UTC | #19

> On Oct 04, 2016, at 10:10, Marc-André Lureau <marcandre.lureau@gmail.com> wrote:
> 
> > How will this path be used? Is it going to be global to qemu for various
> > use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > leaving)
> 
> I'd expect it to be an option set against the vhost user backend, since
> that's the thing using this.
> 
> If other things have similar usage needs wrt memfd in future, they would
> also need similar path config option.

I was going for that approach. I could have something similar to:

-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3,vhostpath=/var/lib/XXXX/YYYY/ 

> The log may be shared if there are several vhost-user (stored in vhost_log_shm global), so I think it makes more sense to have a global config path for it, or you may end up duplicating that information per vhost backend and having files in either of the specified paths.

But, yes, indeed the vhost_log_shm makes that approach tricky. If sharing the same log file with multiple vhost backend. Besides, tools like openstack would put all the vhost log files in the same place at the end. 

Having a global config path, forced to be specified, orelse the vhost log isn't created, like when it fails nowadays. This seems to be the right approach.

Rafael Tinoco Oct. 4, 2016, 1:34 p.m. UTC | #20

True. 

What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only.

This way every device would get its own log file and vhost-user backends would be able to get its file descriptors. (and, of course, allow the security drivers to do their jobs). 

>> On Oct 04, 2016, at 10:25, Daniel P. Berrange <berrange@redhat.com> wrote:
>> 
>> Hmm, is there a reason why it is shared? That seems to make an assumption
>> that all vhost-user backends would be managed by the same external process.
>> While that may be the common case today, it doesn't feel like a reasonable
>> assumption to make long term. IOW it feels wiser to have it set per-NIC
>> unless I'm missing something important that means it must be shared ?
>

Marc-André Lureau Oct. 4, 2016, 1:46 p.m. UTC | #21

Hi

On Tue, Oct 4, 2016 at 5:25 PM Daniel P. Berrange <berrange@redhat.com>
wrote:

> On Tue, Oct 04, 2016 at 01:10:17PM +0000, Marc-André Lureau wrote:
> > Hi
> >
> > On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <berrange@redhat.com>
> > wrote:
> >
> > > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> > > > Hi Rafael, Daniel,
> > > >
> > > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> > > > rafael.tinoco@canonical.com> wrote:
> > > >
> > > > > Let me work on it. I'll get back soon.
> > > > >
> > > > >
> > > > thanks for working on it, before that I have a few questions:
> > > >
> > > > Tks Daniel.
> > > > >
> > > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <
> berrange@redhat.com>
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco
> wrote:
> > > > > >> Yes, definitely. Check this:
> > > > > >
> > > > > > [snip]
> > > > > >
> > > > > > So in that case, I think we must add ability to specify an
> explicit
> > > path
> > > > > > that apps can use *regardles* of whether memfd support exists or
> not.
> > > > >
> > > >
> > > > How will this path be used? Is it going to be global to qemu for
> various
> > > > use (kinda like $TMP), or per-device, or for memfd fallback only?
> Should
> > > > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > > > leaving)
> > >
> > > I'd expect it to be an option set against the vhost user backend, since
> > > that's the thing using this.
> > >
> > > If other things have similar usage needs wrt memfd in future, they
> would
> > > also need similar path config option.
> > >
> >
> > The log may be shared if there are several vhost-user (stored in
> > vhost_log_shm global), so I think it makes more sense to have a global
> > config path for it, or you may end up duplicating that information per
> > vhost backend and having files in either of the specified paths.
>
> Hmm, is there a reason why it is shared? That seems to make an assumption
> that all vhost-user backends would be managed by the same external process.
> While that may be the common case today, it doesn't feel like a reasonable
> assumption to make long term. IOW it feels wiser to have it set per-NIC
> unless I'm missing something important that means it must be shared ?
>
>
It's a shared log, just like they share the same ram. Duplicating the log
would mostly make migration more difficult to handle and increase a bit
memory usage.


> Regards,
> Daniel
> --
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org              -o-             http://virt-manager.org
> :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/
> :|
>

Marc-André Lureau Oct. 4, 2016, 1:50 p.m. UTC | #22

On Tue, Oct 4, 2016 at 5:34 PM Rafael David Tinoco <
rafael.tinoco@canonical.com> wrote:

> True.
>
> What about having a single config parameter as a place to put all vhost
> logs for all drives for a single instance ? Remove the memfd implementation
> with all the memfd shared_memory option ? Replace it with a
> open+unlink+ftruncate+mmap approach only.
>
>
I fail to see your point, memfd is superior to open+unlink and has other
advantages with sealing etc.

Regarding shared log, see my previous reply to Daniel.

This way every device would get its own log file and vhost-user backends
> would be able to get its file descriptors. (and, of course, allow the
> security drivers to do their jobs).
>
> >> On Oct 04, 2016, at 10:25, Daniel P. Berrange <berrange@redhat.com>
> wrote:
> >>
> >> Hmm, is there a reason why it is shared? That seems to make an
> assumption
> >> that all vhost-user backends would be managed by the same external
> process.
> >> While that may be the common case today, it doesn't feel like a
> reasonable
> >> assumption to make long term. IOW it feels wiser to have it set per-NIC
> >> unless I'm missing something important that means it must be shared ?
> >
>
> --
Marc-André Lureau

Rafael Tinoco Oct. 4, 2016, 3:29 p.m. UTC | #23

> On Oct 04, 2016, at 10:50, Marc-André Lureau <marcandre.lureau@gmail.com> wrote:
> 
> What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only.
> 
> 
> I fail to see your point, memfd is superior to open+unlink and has other advantages with sealing etc.

I was just summarising needs based on previous statement from Daniel:

> This makes me wonder about the memfd_create() code path too - we'll
> again not want that external process to be granted access to arbitrary
> FDs of QEMU's and I'm not sure of a way to get the memfd  FD to have
> a specific label. So I think it is possible that when using libvirt
> we'll want the ability to tell QEMU to *always* use an explicit file
> in a path libvirt specifies, and never use memfd even if available.
> 
> Regards,
> Daniel

Rafael Tinoco Oct. 21, 2016, 3:03 a.m. UTC | #24

Hello Again, finally I could get back to this, and..

I was finishing a patch creating the open+truncate+mmap+unlink mechanism on files specified by "vhostlog" parameter of tap devices. Patch is done, problem is that... looks like the "memfd" is only used for shared logs AND vhost-net (used for tap devices) doesn't use it. 

In the following...

(scenario 1)

Linux kvm01 4.8.0-22-generic #24-Ubuntu SMP Sat Oct 8 09:15:00 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

with:
-netdev tap,id=net0,vhost=on
-device virtio-net-pci,netdev=net0,id=net0,mac=52:54:00:20:c5:42,bus=pci.0,addr=0x3

## kvm01

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create worked, truncating...
qemu_memfd_alloc: mmaping
qemu_memfd_free: enter
qemu_memfd_check: ok
vhost_dev_start: enter
vhost_log_get: enter
vhost_log_alloc: enter
vhost_log_alloc: local
vhost_log_get: not shared
vhost_log_put: enter
vhost_log_put: enter
vhost_log_put: local free

(qemu) migrate -d tcp:kvm02:4444
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compres
Migration status: completed
total time: 14586 milliseconds
downtime: 10 milliseconds
setup: 20 milliseconds
transferred ram: 377224 kbytes
throughput: 212.02 mbps
remaining ram: 0 kbytes
total ram: 4001544 kbytes
duplicate: 908879 pages
skipped: 0 pages
normal: 92129 pages
normal bytes: 368516 kbytes
dirty sync count: 4

## kvm02

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create worked, truncating...
qemu_memfd_alloc: mmaping
qemu_memfd_free: enter
qemu_memfd_check: ok
vhost_dev_start: enter

(scenario 2)

Linux kvm01 3.13.0-99-generic #146-Ubuntu SMP Wed Oct 12 20:56:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

with:
-netdev tap,id=net0,vhost=on
-device virtio-net-pci,netdev=net0,id=net0,mac=52:54:00:20:c5:42,bus=pci.0,addr=0x3

## kvm01

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create failed #2
qemu_memfd_alloc: fallback
qemu_memfd_alloc: fname = /tmp/memfd-XXXXXX
qemu_memfd_alloc: fallback truncating
qemu_memfd_alloc: mmaping
qemu_memfd_free
qemu_memfd_check: ok
vhost_dev_start: enter
vhost_log_get: enter
vhost_log_alloc: enter
vhost_log_alloc: local
vhost_log_get: not shared
vhost_log_put: enter
vhost_log_put: enter
vhost_log_put: local free

(qemu) migrate -d tcp:kvm02:4444
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compres
Migration status: completed
total time: 15400 milliseconds
downtime: 9 milliseconds
setup: 5 milliseconds
transferred ram: 375812 kbytes
throughput: 199.99 mbps
remaining ram: 0 kbytes
total ram: 4001544 kbytes
duplicate: 909186 pages
skipped: 0 pages
normal: 91776 pages
normal bytes: 367104 kbytes
dirty sync count: 3

## kvm02

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create failed #2
qemu_memfd_alloc: fallback
qemu_memfd_alloc: fname = /tmp/memfd-XXXXXX
qemu_memfd_alloc: fallback truncating
qemu_memfd_alloc: mmaping
qemu_memfd_free
qemu_memfd_check: ok
vhost_dev_start: enter

For kvm01, we have 2 parts:

(1) From "-netdev tap,id=net0,vhost=on":
  - net_init_clients()
  - net_init_client()
  - net_client_init()
  - net_client_init1()
  - net_client_init_fun() .. net_init_tap() in my case
  - net_init_tap_one()
  - vhost_net_init()
  - vhost_dev_init()
  - migration checks (host feature, memfd functional test)

(2) From "-device virtio-net-pci,netdev=net0...":
  - virtio_pci_device_plugged()
  - virtio_pci_modern_regions_init()
  - virtio_pci_common_write()
  - virtio_set_status()
  - virtio_net_set_status()
  - virtio_net_vhost_status()
  - vhost_net_start()
  - vhost_net_start_one()
  - vhost_dev_start()
  - does the log allocation logic

It looks like "vhost_requires_shm_log" isn't defined by my underlaying VHOST driver (vhost-net in my case). It seems that vhost-user defines it (from VhostOps user_ops).

Judging by the outputs above, looks like vhost_dev_log_is_shared is returning false, making (2) - vhost_dev_start - to use a different log allocation (malloc) than the one that was tested for allowing migrations at (1) - vhost_dev_init.

Question: Why to check for "memfd" when its not sure - yet - if a shared descriptor and memory pointer is going to be needed for the migration to happen ? Do you want me to change that ? If memfd fails, but, the guest in question is using regular "malloc" for vhost log, we are marking it unable to live migrate by mistake. I could check for vhost_requires_shm_log pointer during vhost_dev_init (coming from tap).

Also, if possible, I would like comments about a draft:

https://pastebin.canonical.com/168579/
(please disregard printfs and minor problems)

OBS: I'm basically removing fallback mechanism from memfd, creating a generic qemu_mmap_XXX implementation, adding a vhostlog parameter in tap cmdline AND changing the decision on what to use: if vhostlog is present in cmdline, qemu_mmap_XXX on vhostlog is used. If it is a directory, a random file is created inside it. If it is a file, the file is used. If no vhostlog is given (default while libvirt isn't changed), it tries first to use memfd (all newer kernels), and, if not possible, it tries to fallback using the qemu_mmap mechanism on "tmp" directory creating random files. 

PS: Remember that this is because selinux/apparmor labelling on tmp files (and because file descriptors can be passed away, like we discussed before). 

If that is okay I'll provide a patch asap. Let me know if you prefer something else.

Thank you,
Rafael

> On Oct 04, 2016, at 12:29, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote:
> 
> 
>> On Oct 04, 2016, at 10:50, Marc-André Lureau <marcandre.lureau@gmail.com> wrote:
>> 
>> What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only.
>> 
>> 
>> I fail to see your point, memfd is superior to open+unlink and has other advantages with sealing etc.
> 
> I was just summarising needs based on previous statement from Daniel:
> 
>> This makes me wonder about the memfd_create() code path too - we'll
>> again not want that external process to be granted access to arbitrary
>> FDs of QEMU's and I'm not sure of a way to get the memfd  FD to have
>> a specific label. So I think it is possible that when using libvirt
>> we'll want the ability to tell QEMU to *always* use an explicit file
>> in a path libvirt specifies, and never use memfd even if available.
>> 
>> Regards,
>> Daniel

Rafael Tinoco Oct. 21, 2016, 3:09 a.m. UTC | #25

The correct (and draft) one:
http://pastebin.ubuntu.com/23357210/

Im passing vhostlog parameter as "hdev->log_filename" so it can be accessed from net_init_tap()-> functions AND from vhost_dev_start()-> functions. This way I don't have to change function prototypes anymore.

> On Oct 21, 2016, at 01:03, Rafael David Tinoco <rafael.tinoco@canonical.com> wrote:
> 
> Also, if possible, I would like comments about a draft:
> 
> https://pastebin.canonical.com/168579/
> (please disregard printfs and minor problems)

Marc-André Lureau Oct. 21, 2016, 6:08 p.m. UTC | #26

Hi

On Fri, Oct 21, 2016 at 6:03 AM Rafael David Tinoco <
rafael.tinoco@canonical.com> wrote:

> Judging by the outputs above, looks like vhost_dev_log_is_shared is
> returning false, making (2) - vhost_dev_start - to use a different log
> allocation (malloc) than the one that was tested for allowing migrations at
> (1) - vhost_dev_init.
>
>
correct


> Question: Why to check for "memfd" when its not sure - yet - if a shared
> descriptor and memory pointer is going to be needed for the migration to
> happen ? Do you want me to


It's done early enough to disable migration.


> change that ? If memfd fails, but, the guest in question is using regular
> "malloc" for vhost log, we are marking it unable to live migrate by
> mistake. I could check for vhost_requires_shm_log pointer during
> vhost_dev_init (coming from tap).
>
>
Right, it should be done only if vhost_dev_log_is_shared is true. Patch
welcome


> Also, if possible, I would like comments about a draft:
>
> https://pastebin.canonical.com/168579/
> (please disregard printfs and minor problems)
>
> OBS: I'm basically removing fallback mechanism from memfd, creating a
> generic qemu_mmap_XXX implementation, adding a vhostlog parameter in tap
> cmdline AND changing the decision on what to use: if vhostlog is present in
> cmdline, qemu_mmap_XXX on vhostlog is used. If it is a directory, a random
> file is created inside it. If it is a file, the file is used. If no
> vhostlog is given (default while libvirt isn't changed), it tries first to
> use memfd (all newer kernels), and, if not possible, it tries to fallback
> using the qemu_mmap mechanism on "tmp" directory creating random files.
>

Sounds reasonable, but I am not sure so many fallbacks are necessary. I
would just have an optional filename.

>
> PS: Remember that this is because selinux/apparmor labelling on tmp files
> (and because file descriptors can be passed away, like we discussed before).
>
> If that is okay I'll provide a patch asap. Let me know if you prefer
> something else.
>

Ok, I hope other comments on the idea, and I'll review your patch once on
the ML.

Thanks

util: secure memfd_create fallback mechanism

Commit Message

Comments

Patch