diff mbox series

[multiprocess,RFC,36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess

Message ID 20190307072253.9868-1-elena.ufimtseva@oracle.com (mailing list archive)
State New, archived
Headers show
Series Initial support of multi-process qemu | expand

Commit Message

Elena Ufimtseva March 7, 2019, 7:22 a.m. UTC
From: Elena Ufimtseva <elena.ufimtseva@oracle.com>

TODO: Make relevant changes to the doc.

Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 1109 insertions(+)
 create mode 100644 docs/devel/qemu-multiprocess.txt

Comments

Thomas Huth March 7, 2019, 8:14 a.m. UTC | #1
On 07/03/2019 08.22, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> TODO: Make relevant changes to the doc.
> 
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 1109 insertions(+)
>  create mode 100644 docs/devel/qemu-multiprocess.txt
> 
> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> new file mode 100644
> index 0000000..e29c6c8
> --- /dev/null
> +++ b/docs/devel/qemu-multiprocess.txt
> @@ -0,0 +1,1109 @@
> +/*
> + * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */

Somehow weird to see such a big license statement talking about
"software", but which applies to a text file only... Not sure if it is
an option for you, but maybe one of the Creative Common licenses
(dual-licensed with the GPLv2+) would be a better fit? E.g. for the QEMU
website, the content is dual-licensed: https://www.qemu.org/license.html

 Thomas
Kevin Wolf March 7, 2019, 2:16 p.m. UTC | #2
Am 07.03.2019 um 09:14 hat Thomas Huth geschrieben:
> On 07/03/2019 08.22, elena.ufimtseva@oracle.com wrote:
> > From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> > 
> > TODO: Make relevant changes to the doc.
> > 
> > Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> > Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> > Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> > ---
> >  docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 1109 insertions(+)
> >  create mode 100644 docs/devel/qemu-multiprocess.txt
> > 
> > diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> > new file mode 100644
> > index 0000000..e29c6c8
> > --- /dev/null
> > +++ b/docs/devel/qemu-multiprocess.txt
> > @@ -0,0 +1,1109 @@
> > +/*
> > + * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a copy
> > + * of this software and associated documentation files (the "Software"), to deal
> > + * in the Software without restriction, including without limitation the rights
> > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> > + * copies of the Software, and to permit persons to whom the Software is
> > + * furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> > + * THE SOFTWARE.
> > + */
> 
> Somehow weird to see such a big license statement talking about
> "software", but which applies to a text file only... Not sure if it is
> an option for you, but maybe one of the Creative Common licenses
> (dual-licensed with the GPLv2+) would be a better fit? E.g. for the QEMU
> website, the content is dual-licensed: https://www.qemu.org/license.html

While we're talking about licenses, the "All rights reserved." notice is
out of place in a license header that declares that a lot of permissions
are granted. Better to remove it to avoid any ambiguities that could
result from the contradiction. (Applies to the whole series.)

Kevin
Thomas Huth March 7, 2019, 2:21 p.m. UTC | #3
On 07/03/2019 15.16, Kevin Wolf wrote:
> Am 07.03.2019 um 09:14 hat Thomas Huth geschrieben:
>> On 07/03/2019 08.22, elena.ufimtseva@oracle.com wrote:
>>> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>
>>> TODO: Make relevant changes to the doc.
>>>
>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>> ---
>>>  docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 1109 insertions(+)
>>>  create mode 100644 docs/devel/qemu-multiprocess.txt
>>>
>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
>>> new file mode 100644
>>> index 0000000..e29c6c8
>>> --- /dev/null
>>> +++ b/docs/devel/qemu-multiprocess.txt
>>> @@ -0,0 +1,1109 @@
>>> +/*
>>> + * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>> + * of this software and associated documentation files (the "Software"), to deal
>>> + * in the Software without restriction, including without limitation the rights
>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>> + * copies of the Software, and to permit persons to whom the Software is
>>> + * furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice shall be included in
>>> + * all copies or substantial portions of the Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>> + * THE SOFTWARE.
>>> + */
>>
>> Somehow weird to see such a big license statement talking about
>> "software", but which applies to a text file only... Not sure if it is
>> an option for you, but maybe one of the Creative Common licenses
>> (dual-licensed with the GPLv2+) would be a better fit? E.g. for the QEMU
>> website, the content is dual-licensed: https://www.qemu.org/license.html
> 
> While we're talking about licenses, the "All rights reserved." notice is
> out of place in a license header that declares that a lot of permissions
> are granted. Better to remove it to avoid any ambiguities that could
> result from the contradiction. (Applies to the whole series.)

Apart from that, it is also not required for other work anymore. See:

https://en.wikipedia.org/wiki/All_rights_reserved

 Thomas
Stefan Hajnoczi March 7, 2019, 2:26 p.m. UTC | #4
On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> new file mode 100644
> index 0000000..e29c6c8
> --- /dev/null
> +++ b/docs/devel/qemu-multiprocess.txt

Thanks for this document and the interesting work that you are doing.
I'd like to discuss the security advantages gained by disaggregating
QEMU in more detail.

The security model for VMs managed by libvirt (most production x86, ppc,
s390 guests) is that the QEMU process is untrusted and only has access
to resources belonging to the guest.  SELinux is used to restrict the
process from accessing other files, processes, etc on the host.

QEMU does not hold privileged resources that must be kept away from the
guest.  An escaped guest can access its image file, tap file descriptor,
etc but they are the same resources it could already access via device
emulation.

Can you give specific examples of how disaggregation improves security?

Stefan
Konrad Rzeszutek Wilk March 7, 2019, 2:40 p.m. UTC | #5
On Thu, Mar 07, 2019 at 03:21:47PM +0100, Thomas Huth wrote:
> On 07/03/2019 15.16, Kevin Wolf wrote:
> > Am 07.03.2019 um 09:14 hat Thomas Huth geschrieben:
> >> On 07/03/2019 08.22, elena.ufimtseva@oracle.com wrote:
> >>> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >>>
> >>> TODO: Make relevant changes to the doc.
> >>>
> >>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >>> ---
> >>>  docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 1109 insertions(+)
> >>>  create mode 100644 docs/devel/qemu-multiprocess.txt
> >>>
> >>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> >>> new file mode 100644
> >>> index 0000000..e29c6c8
> >>> --- /dev/null
> >>> +++ b/docs/devel/qemu-multiprocess.txt
> >>> @@ -0,0 +1,1109 @@
> >>> +/*
> >>> + * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> >>> + * of this software and associated documentation files (the "Software"), to deal
> >>> + * in the Software without restriction, including without limitation the rights
> >>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> >>> + * copies of the Software, and to permit persons to whom the Software is
> >>> + * furnished to do so, subject to the following conditions:
> >>> + *
> >>> + * The above copyright notice and this permission notice shall be included in
> >>> + * all copies or substantial portions of the Software.
> >>> + *
> >>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> >>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> >>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> >>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> >>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> >>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> >>> + * THE SOFTWARE.
> >>> + */
> >>
> >> Somehow weird to see such a big license statement talking about
> >> "software", but which applies to a text file only... Not sure if it is
> >> an option for you, but maybe one of the Creative Common licenses
> >> (dual-licensed with the GPLv2+) would be a better fit? E.g. for the QEMU
> >> website, the content is dual-licensed: https://www.qemu.org/license.html
> > 
> > While we're talking about licenses, the "All rights reserved." notice is
> > out of place in a license header that declares that a lot of permissions
> > are granted. Better to remove it to avoid any ambiguities that could
> > result from the contradiction. (Applies to the whole series.)
> 
> Apart from that, it is also not required for other work anymore. See:
> 
> https://en.wikipedia.org/wiki/All_rights_reserved

Interesting. Do folks know why the Linux Foundation does it?

See for example cf0d37aecc06801d4847fb36740da4a5690d9d45 (in the Linux kernel)
where every change they stamp it with their 'All Rights Reserved'?

> 
>  Thomas
Daniel P. Berrangé March 7, 2019, 2:51 p.m. UTC | #6
On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
> On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
> > diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> > new file mode 100644
> > index 0000000..e29c6c8
> > --- /dev/null
> > +++ b/docs/devel/qemu-multiprocess.txt
> 
> Thanks for this document and the interesting work that you are doing.
> I'd like to discuss the security advantages gained by disaggregating
> QEMU in more detail.
> 
> The security model for VMs managed by libvirt (most production x86, ppc,
> s390 guests) is that the QEMU process is untrusted and only has access
> to resources belonging to the guest.  SELinux is used to restrict the
> process from accessing other files, processes, etc on the host.

NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
can even do isolation with traditional DAC by putting each QEMU under
a distinct UID/GID and having libvirtd set ownership on resources each
VM is permitted to use.

> QEMU does not hold privileged resources that must be kept away from the
> guest.  An escaped guest can access its image file, tap file descriptor,
> etc but they are the same resources it could already access via device
> emulation.
> 
> Can you give specific examples of how disaggregation improves security?

I guess one obvious answer is that the existing security mechanisms like
SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
there are distinct processes. This would allow for a more useful seccomp
filter to better protect against secondary kernel exploits should QEMU
itself be exploited, if we can protect individual components.

Not everything is protected by MAC/DAC. For example network based disks
typically have a username + password for accessing the remote storage
server. Best practice would be a distinct username for every QEMU process
such that each can only access its own storage, but I don't know of any
app which does that. So ability to split off backends into separate
processes could limit exposure of information that is not otherwise
protected by current protection models.

Whether any of this is useful in practice depends on the degree to
which the individual disaggregated pieces of QEMU trust each other.
Effectively they would have to consider each other as untrusted,
so one compromised piece can't simply trigger its desired exploit
via the communication channel with another disaggregated piece.

The broadening of vhost-user support is useful with that in much the
same way I imagine.

Regards,
Daniel
Thomas Huth March 7, 2019, 2:53 p.m. UTC | #7
On 07/03/2019 15.40, Konrad Rzeszutek Wilk wrote:
> On Thu, Mar 07, 2019 at 03:21:47PM +0100, Thomas Huth wrote:
>> On 07/03/2019 15.16, Kevin Wolf wrote:
>>> Am 07.03.2019 um 09:14 hat Thomas Huth geschrieben:
>>>> On 07/03/2019 08.22, elena.ufimtseva@oracle.com wrote:
>>>>> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>>>
>>>>> TODO: Make relevant changes to the doc.
>>>>>
>>>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>>>> ---
>>>>>  docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 1109 insertions(+)
>>>>>  create mode 100644 docs/devel/qemu-multiprocess.txt
>>>>>
>>>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
>>>>> new file mode 100644
>>>>> index 0000000..e29c6c8
>>>>> --- /dev/null
>>>>> +++ b/docs/devel/qemu-multiprocess.txt
>>>>> @@ -0,0 +1,1109 @@
>>>>> +/*
>>>>> + * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
>>>>> + *
>>>>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>>>>> + * of this software and associated documentation files (the "Software"), to deal
>>>>> + * in the Software without restriction, including without limitation the rights
>>>>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>>>>> + * copies of the Software, and to permit persons to whom the Software is
>>>>> + * furnished to do so, subject to the following conditions:
>>>>> + *
>>>>> + * The above copyright notice and this permission notice shall be included in
>>>>> + * all copies or substantial portions of the Software.
>>>>> + *
>>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>>>>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>>>>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>>>>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>>>>> + * THE SOFTWARE.
>>>>> + */
>>>>
>>>> Somehow weird to see such a big license statement talking about
>>>> "software", but which applies to a text file only... Not sure if it is
>>>> an option for you, but maybe one of the Creative Common licenses
>>>> (dual-licensed with the GPLv2+) would be a better fit? E.g. for the QEMU
>>>> website, the content is dual-licensed: https://www.qemu.org/license.html
>>>
>>> While we're talking about licenses, the "All rights reserved." notice is
>>> out of place in a license header that declares that a lot of permissions
>>> are granted. Better to remove it to avoid any ambiguities that could
>>> result from the contradiction. (Applies to the whole series.)
>>
>> Apart from that, it is also not required for other work anymore. See:
>>
>> https://en.wikipedia.org/wiki/All_rights_reserved
> 
> Interesting. Do folks know why the Linux Foundation does it?
> 
> See for example cf0d37aecc06801d4847fb36740da4a5690d9d45 (in the Linux kernel)
> where every change they stamp it with their 'All Rights Reserved'?

No clue why they use it. Seems unnecessary to me. But as always: IANAL

 Thomas
Michael S. Tsirkin March 7, 2019, 4:05 p.m. UTC | #8
On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> The broadening of vhost-user support is useful with that in much the
> same way I imagine.

vhost user has more of an impact but is also a bigger maintainance
burden as clients are packaged, can be restarted etc individually.
Daniel P. Berrangé March 7, 2019, 4:19 p.m. UTC | #9
On Thu, Mar 07, 2019 at 11:05:36AM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > The broadening of vhost-user support is useful with that in much the
> > same way I imagine.
> 
> vhost user has more of an impact but is also a bigger maintainance
> burden as clients are packaged, can be restarted etc individually.

It feels like we're having/accepted that cost already though since
vhostuser exists today & has been expanding to cover more backends.

Regards,
Daniel
Michael S. Tsirkin March 7, 2019, 4:46 p.m. UTC | #10
On Thu, Mar 07, 2019 at 04:19:44PM +0000, Daniel P. Berrangé wrote:
> On Thu, Mar 07, 2019 at 11:05:36AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > > The broadening of vhost-user support is useful with that in much the
> > > same way I imagine.
> > 
> > vhost user has more of an impact but is also a bigger maintainance
> > burden as clients are packaged, can be restarted etc individually.
> 
> It feels like we're having/accepted that cost already though since
> vhostuser exists today & has been expanding to cover more backends.
> 
> Regards,
> Daniel

What I am trying to say is that we could eaily add support for
extensions just for in-tree code since these don't create an API that
needs to be maintained.

So e.g. we do not need feature negotiation.

But yes, this could be an extension of vhost-user in some way.

> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Daniel P. Berrangé March 7, 2019, 4:49 p.m. UTC | #11
On Thu, Mar 07, 2019 at 11:46:19AM -0500, Michael S. Tsirkin wrote:
> On Thu, Mar 07, 2019 at 04:19:44PM +0000, Daniel P. Berrangé wrote:
> > On Thu, Mar 07, 2019 at 11:05:36AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > > > The broadening of vhost-user support is useful with that in much the
> > > > same way I imagine.
> > > 
> > > vhost user has more of an impact but is also a bigger maintainance
> > > burden as clients are packaged, can be restarted etc individually.
> > 
> > It feels like we're having/accepted that cost already though since
> > vhostuser exists today & has been expanding to cover more backends.
> 
> What I am trying to say is that we could eaily add support for
> extensions just for in-tree code since these don't create an API that
> needs to be maintained.
> 
> So e.g. we do not need feature negotiation.

Ah, I see what you mean now. Having stuff in-tree makes migration
saner too since we don't have combinatorial expansion of impls to
worry about testnig

> But yes, this could be an extension of vhost-user in some way.

Regards,
Daniel
Stefan Hajnoczi March 7, 2019, 7:27 p.m. UTC | #12
On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
> > On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
> > > diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> > > new file mode 100644
> > > index 0000000..e29c6c8
> > > --- /dev/null
> > > +++ b/docs/devel/qemu-multiprocess.txt
> > 
> > Thanks for this document and the interesting work that you are doing.
> > I'd like to discuss the security advantages gained by disaggregating
> > QEMU in more detail.
> > 
> > The security model for VMs managed by libvirt (most production x86, ppc,
> > s390 guests) is that the QEMU process is untrusted and only has access
> > to resources belonging to the guest.  SELinux is used to restrict the
> > process from accessing other files, processes, etc on the host.
> 
> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
> can even do isolation with traditional DAC by putting each QEMU under
> a distinct UID/GID and having libvirtd set ownership on resources each
> VM is permitted to use.
> 
> > QEMU does not hold privileged resources that must be kept away from the
> > guest.  An escaped guest can access its image file, tap file descriptor,
> > etc but they are the same resources it could already access via device
> > emulation.
> > 
> > Can you give specific examples of how disaggregation improves security?

Elena & collaborators: Dan has posted some ideas but please share yours
so the security benefits of this patch series can be better understood.

> I guess one obvious answer is that the existing security mechanisms like
> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
> there are distinct processes. This would allow for a more useful seccomp
> filter to better protect against secondary kernel exploits should QEMU
> itself be exploited, if we can protect individual components.

Fine-grained sandboxing is possible in theory but tedious in practice.
From what I can tell this patch series doesn't implement any sandboxing
for child processes.

There must be a convenient way to get fine-grained sandboxing for
disaggregated devices.  In other words, it shouldn't be left as an
exercise to device process authors.

How to do this in practice must be clear from the beginning if
fine-grained sandboxing is the main selling point.

Some details to start the discussion:

 * How will fine-grained SELinux/AppArmor/DAC policies be configured for
   each process?  I guess this requires root, so does libvirt need to
   know about each process?

 * We need to make sure that processes cannot send signals to each
   other, ptrace, interfere in /proc/$PID, etc.  How will this be done?

 * Were you planning to use any other sandboxing mechanisms
   (namespaces?)?  How will they be set up if the device processed is
   forked/executed by an unprivileged QEMU?

> Not everything is protected by MAC/DAC. For example network based disks
> typically have a username + password for accessing the remote storage
> server. Best practice would be a distinct username for every QEMU process
> such that each can only access its own storage, but I don't know of any
> app which does that. So ability to split off backends into separate
> processes could limit exposure of information that is not otherwise
> protected by current protection models.

If the disaggregated disk process with a global username + password is
compromised then all your disk images are compromised.  So you still
need to follow the best practice of per-VM credentials even with
disaggregation, and if you do then disaggregation doesn't add anything!

Stefan
John Johnson March 7, 2019, 11:29 p.m. UTC | #13
> On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
>> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
>>> On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
>>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
>>>> new file mode 100644
>>>> index 0000000..e29c6c8
>>>> --- /dev/null
>>>> +++ b/docs/devel/qemu-multiprocess.txt
>>> 
>>> Thanks for this document and the interesting work that you are doing.
>>> I'd like to discuss the security advantages gained by disaggregating
>>> QEMU in more detail.
>>> 
>>> The security model for VMs managed by libvirt (most production x86, ppc,
>>> s390 guests) is that the QEMU process is untrusted and only has access
>>> to resources belonging to the guest.  SELinux is used to restrict the
>>> process from accessing other files, processes, etc on the host.
>> 
>> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
>> can even do isolation with traditional DAC by putting each QEMU under
>> a distinct UID/GID and having libvirtd set ownership on resources each
>> VM is permitted to use.
>> 
>>> QEMU does not hold privileged resources that must be kept away from the
>>> guest.  An escaped guest can access its image file, tap file descriptor,
>>> etc but they are the same resources it could already access via device
>>> emulation.
>>> 
>>> Can you give specific examples of how disaggregation improves security?
> 
> Elena & collaborators: Dan has posted some ideas but please share yours
> so the security benefits of this patch series can be better understood.
> 

	Dan covered the main point.  The security regime we use (selinux)
constrains the actions of processes on objects, so having multiple processes
allows us to apply more fine-grained policies.


>> I guess one obvious answer is that the existing security mechanisms like
>> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
>> there are distinct processes. This would allow for a more useful seccomp
>> filter to better protect against secondary kernel exploits should QEMU
>> itself be exploited, if we can protect individual components.
> 
> Fine-grained sandboxing is possible in theory but tedious in practice.
> From what I can tell this patch series doesn't implement any sandboxing
> for child processes.
> 

	The policies aren’t in QEMU, but in the selinux config files.
They would say, for example, that when the QEMU process exec()s the
disk emulation process, the process security context type transitions
to a new type.  This type would have permission to access the VM image
objects, whereas the QEMU process type (and any other device emulation
process types) cannot access them.

	If you wanted to use DAC, you could do the something similar by
making the disk emulation executable setuid to a UID than can access
VM image files.

	In either case, the policies and permissions are set up before
libvirt even runs, so it doesn’t need to be aware of them.


> There must be a convenient way to get fine-grained sandboxing for
> disaggregated devices.  In other words, it shouldn't be left as an
> exercise to device process authors.
> 

	We can add some MAC or DAC suggestions in the documentation.


> How to do this in practice must be clear from the beginning if
> fine-grained sandboxing is the main selling point.
> 
> Some details to start the discussion:
> 
> * How will fine-grained SELinux/AppArmor/DAC policies be configured for
>   each process?  I guess this requires root, so does libvirt need to
>   know about each process?
> 

	The polices would apply to process security context types (or
UIDs in a DAC regime), so I would not expect libvirt to be aware of them.


> * We need to make sure that processes cannot send signals to each
>   other, ptrace, interfere in /proc/$PID, etc.  How will this be done?
> 

	Any process type restrictions would be enforced by selinux.


> * Were you planning to use any other sandboxing mechanisms
>   (namespaces?)?  How will they be set up if the device processed is
>   forked/executed by an unprivileged QEMU?
> 

	All of the QEMU-related process related to a single VM will run
in the same container, but the container is created, along with it selinux
policies, before libvirt is run.


>> Not everything is protected by MAC/DAC. For example network based disks
>> typically have a username + password for accessing the remote storage
>> server. Best practice would be a distinct username for every QEMU process
>> such that each can only access its own storage, but I don't know of any
>> app which does that. So ability to split off backends into separate
>> processes could limit exposure of information that is not otherwise
>> protected by current protection models.
> 
> If the disaggregated disk process with a global username + password is
> compromised then all your disk images are compromised.  So you still
> need to follow the best practice of per-VM credentials even with
> disaggregation, and if you do then disaggregation doesn't add anything!
> 

	You could put disk secrets in files that can only be read by the
disk emulation process type.  If you wanted even finer granularity, you
could use MCS to run each disk controller instance in a different security
context category, and make the secret files only readable by the corresponding
category.

	Another layer of security would be to have network security policies
that only allow the disk emulation processes to connect to the storage servers.

								JJ
Stefan Hajnoczi March 8, 2019, 9:50 a.m. UTC | #14
On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> >> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
> >>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> >>>> new file mode 100644
> >>>> index 0000000..e29c6c8
> >>>> --- /dev/null
> >>>> +++ b/docs/devel/qemu-multiprocess.txt
> >>> 
> >>> Thanks for this document and the interesting work that you are doing.
> >>> I'd like to discuss the security advantages gained by disaggregating
> >>> QEMU in more detail.
> >>> 
> >>> The security model for VMs managed by libvirt (most production x86, ppc,
> >>> s390 guests) is that the QEMU process is untrusted and only has access
> >>> to resources belonging to the guest.  SELinux is used to restrict the
> >>> process from accessing other files, processes, etc on the host.
> >> 
> >> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
> >> can even do isolation with traditional DAC by putting each QEMU under
> >> a distinct UID/GID and having libvirtd set ownership on resources each
> >> VM is permitted to use.
> >> 
> >>> QEMU does not hold privileged resources that must be kept away from the
> >>> guest.  An escaped guest can access its image file, tap file descriptor,
> >>> etc but they are the same resources it could already access via device
> >>> emulation.
> >>> 
> >>> Can you give specific examples of how disaggregation improves security?
> > 
> > Elena & collaborators: Dan has posted some ideas but please share yours
> > so the security benefits of this patch series can be better understood.
> > 
> 
> 	Dan covered the main point.  The security regime we use (selinux)
> constrains the actions of processes on objects, so having multiple processes
> allows us to apply more fine-grained policies.

Please share the SELinux policy files, containerization scripts, etc.
There is probably a home for them in qemu.git, libvirt.git, or elsewhere
upstream.

We need to find a way to make the sandboxing improvements available to
users besides yourself and easily reusable for developers who wish to
convert additional device models.

Thanks,
Stefan
Elena Ufimtseva March 8, 2019, 6:22 p.m. UTC | #15
On Thu, Mar 07, 2019 at 03:16:42PM +0100, Kevin Wolf wrote:
> Am 07.03.2019 um 09:14 hat Thomas Huth geschrieben:
> > On 07/03/2019 08.22, elena.ufimtseva@oracle.com wrote:
> > > From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> > > 
> > > TODO: Make relevant changes to the doc.
> > > 
> > > Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> > > Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> > > Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> > > ---
> > >  docs/devel/qemu-multiprocess.txt | 1109 ++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 1109 insertions(+)
> > >  create mode 100644 docs/devel/qemu-multiprocess.txt
> > > 
> > > diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> > > new file mode 100644
> > > index 0000000..e29c6c8
> > > --- /dev/null
> > > +++ b/docs/devel/qemu-multiprocess.txt
> > > @@ -0,0 +1,1109 @@
> > > +/*
> > > + * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a copy
> > > + * of this software and associated documentation files (the "Software"), to deal
> > > + * in the Software without restriction, including without limitation the rights
> > > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> > > + * copies of the Software, and to permit persons to whom the Software is
> > > + * furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice shall be included in
> > > + * all copies or substantial portions of the Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> > > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> > > + * THE SOFTWARE.
> > > + */
> > 
> > Somehow weird to see such a big license statement talking about
> > "software", but which applies to a text file only... Not sure if it is
> > an option for you, but maybe one of the Creative Common licenses
> > (dual-licensed with the GPLv2+) would be a better fit? E.g. for the QEMU
> > website, the content is dual-licensed: https://www.qemu.org/license.html
> 

Thanks Thomas,
working on figuring this part out.

> While we're talking about licenses, the "All rights reserved." notice is
> out of place in a license header that declares that a lot of permissions
> are granted. Better to remove it to avoid any ambiguities that could
> result from the contradiction. (Applies to the whole series.)
>

Thanks Kevin,

This will be removed.

Elena

> Kevin
Daniel P. Berrangé March 11, 2019, 10:20 a.m. UTC | #16
On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> 
> 
> > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> >> I guess one obvious answer is that the existing security mechanisms like
> >> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
> >> there are distinct processes. This would allow for a more useful seccomp
> >> filter to better protect against secondary kernel exploits should QEMU
> >> itself be exploited, if we can protect individual components.
> > 
> > Fine-grained sandboxing is possible in theory but tedious in practice.
> > From what I can tell this patch series doesn't implement any sandboxing
> > for child processes.
> > 
> 
> 	The policies aren’t in QEMU, but in the selinux config files.
> They would say, for example, that when the QEMU process exec()s the
> disk emulation process, the process security context type transitions
> to a new type.  This type would have permission to access the VM image
> objects, whereas the QEMU process type (and any other device emulation
> process types) cannot access them.

Note that currently all QEMU instances run by libvirt have seccomp
policy applied that explicitly forbids any use of fork+exec as a way
to reduce avenues of attack for an exploited QEMU.

Even in a modularized QEMU I'd be loathe to allow QEMU to have the
fork+exec privileged, unless "QEMU" in this case was just a stub
process that does nothing more than fork+exec the other binaries,
while having zero attack exposed to the untrusted guest OS.

> 	If you wanted to use DAC, you could do the something similar by
> making the disk emulation executable setuid to a UID than can access
> VM image files.
> 
> 	In either case, the policies and permissions are set up before
> libvirt even runs, so it doesn’t need to be aware of them.

That's not the case bearing in mind the above point about fork+exec
being forbidden. It would likely require libvirt to be in charge of
spawning the various helper binaries from a trusted context.


> > How to do this in practice must be clear from the beginning if
> > fine-grained sandboxing is the main selling point.
> > 
> > Some details to start the discussion:
> > 
> > * How will fine-grained SELinux/AppArmor/DAC policies be configured for
> >   each process?  I guess this requires root, so does libvirt need to
> >   know about each process?
> > 
> 
> 	The polices would apply to process security context types (or
> UIDs in a DAC regime), so I would not expect libvirt to be aware of them.

I'm pretty skeptical that such a large modularization of QEMU can be
done without libvirt being aware of it & needing some kind of changes
applied.


Regards,
Daniel
John Johnson March 22, 2019, 3:26 a.m. UTC | #17
>  On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> 
>> On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <address@hidden> wrote:
>>> 
>>> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
>>>> I guess one obvious answer is that the existing security mechanisms like
>>>> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
>>>> there are distinct processes. This would allow for a more useful seccomp
>>>> filter to better protect against secondary kernel exploits should QEMU
>>>> itself be exploited, if we can protect individual components.
>>> 
>>> Fine-grained sandboxing is possible in theory but tedious in practice.
>>> From what I can tell this patch series doesn't implement any sandboxing
>>> for child processes.
>>> 
>> 
>>     The policies aren’t in QEMU, but in the selinux config files.
>> They would say, for example, that when the QEMU process exec()s the
>> disk emulation process, the process security context type transitions
>> to a new type.  This type would have permission to access the VM image
>> objects, whereas the QEMU process type (and any other device emulation
>> process types) cannot access them.
> 
> 
> Note that currently all QEMU instances run by libvirt have seccomp
> policy applied that explicitly forbids any use of fork+exec as a way
> to reduce avenues of attack for an exploited QEMU.
> 
> Even in a modularized QEMU I'd be loathe to allow QEMU to have the
> fork+exec privileged, unless "QEMU" in this case was just a stub
> process that does nothing more than fork+exec the other binaries,
> while having zero attack exposed to the untrusted guest OS.
> 

	We’re looking at a couple ways to address your concerns.
One is a stub process, as you mentioned above, but if we need to
create programming to fork() and exec() the required emulation
programs before exec()ing QEMU, then it may make sense to just put
that programming into libvirt itself.

	Both paths would need similar changes to QEMU, such as the
ability to receive descriptions of the emulation processes the parent
process has created, and file descriptors that it has setup to
communicate with them.  Each remote device would then be matched with
its corresponding external process.

	The difference would be whether to create a new stub program
to create the emulation processes, or delegate that task to libvirt’s
QEMU driver.

	Do you have an opinion on a stub program vs libvirt integration?

						JJ
Daniel P. Berrangé March 22, 2019, 9:45 a.m. UTC | #18
On Thu, Mar 21, 2019 at 08:26:47PM -0700, John G Johnson wrote:
> 
>  
> >  On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > 
> >> On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <address@hidden> wrote:
> >>> 
> >>> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> >>>> I guess one obvious answer is that the existing security mechanisms like
> >>>> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
> >>>> there are distinct processes. This would allow for a more useful seccomp
> >>>> filter to better protect against secondary kernel exploits should QEMU
> >>>> itself be exploited, if we can protect individual components.
> >>> 
> >>> Fine-grained sandboxing is possible in theory but tedious in practice.
> >>> From what I can tell this patch series doesn't implement any sandboxing
> >>> for child processes.
> >>> 
> >> 
> >>     The policies aren’t in QEMU, but in the selinux config files.
> >> They would say, for example, that when the QEMU process exec()s the
> >> disk emulation process, the process security context type transitions
> >> to a new type.  This type would have permission to access the VM image
> >> objects, whereas the QEMU process type (and any other device emulation
> >> process types) cannot access them.
> > 
> > 
> > Note that currently all QEMU instances run by libvirt have seccomp
> > policy applied that explicitly forbids any use of fork+exec as a way
> > to reduce avenues of attack for an exploited QEMU.
> > 
> > Even in a modularized QEMU I'd be loathe to allow QEMU to have the
> > fork+exec privileged, unless "QEMU" in this case was just a stub
> > process that does nothing more than fork+exec the other binaries,
> > while having zero attack exposed to the untrusted guest OS.
> > 
> 
> 	We’re looking at a couple ways to address your concerns.
> One is a stub process, as you mentioned above, but if we need to
> create programming to fork() and exec() the required emulation
> programs before exec()ing QEMU, then it may make sense to just put
> that programming into libvirt itself.
> 
> 	Both paths would need similar changes to QEMU, such as the
> ability to receive descriptions of the emulation processes the parent
> process has created, and file descriptors that it has setup to
> communicate with them.  Each remote device would then be matched with
> its corresponding external process.
> 
> 	The difference would be whether to create a new stub program
> to create the emulation processes, or delegate that task to libvirt’s
> QEMU driver.
> 
> 	Do you have an opinion on a stub program vs libvirt integration?

Libvirt preference would be to retain full control over what programs
are spawned. This allows us to control their resource usage / placement
/ security policies. Having a stub that hides this from libvirt will
make this control harder, as we'll then need to interogate the stub
to find out what it did & applying controls appropriately. Also if more
external processes need to be spawned when hotplugging a device, then
libvirt would definitely want to have full control, as once QEMU vCPUS
have been started we don't trust it.

Regards,
Daniel
Stefan Hajnoczi March 26, 2019, 8:08 a.m. UTC | #19
On Fri, Mar 08, 2019 at 09:50:36AM +0000, Stefan Hajnoczi wrote:
> On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> > > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > >> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
> > >>> On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
> > >>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> > >>>> new file mode 100644
> > >>>> index 0000000..e29c6c8
> > >>>> --- /dev/null
> > >>>> +++ b/docs/devel/qemu-multiprocess.txt
> > >>> 
> > >>> Thanks for this document and the interesting work that you are doing.
> > >>> I'd like to discuss the security advantages gained by disaggregating
> > >>> QEMU in more detail.
> > >>> 
> > >>> The security model for VMs managed by libvirt (most production x86, ppc,
> > >>> s390 guests) is that the QEMU process is untrusted and only has access
> > >>> to resources belonging to the guest.  SELinux is used to restrict the
> > >>> process from accessing other files, processes, etc on the host.
> > >> 
> > >> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
> > >> can even do isolation with traditional DAC by putting each QEMU under
> > >> a distinct UID/GID and having libvirtd set ownership on resources each
> > >> VM is permitted to use.
> > >> 
> > >>> QEMU does not hold privileged resources that must be kept away from the
> > >>> guest.  An escaped guest can access its image file, tap file descriptor,
> > >>> etc but they are the same resources it could already access via device
> > >>> emulation.
> > >>> 
> > >>> Can you give specific examples of how disaggregation improves security?
> > > 
> > > Elena & collaborators: Dan has posted some ideas but please share yours
> > > so the security benefits of this patch series can be better understood.
> > > 
> > 
> > 	Dan covered the main point.  The security regime we use (selinux)
> > constrains the actions of processes on objects, so having multiple processes
> > allows us to apply more fine-grained policies.
> 
> Please share the SELinux policy files, containerization scripts, etc.
> There is probably a home for them in qemu.git, libvirt.git, or elsewhere
> upstream.
> 
> We need to find a way to make the sandboxing improvements available to
> users besides yourself and easily reusable for developers who wish to
> convert additional device models.

Ping?

Without the scripts/policies there is no security benefit from this
patch series.

Stefan
Jag Raman March 26, 2019, 2:31 p.m. UTC | #20
On 3/26/2019 4:08 AM, Stefan Hajnoczi wrote:
> On Fri, Mar 08, 2019 at 09:50:36AM +0000, Stefan Hajnoczi wrote:
>> On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
>>>> On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
>>>>> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
>>>>>> On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
>>>>>>> diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
>>>>>>> new file mode 100644
>>>>>>> index 0000000..e29c6c8
>>>>>>> --- /dev/null
>>>>>>> +++ b/docs/devel/qemu-multiprocess.txt
>>>>>>
>>>>>> Thanks for this document and the interesting work that you are doing.
>>>>>> I'd like to discuss the security advantages gained by disaggregating
>>>>>> QEMU in more detail.
>>>>>>
>>>>>> The security model for VMs managed by libvirt (most production x86, ppc,
>>>>>> s390 guests) is that the QEMU process is untrusted and only has access
>>>>>> to resources belonging to the guest.  SELinux is used to restrict the
>>>>>> process from accessing other files, processes, etc on the host.
>>>>>
>>>>> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
>>>>> can even do isolation with traditional DAC by putting each QEMU under
>>>>> a distinct UID/GID and having libvirtd set ownership on resources each
>>>>> VM is permitted to use.
>>>>>
>>>>>> QEMU does not hold privileged resources that must be kept away from the
>>>>>> guest.  An escaped guest can access its image file, tap file descriptor,
>>>>>> etc but they are the same resources it could already access via device
>>>>>> emulation.
>>>>>>
>>>>>> Can you give specific examples of how disaggregation improves security?
>>>>
>>>> Elena & collaborators: Dan has posted some ideas but please share yours
>>>> so the security benefits of this patch series can be better understood.
>>>>
>>>
>>> 	Dan covered the main point.  The security regime we use (selinux)
>>> constrains the actions of processes on objects, so having multiple processes
>>> allows us to apply more fine-grained policies.
>>
>> Please share the SELinux policy files, containerization scripts, etc.
>> There is probably a home for them in qemu.git, libvirt.git, or elsewhere
>> upstream.
>>
>> We need to find a way to make the sandboxing improvements available to
>> users besides yourself and easily reusable for developers who wish to
>> convert additional device models.
> 
> Ping?
> 
> Without the scripts/policies there is no security benefit from this
> patch series.

Hi Stefan,

We are working on this. We'll get back to you once we have this
available.

Thanks!
--
Jag

> 
> Stefan
>
Philippe Mathieu-Daudé March 26, 2019, 10:20 p.m. UTC | #21
Le mar. 26 mars 2019 15:34, Jag Raman <jag.raman@oracle.com> a écrit :

>
>
> On 3/26/2019 4:08 AM, Stefan Hajnoczi wrote:
> > On Fri, Mar 08, 2019 at 09:50:36AM +0000, Stefan Hajnoczi wrote:
> >> On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> >>>> On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> >>>> On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> >>>>> On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
> >>>>>> On Wed, Mar 06, 2019 at 11:22:53PM -0800,
> elena.ufimtseva@oracle.com wrote:
> >>>>>>> diff --git a/docs/devel/qemu-multiprocess.txt
> b/docs/devel/qemu-multiprocess.txt
> >>>>>>> new file mode 100644
> >>>>>>> index 0000000..e29c6c8
> >>>>>>> --- /dev/null
> >>>>>>> +++ b/docs/devel/qemu-multiprocess.txt
> >>>>>>
> >>>>>> Thanks for this document and the interesting work that you are
> doing.
> >>>>>> I'd like to discuss the security advantages gained by disaggregating
> >>>>>> QEMU in more detail.
> >>>>>>
> >>>>>> The security model for VMs managed by libvirt (most production x86,
> ppc,
> >>>>>> s390 guests) is that the QEMU process is untrusted and only has
> access
> >>>>>> to resources belonging to the guest.  SELinux is used to restrict
> the
> >>>>>> process from accessing other files, processes, etc on the host.
> >>>>>
> >>>>> NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
> >>>>> can even do isolation with traditional DAC by putting each QEMU under
> >>>>> a distinct UID/GID and having libvirtd set ownership on resources
> each
> >>>>> VM is permitted to use.
> >>>>>
> >>>>>> QEMU does not hold privileged resources that must be kept away from
> the
> >>>>>> guest.  An escaped guest can access its image file, tap file
> descriptor,
> >>>>>> etc but they are the same resources it could already access via
> device
> >>>>>> emulation.
> >>>>>>
> >>>>>> Can you give specific examples of how disaggregation improves
> security?
> >>>>
> >>>> Elena & collaborators: Dan has posted some ideas but please share
> yours
> >>>> so the security benefits of this patch series can be better
> understood.
> >>>>
> >>>
> >>>     Dan covered the main point.  The security regime we use (selinux)
> >>> constrains the actions of processes on objects, so having multiple
> processes
> >>> allows us to apply more fine-grained policies.
> >>
> >> Please share the SELinux policy files, containerization scripts, etc.
> >> There is probably a home for them in qemu.git, libvirt.git, or elsewhere
> >> upstream.
> >>
> >> We need to find a way to make the sandboxing improvements available to
> >> users besides yourself and easily reusable for developers who wish to
> >> convert additional device models.
>

Also for testing this series.

>
> > Ping?
> >
> > Without the scripts/policies there is no security benefit from this
> > patch series.
>
> Hi Stefan,
>
> We are working on this. We'll get back to you once we have this
> available.
>
> Thanks!
> --
> Jag
>
> >
> > Stefan
> >
>
>
Stefan Hajnoczi March 27, 2019, 4:37 p.m. UTC | #22
On Tue, Mar 26, 2019 at 10:31:53AM -0400, Jag Raman wrote:
> 
> 
> On 3/26/2019 4:08 AM, Stefan Hajnoczi wrote:
> > On Fri, Mar 08, 2019 at 09:50:36AM +0000, Stefan Hajnoczi wrote:
> > > On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> > > > > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > > > > > On Thu, Mar 07, 2019 at 02:26:09PM +0000, Stefan Hajnoczi wrote:
> > > > > > > On Wed, Mar 06, 2019 at 11:22:53PM -0800, elena.ufimtseva@oracle.com wrote:
> > > > > > > > diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
> > > > > > > > new file mode 100644
> > > > > > > > index 0000000..e29c6c8
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/docs/devel/qemu-multiprocess.txt
> > > > > > > 
> > > > > > > Thanks for this document and the interesting work that you are doing.
> > > > > > > I'd like to discuss the security advantages gained by disaggregating
> > > > > > > QEMU in more detail.
> > > > > > > 
> > > > > > > The security model for VMs managed by libvirt (most production x86, ppc,
> > > > > > > s390 guests) is that the QEMU process is untrusted and only has access
> > > > > > > to resources belonging to the guest.  SELinux is used to restrict the
> > > > > > > process from accessing other files, processes, etc on the host.
> > > > > > 
> > > > > > NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
> > > > > > can even do isolation with traditional DAC by putting each QEMU under
> > > > > > a distinct UID/GID and having libvirtd set ownership on resources each
> > > > > > VM is permitted to use.
> > > > > > 
> > > > > > > QEMU does not hold privileged resources that must be kept away from the
> > > > > > > guest.  An escaped guest can access its image file, tap file descriptor,
> > > > > > > etc but they are the same resources it could already access via device
> > > > > > > emulation.
> > > > > > > 
> > > > > > > Can you give specific examples of how disaggregation improves security?
> > > > > 
> > > > > Elena & collaborators: Dan has posted some ideas but please share yours
> > > > > so the security benefits of this patch series can be better understood.
> > > > > 
> > > > 
> > > > 	Dan covered the main point.  The security regime we use (selinux)
> > > > constrains the actions of processes on objects, so having multiple processes
> > > > allows us to apply more fine-grained policies.
> > > 
> > > Please share the SELinux policy files, containerization scripts, etc.
> > > There is probably a home for them in qemu.git, libvirt.git, or elsewhere
> > > upstream.
> > > 
> > > We need to find a way to make the sandboxing improvements available to
> > > users besides yourself and easily reusable for developers who wish to
> > > convert additional device models.
> > 
> > Ping?
> > 
> > Without the scripts/policies there is no security benefit from this
> > patch series.
> 
> Hi Stefan,
> 
> We are working on this. We'll get back to you once we have this
> available.

Great, thanks!

Stefan
Jag Raman April 23, 2019, 9:26 p.m. UTC | #23
On 3/26/2019 6:20 PM, Philippe Mathieu-Daudé wrote:

>>>> Please share the SELinux policy files, containerization scripts, etc.
>>>> There is probably a home for them in qemu.git, libvirt.git, or elsewhere
>>>> upstream.
>>>>
>>>> We need to find a way to make the sandboxing improvements available to
>>>> users besides yourself and easily reusable for developers who wish to
>>>> convert additional device models.
>>
> 
> Also for testing this series.

Hi,

We are wondering how to deliver the example SELinux policies. I have
posted on Fedora's SELinux mailing list to get info. on how to upstream
SElinux policy.

We are developing SELinux Type Enforcements and MCS labels to sandbox
the emulation process. Details regarding example Type Enforcement is
available below.

We are also working on changes to libvirt, to launch the remote process
and apply MCS labels. Libvirt changes will be posted separately in the
future.

The Type Enforcements for SElinux is available in the pastebin location
below (also copied at the end of this email):
https://pastebin.com/t1bpS6MY

An RPM package which installs this policy as a SELinux module, and
configures the file contexts for the executables, is available for
download in the link below:
http://wikisend.com/download/156700/mpqemu-selinux-example-1.0-1.fc29.noarch.rpm

The README for RPM could be obtained by running the following commands:
# rpm2cpio ./packagecloud-test-1.1-1.x86_64.rpm | cpio -idmv
# cat opt/mpqemu-selinux-example/doc/README

Thanks!
--
Jag


----------
mpqemu.te:
----------

module mpqemu 1.0;


require {
	class process transition;
	class file { execute read };
	class file entrypoint;
	class dir search;
	class file { getattr open read };
	class file { getattr map open read };
	class file { execute map read };
	class lnk_file read;
	class chr_file { lock open read write };
	class file { getattr ioctl lock open read write };
	class process fork;
	class fd use;
	class unix_stream_socket { read write };
	class file open;
	class process { noatsecure rlimitinh siginh };
	class file write;
	class dir { getattr search };
	class file { open read };
	class process getattr;
	type qemu_t;
	type qemu_exec_t;
	type virtd_t;
	type ld_so_cache_t;
	type ld_so_t;
	type lib_t;
	type null_device_t;
	type virt_image_t;
	type shell_exec_t;
	type init_t;
	attribute domain;
	attribute entry_type;
	attribute exec_type;
	attribute application_exec_type;
	attribute file_type, non_security_file_type, non_auth_file_type;
	attribute virt_domain;
	attribute virt_image_type;
	
};


type qemu_lsi53c895a_exec_t;
type qemu_lsi53c895a_img_t;
type qemu_lsi53c895a_t;

typeattribute qemu_lsi53c895a_t virt_domain;

typeattribute qemu_lsi53c895a_exec_t file_type, non_security_file_type, 
non_auth_file_type;
typeattribute qemu_lsi53c895a_exec_t exec_type;
typeattribute qemu_lsi53c895a_exec_t application_exec_type;
typeattribute qemu_lsi53c895a_exec_t entry_type;
typeattribute qemu_lsi53c895a_img_t file_type, non_security_file_type, 
non_auth_file_type;
typeattribute qemu_lsi53c895a_img_t virt_image_type;
type_transition qemu_t qemu_lsi53c895a_exec_t : process qemu_lsi53c895a_t;
type_transition virtd_t qemu_exec_t : process qemu_t;

#============= init_t ==============
allow init_t qemu_lsi53c895a_t:dir search;
allow init_t qemu_lsi53c895a_t:file { getattr open read };

#============= qemu_lsi53c895a_t ==============
allow qemu_lsi53c895a_t ld_so_cache_t : file { getattr map open read };
allow qemu_lsi53c895a_t ld_so_t : file { execute map read };
allow qemu_lsi53c895a_t lib_t : lnk_file read;
allow qemu_lsi53c895a_t null_device_t : chr_file { lock open read write };
allow qemu_lsi53c895a_t qemu_lsi53c895a_exec_t : file { execute map read };
allow qemu_lsi53c895a_t qemu_lsi53c895a_img_t : file { getattr ioctl 
lock open read write };
allow qemu_lsi53c895a_t self : process fork;
allow qemu_lsi53c895a_t qemu_t : fd use;
allow qemu_lsi53c895a_t qemu_t : unix_stream_socket { read write };
allow qemu_lsi53c895a_t qemu_lsi53c895a_exec_t : file entrypoint;

#============= qemu_t ==============
allow qemu_t qemu_lsi53c895a_exec_t : file open;
allow qemu_t qemu_lsi53c895a_t : process { noatsecure rlimitinh siginh };
allow qemu_t virt_image_t : file write;
allow qemu_t qemu_lsi53c895a_t : process transition;
allow qemu_t qemu_lsi53c895a_exec_t : file { execute read };

#============= virtd_t ==============
allow virtd_t shell_exec_t : file entrypoint;



>>> Stefan
>>>
>>
>>
Stefan Hajnoczi April 25, 2019, 3:44 p.m. UTC | #24
On Tue, Apr 23, 2019 at 05:26:33PM -0400, Jag Raman wrote:
> On 3/26/2019 6:20 PM, Philippe Mathieu-Daudé wrote:
> 
> > > > > Please share the SELinux policy files, containerization scripts, etc.
> > > > > There is probably a home for them in qemu.git, libvirt.git, or elsewhere
> > > > > upstream.
> > > > > 
> > > > > We need to find a way to make the sandboxing improvements available to
> > > > > users besides yourself and easily reusable for developers who wish to
> > > > > convert additional device models.
> > > 
> > 
> > Also for testing this series.
> 
> Hi,
> 
> We are wondering how to deliver the example SELinux policies. I have
> posted on Fedora's SELinux mailing list to get info. on how to upstream
> SElinux policy.
> 
> We are developing SELinux Type Enforcements and MCS labels to sandbox
> the emulation process. Details regarding example Type Enforcement is
> available below.
> 
> We are also working on changes to libvirt, to launch the remote process
> and apply MCS labels. Libvirt changes will be posted separately in the
> future.
> 
> The Type Enforcements for SElinux is available in the pastebin location
> below (also copied at the end of this email):
> https://pastebin.com/t1bpS6MY

Can multiple LSI SCSI controllers be launched such that each process
only has access to a subset of disk images?  Or is the disk image label
per-VM so that there is no isolation between LSI SCSI controller
processes for that VM?

My concern with this overall approach is the practicality vs its
benefits.  Regarding practicality, each emulated device needs to be
proxied separately.  The QEMU subsystem used by the device also needs to
be proxied.  Global state, monitor commands, and live migration all
require code changes to support proxied operation.  This is very
invasive.

Then each emulated device needs an SELinux policy to achieve the
benefits of confinement.  I have no idea how to correctly write a policy
like this and it's likely that developers who contribute a single new
device will not be proficient in it either.  Writing these policies is a
rare thing and few people will be good at this.  It also makes me worry
about how we test and review them.

Despite the efforts required in making this work, all processes still
effectively have full access to the guest since they can access guest
RAM.  What I mean is that the device is actually not confined to its
host process (e.g. LSI SCSI controller process) because it can write
code to executable guest RAM pages.  The guest will then execute that
code and therefore all guest I/O (networking, disk, etc) is still
available indirectly to the "confined" processes.  They are not really
sandboxed from the outside world, regardless of how strict the SELinux
policy is :(.

There are performance issues due to proxying as well, but let's ignore
them for now and focus on security.

How do the benefits compare against today's monolithic approach?  If the
guest exploits monolithic QEMU it has full access to all host files and
APIs available to QEMU.  However, these are largely just the resources
that belong to the guest anyway - not resources we are trying to keep
away from the guest.  With multi-process QEMU each process still has
access to all guest interfaces via the code injection I mentioned above,
but the SELinux policy could restrict access to some resources.  But
this benefit is really small in my opinion, given that the resources
belong to the guest anyway and the guest can already access them.

I think you can implement this for a handful of devices as a one-time
thing, but the invasiveness and the impracticality of getting wide cover
of QEMU make this approach questionable.

Am I mistaken about the invasiveness or impracticality?

Am I misunderstanding the security benefits compared to what already
exists today?

A more practical approach is to strip down QEMU (compiling out unused
devices and features) and to run virtio devices in vhost-user processes
(e.g. virtio-input, virtio-gpu, virtio-fs).  This achieves similar goals
without proxy objects or invasive changes to QEMU since the vhost-user
devices use a different codebase and aren't accessible via the QEMU
monitor.  The limitation is that existing QEMU code and non-virtio
devices aren't available in this model.

Stefan
Jag Raman May 7, 2019, 7 p.m. UTC | #25
Hi Stefan,

Thank you very much for your feedback. Following is a summary of the
discussions our team had regarding your feedback.

On 4/25/2019 11:44 AM, Stefan Hajnoczi wrote:
> 
> Can multiple LSI SCSI controllers be launched such that each process
> only has access to a subset of disk images?  Or is the disk image label
> per-VM so that there is no isolation between LSI SCSI controller
> processes for that VM?

Yes, it is possible to provide each process with access to a subset of
disk images. The Orchestrator (libvirt, etc.) assigns a set of MCS
Categories to each VM, then device instances can be isolated by being
assigned a subset of the VM’s Categories.

> 
> My concern with this overall approach is the practicality vs its
> benefits.  Regarding practicality, each emulated device needs to be
> proxied separately.  The QEMU subsystem used by the device also needs to
> be proxied.  Global state, monitor commands, and live migration all
> require code changes to support proxied operation.  This is very
> invasive.
> 
> Then each emulated device needs an SELinux policy to achieve the
> benefits of confinement.  I have no idea how to correctly write a policy
> like this and it's likely that developers who contribute a single new
> device will not be proficient in it either.  Writing these policies is a
> rare thing and few people will be good at this.  It also makes me worry
> about how we test and review them.

We also think that having an SELinux policy per device would become
complicated. Our proposal, therefore, is to define SELinux policies for
each device class - viz. disk, network, console, graphics, etc.
"fedora-selinux" upstream repo. [1] will contain these policies, so the
device developer doesn't have to worry about defining new policies for
each device. This proposal would diminish the complexity of SELinux
policies.

> 
> Despite the efforts required in making this work, all processes still
> effectively have full access to the guest since they can access guest
> RAM.  What I mean is that the device is actually not confined to its
> host process (e.g. LSI SCSI controller process) because it can write
> code to executable guest RAM pages.  The guest will then execute that
> code and therefore all guest I/O (networking, disk, etc) is still
> available indirectly to the "confined" processes.  They are not really
> sandboxed from the outside world, regardless of how strict the SELinux
> policy is :(.
> 
> There are performance issues due to proxying as well, but let's ignore
> them for now and focus on security.

We are also focusing on performance. Please take a look at the following
blog for an initial report on performance. The results are for an iSCSI
backend in Oracle Cloud. We are working on collecting data on a much
heavier IOPS workload like an NVMe backend.

https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-3-of-3-v2

> 
> How do the benefits compare against today's monolithic approach?  If the
> guest exploits monolithic QEMU it has full access to all host files and
> APIs available to QEMU.  However, these are largely just the resources
> that belong to the guest anyway - not resources we are trying to keep
> away from the guest.  With multi-process QEMU each process still has
> access to all guest interfaces via the code injection I mentioned above,
> but the SELinux policy could restrict access to some resources.  But
> this benefit is really small in my opinion, given that the resources
> belong to the guest anyway and the guest can already access them.

The primary focus of our project is to defend the host from malicious
guest. The code injection problem you outlined above involves part of
the guest attacking itself, but not the host. Therefore, this wouldn't
compromise our objective.

Like you know, there are some parts of QEMU which are not directly
accessible from the guest (via drivers, etc.), which we prefer to call
the control plane. It executes ioctls to the host kernel and has access
to a broader set of syscalls, which the device emulation code doesn’t
need. We want to protect the control plane from emulated devices. In the
case where a device injects code into the RAM to attack another device
on the same VM, the control plane would still be protected.

Another benefit with the project would be regarding detecting and
reporting failures in the emulated devices. For instance, in cases like
CVE-2018-18849, where an emulated device hangs/crashes, it wouldn't
directly crash the QEMU process as well. QEMU could detect the failure,
log the problem and exit, instead of generating coredump/hang.

> 
> I think you can implement this for a handful of devices as a one-time
> thing, but the invasiveness and the impracticality of getting wide cover
> of QEMU make this approach questionable.
> 
> Am I mistaken about the invasiveness or impracticality?

We are not planning to implement this for all devices since it would be
impractical. But the project adds a framework for implementing more
devices in the future.

One other thing we would like to bring your attention to is that the
project doesn't affect the current usage. The same devices could still
be used as part of monolithic QEMU if the user chooses to do so.

> 
> Am I misunderstanding the security benefits compared to what already
> exists today?

As far as we know, there is no other open-source KVM based toolstack
where the privileged operations are in a separate process, and the
emulated devices are in jail and where you can still run legacy OSes
like Windows XP

> 
> A more practical approach is to strip down QEMU (compiling out unused
> devices and features) and to run virtio devices in vhost-user processes
> (e.g. virtio-input, virtio-gpu, virtio-fs).  This achieves similar goals
> without proxy objects or invasive changes to QEMU since the vhost-user
> devices use a different codebase and aren't accessible via the QEMU
> monitor.  The limitation is that existing QEMU code and non-virtio
> devices aren't available in this model.

In some cases, the user/customer brings in VMs with legacy devices
attached to them. It's not possible to take the virtio/vhost approach in
this case.

[1] https://github.com/fedora-selinux

Thanks!
Elena Ufimtseva May 7, 2019, 9 p.m. UTC | #26
On Mon, Mar 11, 2019 at 10:20:06AM +0000, Daniel P. Berrangé wrote:
> On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> > 
> > 

Hi Daniel, Stefan

We have not replied in a while as we were trying to figure out
the best approach after multiple comments we have received on the
patch series.

Leaving other concerns that you, Stefan and others shared with us
out of this particular topic, we would like to get your opinion on
the following approach.

Please see below.

> > > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > 
> > > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > >> I guess one obvious answer is that the existing security mechanisms like
> > >> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
> > >> there are distinct processes. This would allow for a more useful seccomp
> > >> filter to better protect against secondary kernel exploits should QEMU
> > >> itself be exploited, if we can protect individual components.
> > > 
> > > Fine-grained sandboxing is possible in theory but tedious in practice.
> > > From what I can tell this patch series doesn't implement any sandboxing
> > > for child processes.
> > > 
> > 
> > 	The policies aren’t in QEMU, but in the selinux config files.
> > They would say, for example, that when the QEMU process exec()s the
> > disk emulation process, the process security context type transitions
> > to a new type.  This type would have permission to access the VM image
> > objects, whereas the QEMU process type (and any other device emulation
> > process types) cannot access them.
> 
> Note that currently all QEMU instances run by libvirt have seccomp
> policy applied that explicitly forbids any use of fork+exec as a way
> to reduce avenues of attack for an exploited QEMU.
> 
> Even in a modularized QEMU I'd be loathe to allow QEMU to have the
> fork+exec privileged, unless "QEMU" in this case was just a stub
> process that does nothing more than fork+exec the other binaries,
> while having zero attack exposed to the untrusted guest OS.

We see libvirt uses QEMU’s -sandbox option to indicate that QEMU
should use seccomp() to prohibit future use of certain system calls,
including fork() and exec().  Our idea is to enumerate the remote
processes needed via QEMU command line options, and have QEMU exec()
those processes before -sandbox is processed.
And we also will init seccomp for emulated devices processes.

> 
> > 	If you wanted to use DAC, you could do the something similar by
> > making the disk emulation executable setuid to a UID than can access
> > VM image files.
> > 
> > 	In either case, the policies and permissions are set up before
> > libvirt even runs, so it doesn’t need to be aware of them.
> 
> That's not the case bearing in mind the above point about fork+exec
> being forbidden. It would likely require libvirt to be in charge of
> spawning the various helper binaries from a trusted context.
> 
> 
> > > How to do this in practice must be clear from the beginning if
> > > fine-grained sandboxing is the main selling point.
> > > 
> > > Some details to start the discussion:
> > > 
> > > * How will fine-grained SELinux/AppArmor/DAC policies be configured for
> > >   each process?  I guess this requires root, so does libvirt need to
> > >   know about each process?
> > > 
> > 
> > 	The polices would apply to process security context types (or
> > UIDs in a DAC regime), so I would not expect libvirt to be aware of them.
> 
> I'm pretty skeptical that such a large modularization of QEMU can be
> done without libvirt being aware of it & needing some kind of changes
> applied.
>

We agree with that. With above proposed approach we still have to change hotplug
in some way.
If a eparate process will be spawned, libvirt will be the one doing
fork/exec of the separate processes. Or possibly launch a helper
binaries that will unify the way how an instance is being started with
multiple processes and hotplugging.

Thanks!
Elena, Jag, John.


> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Stefan Hajnoczi May 23, 2019, 10:40 a.m. UTC | #27
On Tue, May 07, 2019 at 03:00:52PM -0400, Jag Raman wrote:
> Hi Stefan,
> 
> Thank you very much for your feedback. Following is a summary of the
> discussions our team had regarding your feedback.
> 
> On 4/25/2019 11:44 AM, Stefan Hajnoczi wrote:
> > 
> > Can multiple LSI SCSI controllers be launched such that each process
> > only has access to a subset of disk images?  Or is the disk image label
> > per-VM so that there is no isolation between LSI SCSI controller
> > processes for that VM?
> 
> Yes, it is possible to provide each process with access to a subset of
> disk images. The Orchestrator (libvirt, etc.) assigns a set of MCS
> Categories to each VM, then device instances can be isolated by being
> assigned a subset of the VM’s Categories.
> 
> > 
> > My concern with this overall approach is the practicality vs its
> > benefits.  Regarding practicality, each emulated device needs to be
> > proxied separately.  The QEMU subsystem used by the device also needs to
> > be proxied.  Global state, monitor commands, and live migration all
> > require code changes to support proxied operation.  This is very
> > invasive.
> > 
> > Then each emulated device needs an SELinux policy to achieve the
> > benefits of confinement.  I have no idea how to correctly write a policy
> > like this and it's likely that developers who contribute a single new
> > device will not be proficient in it either.  Writing these policies is a
> > rare thing and few people will be good at this.  It also makes me worry
> > about how we test and review them.
> 
> We also think that having an SELinux policy per device would become
> complicated. Our proposal, therefore, is to define SELinux policies for
> each device class - viz. disk, network, console, graphics, etc.
> "fedora-selinux" upstream repo. [1] will contain these policies, so the
> device developer doesn't have to worry about defining new policies for
> each device. This proposal would diminish the complexity of SELinux
> policies.

Have you considered using Linux namespaces?  I'm beginning to think that
SELinux becomes less relevant with pid and mount namespaces to isolate
processes.  The advantage of namespaces is that they are easy to
understand and can be expressed in code instead of a policy file in a
separate package.  This is the approach we're taking with virtiofsd
(vhost-user device backend for virtio-fs).

> > 
> > Despite the efforts required in making this work, all processes still
> > effectively have full access to the guest since they can access guest
> > RAM.  What I mean is that the device is actually not confined to its
> > host process (e.g. LSI SCSI controller process) because it can write
> > code to executable guest RAM pages.  The guest will then execute that
> > code and therefore all guest I/O (networking, disk, etc) is still
> > available indirectly to the "confined" processes.  They are not really
> > sandboxed from the outside world, regardless of how strict the SELinux
> > policy is :(.
> > 
> > There are performance issues due to proxying as well, but let's ignore
> > them for now and focus on security.
> 
> We are also focusing on performance. Please take a look at the following
> blog for an initial report on performance. The results are for an iSCSI
> backend in Oracle Cloud. We are working on collecting data on a much
> heavier IOPS workload like an NVMe backend.
> 
> https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-3-of-3-v2

Hard to reach a conclusion without also looking at CPU utilization.
IOPS alone don't tell the story.

If the system had spare CPU cycles then the performance results between
built-in LSI and separate LSI will be similar but the efficiency
(IOPS/CPU%) has actually decreased due to the extra CPU cycles required
to forward the hardware register access to the device emulation process.

If you rerun on a system without spare CPU cycles then IOPS degradation
would become apparent.  I'm not saying this is necessarily the case,
maybe the overhead is really doesn't have a significant effect, but the
graph shown in the blog post isn't enough to draw a conclusion either
way.

Regarding the proposed QEMU bypass, these already exist in some form via
kvm.ko's ioeventfd and coalesced MMIO features.

Today ioeventfd is only used for performance-critical hardware
registers, so kvm.ko doesn't use a sophisticated dispatch mechanism.  If
you want to use it for all hardware register accesses handled by a
separate process then ioeventfd probably needs to be tweaked somewhat to
make it more scalable for that case.

Coalesced MMIO is also cool.  kvm.ko can accumulate guest MMIO writes in
a buffer that is only collected at a later point in time.  This improves
performance for devices that require multiple hardware register writes
to kick off an I/O operation (only the last one really needs to be
trapped by the device emulation code!).  This sounds similar to an MMIO
access shared ring buffer.

> > 
> > How do the benefits compare against today's monolithic approach?  If the
> > guest exploits monolithic QEMU it has full access to all host files and
> > APIs available to QEMU.  However, these are largely just the resources
> > that belong to the guest anyway - not resources we are trying to keep
> > away from the guest.  With multi-process QEMU each process still has
> > access to all guest interfaces via the code injection I mentioned above,
> > but the SELinux policy could restrict access to some resources.  But
> > this benefit is really small in my opinion, given that the resources
> > belong to the guest anyway and the guest can already access them.
> 
> The primary focus of our project is to defend the host from malicious
> guest. The code injection problem you outlined above involves part of
> the guest attacking itself, but not the host. Therefore, this wouldn't
> compromise our objective.
> 
> Like you know, there are some parts of QEMU which are not directly
> accessible from the guest (via drivers, etc.), which we prefer to call
> the control plane. It executes ioctls to the host kernel and has access
> to a broader set of syscalls, which the device emulation code doesn’t
> need. We want to protect the control plane from emulated devices. In the
> case where a device injects code into the RAM to attack another device
> on the same VM, the control plane would still be protected.

Are you aware of any cases where the syscall attack surface led to an
exploitable bug in QEMU?  Any proof-of-concept exploit code or a CVE?

> Another benefit with the project would be regarding detecting and
> reporting failures in the emulated devices. For instance, in cases like
> CVE-2018-18849, where an emulated device hangs/crashes, it wouldn't
> directly crash the QEMU process as well. QEMU could detect the failure,
> log the problem and exit, instead of generating coredump/hang.

Debugging is a lot easier with a coredump though :).  I would rather
have a coredump than a nice message that says "LSI died".

> > 
> > I think you can implement this for a handful of devices as a one-time
> > thing, but the invasiveness and the impracticality of getting wide cover
> > of QEMU make this approach questionable.
> > 
> > Am I mistaken about the invasiveness or impracticality?
> 
> We are not planning to implement this for all devices since it would be
> impractical. But the project adds a framework for implementing more
> devices in the future.
> 
> One other thing we would like to bring your attention to is that the
> project doesn't affect the current usage. The same devices could still
> be used as part of monolithic QEMU if the user chooses to do so.

I don't follow, to me this proposal seems extremely invasive and
requires awareness from all developers.

QEMU contains global state (like net/net.c:net_clients or
block.c:all_bdrv_states) and QMP commands that access global state.  All
of this needs to be carefully proxied to avoid losing functionality as
fundamental as the QMP monitor.

This is what worries me about this project.  There are amazing niche
features like record/replay that have been integrated into QEMU without
requiring all developers to be aware of how they work.  If you can
achieve this then I would have no reservations.

Right now I don't see that this will be possible and that's why I'm
challenging you to justify that the reduction in system call attack
surface is actually worth the invasive changes required.

Do you see a way to solve the issues I've mentioned?

Stefan
Stefan Hajnoczi May 23, 2019, 11:11 a.m. UTC | #28
Hi Jag and Elena,
Do you think a call would help to move discussion along more quickly?

We could use the next KVM Community Call on June 4th to discuss
remaining concerns and the next steps:
https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

I also hope to include other core QEMU developers.  As you know, I'm
skeptical, but it could be just me and I don't want to block you
unnecessarily if others are more enthusiastic about this approach.

Stefan
Stefan Hajnoczi May 23, 2019, 11:22 a.m. UTC | #29
On Tue, May 07, 2019 at 02:00:59PM -0700, Elena Ufimtseva wrote:
> On Mon, Mar 11, 2019 at 10:20:06AM +0000, Daniel P. Berrangé wrote:
> > On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> > > 
> > > 
> 
> Hi Daniel, Stefan
> 
> We have not replied in a while as we were trying to figure out
> the best approach after multiple comments we have received on the
> patch series.
> 
> Leaving other concerns that you, Stefan and others shared with us
> out of this particular topic, we would like to get your opinion on
> the following approach.
> 
> Please see below.
> 
> > > > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > 
> > > > On Thu, Mar 07, 2019 at 02:51:20PM +0000, Daniel P. Berrangé wrote:
> > > >> I guess one obvious answer is that the existing security mechanisms like
> > > >> SELinux/ApArmor/DAC can be made to work in a more fine grained manner if
> > > >> there are distinct processes. This would allow for a more useful seccomp
> > > >> filter to better protect against secondary kernel exploits should QEMU
> > > >> itself be exploited, if we can protect individual components.
> > > > 
> > > > Fine-grained sandboxing is possible in theory but tedious in practice.
> > > > From what I can tell this patch series doesn't implement any sandboxing
> > > > for child processes.
> > > > 
> > > 
> > > 	The policies aren’t in QEMU, but in the selinux config files.
> > > They would say, for example, that when the QEMU process exec()s the
> > > disk emulation process, the process security context type transitions
> > > to a new type.  This type would have permission to access the VM image
> > > objects, whereas the QEMU process type (and any other device emulation
> > > process types) cannot access them.
> > 
> > Note that currently all QEMU instances run by libvirt have seccomp
> > policy applied that explicitly forbids any use of fork+exec as a way
> > to reduce avenues of attack for an exploited QEMU.
> > 
> > Even in a modularized QEMU I'd be loathe to allow QEMU to have the
> > fork+exec privileged, unless "QEMU" in this case was just a stub
> > process that does nothing more than fork+exec the other binaries,
> > while having zero attack exposed to the untrusted guest OS.
> 
> We see libvirt uses QEMU’s -sandbox option to indicate that QEMU
> should use seccomp() to prohibit future use of certain system calls,
> including fork() and exec().  Our idea is to enumerate the remote
> processes needed via QEMU command line options, and have QEMU exec()
> those processes before -sandbox is processed.
> And we also will init seccomp for emulated devices processes.

Sounds good.

My experience with seccomp is that whitelisting syscalls is fragile
because of library dependencies.  Even glibc might invoke syscalls you
didn't expect, especially after a kernel/glibc upgrade, forcing you to
modify the whitelist.

However, once a whitelist is successfully in place it's a simple way to
reduce the syscall attack surface and I think it's worthwhile.
Elena Ufimtseva May 28, 2019, 3:18 p.m. UTC | #30
On Thu, May 23, 2019 at 12:11:30PM +0100, Stefan Hajnoczi wrote:
> Hi Jag and Elena,
> Do you think a call would help to move discussion along more quickly?
>

Hi Stefan,

We would like to join this call.
And thank you inviting us!

Elena
> We could use the next KVM Community Call on June 4th to discuss
> remaining concerns and the next steps:
> https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
>
> I also hope to include other core QEMU developers.  As you know, I'm
> skeptical, but it could be just me and I don't want to block you
> unnecessarily if others are more enthusiastic about this approach.
>


> Stefan
Elena Ufimtseva May 30, 2019, 8:54 p.m. UTC | #31
On Tue, May 28, 2019 at 08:18:20AM -0700, Elena Ufimtseva wrote:
> On Thu, May 23, 2019 at 12:11:30PM +0100, Stefan Hajnoczi wrote:
> > Hi Jag and Elena,
> > Do you think a call would help to move discussion along more quickly?
> >
> 
> Hi Stefan,
> 
> We would like to join this call.
> And thank you inviting us!
> 
> Elena
> > We could use the next KVM Community Call on June 4th to discuss
> > remaining concerns and the next steps:
> > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> >
> > I also hope to include other core QEMU developers.  As you know, I'm
> > skeptical, but it could be just me and I don't want to block you
> > unnecessarily if others are more enthusiastic about this approach.
> >

Hi Stefan

A few questions we have are about the call.
What is the format of the call usually? Should we provide some kind of the project outline for 5 minutes?
We are planning to address some of the concerns you have voiced in regards to amount of changes, usability,
security and performance. I assume there will be other questions as well. Is there any time limit per topic?

And would you mind sharing the call details with us?

Thanks!
Elena
> 
> 
> > Stefan
> 
>
Jag Raman June 11, 2019, 3:53 p.m. UTC | #32
On 5/23/2019 6:40 AM, Stefan Hajnoczi wrote:
> On Tue, May 07, 2019 at 03:00:52PM -0400, Jag Raman wrote:
>> Hi Stefan,
>>
>> Thank you very much for your feedback. Following is a summary of the
>> discussions our team had regarding your feedback.
>>
>> On 4/25/2019 11:44 AM, Stefan Hajnoczi wrote:
>>>
>>> Can multiple LSI SCSI controllers be launched such that each process
>>> only has access to a subset of disk images?  Or is the disk image label
>>> per-VM so that there is no isolation between LSI SCSI controller
>>> processes for that VM?
>>
>> Yes, it is possible to provide each process with access to a subset of
>> disk images. The Orchestrator (libvirt, etc.) assigns a set of MCS
>> Categories to each VM, then device instances can be isolated by being
>> assigned a subset of the VM’s Categories.
>>
>>>
>>> My concern with this overall approach is the practicality vs its
>>> benefits.  Regarding practicality, each emulated device needs to be
>>> proxied separately.  The QEMU subsystem used by the device also needs to
>>> be proxied.  Global state, monitor commands, and live migration all
>>> require code changes to support proxied operation.  This is very
>>> invasive.
>>>
>>> Then each emulated device needs an SELinux policy to achieve the
>>> benefits of confinement.  I have no idea how to correctly write a policy
>>> like this and it's likely that developers who contribute a single new
>>> device will not be proficient in it either.  Writing these policies is a
>>> rare thing and few people will be good at this.  It also makes me worry
>>> about how we test and review them.
>>
>> We also think that having an SELinux policy per device would become
>> complicated. Our proposal, therefore, is to define SELinux policies for
>> each device class - viz. disk, network, console, graphics, etc.
>> "fedora-selinux" upstream repo. [1] will contain these policies, so the
>> device developer doesn't have to worry about defining new policies for
>> each device. This proposal would diminish the complexity of SELinux
>> policies.
> 
> Have you considered using Linux namespaces?  I'm beginning to think that
> SELinux becomes less relevant with pid and mount namespaces to isolate
> processes.  The advantage of namespaces is that they are easy to
> understand and can be expressed in code instead of a policy file in a
> separate package.  This is the approach we're taking with virtiofsd
> (vhost-user device backend for virtio-fs).
> 
>>>
>>> Despite the efforts required in making this work, all processes still
>>> effectively have full access to the guest since they can access guest
>>> RAM.  What I mean is that the device is actually not confined to its
>>> host process (e.g. LSI SCSI controller process) because it can write
>>> code to executable guest RAM pages.  The guest will then execute that
>>> code and therefore all guest I/O (networking, disk, etc) is still
>>> available indirectly to the "confined" processes.  They are not really
>>> sandboxed from the outside world, regardless of how strict the SELinux
>>> policy is :(.
>>>
>>> There are performance issues due to proxying as well, but let's ignore
>>> them for now and focus on security.
>>
>> We are also focusing on performance. Please take a look at the following
>> blog for an initial report on performance. The results are for an iSCSI
>> backend in Oracle Cloud. We are working on collecting data on a much
>> heavier IOPS workload like an NVMe backend.
>>
>> https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-3-of-3-v2
> 
> Hard to reach a conclusion without also looking at CPU utilization.
> IOPS alone don't tell the story.
> 
> If the system had spare CPU cycles then the performance results between
> built-in LSI and separate LSI will be similar but the efficiency
> (IOPS/CPU%) has actually decreased due to the extra CPU cycles required
> to forward the hardware register access to the device emulation process.
> 
> If you rerun on a system without spare CPU cycles then IOPS degradation
> would become apparent.  I'm not saying this is necessarily the case,
> maybe the overhead is really doesn't have a significant effect, but the
> graph shown in the blog post isn't enough to draw a conclusion either
> way.

Hi Stefan,

We are working on getting a better idea about the CPU utilization while 
the performance test is running. We're looking forward to discussing 
this during the forthcoming KVM meeting.

Thank you!
--
Jag

> 
> Regarding the proposed QEMU bypass, these already exist in some form via
> kvm.ko's ioeventfd and coalesced MMIO features.
> 
> Today ioeventfd is only used for performance-critical hardware
> registers, so kvm.ko doesn't use a sophisticated dispatch mechanism.  If
> you want to use it for all hardware register accesses handled by a
> separate process then ioeventfd probably needs to be tweaked somewhat to
> make it more scalable for that case.
> 
> Coalesced MMIO is also cool.  kvm.ko can accumulate guest MMIO writes in
> a buffer that is only collected at a later point in time.  This improves
> performance for devices that require multiple hardware register writes
> to kick off an I/O operation (only the last one really needs to be
> trapped by the device emulation code!).  This sounds similar to an MMIO
> access shared ring buffer.
> 
>>>
>>> How do the benefits compare against today's monolithic approach?  If the
>>> guest exploits monolithic QEMU it has full access to all host files and
>>> APIs available to QEMU.  However, these are largely just the resources
>>> that belong to the guest anyway - not resources we are trying to keep
>>> away from the guest.  With multi-process QEMU each process still has
>>> access to all guest interfaces via the code injection I mentioned above,
>>> but the SELinux policy could restrict access to some resources.  But
>>> this benefit is really small in my opinion, given that the resources
>>> belong to the guest anyway and the guest can already access them.
>>
>> The primary focus of our project is to defend the host from malicious
>> guest. The code injection problem you outlined above involves part of
>> the guest attacking itself, but not the host. Therefore, this wouldn't
>> compromise our objective.
>>
>> Like you know, there are some parts of QEMU which are not directly
>> accessible from the guest (via drivers, etc.), which we prefer to call
>> the control plane. It executes ioctls to the host kernel and has access
>> to a broader set of syscalls, which the device emulation code doesn’t
>> need. We want to protect the control plane from emulated devices. In the
>> case where a device injects code into the RAM to attack another device
>> on the same VM, the control plane would still be protected.
> 
> Are you aware of any cases where the syscall attack surface led to an
> exploitable bug in QEMU?  Any proof-of-concept exploit code or a CVE?
> 
>> Another benefit with the project would be regarding detecting and
>> reporting failures in the emulated devices. For instance, in cases like
>> CVE-2018-18849, where an emulated device hangs/crashes, it wouldn't
>> directly crash the QEMU process as well. QEMU could detect the failure,
>> log the problem and exit, instead of generating coredump/hang.
> 
> Debugging is a lot easier with a coredump though :).  I would rather
> have a coredump than a nice message that says "LSI died".
> 
>>>
>>> I think you can implement this for a handful of devices as a one-time
>>> thing, but the invasiveness and the impracticality of getting wide cover
>>> of QEMU make this approach questionable.
>>>
>>> Am I mistaken about the invasiveness or impracticality?
>>
>> We are not planning to implement this for all devices since it would be
>> impractical. But the project adds a framework for implementing more
>> devices in the future.
>>
>> One other thing we would like to bring your attention to is that the
>> project doesn't affect the current usage. The same devices could still
>> be used as part of monolithic QEMU if the user chooses to do so.
> 
> I don't follow, to me this proposal seems extremely invasive and
> requires awareness from all developers.
> 
> QEMU contains global state (like net/net.c:net_clients or
> block.c:all_bdrv_states) and QMP commands that access global state.  All
> of this needs to be carefully proxied to avoid losing functionality as
> fundamental as the QMP monitor.
> 
> This is what worries me about this project.  There are amazing niche
> features like record/replay that have been integrated into QEMU without
> requiring all developers to be aware of how they work.  If you can
> achieve this then I would have no reservations.
> 
> Right now I don't see that this will be possible and that's why I'm
> challenging you to justify that the reduction in system call attack
> surface is actually worth the invasive changes required.
> 
> Do you see a way to solve the issues I've mentioned?
> 
> Stefan
>
Jag Raman June 11, 2019, 3:59 p.m. UTC | #33
On 5/30/2019 4:54 PM, Elena Ufimtseva wrote:
> On Tue, May 28, 2019 at 08:18:20AM -0700, Elena Ufimtseva wrote:
>> On Thu, May 23, 2019 at 12:11:30PM +0100, Stefan Hajnoczi wrote:
>>> Hi Jag and Elena,
>>> Do you think a call would help to move discussion along more quickly?
>>>
>>
>> Hi Stefan,
>>
>> We would like to join this call.
>> And thank you inviting us!
>>
>> Elena
>>> We could use the next KVM Community Call on June 4th to discuss
>>> remaining concerns and the next steps:
>>> https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
>>>
>>> I also hope to include other core QEMU developers.  As you know, I'm
>>> skeptical, but it could be just me and I don't want to block you
>>> unnecessarily if others are more enthusiastic about this approach.
>>>
> 
> Hi Stefan
> 
> A few questions we have are about the call.
> What is the format of the call usually? Should we provide some kind of the project outline for 5 minutes?
> We are planning to address some of the concerns you have voiced in regards to amount of changes, usability,
> security and performance. I assume there will be other questions as well. Is there any time limit per topic?
> 
> And would you mind sharing the call details with us?
> 
> Thanks!
> Elena
>>
>>
>>> Stefan

Hi Stefan,

We would like to add multi-process QEMU to the agenda for any of the
upcoming KVM community calls. Do you know how we could go about doing
this?

Could you kindly share the contact details of the organizer for this
meeting?

Thank you very much!
--
Jag

>>
>>
Stefan Hajnoczi June 12, 2019, 4:24 p.m. UTC | #34
On Thu, May 30, 2019 at 01:54:35PM -0700, Elena Ufimtseva wrote:
> On Tue, May 28, 2019 at 08:18:20AM -0700, Elena Ufimtseva wrote:
> > On Thu, May 23, 2019 at 12:11:30PM +0100, Stefan Hajnoczi wrote:
> > > Hi Jag and Elena,
> > > Do you think a call would help to move discussion along more quickly?
> > >
> > 
> > Hi Stefan,
> > 
> > We would like to join this call.
> > And thank you inviting us!
> > 
> > Elena
> > > We could use the next KVM Community Call on June 4th to discuss
> > > remaining concerns and the next steps:
> > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> > >
> > > I also hope to include other core QEMU developers.  As you know, I'm
> > > skeptical, but it could be just me and I don't want to block you
> > > unnecessarily if others are more enthusiastic about this approach.
> > >
> 
> Hi Stefan
> 
> A few questions we have are about the call.
> What is the format of the call usually? Should we provide some kind of the project outline for 5 minutes?
> We are planning to address some of the concerns you have voiced in regards to amount of changes, usability,
> security and performance. I assume there will be other questions as well. Is there any time limit per topic?
> 
> And would you mind sharing the call details with us?

Hi Elena and Jag,
Sorry, I was away on sick leave.  The KVM Community Call is informal.
The goal is to get people together in a teleconference where we can
discuss topics much more quickly than on the mailing list.  This can
help make progress in areas where the mailing list discussion seems to
be making slow progress.

I would suggest starting with a status update the describes your
current approach (without assuming the audience has familiarity).  Then
you could touch on any issues where you'd like input from the community
and you could take questions.

Our goal should be to get a consensus on whether disaggregated QEMU can
be merged or not.

Here are the calendar details (Tuesday, June 18th at 8:00 UTC):
https://calendar.google.com/calendar/ical/tob1tjqp37v8evp74h0q8kpjqs%40group.calendar.google.com/public/basic.ics

Is this time okay for you?

Stefan
Elena Ufimtseva June 12, 2019, 5:01 p.m. UTC | #35
On Wed, Jun 12, 2019 at 05:24:13PM +0100, Stefan Hajnoczi wrote:
> On Thu, May 30, 2019 at 01:54:35PM -0700, Elena Ufimtseva wrote:
> > On Tue, May 28, 2019 at 08:18:20AM -0700, Elena Ufimtseva wrote:
> > > On Thu, May 23, 2019 at 12:11:30PM +0100, Stefan Hajnoczi wrote:
> > > > Hi Jag and Elena,
> > > > Do you think a call would help to move discussion along more quickly?
> > > >
> > > 
> > > Hi Stefan,
> > > 
> > > We would like to join this call.
> > > And thank you inviting us!
> > > 
> > > Elena
> > > > We could use the next KVM Community Call on June 4th to discuss
> > > > remaining concerns and the next steps:
> > > > https://calendar.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> > > >
> > > > I also hope to include other core QEMU developers.  As you know, I'm
> > > > skeptical, but it could be just me and I don't want to block you
> > > > unnecessarily if others are more enthusiastic about this approach.
> > > >
> > 
> > Hi Stefan
> > 
> > A few questions we have are about the call.
> > What is the format of the call usually? Should we provide some kind of the project outline for 5 minutes?
> > We are planning to address some of the concerns you have voiced in regards to amount of changes, usability,
> > security and performance. I assume there will be other questions as well. Is there any time limit per topic?
> > 
> > And would you mind sharing the call details with us?
> 
> Hi Elena and Jag,

Hi Stefan,

> Sorry, I was away on sick leave. 

Ah, sorry about that - we have guessed that you were away, but thought
people were mostly on vacation.

> The KVM Community Call is informal.
> The goal is to get people together in a teleconference where we can
> discuss topics much more quickly than on the mailing list.  This can
> help make progress in areas where the mailing list discussion seems to
> be making slow progress.
> 
> I would suggest starting with a status update the describes your
> current approach (without assuming the audience has familiarity).  Then
> you could touch on any issues where you'd like input from the community
> and you could take questions.
> 
> Our goal should be to get a consensus on whether disaggregated QEMU can
> be merged or not.
>

Thanks!
> Here are the calendar details (Tuesday, June 18th at 8:00 UTC):
> https://calendar.google.com/calendar/ical/tob1tjqp37v8evp74h0q8kpjqs%40group.calendar.google.com/public/basic.ics
> 
> Is this time okay for you?

Yes, this time is fine.
Do you have dial-in info for us?

Thank you!

Elena, Jag and JJ
> 
> Stefan
diff mbox series

Patch

diff --git a/docs/devel/qemu-multiprocess.txt b/docs/devel/qemu-multiprocess.txt
new file mode 100644
index 0000000..e29c6c8
--- /dev/null
+++ b/docs/devel/qemu-multiprocess.txt
@@ -0,0 +1,1109 @@ 
+/*
+ * Copyright 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+Disaggregating QEMU
+
+This document describes implementation details of multi-process
+qemu.
+
+1. QEMU services
+
+        QEMU can be broadly described as providing three main
+services.  One is a VM control point, where VMs can be created,
+migrated, re-configured, and destroyed.  A second is to emulate the
+CPU instructions within the VM, often accelerated by HW virtualization
+features such as Intel's VT extensions.  Finally, it provides IO
+services to the VM by emulating HW IO devices, such as disk and
+network devices.
+
+1.1 A disaggregated QEMU
+
+        A disaggregated QEMU involves separating QEMU services into
+separate host processes.  Each of these processes can be given only
+the privileges it needs to provide its service, e.g., a disk service
+could be given access only the the disk images it provides, and not be
+allowed to access other files, or any network devices.  An attacker
+who compromised this service would not be able to use this exploit to
+access files or devices beyond what the disk service was given access
+to.
+
+        A control QEMU process would remain, but in disaggregated
+mode, it would be a control point that exec()s the processes needed to
+support the VM being created, but have no direct interface to the VM.
+During VM execution, it would still provide the user interface to
+hot-plug devices or live migrate the VM.
+
+        A first step in creating a disaggregated QEMU is to separate
+IO services from the main QEMU program, which would continue to
+provide CPU emulation. i.e., the control process would also be the CPU
+emulation process.  In a later phase, CPU emulation could be separated
+from the QEMU control process.
+
+
+2. Disaggregating IO services
+
+        Disaggregating IO services is a good place to begin QEMU
+disaggregating for a couple of reasons.  One is the sheer number of IO
+devices QEMU can emulate provides a large surface of interfaces which
+could potentially be exploited, and, indeed, have been a source of
+exploits in the past.  Another is the modular nature of QEMU device
+emulation code provides interface points where the QEMU functions that
+perform device emulation can be separated from the QEMU functions that
+manage the emulation of guest CPU instructions.
+
+2.1 QEMU device emulation
+
+        QEMU uses a object oriented SW architecture for device
+emulation code.  Configured objects are all compiled into the QEMU
+binary, then objects are instantiated by name when used by the guest
+VM.  For example, the code to emulate a device named "foo" is always
+present in QEMU, but its instantiation code is only run when a device
+named "foo" is included in the target VM (such as via the QEMU command
+line as -device "foo".)
+
+        The object model is hierarchical, so device emulation code can
+name its parent object (such as "pci-device" for a PCI device) and
+QEMU will instantiate a parent object before calling the device's
+instantiation code.
+
+2.2 Current separation models
+
+        In order to separate the device emulation code from the CPU
+emulation code, the device object code must run in a different
+process.  There are a couple of existing QEMU features that can run
+emulation code separately from the main QEMU process.  These are
+examined below.
+
+2.2.1 vhost user model
+
+        Virtio guest device drivers can be connected to vhost user
+applications in order to perform their IO operations.  This model uses
+special virtio device drivers in the guest and vhost user device
+objects in QEMU, but once the QEMU vhost user code has configured the
+vhost user application, mission-mode IO is performed by the
+application.  The vhost user application is a daemon process that can
+be contacted via a known UNIX domain socket.
+
+2.2.1.1 vhost socket
+
+        As mentioned above, one of the tasks of the vhost device
+object within QEMU is to contact the vhost application and send it
+configuration information about this device instance.  As part of the
+configuration process, the application can also be sent other file
+descriptors over the socket, which then can be used by the vhost user
+application in various ways, some of which are described below.
+
+2.2.1.2 vhost MMIO store acceleration
+
+        VMs are often run using HW virtualization features via the KVM
+kernel driver.  This driver allows QEMU to accelerate the emulation of
+guest CPU instructions by running the guest in a virtual HW mode.
+When the guest executes instructions that cannot be executed by
+virtual HW mode, execution return to the KVM driver so it can inform
+QEMU to emulate the instructions in SW.
+
+        One of the events that can cause a return to QEMU is when a
+guest device driver accesses an IO location. QEMU then dispatches the
+memory operation to the corresponding QEMU device object.  In the case
+of a vhost user device, the memory operation would need to be sent
+over a socket to the vhost application.  This path is accelerated by
+the QEMU virtio code by setting up an eventfd file descriptor that the
+vhost application can directly receive MMIO store notifications from
+the KVM driver, instead of needing them to be sent to the QEMU process
+first.
+
+2.2.1.3 vhost interrupt acceleration
+
+        Another optimization used by the vhost application is the
+ability to directly inject interrupts into the VM via the KVM driver,
+again, bypassing the need to send the interrupt back to the QEMU
+process first.  The QEMU virtio setup code configures the KVM driver
+with an eventfd that triggers the device interrupt in the guest when
+the eventfd is written. This irqfd file descriptor is then passed to
+the vhost user application program.
+
+2.2.1.4 vhost access to guest memory
+
+        The vhost application is also allowed to directly access guest
+memory, instead of needing to send the data as messages to QEMU.  This
+is also done with file descriptors sent to the vhost user application
+by QEMU.  These descriptors can be mmap()d by the vhost application to
+map the guest address space into the vhost application.
+
+        IOMMUs introduce another level of complexity, since the
+address given to the guest virtio device to DMA to or from is not a
+guest physical address.  This case is handled by having vhost code
+within QEMU register as a listener for IOMMU mapping changes.  The
+vhost application maintains a cache of IOMMMU translations: sending
+translation requests back to QEMU on cache misses, and in turn
+receiving flush requests from QEMU when mappings are purged.
+
+2.2.1.5 applicability to device separation
+
+        Much of the vhost model can be re-used by separated device
+emulation.  In particular, the ideas of using a socket between QEMU
+and the device emulation application, using a file descriptor to
+inject interrupts into the VM via KVM, and allowing the application to
+mmap() the guest should be re-used.
+
+        There are, however, some notable differences between how a
+vhost application works and the needs of separated device emulation.
+The most basic is that vhost uses custom virtio device drivers which
+always trigger IO with MMIO stores.  A separated device emulation model
+must work with existing IO device models and guest device drivers.
+MMIO loads break vhost store acceleration since they are synchronous -
+guest progress cannot continue until the load has been emulated.  By
+contrast, stores are asynchronous, the guest can continue after the
+store event has been sent to the vhost application.
+
+        Another difference is that in the vhost user model, a single
+daemon can support multiple QEMU instances.  This is contrary to the
+security regime desired, in which the emulation application should
+only be allowed to access the files or devices the VM it's running on
+behalf of can access.
+
+2.2.2 qemu-io model
+
+        Qemu-io is a test harness used to test changes to the QEMU
+block backend object code. (e.g., the code that implements disk images
+for disk driver emulation) Qemu-io is not a device emulation
+application per se, but it does compile the QEMU block objects into a
+separate binary from the main QEMU one.  This could be useful for disk
+device emulation, since its emulation applications will need to
+include the QEMU block objects.
+
+2.3 New separation model based on proxy objects
+
+        A different model based on proxy objects in the QEMU program
+communicating with proxy objects the separated emulation programs
+could provide separation while minimizing the changes needed to the
+device emulation code.  The rest of this section is a discussion of
+how a proxy object model would work.
+
+2.3.1 command line specification
+
+        The QEMU command line options will need to be modified to
+indicate which items are emulated by a separate program, and which
+remain emulated by QEMU itself.
+
+2.3.1.1 devices
+
+        Devices that are to be emulated in a separate process will be
+identified by using "-rdevice" on the QEMU command line in lieu of
+"-device".  The device's other options will also be included in the
+command line, with the addition of a "command" option that specifies
+the remote program to execute to emulate the device.  e.g., an LSI
+SCSI controller and disk can be specified as:
+
+-device lsi53c895a,id=scsi0 device
+-device scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0
+
+        If these devices are emulated with program "lsi-scsi," the
+QEMU command line would be:
+
+-rdevice lsi53c895a,id=scsi0,command="lsi-scsi"
+-rdevice scsi-hd,drive=drive0,bus=scsi0.0,scsi-id=0
+
+        Some devices are implicitly created by the machine object.
+e.g., the "q35" machine object will create its PCI bus, and attach a
+"ich9-ahci" IDE controller to it.  In this case, options will need to
+be added to the "-machine" command line.  e.g.,
+
+-machine pc-q35,ide-command="ahci-ide"
+
+        will use the "ahci-ide" program to emulate the IDE controller
+and its disks.  The disks themselves still need to be specified with
+"-rdevice", e.g.,
+
+-rdevice ide-hd,drive=drive0,bus=ide.0,unit=0
+
+        The "-rdevice" devices will be parsed into a separate
+QemuOptsList from "-device" ones, but will still have "driver"
+as the implied name of the initial option.
+
+2.3.1.2 backends
+
+        The device's backend would similarly have a changed command
+line specification.  e.g., a qcow2 block backend specified as:
+
+-blockdev driver=file,node-name=file0,filename=disk-file0
+-blockdev driver=qcow2,node-name=drive0,file=file0
+
+becomes
+
+-rblockdev driver=file,node-name=file0,filename=disk-file0
+-rblockdev driver=qcow2,node-name=drive0,file=file0
+
+        As is the case with devices, "-rblockdev" backends will
+be parsed into their own BlockdevOptions_queue.
+
+2.3.2 device proxy objects
+
+        QEMU has an object model based on sub-classes inherited from
+the "object" super-class.  The sub-classes that are of interest here
+are the "device" and "bus" sub-classes whose child sub-classes make up
+the device tree of a QEMU emulated system.
+
+        The proxy object model will use device proxy objects to
+replace the device emulation code within the QEMU process.  These
+objects will live in the same place in the object and bus hierarchies
+as the objects they replace.  i.e., the proxy object for an LSI SCSI
+controller will be a sub-class of the "pci-device" class, and will
+have the same PCI bus parent and the same SCSI bus child objects as
+the LSI controller object it replaces.
+
+        After the QEMU command line has been parsed, the "-rdevice"
+devices will be instantiated in the same manner as "-device" devices
+are. (i.e., qdev_device_add()).  In order to distinguish them from
+regular "-device" device objects, their class name will be the name of
+the class it replaces, with "-proxy" appended.  e.g., the "scsi-hd"
+proxy class will be "scsi-hd-proxy"
+
+2.3.2.1 object initialization
+
+        QEMU object initialization occurs in two phases.  The first
+initialization happens once per object class. (i.e., there can be many
+SCSI disks in an emulated system, but the "scsi-hd" class has its
+class_init() function called only once) The second phase happens when
+each object's instance_init() function is called to initialize each
+instance of the object.
+
+        All device objects are sub-classes of the "device" class, so
+they also have a realize() function that is called after
+instance_init() is called and after the object's static properties
+have been initialized.  Many device objects don't even provide an
+instance_init() function, and do all their per-instance work in
+realize().
+
+2.3.2.1.1 class_init
+
+        The class_init() method of a proxy object will, in general
+behave similarly to the object it replaces, including setting any
+static properties and methods needed by the proxy.
+
+2.3.2.1.2 instance_init / realize
+
+        The instance_init() and realize() functions would only need to
+perform tasks related to being a proxy, such are registering its own
+MMIO handlers, or creating a child bus that other proxy devices can be
+attached to later.  They also need to add a "json_device" string
+property that contains the JSON representation of the command line
+options used to create the object.
+
+        This JSON representation is used to create the corresponding
+object in an emulation process.  e.g., for an LSI SCSI controller
+invoked as:
+
+ -rdevice lsi53c895a,id=scsi0,command="lsi-scsi"
+
+the proxy object would create a
+
+{ "driver" : "lsi53c895a", "id" : "scsi0" }
+
+JSON description.  The "driver" option is assigned to the device name
+when the command line is parsed, so the "-proxy" appended by the
+command line parsing code must be removed.  The "command" option isn't
+needed in the JSON description since it only applies to the proxy
+object in the QEMU process.
+
+        Other tasks will are device-specific.  PCI device objects will
+initialize the PCI config space in order to make a valid PCI device
+tree within the QEMU process.  Disk devices will probe their backend
+object to get its JSON description, and publish this description as a
+"json_backend" string property (see the backend discussion below.)
+
+2.3.2.2 address space registration
+
+        Most devices are driven by guest device driver accesses to IO
+addresses or ports.  The QEMU device emulation code uses QEMU's memory
+region function calls (such as memory_region_init_io()) to add
+callback functions that QEMU will invoke when the guest accesses the
+device's areas of the IO address space.  When a guest driver does
+access the device, the VM will exit HW virtualization mode and return
+to QEMU, which will then lookup and execute the corresponding callback
+function.
+
+        A proxy object would need to mirror the memory region calls
+the actual device emulator would perform in its initialization code,
+but with its own callbacks.  When invoked by QEMU as a result of a
+guest IO operation, they will forward the operation to the device
+emulation process via a proxy_proc_send() call.  Any response will
+be read via proxy_proc_recv().
+
+        Note that the callbacks are called with an address space lock,
+so it would not be a appropriate to synchronously wait for any
+response.  Instead the QEMU code must be changed to check if the
+thread needs to sleep after the address_space_rw() call (in
+kvm_cpu_exec().)
+
+2.3.2.3 PCI config space
+
+        PCI devices also have a configuration space that can be
+accessed by the guest driver.  Guest accesses to this space is not
+handled by the device emulation object, but by it's PCI parent object.
+Much of this space is read-only, but certain registers (especially BAR
+and MSI-related ones) need to be propagated to the emulation process.
+
+2.3.2.3.1 PCI parent proxy
+
+        One way to propagate guest PCI config accesses is to create a
+"pci-device-proxy" class that can serve as the parent of a PCI device
+proxy object.  This class's parent would be "pci-device" and it would
+override the PCI parent's config_read and config_write methods with
+ones that forward these operations to the emulation program.
+
+2.3.2.4 interrupt receipt
+
+        A proxy for a device that generates interrupts will receive
+the interrupt indication via the read callback it provided to
+proxy_ctx_alloc().  The interrupt indication would then be sent up to
+its bus parent to be injected into the guest.  For example, a PCI
+device object may use pci_set_irq().
+
+2.3.3 device backends
+
+        Each type of device has backends which perform IO operations
+in the host system.  For example, block backend objects emulate the
+disk images configured into the VM.  While block backends are
+implemented as objects, not all backends are.  For example, display
+backends (e.g., vnc) are not objects, they register a set of virtual
+functions that are called by QEMU's display emulation.
+
+        These device backends also need run in the device emulation
+processes, and the emulation process must be given access the the
+corresponding host files or devices.
+
+2.3.3.1 block backends
+
+        Block backends are objects that can implement file protocols
+(such as a local file or an iSCSI volume), implement disk image
+formats (such as qcow2), or serve as a request filter (such as IO
+throttling).  They are often stacked on each other (such as a qcow2
+format on a local file protocol.)  They're are named by "node-name"
+properties that are then matched to "drive" properties of the
+corresponding disk devices.
+
+        Block backend objects are not part of the QEMU object model
+(i.e., they're not sub-classes of "object").  They are instantiated
+when the bdrv_file_open() method is invoked with a Qdict dictionary
+of the backend's command line options.
+
+2.3.3.1.1 initialization
+
+        When a "-rblockdev" backend is initialized, it will not open
+the underlying backend object, as is done for "-blockdev" backends.
+Instead, it will create a BlockDriverState node that has a proxy name
+and the original options Qdict.  The proxy name will consist of the
+backend's node-name with "-peer" appended to it (i.e., a "drive0"
+node-name would have a "drive0-peer" peer.)
+
+        A proxy backend object with then be opened, using an
+initialization Qdict containing the "node-name" of the underlying
+backend, so that disk device objects and QMP commands can find it.
+The proxy's Qdict will also be given the proxy name as a "peer"
+property so it can lookup its underlying backend object and its
+associated Qdict.
+
+2.3.3.1.2 bdrv_probe_json
+
+        This API returns the JSON description of the peer of a given
+backend proxy.  It will be used by disk device proxy objects to get
+the JSON descriptions of the block backend (and any backends layered
+below) needed to emulate the disk image.
+
+2.3.3.1.3 bdrv_get_json
+
+        This is a new block backend object method that returns the
+JSON description this object, and all of its underlying objects.  It
+will recursively descend any layered backend objects (e.g., a format
+object will call its underlying protocol object) This method can be
+invoked on an object that has not been opened.  It will mainly be used
+by bdrv_probe_json().
+
+2.3.3.1.4 bdrv_assign_proxy_name
+
+        The API creates the node with a proxy name, and enters it on a
+list of peer nodes.  This list can be searched by proxy backends to
+find their associated peers.
+
+2.3.3.1.5 QMP commands
+
+        Various QMP command operate on blockdevs.  These will need
+to work on rblockevs in separated processes as well.  There are
+several cases that need to be handled.
+
+2.3.3.1.5.1 adding rblockdevs
+
+        QMP allows users to add blockdevs to a running QEMU instance.
+This is done not just to hot-plug a disk device into a guest, but also
+for advanced blockdev features such as changing quorum devices.
+Likewise, QMP needs be able to add an rblockdev to the guest, so
+similar operations can be performed on devices being emulated in a
+separate process.
+
+        This operation doesn't need to be performed differently from
+adding an rblockdev from the command line.  Blockdevs are added with a
+qmp_blockdev_add() routine that can be called from either the command
+line parser or from QMP.  Note the name of the C routine called from
+QMP is generated by a python script, so a "rblockdev-add" command must
+be implemented by qmp_rblockdev_add().
+
+2.3.3.1.5.2 targeted commands
+
+        Many QMP commands operate on specified blockdevs.  These
+commands will find the proxy node when they lookup the targeted name,
+which will then forward the request to the emulation process managing
+the peer node.
+
+2.3.3.1.5.3 blockdev lists
+
+        Several QMP query commands (such as query-block or
+query-block-jobs) operate on all blockdevs.   These will function
+much like targeted commands, with the proxy nodes forwarding the
+request to its peer emulation process.
+
+2.3.4 proxy APIs
+
+        There will be a set of APIs provided by a process execution
+service for proxy objects to use to manage the separate emulation
+program.
+
+2.3.4.1 proxy_register
+
+        A proxy device object must register itself with the
+proxy_register() API.  The registration call will include validation
+and execution callbacks that will be invoked after the emulated
+machine has been setup in QEMU.
+
+2.3.4.1.1 validation callback
+
+        This callback will invoked after all devices in the emulated
+system have been initialized.  Its purpose is to validate the device
+configuration by checking that its parent and child bus objects are
+compatible with being proxied.  For example, a disk controller can
+check that all the devices on its bus are all proxy objects, or a disk
+object can check that its backend object is a proxy.  If any of the
+validation callbacks return an error, QEMU will exit.  If there are no
+errors, the execution callbacks will be invoked.
+
+2.3.4.1.2 execution callback
+
+        A device proxy object that manages an emulation process will
+provide an execution callback in its proxy_register() call.  This
+callback will allocate an execution context with proxy_ctx_alloc(),
+marshal the arguments needed for the emulation program, and invoke
+proxy_execute() to execute it.
+
+2.3.4.2 proxy_ctx_alloc
+
+        Before the emulation program can be executed, the proxy object
+must call proxy_ctx_alloc() to create an execution context for the
+process.  The execution context will serve as a handle on which the
+other proxy APIs operate.
+
+2.3.4.3 proxy_ctx_callbacks
+
+        This API registers two callback functions: get_reply() and
+get_request(), on the context.  get_reply() is invoked to handle
+replies to requests sent to the emulation process.  get_request() is
+invoked to handle requests from the emulation process.  This API can
+be called multiple times on the same context; a class field within an
+incoming message indicates which callbacks will be invoked.
+
+2.3.4.4 proxy_execute
+
+        This function executes an emulation program.  It needs to be
+provided with an execution context, the file to execute, and any
+arguments needed by the program.  Before executing the given program,
+it will setup the communications channels for the new process.
+
+2.3.5 communication with emulation process
+
+        The execution service will setup two communication channels
+between the main QEMU process and the emulation process.  The channels
+will be created using socketpair() so that file descriptors can be
+passed from QEMU to the process.
+
+2.3.5.1 requests to emulation process
+
+        The stdin file descriptor of the emulation process will be
+used for requests from QEMU to the emulation process.  The execution
+service provides APIs to send and receive messages from the emulation
+process.
+
+2.3.5.1.1 proxy_proc_send
+
+        This API is for the proxy object in QEMU to send messages to
+the emulation process.  Its arguments will include an execution
+context in addition to the actual message.
+
+2.3.5.1.2 proxy_proc_recv
+
+        This API receives replies from the emulation process.  It
+requires the execution context of the target process, and will usually
+be called from the get_reply() callback specified in proxy_ctx_alloc.
+
+2.3.5.2 requests to QEMU process
+
+        The stdout file descriptor to the emulation process will be
+used for requests from the emulation process to QEMU.  As with
+requests to the emulation process, APIs will be provided to facilitate
+communication.
+
+2.3.5.2.1 proxy_qemu_recv
+
+        This API receives requests from the emulation process.  It
+requires the execution context of the target process, and will usually
+be called from the get_request() callback specified in
+proxy_ctx_alloc.
+
+2.3.5.2.2 proxy_qemu_send
+
+        This API is for the proxy object in QEMU to send replies to
+the emulation process.  Its arguments will include an execution
+context in addition to the actual reply.
+
+2.3.5.3 JSON descriptions
+
+        The initial messages sent to the emulation process will
+describe the devices its will be tasked to emulate.  The will be
+described as JSON arrays of backend and device objects that need to be
+instantiated by the emulation process.
+
+2.3.5.3.1 backend JSON
+
+        The device proxy object will aggregate the "json_backend"
+properties from the disk devices on the bus it controls, and send
+them as a JSON array of objects. e.g., this command line:
+
+-rblockdev driver=file,node-name=file0,filename=disk-file0
+-rblockdev driver=qcow2,node-name=drive0,file=file0
+
+would generate
+
+[
+  { "driver" : "file", "node-name" : "file0", "filename" : "disk-file0" }.
+  { "driver" : "qcow2", "node-name" : "drive0", "file" : "file0" }
+]
+
+2.3.5.3.2 device JSON
+
+        The device proxy object will aggregate a JSON description of
+itself and devices on the bus it controls (via their "json_device"
+properties), and send them to the emulation process as a JSON array of
+objects.
+
+2.3.5.4 DMA operations
+
+        DMA operations would be handled much like vhost applications
+do.  One of the initial messages sent to the emulation process is a
+guest memory table.  Each entry in this table consists of a file
+descriptor and size that the emulation process can mmap() to directly
+access guest memory, similar to vhost_user_set_mem_table().  Note
+guest memory must be backed by file descriptors, such as when QEMU is
+given the "-mem-path" command line option.
+
+2.3.5.5 IOMMU operations
+
+        When the emulated system includes an IOMMU, the proxy
+execution service will need to handle IOMMU requests from the
+emulation process using an address_space_get_iotlb_entry() call.  In
+order to handle IOMMU unmaps, the proxy execution service will also
+register as a listener on the device's DMA address space.  When an
+IOMMU memory region is created within the DMA address space, an IOMMU
+notifier for unmaps will be added to the memory region that will
+forward unmaps to the emulation process.
+
+        This also will require a proxy_ctx_callbacks() call to
+register an IOMMU handler for incoming IOMMU requests from the
+emulation program.
+
+2.3.6 device emulation process
+
+        The device emulation process will run the object hierarchy of
+the device, hopefully unmodified.  It will be based on the QEMU source
+code, because for anything but the simplest device, it would not be a
+tractable problem to re-implement both the object model and the many
+device backends that QEMU has.
+
+        The parts of QEMU that the emulation program will need include
+the object model; the memory emulation objects; the device emulation
+objects of the targeted device, and any dependent devices; and, the
+device's backends.  It will also need code to setup the machine
+environment, handle requests from the QEMU process, and route
+machine-level requests (such as interrupts or IOMMU mappings) back to
+the QEMU process.
+
+2.3.6.1 initialization
+
+        The process initialization sequence will follow the same
+sequence followed by QEMU.  It will first initialize the backend
+objects, then device emulation objects.  The JSON arrays sent by the
+QEMU process will drive which objects need to be created.
+
+2.3.6.1.1 address spaces
+
+        Before the device objects are created, the initial address
+spaces and memory regions must be configured with memory_map_init().
+This creates a RAM memory region object (system_memory) and an IO
+memory region object (system_io).
+
+2.3.6.1.2 RAM
+
+        RAM memory region creation will follow how pc_memory_init()
+creates them, but must use memory_region_init_ram_from_fd() instead of
+memory_region_allocate_system_memory().  The file descriptors needed
+will be supplied by the guest memory table from above.  Those RAM
+regions would then be added to the system_memory memory region with
+memory_region_add_subregion().
+
+2.3.6.1.3 PCI
+
+        IO initialization will be driven by the JSON description sent
+from the QEMU process.  For a PCI device, a PCI bus will need to be
+created with pci_root_bus_new(), and a PCI memory region will need to
+be created and added to the system_memory memory region with
+memory_region_add_subregion_overlap().  The overlap version is
+required for architectures where PCI memory overlaps with RAM memory.
+
+2.3.6.2 MMIO handling
+
+        The device emulation objects will use memory_region_init_io()
+to install their MMIO handlers, and pci_register_bar() to associate
+those handlers with a PCI BAR, as they do withing QEMU currently.
+
+        In order to use address_space_rw() in the emulation process to
+handle MMIO requests from QEMU, the PCI physical addresses must be the
+same in the QEMU process and the device emulation process.  In order
+to accomplish that, guest BAR programming must also be forwarded from
+QEMU to the emulation process.
+
+2.3.6.3 interrupt injection
+
+        When device emulation wants to inject an interrupt into the
+VM, the request climbs the device's bus object hierarchy until the
+point where a bus object knows how to signal the interrupt to the
+guest.  The details depend on the type of interrupt being raised.
+
+2.3.6.3.1 PCI pin interrupts
+
+        On x86 systems, there is an emulated IOAPIC object attached to
+the root PCI bus object, and the root PCI object forwards interrupt
+requests to it.  The IOAPIC object, in turn, calls the KVM driver to
+inject the corresponding interrupt into the VM.  The simplest way to
+handle this in an emulation process would be to setup the root PCI bus
+driver (via pci_bus_irqs()) to send a interrupt request back to the
+QEMU process, and have the device proxy object reflect it up the PCI
+tree there.
+
+2.3.6.3.2 PCI MSI/X interrupts
+
+        PCI MSI/X interrupts are implemented in HW as DMA writes to a
+CPU-specific PCI address.  In QEMU on x86, a KVM APIC object receives
+these DMA writes, then calls into the KVM driver to inject the
+interrupt into the VM.  A simple emulation process implementation
+would be to send the MSI DMA address from QEMU as a message at
+initialization, then install an address space handler at that address
+which forwards the MSI message back to QEMU.
+
+2.3.6.4 DMA operations
+
+        When a emulation object wants to DMA into or out of guest
+memory, it first must use dma_memory_map() to convert the DMA address
+to a local virtual address.  The emulation process memory region
+objects setup above will be used to translate the DMA address to a
+local virtual address the device emulation code can access.
+
+2.3.6.5 IOMMU
+
+        When an IOMMU is in use in QEMU, DMA translation uses IOMMU
+memory regions to translate the DMA address to a guest physical
+address before that physical address can be translated to a local
+virtual address.  The emulation process will need similar
+functionality.
+
+2.3.6.5.1 IOTLB cache
+
+        The emulation process will maintain a cache of recent IOMMU
+translations (the IOTLB).  When the translate() callback of an IOMMU
+memory region is invoked, the IOTLB cache will be searched for an
+entry that will map the DMA address to a guest PA.  On a cache miss, a
+message will be sent back to QEMU requesting the corresponding
+translation entry, which be both be used to return a guest address and
+be added to the cache.
+
+2.3.6.5.2 IOTLB purge
+
+        The IOMMU emulation will also need to act on unmap requests
+from QEMU.  These happen when the guest IOMMU driver purges an entry
+from the guest's translation table.
+
+2.4 Accelerating device emulation
+
+        The messages that are required to be sent between QEMU and the
+emulation process can add considerable latency to IO operations.  The
+optimizations described below attempt to ameliorate this effect by
+allowing the emulation process to communicate directly with the kernel
+KVM driver.  The KVM file descriptors created wold be passed to the
+emulation process via initialization messages, much like the guest
+memory table is done.
+
+2.4.1 MMIO acceleration
+
+        Vhost user applications can receive guest virtio driver stores
+directly from KVM.  The issue with the eventfd mechanism used by vhost
+user is that it does not pass any data with the event indication, so
+it cannot handle guest loads or guest stores that carry store data.
+This concept could, however, be expanded to cover more cases.
+
+        The expanded idea would require a new type of KVM device:
+KVM_DEV_TYPE_USER.  This device has two file descriptors: a master
+descriptor that QEMU can use for configuration, and a slave descriptor
+that the emulation process can use to receive MMIO notifications.
+QEMU would create both descriptors using the KVM driver, and pass the
+slave descriptor to the emulation process via an initialization
+message.
+
+2.4.1.1 data structures
+
+2.4.1.1.1 guest physical range
+
+        The guest physical range structure describes the address range
+that a device will respond to.  It includes the base and length of the
+range, as well as which bus the range resides on (e.g., on an x86
+machine, it can specify whether the range refers to memory or IO
+addresses).
+
+        A device can have multiple physical address ranges it responds
+to (e.g., a PCI device can have multiple BARs), so the structure will
+also include an enumeration value to specify which of the device's
+ranges is being referred to.
+
+2.4.1.1.2 MMIO request structure
+
+        This structure describes an MMIO operation.  It includes which
+guest physical range the MMIO was within, the offset within that
+range, the MMIO type (e.g., load or store), and its length and data.
+It also includes a sequence number that can be used to reply to the
+MMIO, and the CPU that issued the MMIO.
+
+2.4.1.1.3 MMIO request queues
+
+        MMIO request queues are FIFO arrays of MMIO request
+structures.  There are two queues: pending queue is for MMIOs that
+haven't been read by the emulation program, and the sent queue is for
+MMIOs that haven't been acknowledged.  The main use of the second
+queue is to validate MMIO replies from the emulation program.
+
+2.4.1.1.4 scoreboard
+
+        Each CPU in the VM is emulated in QEMU by a separate thread,
+so multiple MMIOs may be waiting to be consumed by an emulation
+program and multiple threads may be waiting for MMIO replies.  The
+scoreboard would contain a wait queue and sequence number for the
+per-CPU threads, allowing them to be individually woken when the MMIO
+reply is received from the emulation program.  It also tracks the
+number of posted MMIO stores to the device that haven't been replied
+to, in order to satisfy the PCI constraint that a load to a device
+will not complete until all previous stores to that device have been
+completed.
+
+2.4.1.1.5 device shadow memory
+
+        Some MMIO loads do not have device side-effects.  These MMIOs
+can be completed without sending a MMIO request to the emulation
+program if the emulation program shares a shadow image of the device's
+memory image with the KVM driver.
+
+        The emulation program will ask the KVM driver to allocate
+memory for the shadow image, and will then use mmap() to directly
+access it.  The emulation program can control KVM access to the shadow
+image by sending KVM an access map telling it which areas of the image
+have no side-effects (and can be completed immediately), and which
+require a MMIO request to the emulation program.  The access map can
+also inform the KVM drive which size accesses are allowed to the
+image.
+
+2.4.1.2 master descriptor
+
+        The master descriptor is used by QEMU to configure the new KVM
+device.  The descriptor would be returned by the KVM driver when QEMU
+issues a KVM_CREATE_DEVICE ioctl() with a KVM_DEV_TYPE_USER type.
+
+2.4.1.2.1 KVM_DEV_TYPE_USER device ops
+
+        The KVM_DEV_TYPE_USER operations vector will be registered by
+a kvm_register_device_ops() call when the KVM system in initialized by
+kvm_init().  These device ops are called by the KVM driver when QEMU
+executes certain ioctls() on its KVM file descriptor.  They include:
+
+2.4.1.1.2.1 create
+
+        This routine is called when QEMU issues a KVM_CREATE_DEVICE
+ioctl() on its per-VM file descriptor.  It will allocate and
+initialize a KVM user device specific data structure, and assign the
+kvm_device private field to it.
+
+2.4.1.1.2.2 ioctl
+
+        This routine is invoked when QEMU issues an ioctl() on the
+master descriptor.  The ioctl() commands supported are defined by the
+KVM device type.  KVM_DEV_TYPE_USER ones will need several commands:
+
+        KVM_DEV_USER_SLAVE_FD creates the slave file descriptor that
+will be passed to the device emulation program.  Only one slave can be
+created by each master descriptor.  The file operations performed by
+this descriptor are described below.
+
+        The KVM_DEV_USER_PA_RANGE command configures a guest physical
+address range that the slave descriptor will receive MMIO
+notifications for.  The range is specified by a guest physical range
+structure argument.  For buses that assign addresses to devices
+dynamically, this command can be executed while the guest is running,
+such as the case when a guest changes a device's PCI BAR registers.
+
+        KVM_DEV_USER_PA_RANGE will use kvm_io_bus_register_dev() to
+register kvm_io_device_ops callbacks to be invoked when the guest
+performs a MMIO operation within the range.  When a range is changed,
+kvm_io_bus_unregister_dev() is used to remove the previous
+instantiation.
+
+        KVM_DEV_USER_TIMEOUT will configure a timeout value that
+specifies how long KVM will wait for the emulation process to respond
+to a MMIO indication.
+
+2.4.1.1.2.3 destroy
+
+        This routine is called when the VM instance is destroyed.  It
+will need to destroy the slave descriptor; and free any memory
+allocated by the driver, as well as the kvm_device structure itself.
+
+2.4.1.3 slave descriptor
+
+        The slave descriptor will have its own file operations vector,
+which responds to system calls on the descriptor performed by the
+device emulation program.
+
+2.4.1.3.1 read
+
+        A read returns any pending MMIO requests from the KVM driver
+as MMIO request structures.  Multiple structures can be returned if
+there are multiple MMIO operations pending.  The MMIO requests are
+moved from the pending queue to the sent queue, and if there are
+threads waiting for space in the pending to add new MMIO operations,
+they will be woken here.
+
+2.4.1.3.2 write
+
+        A write also consists of a set of MMIO requests.  They are
+compared to the MMIO requests in the sent queue.  Matches are removed
+from the sent queue, and any threads waiting for the reply are woken.
+If a store is removed, then the number of posted stores in the per-CPU
+scoreboard is decremented.  When the number is zero, and a non
+side-effect load was waiting for posted stores to complete, the load
+is continued.
+
+2.4.1.3.3 ioctl
+
+        There are several ioctl()s that can be performed on the
+slave descriptor.
+
+        A KVM_DEV_USER_SHADOW_SIZE ioctl() causes the KVM driver to
+allocate memory for the shadow image.  This memory can later be
+mmap()ed by the emulation process to share the emulation's view of
+device memory with the KVM driver.
+
+        A KVM_DEV_USER_SHADOW_CTRL ioctl() controls access to the
+shadow image.  It will send the KVM driver a shadow control map, which
+specifies which areas of the image can complete guest loads without
+sending the load request to the emulation program.  It will also
+specify the size of load operations that are allowed.
+
+2.4.1.3.4 poll
+
+        An emulation program will use the poll() call with a POLLIN
+flag to determine if there are MMIO requests waiting to be read.  It
+will return if the pending MMIO request queue is not empty.
+
+2.4.1.3.5 mmap
+
+        This call allows the emulation program to directly access the
+shadow image allocated by the KVM driver.  As device emulation updates
+device memory, changes with no side-effects will be reflected in the
+shadow, and the KVM driver can satisfy guest loads from the shadow
+image without needing to wait for the emulation program.
+
+2.4.1.4 kvm_io_device ops
+
+        Each KVM per-CPU thread can handle MMIO operation on behalf of
+the guest VM.  KVM will use the MMIO's guest physical address to
+search for a matching kvm_io_devce to see if the MMIO can be handled
+by the KVM driver instead of exiting back to QEMU.  If a match is
+found, the corresponding callback will be invoked.
+
+2.4.1.4.1 read
+
+        This callback is invoked when the guest performs a load to the
+device.  Loads with side-effects must be handled synchronously, with
+the KVM driver putting the QEMU thread to sleep waiting for the
+emulation process reply before re-starting the guest.  Loads that do
+not have side-effects may be optimized by satisfying them from the
+shadow image, if there are no outstanding stores to the device by this
+CPU.  PCI memory ordering demands that a load cannot complete before
+all older stores to the same device have been completed.
+
+2.4.1.4.2 write
+
+        Stores can be handled asynchronously unless the pending MMIO
+request queue is full.  In this case, the QEMU thread must sleep
+waiting for space in the queue.  Stores will increment the number of
+posted stores in the per-CPU scoreboard, in order to implement the PCI
+ordering constraint above.
+
+2.4.2 interrupt acceleration
+
+        This performance optimization would work much like a vhost
+user application does, where the QEMU process sets up eventfds that
+cause the device's corresponding interrupt to be triggered by the KVM
+driver.  These irq file descriptors are sent to the emulation process
+at initialization, and are used when the emulation code raises a
+device interrupt.
+
+2.4.2.1 intx acceleration
+
+        Traditional PCI pin interrupts are level based, so, in
+addition to an irq file descriptor, a re-sampling file descriptor
+needs to be sent to the emulation program.  This second file
+descriptor allows multiple devices sharing an irq to be notified when
+the interrupt has been acknowledged by the guest, so they can
+re-trigger the interrupt if their device has not de-asserted it.
+
+2.4.2.1.1 intx irq descriptor
+
+        The irq descriptors are created by the proxy object using
+event_notifier_init() to create the irq and re-sampling eventds, and
+kvm_vm_ioctl(KVM_IRQFD) to bind them to an interrupt.  The interrupt
+route can be found with pci_device_route_intx_to_irq().
+
+2.4.2.1.2 intx routing changes
+
+        Intx routing can be changed when the guest programs the APIC
+the device pin is connected to.  The proxy object in QEMU will use
+pci_device_set_intx_routing_notifier() to be informed of any guest
+changes to the route.  This handler will broadly follow the VFIO
+interrupt logic to change the route: de-assigning the existing irq
+descriptor from its route, then assigning it the new route. (see
+vfio_intx_update())
+
+2.4.2.2 MSI/X acceleration
+
+        MSI/X interrupts are sent as DMA transactions to the host.
+The interrupt data contains a vector that is programed by the guest, A
+device may have multiple MSI interrupts associated with it, so
+multiple irq descriptors may need to be sent to the emulation program.
+
+2.4.2.2.1 MSI/X irq descriptor
+
+        This case will also follow the VFIO example.  For each MSI/X
+interrupt, an eventfd is created, a virtual interrupt is allocated by
+kvm_irqchip_add_msi_route(), and the virtual interrupt is bound to the
+eventfd with kvm_irqchip_add_irqfd_notifier().
+
+2.4.2.2.2 MSI/X config space changes
+
+        The guest may dynamically update several MSI-related tables in
+the device's PCI config space.  These include per-MSI interrupt
+enables and vector data.  Additionally, MSIX tables exist in device
+memory space, not config space.  Much like the BAR case above, the
+proxy object must look at guest config space programming to keep the
+MSI interrupt state consistent between QEMU and the emulation program.
+
+
+3. Disaggregated CPU emulation
+
+        After IO services have been disaggregated, a second phase
+would be to separate a process to handle CPU instruction emulation
+from the main QEMU control function.  There are no object separation
+points for this code, so the first task would be to create one.
+
+
+4. Host access controls
+
+        Separating QEMU relies on the host OS's access restriction
+mechanisms to enforce that the differing processes can only access the
+objects they are entitled to.  There are a couple types of mechanisms
+usually provided by general purpose OSs.
+
+4.1 Discretionary access control
+
+        Discretionary access control allows each user to control who
+can access their files. In Linux, this type of control is usually too
+coarse for QEMU separation, since it only provides three separate
+access controls: one for the same user ID, the second for users IDs
+with the same group ID, and the third for all other user IDs.  Each
+device instance would need a separate user ID to provide access
+control, which is likely to be unwieldy for dynamically created VMs.
+
+4.2 Mandatory access control
+
+        Mandatory access control allows the OS to add an additional
+set of controls on top of discretionary access for the OS to control.
+It also adds other attributes to processes and files such as types,
+roles, and categories, and can establish rules for how processes and
+files can interact.
+
+4.2.1 Type enforcement
+
+        Type enforcement assigns a 'type' attribute to processes and
+files, and allows rules to be written on what operations a process
+with a given type can perform on a file with a given type.  QEMU
+separation could take advantage of type enforcement by running the
+emulation processes with different types, both from the main QEMU
+process, and from the emulation processes of different classes of
+devices.
+
+        For example, guest disk images and disk emulation processes
+could have types separate from the main QEMU process and non-disk
+emulation processes, and the type rules could prevent processes other
+than disk emulation ones from accessing guest disk images.  Similarly,
+network emulation processes can have a type separate from the main
+QEMU process and non-network emulation process, and only that type can
+access the host tun/tap device used to provide guest networking.
+
+4.2.2 Category enforcement
+
+        Category enforcement assigns a set of numbers within a given
+ range to the process or file.  The process is granted access to the
+ file if the process's set is a superset of the file's set.  This
+ enforcement can be used to separate multiple instances of devices in
+ the same class.
+
+        For example, if there are multiple disk devices provides to a
+guest, each device emulation process could be provisioned with a
+separate category.  The different device emulation processes would not
+be able to access each other's backing disk images.
+
+        Alternatively, categories could be used in lieu of the type
+enforcement scheme described above.  In this scenario, different
+categories would be used to prevent device emulation processes in
+different classes from accessing resources assigned to other classes.
+