diff mbox series

libxl: create backend/ xenstore dir for driver domains

Message ID 20200105084148.18887-1-marmarek@invisiblethingslab.com (mailing list archive)
State New, archived
Headers show
Series libxl: create backend/ xenstore dir for driver domains | expand

Commit Message

Marek Marczykowski-Górecki Jan. 5, 2020, 8:41 a.m. UTC
Cleaning up backend xenstore entries is a responsibility of the backend.
When backend lives outside of dom0, the domain needs proper permissions
to do it. Normally it is given permission to remove the device dir
itself, but not the dir containing it (named after frontend ID). After a
whole those empty leftover directories accumulate to the point xenstore
returning E2BIG on listing them.

Fix this by giving backend domain write access also to backend/
directory itself when c_info->driver_domain option is set. The code
removing relevant dir is already there (just lacked permissions to do so).

Note this also allows the backend domain to create new entries,
pretending to host backend devices it don't have. But since libxl uses
/libxl/ xenstore dir for this information (still outside of backend
domain control), this shouldn't be an issue.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
 tools/libxl/libxl_create.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Ian Jackson Jan. 6, 2020, 2:20 p.m. UTC | #1
Marek Marczykowski-Górecki writes ("[PATCH] libxl: create backend/ xenstore dir for driver domains"):
> Cleaning up backend xenstore entries is a responsibility of the backend.
> When backend lives outside of dom0, the domain needs proper permissions
> to do it. Normally it is given permission to remove the device dir
> itself, but not the dir containing it (named after frontend ID). After a
> whole those empty leftover directories accumulate to the point xenstore
> returning E2BIG on listing them.
> 
> Fix this by giving backend domain write access also to backend/
> directory itself when c_info->driver_domain option is set. The code
> removing relevant dir is already there (just lacked permissions to do so).
> 
> Note this also allows the backend domain to create new entries,
> pretending to host backend devices it don't have. But since libxl uses
> /libxl/ xenstore dir for this information (still outside of backend
> domain control), this shouldn't be an issue.

This seems quite hazardous to me.  The reasoning you use to show that
this iws OK seems fragile, and in general it doesn't feel right to
give the particular backend such wide scope.

Can we find another way to address this problem ?  I think the
containing directory should be removed by the toolstack.  Why is this
difficult ?  (I presume there is a reason or you would have done it
that way...)

Ian.
Marek Marczykowski-Górecki Jan. 6, 2020, 2:38 p.m. UTC | #2
On Mon, Jan 06, 2020 at 02:20:46PM +0000, Ian Jackson wrote:
> Marek Marczykowski-Górecki writes ("[PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > Cleaning up backend xenstore entries is a responsibility of the backend.
> > When backend lives outside of dom0, the domain needs proper permissions
> > to do it. Normally it is given permission to remove the device dir
> > itself, but not the dir containing it (named after frontend ID). After a
> > whole those empty leftover directories accumulate to the point xenstore
> > returning E2BIG on listing them.
> > 
> > Fix this by giving backend domain write access also to backend/
> > directory itself when c_info->driver_domain option is set. The code
> > removing relevant dir is already there (just lacked permissions to do so).
> > 
> > Note this also allows the backend domain to create new entries,
> > pretending to host backend devices it don't have. But since libxl uses
> > /libxl/ xenstore dir for this information (still outside of backend
> > domain control), this shouldn't be an issue.
> 
> This seems quite hazardous to me.  The reasoning you use to show that
> this iws OK seems fragile, and in general it doesn't feel right to
> give the particular backend such wide scope.
> 
> Can we find another way to address this problem ?  I think the
> containing directory should be removed by the toolstack.  Why is this
> difficult ?  (I presume there is a reason or you would have done it
> that way...)

It was done this way previously and caused issues, see this commit:

commit 546678c6a60f64fb186640460dfa69a837c8fba5
Author: Roger Pau Monne <roger.pau@citrix.com>
Date:   Wed Sep 23 12:06:56 2015 +0200

    libxl: fix the cleanup of the backend path when using driver domains
    
    With the current libxl implementation the control domain will remove both
    the frontend and the backend xenstore paths of a device that's handled by a
    driver domain. This is incorrect, since the driver domain possibly needs to
    access the backend path in order to perform the disconnection and cleanup of
    the device.
    
    Fix this by making sure the control domain only cleans the frontend path,
    leaving the backend path to be cleaned by the driver domain. Note that if
    the device is not handled by a driver domain the control domain will perform
    the removal of both the frontend and the backend paths.
    
    Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
    Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
    Reported-by: Alex Velazquez <alex.j.velazquez@gmail.com>
    Cc: Alex Velazquez <alex.j.velazquez@gmail.com>
    Cc: Ian Jackson <ian.jackson@eu.citrix.com>
    Cc: Ian Campbell <ian.campbell@citrix.com>
    Cc: Wei Liu <wei.liu2@citrix.com>
    Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson Jan. 6, 2020, 3:40 p.m. UTC | #3
Adding Roger to the CC.

Marek Marczykowski-Górecki writes ("Re: [PATCH] libxl: create backend/ xenstore dir for driver domains"):
> On Mon, Jan 06, 2020 at 02:20:46PM +0000, Ian Jackson wrote:
> > Marek Marczykowski-Górecki writes ("[PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > > Cleaning up backend xenstore entries is a responsibility of the backend.
> > > When backend lives outside of dom0, the domain needs proper permissions
> > > to do it. Normally it is given permission to remove the device dir
> > > itself, but not the dir containing it (named after frontend ID). After a
> > > whole those empty leftover directories accumulate to the point xenstore
> > > returning E2BIG on listing them.
> > > 
> > > Fix this by giving backend domain write access also to backend/
> > > directory itself when c_info->driver_domain option is set. The code
> > > removing relevant dir is already there (just lacked permissions to do so).
> > > 
> > > Note this also allows the backend domain to create new entries,
> > > pretending to host backend devices it don't have. But since libxl uses
> > > /libxl/ xenstore dir for this information (still outside of backend
> > > domain control), this shouldn't be an issue.
> > 
> > This seems quite hazardous to me.  The reasoning you use to show that
> > this iws OK seems fragile, and in general it doesn't feel right to
> > give the particular backend such wide scope.
> > 
> > Can we find another way to address this problem ?  I think the
> > containing directory should be removed by the toolstack.  Why is this
> > difficult ?  (I presume there is a reason or you would have done it
> > that way...)
> 
> It was done this way previously and caused issues, see this commit:
> 
> commit 546678c6a60f64fb186640460dfa69a837c8fba5
> Author: Roger Pau Monne <roger.pau@citrix.com>
> Date:   Wed Sep 23 12:06:56 2015 +0200
> 
>     libxl: fix the cleanup of the backend path when using driver domains

Thanks.

>     With the current libxl implementation the control domain will
>     remove both the frontend and the backend xenstore paths of a
>     device that's handled by a driver domain. This is incorrect,
>     since the driver domain possibly needs to access the backend
>     path in order to perform the disconnection and cleanup of the
>     device.
>     
>     Fix this by making sure the control domain only cleans the
>     frontend path, leaving the backend path to be cleaned by the
>     driver domain. Note that if the device is not handled by a
>     driver domain the control domain will perform the removal of
>     both the frontend and the backend paths.

Hmm.  I see my Ack on that.  Nevertheless maybe it is wrong.

Looking at it afresh, I think maybe the right answer is:

 * If the driver domain is expected to be working properly, the
   toolstack should wait for the driver domain to complete the device
   shutdown, before removing the backend node.  Indeed, the toolstack
   ought to wait for this before actually destroying the guest in Xen,
   by the usual logic for clean domain shutdown.

 * There needs to be a way to deal with a broken/unresponsive driver
   domain.  That will involve not waiting for the backend so must
   involve simply deleting the backend from xenstore.

Is the distinction here between "xl shutdown" and "xl destroy", on the
actual guest domain, good enough ?  Hopefully if the driver domain
sees the backend directory simply vanish it can destructively tear
everything down ?

Ian.
Marek Marczykowski-Górecki Jan. 6, 2020, 4:03 p.m. UTC | #4
On Mon, Jan 06, 2020 at 03:40:22PM +0000, Ian Jackson wrote:
> Adding Roger to the CC.
> 
> Marek Marczykowski-Górecki writes ("Re: [PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > On Mon, Jan 06, 2020 at 02:20:46PM +0000, Ian Jackson wrote:
> > > Marek Marczykowski-Górecki writes ("[PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > > > Cleaning up backend xenstore entries is a responsibility of the backend.
> > > > When backend lives outside of dom0, the domain needs proper permissions
> > > > to do it. Normally it is given permission to remove the device dir
> > > > itself, but not the dir containing it (named after frontend ID). After a
> > > > whole those empty leftover directories accumulate to the point xenstore
> > > > returning E2BIG on listing them.
> > > > 
> > > > Fix this by giving backend domain write access also to backend/
> > > > directory itself when c_info->driver_domain option is set. The code
> > > > removing relevant dir is already there (just lacked permissions to do so).
> > > > 
> > > > Note this also allows the backend domain to create new entries,
> > > > pretending to host backend devices it don't have. But since libxl uses
> > > > /libxl/ xenstore dir for this information (still outside of backend
> > > > domain control), this shouldn't be an issue.
> > > 
> > > This seems quite hazardous to me.  The reasoning you use to show that
> > > this iws OK seems fragile, and in general it doesn't feel right to
> > > give the particular backend such wide scope.
> > > 
> > > Can we find another way to address this problem ?  I think the
> > > containing directory should be removed by the toolstack.  Why is this
> > > difficult ?  (I presume there is a reason or you would have done it
> > > that way...)
> > 
> > It was done this way previously and caused issues, see this commit:
> > 
> > commit 546678c6a60f64fb186640460dfa69a837c8fba5
> > Author: Roger Pau Monne <roger.pau@citrix.com>
> > Date:   Wed Sep 23 12:06:56 2015 +0200
> > 
> >     libxl: fix the cleanup of the backend path when using driver domains
> 
> Thanks.
> 
> >     With the current libxl implementation the control domain will
> >     remove both the frontend and the backend xenstore paths of a
> >     device that's handled by a driver domain. This is incorrect,
> >     since the driver domain possibly needs to access the backend
> >     path in order to perform the disconnection and cleanup of the
> >     device.
> >     
> >     Fix this by making sure the control domain only cleans the
> >     frontend path, leaving the backend path to be cleaned by the
> >     driver domain. Note that if the device is not handled by a
> >     driver domain the control domain will perform the removal of
> >     both the frontend and the backend paths.
> 
> Hmm.  I see my Ack on that.  Nevertheless maybe it is wrong.
> 
> Looking at it afresh, I think maybe the right answer is:
> 
>  * If the driver domain is expected to be working properly, the
>    toolstack should wait for the driver domain to complete the device
>    shutdown, before removing the backend node.  Indeed, the toolstack
>    ought to wait for this before actually destroying the guest in Xen,
>    by the usual logic for clean domain shutdown.

I think that's not enough. .../state = 6 is set by the kernel, but
xl devd in the driver domain may want to cleanup things (hotplug scripts
etc). And indeed libxl__device_destroy() is called from
device_hotplug_done(), not device_backend_callback().

Alternatively, toolstack could wait for the actual backend node to be
removed (by the driver domain), and then cleanup the parent directory (if
empty). I don't find it particularly appealing, as every contact with
libxl async code reduce overall happiness...

>  * There needs to be a way to deal with a broken/unresponsive driver
>    domain.  That will involve not waiting for the backend so must
>    involve simply deleting the backend from xenstore.

It's already there: if driver domain fails to set .../state = 6 within
a timeout, toolstack will forcibly remove the entry.

> Is the distinction here between "xl shutdown" and "xl destroy", on the
> actual guest domain, good enough ?  Hopefully if the driver domain
> sees the backend directory simply vanish it can destructively tear
> everything down ?

In the past this lead to multiple issues, where hotplug script didn't
know which device actually was removed. In some cases I needed to
workaround this by saving xenstore dump into a file in an "online"
hotplug script, but it is very ugly solution.
Marek Marczykowski-Górecki March 15, 2020, 10:20 p.m. UTC | #5
On Mon, Jan 06, 2020 at 05:03:40PM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Jan 06, 2020 at 03:40:22PM +0000, Ian Jackson wrote:
> > Adding Roger to the CC.
> > 
> > Marek Marczykowski-Górecki writes ("Re: [PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > > On Mon, Jan 06, 2020 at 02:20:46PM +0000, Ian Jackson wrote:
> > > > Marek Marczykowski-Górecki writes ("[PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > > > > Cleaning up backend xenstore entries is a responsibility of the backend.
> > > > > When backend lives outside of dom0, the domain needs proper permissions
> > > > > to do it. Normally it is given permission to remove the device dir
> > > > > itself, but not the dir containing it (named after frontend ID). After a
> > > > > whole those empty leftover directories accumulate to the point xenstore
> > > > > returning E2BIG on listing them.
> > > > > 
> > > > > Fix this by giving backend domain write access also to backend/
> > > > > directory itself when c_info->driver_domain option is set. The code
> > > > > removing relevant dir is already there (just lacked permissions to do so).
> > > > > 
> > > > > Note this also allows the backend domain to create new entries,
> > > > > pretending to host backend devices it don't have. But since libxl uses
> > > > > /libxl/ xenstore dir for this information (still outside of backend
> > > > > domain control), this shouldn't be an issue.
> > > > 
> > > > This seems quite hazardous to me.  The reasoning you use to show that
> > > > this iws OK seems fragile, and in general it doesn't feel right to
> > > > give the particular backend such wide scope.
> > > > 
> > > > Can we find another way to address this problem ?  I think the
> > > > containing directory should be removed by the toolstack.  Why is this
> > > > difficult ?  (I presume there is a reason or you would have done it
> > > > that way...)
> > > 
> > > It was done this way previously and caused issues, see this commit:
> > > 
> > > commit 546678c6a60f64fb186640460dfa69a837c8fba5
> > > Author: Roger Pau Monne <roger.pau@citrix.com>
> > > Date:   Wed Sep 23 12:06:56 2015 +0200
> > > 
> > >     libxl: fix the cleanup of the backend path when using driver domains
> > 
> > Thanks.
> > 
> > >     With the current libxl implementation the control domain will
> > >     remove both the frontend and the backend xenstore paths of a
> > >     device that's handled by a driver domain. This is incorrect,
> > >     since the driver domain possibly needs to access the backend
> > >     path in order to perform the disconnection and cleanup of the
> > >     device.
> > >     
> > >     Fix this by making sure the control domain only cleans the
> > >     frontend path, leaving the backend path to be cleaned by the
> > >     driver domain. Note that if the device is not handled by a
> > >     driver domain the control domain will perform the removal of
> > >     both the frontend and the backend paths.
> > 
> > Hmm.  I see my Ack on that.  Nevertheless maybe it is wrong.
> > 
> > Looking at it afresh, I think maybe the right answer is:
> > 
> >  * If the driver domain is expected to be working properly, the
> >    toolstack should wait for the driver domain to complete the device
> >    shutdown, before removing the backend node.  Indeed, the toolstack
> >    ought to wait for this before actually destroying the guest in Xen,
> >    by the usual logic for clean domain shutdown.
> 
> I think that's not enough. .../state = 6 is set by the kernel, but
> xl devd in the driver domain may want to cleanup things (hotplug scripts
> etc). And indeed libxl__device_destroy() is called from
> device_hotplug_done(), not device_backend_callback().
> 
> Alternatively, toolstack could wait for the actual backend node to be
> removed (by the driver domain), and then cleanup the parent directory (if
> empty). I don't find it particularly appealing, as every contact with
> libxl async code reduce overall happiness...
> 
> >  * There needs to be a way to deal with a broken/unresponsive driver
> >    domain.  That will involve not waiting for the backend so must
> >    involve simply deleting the backend from xenstore.
> 
> It's already there: if driver domain fails to set .../state = 6 within
> a timeout, toolstack will forcibly remove the entry.
> 
> > Is the distinction here between "xl shutdown" and "xl destroy", on the
> > actual guest domain, good enough ?  Hopefully if the driver domain
> > sees the backend directory simply vanish it can destructively tear
> > everything down ?
> 
> In the past this lead to multiple issues, where hotplug script didn't
> know which device actually was removed. In some cases I needed to
> workaround this by saving xenstore dump into a file in an "online"
> hotplug script, but it is very ugly solution.

Any opinion on the above?
In the above context (plus the fact that the toolstack use /libxl to
enumerate devices), I still think giving driver domain write access to
the backend/ node is the right solution for this problem.
Roger Pau Monné March 23, 2020, 3:35 p.m. UTC | #6
On Mon, Jan 06, 2020 at 05:03:40PM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Jan 06, 2020 at 03:40:22PM +0000, Ian Jackson wrote:
> > Adding Roger to the CC.
> > 
> > Marek Marczykowski-Górecki writes ("Re: [PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > > On Mon, Jan 06, 2020 at 02:20:46PM +0000, Ian Jackson wrote:
> > > > Marek Marczykowski-Górecki writes ("[PATCH] libxl: create backend/ xenstore dir for driver domains"):
> > > > > Cleaning up backend xenstore entries is a responsibility of the backend.
> > > > > When backend lives outside of dom0, the domain needs proper permissions
> > > > > to do it. Normally it is given permission to remove the device dir
> > > > > itself, but not the dir containing it (named after frontend ID). After a
> > > > > whole those empty leftover directories accumulate to the point xenstore
> > > > > returning E2BIG on listing them.
> > > > > 
> > > > > Fix this by giving backend domain write access also to backend/
> > > > > directory itself when c_info->driver_domain option is set. The code
> > > > > removing relevant dir is already there (just lacked permissions to do so).
> > > > > 
> > > > > Note this also allows the backend domain to create new entries,
> > > > > pretending to host backend devices it don't have. But since libxl uses
> > > > > /libxl/ xenstore dir for this information (still outside of backend
> > > > > domain control), this shouldn't be an issue.
> > > > 
> > > > This seems quite hazardous to me.  The reasoning you use to show that
> > > > this iws OK seems fragile, and in general it doesn't feel right to
> > > > give the particular backend such wide scope.
> > > > 
> > > > Can we find another way to address this problem ?  I think the
> > > > containing directory should be removed by the toolstack.  Why is this
> > > > difficult ?  (I presume there is a reason or you would have done it
> > > > that way...)
> > > 
> > > It was done this way previously and caused issues, see this commit:
> > > 
> > > commit 546678c6a60f64fb186640460dfa69a837c8fba5
> > > Author: Roger Pau Monne <roger.pau@citrix.com>
> > > Date:   Wed Sep 23 12:06:56 2015 +0200
> > > 
> > >     libxl: fix the cleanup of the backend path when using driver domains
> > 
> > Thanks.
> > 
> > >     With the current libxl implementation the control domain will
> > >     remove both the frontend and the backend xenstore paths of a
> > >     device that's handled by a driver domain. This is incorrect,
> > >     since the driver domain possibly needs to access the backend
> > >     path in order to perform the disconnection and cleanup of the
> > >     device.
> > >     
> > >     Fix this by making sure the control domain only cleans the
> > >     frontend path, leaving the backend path to be cleaned by the
> > >     driver domain. Note that if the device is not handled by a
> > >     driver domain the control domain will perform the removal of
> > >     both the frontend and the backend paths.
> > 
> > Hmm.  I see my Ack on that.  Nevertheless maybe it is wrong.
> > 
> > Looking at it afresh, I think maybe the right answer is:
> > 
> >  * If the driver domain is expected to be working properly, the
> >    toolstack should wait for the driver domain to complete the device
> >    shutdown, before removing the backend node.  Indeed, the toolstack
> >    ought to wait for this before actually destroying the guest in Xen,
> >    by the usual logic for clean domain shutdown.
> 
> I think that's not enough. .../state = 6 is set by the kernel, but
> xl devd in the driver domain may want to cleanup things (hotplug scripts
> etc). And indeed libxl__device_destroy() is called from
> device_hotplug_done(), not device_backend_callback().
> 
> Alternatively, toolstack could wait for the actual backend node to be
> removed (by the driver domain), and then cleanup the parent directory (if
> empty).

I'm not sure you need to cleanup the parent directory, albeit it
wouldn't hurt. It needs to be done in a transaction though, so that
you don't race with new additions to it.

> I don't find it particularly appealing, as every contact with
> libxl async code reduce overall happiness...
> 
> >  * There needs to be a way to deal with a broken/unresponsive driver
> >    domain.  That will involve not waiting for the backend so must
> >    involve simply deleting the backend from xenstore.
> 
> It's already there: if driver domain fails to set .../state = 6 within
> a timeout, toolstack will forcibly remove the entry.

Would it work to change this and instead of monitor .../state = 6
monitor that the parent directory still exist?

> > Is the distinction here between "xl shutdown" and "xl destroy", on the
> > actual guest domain, good enough ?  Hopefully if the driver domain
> > sees the backend directory simply vanish it can destructively tear
> > everything down ?
> 
> In the past this lead to multiple issues, where hotplug script didn't
> know which device actually was removed. In some cases I needed to
> workaround this by saving xenstore dump into a file in an "online"
> hotplug script, but it is very ugly solution.

Removing the whole directory without giving time to the driver domain
to execute it's hotplug scripts can indeed lead to issues, as there's
no guarantee that the hotplug script won't use data in xenstore in
order to perform the cleanup IIRC.

My preferred option would be to wait for the backend directory to be
removed by the driver domain, as I think it's the cleanest and likely
safest approach.

Thanks, Roger.
Marek Marczykowski-Górecki March 24, 2020, 2:45 a.m. UTC | #7
On Mon, Mar 23, 2020 at 04:35:12PM +0100, Roger Pau Monné wrote:
> On Mon, Jan 06, 2020 at 05:03:40PM +0100, Marek Marczykowski-Górecki wrote:
> > Alternatively, toolstack could wait for the actual backend node to be
> > removed (by the driver domain), and then cleanup the parent directory (if
> > empty).
> 
> I'm not sure you need to cleanup the parent directory, 

You do, that's why this is an issue. Otherwise empty directories will
accumulate there, leading to various issues (inability to list, running
out of watches for monitoring them etc).

Example state:

/local/domain/5/backend = ""
/local/domain/5/backend/vif = ""
/local/domain/5/backend/vif/6 = ""
/local/domain/5/backend/vif/7 = ""
/local/domain/5/backend/vif/7/0 = ""
/local/domain/5/backend/vif/7/0/frontend = "/local/domain/7/device/vif/0"
/local/domain/5/backend/vif/7/0/frontend-id = "7"
/local/domain/5/backend/vif/7/0/online = "1"
/local/domain/5/backend/vif/7/0/state = "4"
/local/domain/5/backend/vif/7/0/script = "/etc/xen/scripts/vif-route-qubes"
/local/domain/5/backend/vif/7/0/mac = "00:16:3e:5e:6c:00"
/local/domain/5/backend/vif/7/0/ip = "10.137.0.49 fd09:24ef:4179::a89:31"
/local/domain/5/backend/vif/7/0/bridge = "xenbr0"
/local/domain/5/backend/vif/7/0/handle = "0"
/local/domain/5/backend/vif/7/0/type = "vif"
/local/domain/5/backend/vif/7/0/feature-sg = "1"
/local/domain/5/backend/vif/7/0/feature-gso-tcpv4 = "1"
/local/domain/5/backend/vif/7/0/feature-gso-tcpv6 = "1"
/local/domain/5/backend/vif/7/0/feature-ipv6-csum-offload = "1"
/local/domain/5/backend/vif/7/0/feature-rx-copy = "1"
/local/domain/5/backend/vif/7/0/feature-rx-flip = "0"
/local/domain/5/backend/vif/7/0/feature-multicast-control = "1"
/local/domain/5/backend/vif/7/0/feature-dynamic-multicast-control = "1"
/local/domain/5/backend/vif/7/0/feature-split-event-channels = "1"
/local/domain/5/backend/vif/7/0/multi-queue-max-queues = "2"
/local/domain/5/backend/vif/7/0/feature-ctrl-ring = "1"
/local/domain/5/backend/vif/7/0/hotplug-status = "connected"
/local/domain/5/backend/vif/8 = ""
/local/domain/5/backend/vif/11 = ""
/local/domain/5/backend/vif/12 = ""
/local/domain/5/backend/vif/17 = ""
/local/domain/5/backend/vif/20 = ""
/local/domain/5/backend/vif/23 = ""
/local/domain/5/backend/vif/26 = ""
/local/domain/5/backend/vif/28 = ""
/local/domain/5/backend/vif/29 = ""
/local/domain/5/backend/vif/30 = ""
/local/domain/5/backend/vif/33 = ""
/local/domain/5/backend/vif/34 = ""
(...)
/local/domain/5/backend/vif/416 = ""

> albeit it
> wouldn't hurt. It needs to be done in a transaction though, so that
> you don't race with new additions to it.

Good point.

> > I don't find it particularly appealing, as every contact with
> > libxl async code reduce overall happiness...
> > 
> > >  * There needs to be a way to deal with a broken/unresponsive driver
> > >    domain.  That will involve not waiting for the backend so must
> > >    involve simply deleting the backend from xenstore.
> > 
> > It's already there: if driver domain fails to set .../state = 6 within
> > a timeout, toolstack will forcibly remove the entry.
> 
> Would it work to change this and instead of monitor .../state = 6
> monitor that the parent directory still exist?

That could be a good idea, to avoid introducing yet another (set of)
callback. I'll look into it, it may require different handling of
dom0/non-dom0 backend.

> > > Is the distinction here between "xl shutdown" and "xl destroy", on the
> > > actual guest domain, good enough ?  Hopefully if the driver domain
> > > sees the backend directory simply vanish it can destructively tear
> > > everything down ?
> > 
> > In the past this lead to multiple issues, where hotplug script didn't
> > know which device actually was removed. In some cases I needed to
> > workaround this by saving xenstore dump into a file in an "online"
> > hotplug script, but it is very ugly solution.
> 
> Removing the whole directory without giving time to the driver domain
> to execute it's hotplug scripts can indeed lead to issues, as there's
> no guarantee that the hotplug script won't use data in xenstore in
> order to perform the cleanup IIRC.

Yes, that's what 546678c6a60f64fb186640460dfa69a837c8fba5 fixed, but not
removing it too early.

> My preferred option would be to wait for the backend directory to be
> removed by the driver domain, as I think it's the cleanest and likely
> safest approach.
> 
> Thanks, Roger.
>
Roger Pau Monné March 25, 2020, 10:36 a.m. UTC | #8
On Tue, Mar 24, 2020 at 03:45:30AM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Mar 23, 2020 at 04:35:12PM +0100, Roger Pau Monné wrote:
> > On Mon, Jan 06, 2020 at 05:03:40PM +0100, Marek Marczykowski-Górecki wrote:
> > > >  * There needs to be a way to deal with a broken/unresponsive driver
> > > >    domain.  That will involve not waiting for the backend so must
> > > >    involve simply deleting the backend from xenstore.
> > > 
> > > It's already there: if driver domain fails to set .../state = 6 within
> > > a timeout, toolstack will forcibly remove the entry.
> > 
> > Would it work to change this and instead of monitor .../state = 6
> > monitor that the parent directory still exist?
> 
> That could be a good idea, to avoid introducing yet another (set of)
> callback. I'll look into it, it may require different handling of
> dom0/non-dom0 backend.

Yes, the domain handling the backend needs to watch .../state, while
the control domain (where the toolstack actually runs) would need to
watch .../ AFAICT.

As you say, I think you could maybe reuse some of the code and add a
special case for the toolstack domain when the backend runs in a
driver domain.

Thanks, Roger.
diff mbox series

Patch

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index a6d40b753e..38ca9b85a4 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -763,6 +763,13 @@  retry_transaction:
          */
         libxl__xs_mknod(gc, t, GCSPRINTF("%s/device-model", dom_path), rwperm,
                         ARRAY_SIZE(rwperm));
+
+        /*
+         * Create a local "backend" directory for each guest, writable by that
+         * guest, to allow it properly cleanup removed devices
+         */
+        libxl__xs_mknod(gc, t, GCSPRINTF("%s/backend", dom_path), rwperm,
+                        ARRAY_SIZE(rwperm));
     }
 
     vm_list = libxl_list_vm(ctx, &nb_vm);