Message ID | 20210812005700.3159-1-jandryuk@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] libxl: Fix stubdom PCI passthrough | expand |
On 12.08.2021 02:57, Jason Andryuk wrote: > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are > reflected in the config" broken stubdom PCI passthrough when it moved > libxl__create_pci_backend later in the function. xl pci-attach for a > running PV domain may also have been broken, but that was not verified. > > The stubdomain is running (!starting) and PV, so it calls > libxl__wait_for_backend. With the new placement of > libxl__create_pci_backend, the path does not exist and the call > immediately fails. > libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist > libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3) > libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices > > The wait is only relevant when the backend is already present. num_devs > is already used to determine if the backend needs to be created. Re-use > num_devs to determine if the backend wait is necessary. The wait is > necessary to avoid racing with another PCI attachment reconfiguring the > front/back. If we are creating the backend, then we don't have to worry > about a racing reconfigure. And why is such a race possible in the first place? If two independent actions are permitted in parallel on a domain, wouldn't there better be synchronization closer to the root of the call tree? Jan
Jan Beulich writes ("Re: [PATCH v2] libxl: Fix stubdom PCI passthrough"): > On 12.08.2021 02:57, Jason Andryuk wrote: > > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are > > reflected in the config" broken stubdom PCI passthrough when it moved > > libxl__create_pci_backend later in the function. xl pci-attach for a > > running PV domain may also have been broken, but that was not verified. > > > > The stubdomain is running (!starting) and PV, so it calls > > libxl__wait_for_backend. With the new placement of > > libxl__create_pci_backend, the path does not exist and the call > > immediately fails. > > libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist > > libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3) > > libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices > > > > The wait is only relevant when the backend is already present. num_devs > > is already used to determine if the backend needs to be created. Re-use > > num_devs to determine if the backend wait is necessary. The wait is > > necessary to avoid racing with another PCI attachment reconfiguring the > > front/back. If we are creating the backend, then we don't have to worry > > about a racing reconfigure. > > And why is such a race possible in the first place? If two independent > actions are permitted in parallel on a domain, wouldn't there better > be synchronization closer to the root of the call tree? libxl does not have a per-domain lock that would prevent this kind of thing. Individual operations that might malfunction if done concurrently are supposed to take appropriate precautions. I think that this patch is trying to be those precautions. It's not clear to me whether it's correct, though. I (or another mailntainer) will have to look at it properly... Thanks, Ian.
On Thu, Aug 12, 2021 at 7:20 AM Ian Jackson <iwj@xenproject.org> wrote: > > Jan Beulich writes ("Re: [PATCH v2] libxl: Fix stubdom PCI passthrough"): > > On 12.08.2021 02:57, Jason Andryuk wrote: > > > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are > > > reflected in the config" broken stubdom PCI passthrough when it moved > > > libxl__create_pci_backend later in the function. xl pci-attach for a > > > running PV domain may also have been broken, but that was not verified. > > > > > > The stubdomain is running (!starting) and PV, so it calls > > > libxl__wait_for_backend. With the new placement of > > > libxl__create_pci_backend, the path does not exist and the call > > > immediately fails. > > > libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist > > > libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3) > > > libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices > > > > > > The wait is only relevant when the backend is already present. num_devs > > > is already used to determine if the backend needs to be created. Re-use > > > num_devs to determine if the backend wait is necessary. The wait is > > > necessary to avoid racing with another PCI attachment reconfiguring the > > > front/back. If we are creating the backend, then we don't have to worry > > > about a racing reconfigure. > > > > And why is such a race possible in the first place? If two independent > > actions are permitted in parallel on a domain, wouldn't there better > > be synchronization closer to the root of the call tree? It is possible because pci front/back share the single xenstore state node but have multiple sub-devices. "The wait is necessary to avoid racing with another PCI attachment reconfiguring the front/back" is my determination after thinking through the code. xl is poking at the frontend and backend domains, which are running independently and their state is not under xl's control. Connected is the quiescent state, so it makes sense to wait for it before switching to Reconfigurating. That was the old behavior which I am restoring. > libxl does not have a per-domain lock that would prevent this kind of > thing. Individual operations that might malfunction if done > concurrently are supposed to take appropriate precautions. > > I think that this patch is trying to be those precautions. It's not > clear to me whether it's correct, though. I (or another mailntainer) > will have to look at it properly... Please do :) Regards, Jason
Jason Andryuk writes ("[PATCH v2] libxl: Fix stubdom PCI passthrough"): > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are > reflected in the config" broken stubdom PCI passthrough when it moved > libxl__create_pci_backend later in the function. xl pci-attach for a > running PV domain may also have been broken, but that was not verified. > > The stubdomain is running (!starting) and PV, so it calls > libxl__wait_for_backend. With the new placement of > libxl__create_pci_backend, the path does not exist and the call > immediately fails. ... > The wait is only relevant when the backend is already present. num_devs > is already used to determine if the backend needs to be created. Re-use > num_devs to determine if the backend wait is necessary. The wait is > necessary to avoid racing with another PCI attachment reconfiguring the > front/back. If we are creating the backend, then we don't have to worry > about a racing reconfigure. Thanks for working on this. Sorry it's taken a while for me to look at this properly. It seems very complicated. I confess I am confused. I wonder if I actually understand properly how the code in tree is supposed to work. So if I seem not to be making sense, please explain :-). AFAICT what you are saying is that: In 0fdb48ffe7a1, pci-attach was moved later in the setup, until a time after the stubdomain is running. So that code now always runs with !starting, when previously it would run with !!starting. libxl__wait_for_backend fails if the backend path does not exist; previously, when the domain is being created, the wait would be skipped. Now because !starting, the wait is done, and fails because the backend path is missing. The purpose of the wait is to make sure the frontend is ready to accept the instructions, mostly to prevent multiple pci attach happening simultaneously. You are using num_devs to see whether the backend exists already, so as to skip the failing check. I don't think that is right. But I'm not sure the old code is right either. If you are right about the reason for the wait, it does not seem correctly placed. There is surely a TOCTOU race: first, we do the wait, and then we write, non-transactionally, to xenstore. If two of these processes run at once, they could both decide not to wait, then both try to create the backend and trample on each other. This kind of thing is usually supposed to be dealt with by a combination of the userdata lock (for the config) and xenstore transaction but the code here doesn't seem to do that correctly. Shouldn't all of this looking at xenstore occur within the transaction loop ? What this code seems to do is read some things nontransactionally, then enter a transaction, and then make some writes based on a combination of the pre-transaction and within-transaction data. In particular `num_devs` is read nontransactionally and then written within the transaction, without, I think being checked for subsequent modification. Also, I think the purpose of `starting` is to avoid waiting for the backend to be stable before the frontend is actaully unpaused, in which case presumably the backend would never be Connected and we would deadlock (and eventually time out). In general when we are hot-adding we must wait for the frontend and backend to be stable before saying we're done, whereas with cold-adding we set up the information and simply hope (expect) it all to sort itself out while the domain boots. So, I would be expecting to see the wait to happen *after* the add. There is also the wrinkle that the non-pci code is differently structured, because it must not use libxl__wait_for_backend at all. libxl__wait_for_backend is synchronous and blocking the whole libxl process for an extended time is not allowed. But AIUI we have made an exception for pci because the pci backend is always in dom0 and trusted. I looked through the git history. Very relevant seems 70628d024da42eea libxl: Multi-device passthrough coldplug: do not wait for unstarted guest which has some explanation from me in the commit message. I'm not sure why I didn't spot the anomalous transaction use problem. Also I found 1a734d51902dff44 libxl: fix cold plugged PCI device with stubdomain and, would you believe it 18f93842ac3c2ca4 libxl: fix cold plugged PCI devices with stubdomains which seems at least tangentially relevant, showing that this seems persistently to break :-(. This suggests quite strongly that we should add some tests for pci passthrough including at least one for stubdom coldplug. Also: b6c23c86fe5a1a02 libxl: add all pci devices to xenstore at once (during VM create) which seems OK. There has been a lot of refactoring, but much of it hasn't really changed the structure of this function. The issue I identify above, with the inconsistent use of transactions, seems to have existed since the very beginning. In b0a1af61678b5e4c libxenlight: implement pci passthrough the `retry_transaction:` label seems to me to be in the wrong place. I have CC'd Paul and Oleksandr (committer/reviewer of 0fdb48ffe7a1), Marek (seems to have touched a lot of the stubdom code here) and Stefano (original author of the pci passthrough code here) in case they would like to comment... Thanks, Ian.
> I looked through the git history.
Also there are some doc comments in libxl_internal.h.
In particular "Algorithm for handling device removal"
(near l.2844 in my tree) and the following
I don't think that diagram has been modified since I drew it, so it
may well be out of date. Indeed I hope at least the new QMP
arrangements aren't racy...
And, "As of Xen 4.5 we maintain various information"
which describes the various locks and has a proof
sketch for correct operation.
Ian.
On Tue, Aug 17, 2021 at 8:26 AM Ian Jackson <iwj@xenproject.org> wrote: > > Jason Andryuk writes ("[PATCH v2] libxl: Fix stubdom PCI passthrough"): > > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are > > reflected in the config" broken stubdom PCI passthrough when it moved > > libxl__create_pci_backend later in the function. xl pci-attach for a > > running PV domain may also have been broken, but that was not verified. > > > > The stubdomain is running (!starting) and PV, so it calls > > libxl__wait_for_backend. With the new placement of > > libxl__create_pci_backend, the path does not exist and the call > > immediately fails. > ... > > The wait is only relevant when the backend is already present. num_devs > > is already used to determine if the backend needs to be created. Re-use > > num_devs to determine if the backend wait is necessary. The wait is > > necessary to avoid racing with another PCI attachment reconfiguring the > > front/back. If we are creating the backend, then we don't have to worry > > about a racing reconfigure. > > Thanks for working on this. Sorry it's taken a while for me to look > at this properly. It seems very complicated. I confess I am > confused. I wonder if I actually understand properly how the code in > tree is supposed to work. So if I seem not to be making sense, please > explain :-). > > > AFAICT what you are saying is that: > > In 0fdb48ffe7a1, pci-attach was moved later in the setup, until a time > after the stubdomain is running. So that code now always runs with > !starting, when previously it would run with !!starting. No, it's not a starting vs. starting issue. More that the 0fdb48ffe7a1 change didn't consider the !starting case. We have 3 cases: PV domain - use xen-pcifront/back HVM - use QEMU QMP (Modern QEMU - I'm not sure about qemu-traditional) HVM with Stubdom - xen-pcifront/back to stubdom + QEMU (QMP or xenstore (traditional)). Stubdomain is always !starting (running) when the guest is starting. I think this is so that QEMU is running and can handle QMP commands. > libxl__wait_for_backend fails if the backend path does not exist; > previously, when the domain is being created, the wait would be > skipped. Now because !starting, the wait is done, and fails because > the backend path is missing. Previously, the backend was created before the wait was called. The diff for 0fdb48ffe7a1 shows the movement of the call to libxl__create_pci_backend(). It's not shown in the diff, but libxl__wait_for_backend() does not move which leads to the wait for a non-existent node. > The purpose of the wait is to make sure the frontend is ready to > accept the instructions, mostly to prevent multiple pci attach > happening simultaneously. > > You are using num_devs to see whether the backend exists already, so > as to skip the failing check. I don't think that is right. But I'm > not sure the old code is right either. num_devs is used pre and post-0fdb48ffe7a1 to control behaviour. My change just adds another case. commit 70628d024da4 "libxl: Multi-device passthrough coldplug: do not wait for unstarted guest" which you reference below explains the num_devs use. > If you are right about the reason for the wait, it does not seem > correctly placed. There is surely a TOCTOU race: first, we do the > wait, and then we write, non-transactionally, to xenstore. If two of > these processes run at once, they could both decide not to wait, > then both try to create the backend and trample on each other. Yes, two simultaneous "1st" adds would want to create the backend and clash. > This kind of thing is usually supposed to be dealt with by a > combination of the userdata lock (for the config) and xenstore > transaction but the code here doesn't seem to do that correctly. > > Shouldn't all of this looking at xenstore occur within the transaction > loop ? What this code seems to do is read some things > nontransactionally, then enter a transaction, and then make some > writes based on a combination of the pre-transaction and > within-transaction data. In particular `num_devs` is read > nontransactionally and then written within the transaction, without, I > think being checked for subsequent modification. > > Also, I think the purpose of `starting` is to avoid waiting for the > backend to be stable before the frontend is actaully unpaused, in > which case presumably the backend would never be Connected and we > would deadlock (and eventually time out). In general when we are > hot-adding we must wait for the frontend and backend to be stable > before saying we're done, whereas with cold-adding we set up the > information and simply hope (expect) it all to sort itself out while > the domain boots. So, I would be expecting to see the wait to happen > *after* the add. > > There is also the wrinkle that the non-pci code is differently > structured, because it must not use libxl__wait_for_backend at all. > libxl__wait_for_backend is synchronous and blocking the whole libxl > process for an extended time is not allowed. But AIUI we have made an > exception for pci because the pci backend is always in dom0 and > trusted. I think the wrinkle is that the single PCI backend hosts multiple devices. > I looked through the git history. > > Very relevant seems > 70628d024da42eea > libxl: Multi-device passthrough coldplug: do not wait for unstarted guest > which has some explanation from me in the commit message. I'm not > sure why I didn't spot the anomalous transaction use problem. > > > Also I found > 1a734d51902dff44 > libxl: fix cold plugged PCI device with stubdomain > and, would you believe it > 18f93842ac3c2ca4 > libxl: fix cold plugged PCI devices with stubdomains > which seems at least tangentially relevant, showing that this seems > persistently to break :-(. This suggests quite strongly that we > should add some tests for pci passthrough including at least one for > stubdom coldplug. > > Also: > b6c23c86fe5a1a02 > libxl: add all pci devices to xenstore at once (during VM create) > which seems OK. > > There has been a lot of refactoring, but much of it hasn't really > changed the structure of this function. > > The issue I identify above, with the inconsistent use of transactions, > seems to have existed since the very beginning. In > b0a1af61678b5e4c > libxenlight: implement pci passthrough > the `retry_transaction:` label seems to me to be in the wrong place. > > > I have CC'd Paul and Oleksandr (committer/reviewer of 0fdb48ffe7a1), > Marek (seems to have touched a lot of the stubdom code here) and > Stefano (original author of the pci passthrough code here) > in case they would like to comment... Regards, Jason
diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c index 1a1c263080..19daf1d4ee 100644 --- a/tools/libs/light/libxl_pci.c +++ b/tools/libs/light/libxl_pci.c @@ -157,8 +157,10 @@ static int libxl__device_pci_add_xenstore(libxl__gc *gc, if (domtype == LIBXL_DOMAIN_TYPE_INVALID) return ERROR_FAIL; - if (!starting && domtype == LIBXL_DOMAIN_TYPE_PV) { - if (libxl__wait_for_backend(gc, be_path, GCSPRINTF("%d", XenbusStateConnected)) < 0) + /* wait is only needed if the backend already exists (num_devs != NULL) */ + if (num_devs && !starting && domtype == LIBXL_DOMAIN_TYPE_PV) { + if (libxl__wait_for_backend(gc, be_path, + GCSPRINTF("%d", XenbusStateConnected)) < 0) return ERROR_FAIL; }
commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are reflected in the config" broken stubdom PCI passthrough when it moved libxl__create_pci_backend later in the function. xl pci-attach for a running PV domain may also have been broken, but that was not verified. The stubdomain is running (!starting) and PV, so it calls libxl__wait_for_backend. With the new placement of libxl__create_pci_backend, the path does not exist and the call immediately fails. libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3) libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices The wait is only relevant when the backend is already present. num_devs is already used to determine if the backend needs to be created. Re-use num_devs to determine if the backend wait is necessary. The wait is necessary to avoid racing with another PCI attachment reconfiguring the front/back. If we are creating the backend, then we don't have to worry about a racing reconfigure. Fixes: 0fdb48ffe7a1 ("libxl: Make sure devices added by pci-attach are reflected in the config") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> --- v2: Add Fixes Expand num_devs use in commit message --- tools/libs/light/libxl_pci.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)