[v2] libxl: Fix stubdom PCI passthrough

Message ID	20210812005700.3159-1-jandryuk@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=NCXs=ND=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 12CFC601FD Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Jason Andryuk <jandryuk@gmail.com> To: xen-devel@lists.xenproject.org Cc: pdurrant@amazon.com, Jason Andryuk <jandryuk@gmail.com>, Ian Jackson <iwj@xenproject.org>, Wei Liu <wl@xen.org>, Anthony PERARD <anthony.perard@citrix.com>, Juergen Gross <jgross@suse.com> Subject: [PATCH v2] libxl: Fix stubdom PCI passthrough Date: Wed, 11 Aug 2021 20:57:00 -0400 Message-Id: <20210812005700.3159-1-jandryuk@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] libxl: Fix stubdom PCI passthrough \| expand [v2] libxl: Fix stubdom PCI passthrough

Jason Andryuk Aug. 12, 2021, 12:57 a.m. UTC

commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
reflected in the config" broken stubdom PCI passthrough when it moved
libxl__create_pci_backend later in the function.  xl pci-attach for a
running PV domain may also have been broken, but that was not verified.

The stubdomain is running (!starting) and PV, so it calls
libxl__wait_for_backend.  With the new placement of
libxl__create_pci_backend, the path does not exist and the call
immediately fails.
libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist
libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3)
libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices

The wait is only relevant when the backend is already present.  num_devs
is already used to determine if the backend needs to be created.  Re-use
num_devs to determine if the backend wait is necessary.  The wait is
necessary to avoid racing with another PCI attachment reconfiguring the
front/back. If we are creating the backend, then we don't have to worry
about a racing reconfigure.

Fixes: 0fdb48ffe7a1 ("libxl: Make sure devices added by pci-attach are
reflected in the config")

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
---
v2:
Add Fixes
Expand num_devs use in commit message
---
 tools/libs/light/libxl_pci.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Jan Beulich Aug. 12, 2021, 6:18 a.m. UTC | #1

On 12.08.2021 02:57, Jason Andryuk wrote:
> commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
> reflected in the config" broken stubdom PCI passthrough when it moved
> libxl__create_pci_backend later in the function.  xl pci-attach for a
> running PV domain may also have been broken, but that was not verified.
> 
> The stubdomain is running (!starting) and PV, so it calls
> libxl__wait_for_backend.  With the new placement of
> libxl__create_pci_backend, the path does not exist and the call
> immediately fails.
> libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist
> libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3)
> libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices
> 
> The wait is only relevant when the backend is already present.  num_devs
> is already used to determine if the backend needs to be created.  Re-use
> num_devs to determine if the backend wait is necessary.  The wait is
> necessary to avoid racing with another PCI attachment reconfiguring the
> front/back. If we are creating the backend, then we don't have to worry
> about a racing reconfigure.

And why is such a race possible in the first place? If two independent
actions are permitted in parallel on a domain, wouldn't there better
be synchronization closer to the root of the call tree?

Jan

Ian Jackson Aug. 12, 2021, 11:20 a.m. UTC | #2

Jan Beulich writes ("Re: [PATCH v2] libxl: Fix stubdom PCI passthrough"):
> On 12.08.2021 02:57, Jason Andryuk wrote:
> > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
> > reflected in the config" broken stubdom PCI passthrough when it moved
> > libxl__create_pci_backend later in the function.  xl pci-attach for a
> > running PV domain may also have been broken, but that was not verified.
> > 
> > The stubdomain is running (!starting) and PV, so it calls
> > libxl__wait_for_backend.  With the new placement of
> > libxl__create_pci_backend, the path does not exist and the call
> > immediately fails.
> > libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist
> > libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3)
> > libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices
> > 
> > The wait is only relevant when the backend is already present.  num_devs
> > is already used to determine if the backend needs to be created.  Re-use
> > num_devs to determine if the backend wait is necessary.  The wait is
> > necessary to avoid racing with another PCI attachment reconfiguring the
> > front/back. If we are creating the backend, then we don't have to worry
> > about a racing reconfigure.
> 
> And why is such a race possible in the first place? If two independent
> actions are permitted in parallel on a domain, wouldn't there better
> be synchronization closer to the root of the call tree?

libxl does not have a per-domain lock that would prevent this kind of
thing.  Individual operations that might malfunction if done
concurrently are supposed to take appropriate precautions.

I think that this patch is trying to be those precautions.  It's not
clear to me whether it's correct, though.  I (or another mailntainer)
will have to look at it properly...

Thanks,
Ian.

Jason Andryuk Aug. 13, 2021, 1:22 a.m. UTC | #3

On Thu, Aug 12, 2021 at 7:20 AM Ian Jackson <iwj@xenproject.org> wrote:
>
> Jan Beulich writes ("Re: [PATCH v2] libxl: Fix stubdom PCI passthrough"):
> > On 12.08.2021 02:57, Jason Andryuk wrote:
> > > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
> > > reflected in the config" broken stubdom PCI passthrough when it moved
> > > libxl__create_pci_backend later in the function.  xl pci-attach for a
> > > running PV domain may also have been broken, but that was not verified.
> > >
> > > The stubdomain is running (!starting) and PV, so it calls
> > > libxl__wait_for_backend.  With the new placement of
> > > libxl__create_pci_backend, the path does not exist and the call
> > > immediately fails.
> > > libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist
> > > libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3)
> > > libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices
> > >
> > > The wait is only relevant when the backend is already present.  num_devs
> > > is already used to determine if the backend needs to be created.  Re-use
> > > num_devs to determine if the backend wait is necessary.  The wait is
> > > necessary to avoid racing with another PCI attachment reconfiguring the
> > > front/back. If we are creating the backend, then we don't have to worry
> > > about a racing reconfigure.
> >
> > And why is such a race possible in the first place? If two independent
> > actions are permitted in parallel on a domain, wouldn't there better
> > be synchronization closer to the root of the call tree?

It is possible because pci front/back share the single xenstore state
node but have multiple sub-devices.    "The wait is necessary to avoid
racing with another PCI attachment reconfiguring the front/back" is my
determination after thinking through the code.  xl is poking at the
frontend and backend domains, which are running independently and
their state is not under xl's control.  Connected is the quiescent
state, so it makes sense to wait for it before switching to
Reconfigurating.  That was the old behavior which I am restoring.

> libxl does not have a per-domain lock that would prevent this kind of
> thing.  Individual operations that might malfunction if done
> concurrently are supposed to take appropriate precautions.
>
> I think that this patch is trying to be those precautions.  It's not
> clear to me whether it's correct, though.  I (or another mailntainer)
> will have to look at it properly...

Please do :)

Regards,
Jason

Ian Jackson Aug. 17, 2021, 12:26 p.m. UTC | #4

Jason Andryuk writes ("[PATCH v2] libxl: Fix stubdom PCI passthrough"):
> commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
> reflected in the config" broken stubdom PCI passthrough when it moved
> libxl__create_pci_backend later in the function.  xl pci-attach for a
> running PV domain may also have been broken, but that was not verified.
> 
> The stubdomain is running (!starting) and PV, so it calls
> libxl__wait_for_backend.  With the new placement of
> libxl__create_pci_backend, the path does not exist and the call
> immediately fails.
...
> The wait is only relevant when the backend is already present.  num_devs
> is already used to determine if the backend needs to be created.  Re-use
> num_devs to determine if the backend wait is necessary.  The wait is
> necessary to avoid racing with another PCI attachment reconfiguring the
> front/back. If we are creating the backend, then we don't have to worry
> about a racing reconfigure.

Thanks for working on this.  Sorry it's taken a while for me to look
at this properly.  It seems very complicated.  I confess I am
confused.  I wonder if I actually understand properly how the code in
tree is supposed to work.  So if I seem not to be making sense, please
explain :-).


AFAICT what you are saying is that:

  In 0fdb48ffe7a1, pci-attach was moved later in the setup, until a time
  after the stubdomain is running.  So that code now always runs with
  !starting, when previously it would run with !!starting.

  libxl__wait_for_backend fails if the backend path does not exist;
  previously, when the domain is being created, the wait would be
  skipped.  Now because !starting, the wait is done, and fails because
  the backend path is missing.

  The purpose of the wait is to make sure the frontend is ready to
  accept the instructions, mostly to prevent multiple pci attach
  happening simultaneously.

You are using num_devs to see whether the backend exists already, so
as to skip the failing check.  I don't think that is right.  But I'm
not sure the old code is right either.

If you are right about the reason for the wait, it does not seem
correctly placed.  There is surely a TOCTOU race: first, we do the
wait, and then we write, non-transactionally, to xenstore.  If two of
these processes run at once, they could both decide not to wait,
then both try to create the backend and trample on each other.

This kind of thing is usually supposed to be dealt with by a
combination of the userdata lock (for the config) and xenstore
transaction but the code here doesn't seem to do that correctly.

Shouldn't all of this looking at xenstore occur within the transaction
loop ?  What this code seems to do is read some things
nontransactionally, then enter a transaction, and then make some
writes based on a combination of the pre-transaction and
within-transaction data.  In particular `num_devs` is read
nontransactionally and then written within the transaction, without, I
think being checked for subsequent modification.

Also, I think the purpose of `starting` is to avoid waiting for the
backend to be stable before the frontend is actaully unpaused, in
which case presumably the backend would never be Connected and we
would deadlock (and eventually time out).  In general when we are
hot-adding we must wait for the frontend and backend to be stable
before saying we're done, whereas with cold-adding we set up the
information and simply hope (expect) it all to sort itself out while
the domain boots.  So, I would be expecting to see the wait to happen
*after* the add.

There is also the wrinkle that the non-pci code is differently
structured, because it must not use libxl__wait_for_backend at all.
libxl__wait_for_backend is synchronous and blocking the whole libxl
process for an extended time is not allowed.  But AIUI we have made an
exception for pci because the pci backend is always in dom0 and
trusted.


I looked through the git history.

Very relevant seems
   70628d024da42eea
   libxl: Multi-device passthrough coldplug: do not wait for unstarted guest
which has some explanation from me in the commit message.  I'm not
sure why I didn't spot the anomalous transaction use problem.


Also I found
   1a734d51902dff44
   libxl: fix cold plugged PCI device with stubdomain
and, would you believe it
   18f93842ac3c2ca4
   libxl: fix cold plugged PCI devices with stubdomains
which seems at least tangentially relevant, showing that this seems
persistently to break :-(.  This suggests quite strongly that we
should add some tests for pci passthrough including at least one for
stubdom coldplug.

Also:
   b6c23c86fe5a1a02
   libxl: add all pci devices to xenstore at once (during VM create)
which seems OK.

There has been a lot of refactoring, but much of it hasn't really
changed the structure of this function.

The issue I identify above, with the inconsistent use of transactions,
seems to have existed since the very beginning.  In
   b0a1af61678b5e4c
   libxenlight: implement pci passthrough
the `retry_transaction:` label seems to me to be in the wrong place.


I have CC'd Paul and Oleksandr (committer/reviewer of 0fdb48ffe7a1),
Marek (seems to have touched a lot of the stubdom code here) and
Stefano (original author of the pci passthrough code here)
in case they would like to comment...


Thanks,
Ian.

Ian Jackson Aug. 17, 2021, 12:40 p.m. UTC | #5

> I looked through the git history.

Also there are some doc comments in libxl_internal.h.

In particular "Algorithm for handling device removal"
(near l.2844 in my tree) and the following

I don't think that diagram has been modified since I drew it, so it
may well be out of date.  Indeed I hope at least the new QMP
arrangements aren't racy...

And, "As of Xen 4.5 we maintain various information"
which describes the various locks and has a proof
sketch for correct operation.

Ian.

Jason Andryuk Aug. 18, 2021, 1:13 a.m. UTC | #6

On Tue, Aug 17, 2021 at 8:26 AM Ian Jackson <iwj@xenproject.org> wrote:
>
> Jason Andryuk writes ("[PATCH v2] libxl: Fix stubdom PCI passthrough"):
> > commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
> > reflected in the config" broken stubdom PCI passthrough when it moved
> > libxl__create_pci_backend later in the function.  xl pci-attach for a
> > running PV domain may also have been broken, but that was not verified.
> >
> > The stubdomain is running (!starting) and PV, so it calls
> > libxl__wait_for_backend.  With the new placement of
> > libxl__create_pci_backend, the path does not exist and the call
> > immediately fails.
> ...
> > The wait is only relevant when the backend is already present.  num_devs
> > is already used to determine if the backend needs to be created.  Re-use
> > num_devs to determine if the backend wait is necessary.  The wait is
> > necessary to avoid racing with another PCI attachment reconfiguring the
> > front/back. If we are creating the backend, then we don't have to worry
> > about a racing reconfigure.
>
> Thanks for working on this.  Sorry it's taken a while for me to look
> at this properly.  It seems very complicated.  I confess I am
> confused.  I wonder if I actually understand properly how the code in
> tree is supposed to work.  So if I seem not to be making sense, please
> explain :-).
>
>
> AFAICT what you are saying is that:
>
>   In 0fdb48ffe7a1, pci-attach was moved later in the setup, until a time
>   after the stubdomain is running.  So that code now always runs with
>   !starting, when previously it would run with !!starting.

No, it's not a starting vs. starting issue.  More that the
0fdb48ffe7a1 change didn't consider the !starting case.

We have 3 cases:
PV domain - use xen-pcifront/back
HVM - use QEMU QMP (Modern QEMU - I'm not sure about qemu-traditional)
HVM with Stubdom - xen-pcifront/back to stubdom + QEMU (QMP or
xenstore (traditional)).

Stubdomain is always !starting (running) when the guest is starting.
I think this is so that QEMU is running and can handle QMP commands.

>   libxl__wait_for_backend fails if the backend path does not exist;
>   previously, when the domain is being created, the wait would be
>   skipped.  Now because !starting, the wait is done, and fails because
>   the backend path is missing.

Previously, the backend was created before the wait was called.  The
diff for 0fdb48ffe7a1 shows the movement of the call to
libxl__create_pci_backend().  It's not shown in the diff, but
libxl__wait_for_backend() does not move which leads to the wait for a
non-existent node.

>   The purpose of the wait is to make sure the frontend is ready to
>   accept the instructions, mostly to prevent multiple pci attach
>   happening simultaneously.
>
> You are using num_devs to see whether the backend exists already, so
> as to skip the failing check.  I don't think that is right.  But I'm
> not sure the old code is right either.

num_devs is used pre and post-0fdb48ffe7a1 to control behaviour.  My
change just adds another case.

commit 70628d024da4 "libxl: Multi-device passthrough coldplug: do not
wait for unstarted guest" which you reference below explains the
num_devs use.

> If you are right about the reason for the wait, it does not seem
> correctly placed.  There is surely a TOCTOU race: first, we do the
> wait, and then we write, non-transactionally, to xenstore.  If two of
> these processes run at once, they could both decide not to wait,
> then both try to create the backend and trample on each other.

Yes, two simultaneous "1st" adds would want to create the backend and clash.

> This kind of thing is usually supposed to be dealt with by a
> combination of the userdata lock (for the config) and xenstore
> transaction but the code here doesn't seem to do that correctly.
>
> Shouldn't all of this looking at xenstore occur within the transaction
> loop ?  What this code seems to do is read some things
> nontransactionally, then enter a transaction, and then make some
> writes based on a combination of the pre-transaction and
> within-transaction data.  In particular `num_devs` is read
> nontransactionally and then written within the transaction, without, I
> think being checked for subsequent modification.
>
> Also, I think the purpose of `starting` is to avoid waiting for the
> backend to be stable before the frontend is actaully unpaused, in
> which case presumably the backend would never be Connected and we
> would deadlock (and eventually time out).  In general when we are
> hot-adding we must wait for the frontend and backend to be stable
> before saying we're done, whereas with cold-adding we set up the
> information and simply hope (expect) it all to sort itself out while
> the domain boots.  So, I would be expecting to see the wait to happen
> *after* the add.
>
> There is also the wrinkle that the non-pci code is differently
> structured, because it must not use libxl__wait_for_backend at all.
> libxl__wait_for_backend is synchronous and blocking the whole libxl
> process for an extended time is not allowed.  But AIUI we have made an
> exception for pci because the pci backend is always in dom0 and
> trusted.

I think the wrinkle is that the single PCI backend hosts multiple devices.

> I looked through the git history.
>
> Very relevant seems
>    70628d024da42eea
>    libxl: Multi-device passthrough coldplug: do not wait for unstarted guest
> which has some explanation from me in the commit message.  I'm not
> sure why I didn't spot the anomalous transaction use problem.
>
>
> Also I found
>    1a734d51902dff44
>    libxl: fix cold plugged PCI device with stubdomain
> and, would you believe it
>    18f93842ac3c2ca4
>    libxl: fix cold plugged PCI devices with stubdomains
> which seems at least tangentially relevant, showing that this seems
> persistently to break :-(.  This suggests quite strongly that we
> should add some tests for pci passthrough including at least one for
> stubdom coldplug.
>
> Also:
>    b6c23c86fe5a1a02
>    libxl: add all pci devices to xenstore at once (during VM create)
> which seems OK.
>
> There has been a lot of refactoring, but much of it hasn't really
> changed the structure of this function.
>
> The issue I identify above, with the inconsistent use of transactions,
> seems to have existed since the very beginning.  In
>    b0a1af61678b5e4c
>    libxenlight: implement pci passthrough
> the `retry_transaction:` label seems to me to be in the wrong place.
>
>
> I have CC'd Paul and Oleksandr (committer/reviewer of 0fdb48ffe7a1),
> Marek (seems to have touched a lot of the stubdom code here) and
> Stefano (original author of the pci passthrough code here)
> in case they would like to comment...

Regards,
Jason

[v2] libxl: Fix stubdom PCI passthrough

Commit Message

Comments

Patch