diff mbox series

[XEN,v10,4/5] tools: Add new function to get gsi from dev

Message ID 20240617090035.839640-5-Jiqian.Chen@amd.com (mailing list archive)
State Superseded
Headers show
Series Support device passthrough when dom0 is PVH on Xen | expand

Commit Message

Chen, Jiqian June 17, 2024, 9 a.m. UTC
In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
irq number is alloced from small to large, but the applying
gsi number is not, may gsi 38 comes before gsi 28, that
causes the irq number is not equal with the gsi number.
And when passthrough a device, QEMU will use its gsi number
to do pirq mapping, see xen_pt_realize->xc_physdev_map_pirq,
but the gsi number is got from file
/sys/bus/pci/devices/<sbdf>/irq, so it will fail when mapping.
And in current codes, there is no method to get gsi for
userspace.

For above purpose, add new function to get gsi. And call this
function before xc_physdev_(un)map_pirq

Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Chen Jiqian <Jiqian.Chen@amd.com>
---
RFC: it needs review and needs to wait for the corresponding third patch on linux kernel side to be merged.
---
 tools/include/xen-sys/Linux/privcmd.h |  7 +++++
 tools/include/xencall.h               |  2 ++
 tools/include/xenctrl.h               |  2 ++
 tools/libs/call/core.c                |  5 ++++
 tools/libs/call/libxencall.map        |  2 ++
 tools/libs/call/linux.c               | 15 +++++++++++
 tools/libs/call/private.h             |  9 +++++++
 tools/libs/ctrl/xc_physdev.c          |  4 +++
 tools/libs/light/Makefile             |  2 +-
 tools/libs/light/libxl_pci.c          | 38 +++++++++++++++++++++++++++
 10 files changed, 85 insertions(+), 1 deletion(-)

Comments

Jan Beulich June 17, 2024, 3:10 p.m. UTC | #1
On 17.06.2024 11:00, Jiqian Chen wrote:
> In PVH dom0, it uses the linux local interrupt mechanism,
> when it allocs irq for a gsi, it is dynamic, and follow
> the principle of applying first, distributing first. And
> irq number is alloced from small to large, but the applying
> gsi number is not, may gsi 38 comes before gsi 28, that
> causes the irq number is not equal with the gsi number.

Hmm, see my earlier explanations on patch 5: GSI and IRQ generally aren't
the same anyway. Therefore this part of the description, while not wrong,
is at least at risk of misleading people.

> --- a/tools/include/xen-sys/Linux/privcmd.h
> +++ b/tools/include/xen-sys/Linux/privcmd.h
> @@ -95,6 +95,11 @@ typedef struct privcmd_mmap_resource {
>  	__u64 addr;
>  } privcmd_mmap_resource_t;
>  
> +typedef struct privcmd_gsi_from_dev {
> +	__u32 sbdf;

That's PCI-centric, without struct and IOCTL names reflecting this fact.

> +	int gsi;

Is "int" legitimate to use here? Doesn't this want to similarly be __u32?

> --- a/tools/include/xencall.h
> +++ b/tools/include/xencall.h
> @@ -113,6 +113,8 @@ int xencall5(xencall_handle *xcall, unsigned int op,
>               uint64_t arg1, uint64_t arg2, uint64_t arg3,
>               uint64_t arg4, uint64_t arg5);
>  
> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf);

Hmm, something (by name at least) OS-specific being in the public header
and ...

> --- a/tools/libs/call/libxencall.map
> +++ b/tools/libs/call/libxencall.map
> @@ -10,6 +10,8 @@ VERS_1.0 {
>  		xencall4;
>  		xencall5;
>  
> +		xen_oscall_gsi_from_dev;

... map file. I'm not sure things are intended to be this way.

> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>  #endif
>  }
>  
> +#define PCI_DEVID(bus, devfn)\
> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
> +
> +#define PCI_SBDF(seg, bus, devfn) \
> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))

I'm not a maintainer of this file; if I were, I'd ask that for readability's
sake all excess parentheses be dropped from these.

> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>          goto out_no_irq;
>      }
>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> +#ifdef CONFIG_X86
> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
> +                        (PCI_DEVFN(pci->dev, pci->func)));
> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
> +        /*
> +         * Old kernel version may not support this function,

Just kernel?

> +         * so if fail, keep using irq; if success, use gsi
> +         */
> +        if (gsi > 0) {
> +            irq = gsi;

I'm still puzzled by this, when by now I think we've sufficiently clarified
that IRQs and GSIs use two distinct numbering spaces.

Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
the assumption that it'll fail. What if we decide to make the functionality
available there, too (if only for informational purposes, or for
consistency)? Suddenly you're fallback logic wouldn't work anymore, and
you'd call ...

> +        }
> +#endif
>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);

... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
you strictly want to avoid the call on PV Dom0.

Also for PVH Dom0: I don't think I've seen changes to the hypercall
handling, yet. How can that be when GSI and IRQ aren't the same, and hence
incoming GSI would need translating to IRQ somewhere? I can once again only
assume all your testing was done with IRQs whose numbers happened to match
their GSI numbers. (The difference, imo, would also need calling out in the
public header, where the respective interface struct(s) is/are defined.)

Jan
Chen, Jiqian June 18, 2024, 8:10 a.m. UTC | #2
On 2024/6/17 23:10, Jan Beulich wrote:
> On 17.06.2024 11:00, Jiqian Chen wrote:
>> In PVH dom0, it uses the linux local interrupt mechanism,
>> when it allocs irq for a gsi, it is dynamic, and follow
>> the principle of applying first, distributing first. And
>> irq number is alloced from small to large, but the applying
>> gsi number is not, may gsi 38 comes before gsi 28, that
>> causes the irq number is not equal with the gsi number.
> 
> Hmm, see my earlier explanations on patch 5: GSI and IRQ generally aren't
> the same anyway. Therefore this part of the description, while not wrong,
> is at least at risk of misleading people.
OK, I wll change to say "irq is not the same as gsi".
> 
>> --- a/tools/include/xen-sys/Linux/privcmd.h
>> +++ b/tools/include/xen-sys/Linux/privcmd.h
>> @@ -95,6 +95,11 @@ typedef struct privcmd_mmap_resource {
>>  	__u64 addr;
>>  } privcmd_mmap_resource_t;
>>  
>> +typedef struct privcmd_gsi_from_dev {
>> +	__u32 sbdf;
> 
> That's PCI-centric, without struct and IOCTL names reflecting this fact.
So, change to privcmd_gsi_from_pcidev ?

> 
>> +	int gsi;
> 
> Is "int" legitimate to use here? Doesn't this want to similarly be __u32?
I want to set gsi to negative if there is no record of this translation.

> 
>> --- a/tools/include/xencall.h
>> +++ b/tools/include/xencall.h
>> @@ -113,6 +113,8 @@ int xencall5(xencall_handle *xcall, unsigned int op,
>>               uint64_t arg1, uint64_t arg2, uint64_t arg3,
>>               uint64_t arg4, uint64_t arg5);
>>  
>> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf);
> 
> Hmm, something (by name at least) OS-specific being in the public header
> and ...
> 
>> --- a/tools/libs/call/libxencall.map
>> +++ b/tools/libs/call/libxencall.map
>> @@ -10,6 +10,8 @@ VERS_1.0 {
>>  		xencall4;
>>  		xencall5;
>>  
>> +		xen_oscall_gsi_from_dev;
> 
> ... map file. I'm not sure things are intended to be this way.
Let's see other maintainer's opinion.

> 
>> --- a/tools/libs/light/libxl_pci.c
>> +++ b/tools/libs/light/libxl_pci.c
>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>  #endif
>>  }
>>  
>> +#define PCI_DEVID(bus, devfn)\
>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>> +
>> +#define PCI_SBDF(seg, bus, devfn) \
>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
> 
> I'm not a maintainer of this file; if I were, I'd ask that for readability's
> sake all excess parentheses be dropped from these.
Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
It seems other files also do this. See tools/libs/light/libxl_internal.h

> 
>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>          goto out_no_irq;
>>      }
>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>> +#ifdef CONFIG_X86
>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>> +        /*
>> +         * Old kernel version may not support this function,
> 
> Just kernel?
Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
> 
>> +         * so if fail, keep using irq; if success, use gsi
>> +         */
>> +        if (gsi > 0) {
>> +            irq = gsi;
> 
> I'm still puzzled by this, when by now I think we've sufficiently clarified
> that IRQs and GSIs use two distinct numbering spaces.
> 
> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
> the assumption that it'll fail. What if we decide to make the functionality
> available there, too (if only for informational purposes, or for
> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
> you'd call ...
> 
>> +        }
>> +#endif
>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
> 
> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
> you strictly want to avoid the call on PV Dom0.
> 
> Also for PVH Dom0: I don't think I've seen changes to the hypercall
> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
> incoming GSI would need translating to IRQ somewhere? I can once again only
> assume all your testing was done with IRQs whose numbers happened to match
> their GSI numbers. (The difference, imo, would also need calling out in the
> public header, where the respective interface struct(s) is/are defined.)
I feel like you missed out on many of the previous discussions.
Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
But not when using PVH, so passthrough failed.
With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 

And regarding to the implementation of ioctl IOCTL_PRIVCMD_GSI_FROM_DEV, it doesn't have any xen heypercall handling changes, all of its processing logic is on the kernel side.
I know, so you might want to say, "Then you shouldn't put this in xen's code." But this concern was discussed in previous versions, and since the pci maintainer disallowed to add
gsi sysfs on linux kernel side, I had to do so.
Roger, Stefano and Juergen may know more about this part.

> 
> Jan
Jan Beulich June 18, 2024, 9:13 a.m. UTC | #3
On 18.06.2024 10:10, Chen, Jiqian wrote:
> On 2024/6/17 23:10, Jan Beulich wrote:
>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>> --- a/tools/include/xen-sys/Linux/privcmd.h
>>> +++ b/tools/include/xen-sys/Linux/privcmd.h
>>> @@ -95,6 +95,11 @@ typedef struct privcmd_mmap_resource {
>>>  	__u64 addr;
>>>  } privcmd_mmap_resource_t;
>>>  
>>> +typedef struct privcmd_gsi_from_dev {
>>> +	__u32 sbdf;
>>
>> That's PCI-centric, without struct and IOCTL names reflecting this fact.
> So, change to privcmd_gsi_from_pcidev ?

That's what I'd suggest, yes. But remember that it's the kernel maintainers
who have the ultimate say here, as here you're only making a copy of what
the canonical header (in the kernel tree) is going to have.

>>> +	int gsi;
>>
>> Is "int" legitimate to use here? Doesn't this want to similarly be __u32?
> I want to set gsi to negative if there is no record of this translation.

There are surely more explicit ways to signal that case?

>>> --- a/tools/libs/light/libxl_pci.c
>>> +++ b/tools/libs/light/libxl_pci.c
>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>  #endif
>>>  }
>>>  
>>> +#define PCI_DEVID(bus, devfn)\
>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>> +
>>> +#define PCI_SBDF(seg, bus, devfn) \
>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>
>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>> sake all excess parentheses be dropped from these.
> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
> It seems other files also do this. See tools/libs/light/libxl_internal.h

As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
has its own CODING_STYLE, I can't spot anything towards excessive use of
parentheses there.

>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>          goto out_no_irq;
>>>      }
>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>> +#ifdef CONFIG_X86
>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>> +        /*
>>> +         * Old kernel version may not support this function,
>>
>> Just kernel?
> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.

Okay, and when the kernel supports it but the underlying hypervisor doesn't
support what the kernel wants to use in order to fulfill the request, all
is fine? (See also below for what may be needed in the hypervisor, even if
this IOCTL would be satisfied by the kernel without needing to interact with
the hypervisor.)

>>> +         * so if fail, keep using irq; if success, use gsi
>>> +         */
>>> +        if (gsi > 0) {
>>> +            irq = gsi;
>>
>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>> that IRQs and GSIs use two distinct numbering spaces.
>>
>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>> the assumption that it'll fail. What if we decide to make the functionality
>> available there, too (if only for informational purposes, or for
>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>> you'd call ...
>>
>>> +        }
>>> +#endif
>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>
>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>> you strictly want to avoid the call on PV Dom0.
>>
>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>> incoming GSI would need translating to IRQ somewhere? I can once again only
>> assume all your testing was done with IRQs whose numbers happened to match
>> their GSI numbers. (The difference, imo, would also need calling out in the
>> public header, where the respective interface struct(s) is/are defined.)
> I feel like you missed out on many of the previous discussions.
> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
> But not when using PVH, so passthrough failed.
> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 

I understand all of this, to a (I think) sufficient degree at least. Yet what
you're effectively proposing (without explicitly saying so) is that e.g.
struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
to make (simply because the caller may have no way to express things in pIRQ
terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
would be necessary. In fact that field is presently marked as "IN or OUT";
when re-purposed to take a GSI on input, it may end up being necessary to pass
back the pIRQ (in the subject domain's numbering space). Or alternatively it
may be necessary to add yet another sub-function so the GSI can be translated
to the corresponding pIRQ for the domain that's going to use the IRQ, for that
then to be passed into PHYSDEVOP_map_pirq.

> And regarding to the implementation of ioctl IOCTL_PRIVCMD_GSI_FROM_DEV, it doesn't have any xen heypercall handling changes, all of its processing logic is on the kernel side.
> I know, so you might want to say, "Then you shouldn't put this in xen's code." But this concern was discussed in previous versions, and since the pci maintainer disallowed to add
> gsi sysfs on linux kernel side, I had to do so.

Right, but this is a separate aspect (which we simply need to live with on
the Xen side).

Jan
Chen, Jiqian June 20, 2024, 7:03 a.m. UTC | #4
On 2024/6/18 17:13, Jan Beulich wrote:
> On 18.06.2024 10:10, Chen, Jiqian wrote:
>> On 2024/6/17 23:10, Jan Beulich wrote:
>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>> --- a/tools/include/xen-sys/Linux/privcmd.h
>>>> +++ b/tools/include/xen-sys/Linux/privcmd.h
>>>> @@ -95,6 +95,11 @@ typedef struct privcmd_mmap_resource {
>>>>  	__u64 addr;
>>>>  } privcmd_mmap_resource_t;
>>>>  
>>>> +typedef struct privcmd_gsi_from_dev {
>>>> +	__u32 sbdf;
>>>
>>> That's PCI-centric, without struct and IOCTL names reflecting this fact.
>> So, change to privcmd_gsi_from_pcidev ?
> 
> That's what I'd suggest, yes. But remember that it's the kernel maintainers
> who have the ultimate say here, as here you're only making a copy of what
> the canonical header (in the kernel tree) is going to have.
OK, then let's wait for the corresponding patch on kernel side to be accepted first.
> 
>>>> +	int gsi;
>>>
>>> Is "int" legitimate to use here? Doesn't this want to similarly be __u32?
>> I want to set gsi to negative if there is no record of this translation.
> 
> There are surely more explicit ways to signal that case?
Maybe, I will think about the implementation on kernel side again.
> 
>>>> --- a/tools/libs/light/libxl_pci.c
>>>> +++ b/tools/libs/light/libxl_pci.c
>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>  #endif
>>>>  }
>>>>  
>>>> +#define PCI_DEVID(bus, devfn)\
>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>> +
>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>
>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>> sake all excess parentheses be dropped from these.
>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>> It seems other files also do this. See tools/libs/light/libxl_internal.h
> 
> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
> has its own CODING_STYLE, I can't spot anything towards excessive use of
> parentheses there.
So, which parentheses do you think are excessive use?
> 
>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>          goto out_no_irq;
>>>>      }
>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>> +#ifdef CONFIG_X86
>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>> +        /*
>>>> +         * Old kernel version may not support this function,
>>>
>>> Just kernel?
>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
> 
> Okay, and when the kernel supports it but the underlying hypervisor doesn't
> support what the kernel wants to use in order to fulfill the request, all
I don't know what things you mentioned hypervisor doesn't support are,
because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
that relationship can be got only in dom0 instead of Xen hypervisor.

> is fine? (See also below for what may be needed in the hypervisor, even if
You mean xc_physdev_map_pirq needs gsi?

> this IOCTL would be satisfied by the kernel without needing to interact with
> the hypervisor.)
> 
>>>> +         * so if fail, keep using irq; if success, use gsi
>>>> +         */
>>>> +        if (gsi > 0) {
>>>> +            irq = gsi;
>>>
>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>> that IRQs and GSIs use two distinct numbering spaces.
>>>
>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>> the assumption that it'll fail. What if we decide to make the functionality
>>> available there, too (if only for informational purposes, or for
>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>> you'd call ...
>>>
>>>> +        }
>>>> +#endif
>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>
>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>> you strictly want to avoid the call on PV Dom0.
>>>
>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>> assume all your testing was done with IRQs whose numbers happened to match
>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>> public header, where the respective interface struct(s) is/are defined.)
>> I feel like you missed out on many of the previous discussions.
>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>> But not when using PVH, so passthrough failed.
>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
> 
> I understand all of this, to a (I think) sufficient degree at least. Yet what
> you're effectively proposing (without explicitly saying so) is that e.g.
> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
> to make (simply because the caller may have no way to express things in pIRQ
> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
> would be necessary. In fact that field is presently marked as "IN or OUT";
> when re-purposed to take a GSI on input, it may end up being necessary to pass
> back the pIRQ (in the subject domain's numbering space). Or alternatively it
> may be necessary to add yet another sub-function so the GSI can be translated
> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
> then to be passed into PHYSDEVOP_map_pirq.
If I understood correctly, your concerns about this patch are two:
First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.
Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.

> 
>> And regarding to the implementation of ioctl IOCTL_PRIVCMD_GSI_FROM_DEV, it doesn't have any xen heypercall handling changes, all of its processing logic is on the kernel side.
>> I know, so you might want to say, "Then you shouldn't put this in xen's code." But this concern was discussed in previous versions, and since the pci maintainer disallowed to add
>> gsi sysfs on linux kernel side, I had to do so.
> 
> Right, but this is a separate aspect (which we simply need to live with on
> the Xen side).
> 
> Jan
Jan Beulich June 20, 2024, 7:43 a.m. UTC | #5
On 20.06.2024 09:03, Chen, Jiqian wrote:
> On 2024/6/18 17:13, Jan Beulich wrote:
>> On 18.06.2024 10:10, Chen, Jiqian wrote:
>>> On 2024/6/17 23:10, Jan Beulich wrote:
>>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>>  #endif
>>>>>  }
>>>>>  
>>>>> +#define PCI_DEVID(bus, devfn)\
>>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>>> +
>>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>>
>>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>>> sake all excess parentheses be dropped from these.
>>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>>> It seems other files also do this. See tools/libs/light/libxl_internal.h
>>
>> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
>> has its own CODING_STYLE, I can't spot anything towards excessive use of
>> parentheses there.
> So, which parentheses do you think are excessive use?

#define PCI_DEVID(bus, devfn)\
            (((uint16_t)(bus) << 8) | ((devfn) & 0xff))

#define PCI_SBDF(seg, bus, devfn) \
            (((uint32_t)(seg) << 16) | PCI_DEVID(bus, devfn))

>>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>          goto out_no_irq;
>>>>>      }
>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>> +#ifdef CONFIG_X86
>>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>>> +        /*
>>>>> +         * Old kernel version may not support this function,
>>>>
>>>> Just kernel?
>>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
>>
>> Okay, and when the kernel supports it but the underlying hypervisor doesn't
>> support what the kernel wants to use in order to fulfill the request, all
> I don't know what things you mentioned hypervisor doesn't support are,
> because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
> that relationship can be got only in dom0 instead of Xen hypervisor.
> 
>> is fine? (See also below for what may be needed in the hypervisor, even if
> You mean xc_physdev_map_pirq needs gsi?

I'd put it slightly differently: You arrange for that function to now take a
GSI when the caller is PVH. But yes, the function, when used with
MAP_PIRQ_TYPE_GSI, clearly expects a GSI as input (see also below).

>> this IOCTL would be satisfied by the kernel without needing to interact with
>> the hypervisor.)
>>
>>>>> +         * so if fail, keep using irq; if success, use gsi
>>>>> +         */
>>>>> +        if (gsi > 0) {
>>>>> +            irq = gsi;
>>>>
>>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>>> that IRQs and GSIs use two distinct numbering spaces.
>>>>
>>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>>> the assumption that it'll fail. What if we decide to make the functionality
>>>> available there, too (if only for informational purposes, or for
>>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>>> you'd call ...
>>>>
>>>>> +        }
>>>>> +#endif
>>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>>
>>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>>> you strictly want to avoid the call on PV Dom0.
>>>>
>>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>>> assume all your testing was done with IRQs whose numbers happened to match
>>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>>> public header, where the respective interface struct(s) is/are defined.)
>>> I feel like you missed out on many of the previous discussions.
>>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>>> But not when using PVH, so passthrough failed.
>>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
>>
>> I understand all of this, to a (I think) sufficient degree at least. Yet what
>> you're effectively proposing (without explicitly saying so) is that e.g.
>> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
>> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
>> to make (simply because the caller may have no way to express things in pIRQ
>> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
>> would be necessary. In fact that field is presently marked as "IN or OUT";
>> when re-purposed to take a GSI on input, it may end up being necessary to pass
>> back the pIRQ (in the subject domain's numbering space). Or alternatively it
>> may be necessary to add yet another sub-function so the GSI can be translated
>> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
>> then to be passed into PHYSDEVOP_map_pirq.
> If I understood correctly, your concerns about this patch are two:
> First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.

Yes.

> Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.

I think so, yes. You also may need to record the output value, so you can later
use it for unmap. xc_physdev_map_pirq() may also need adjusting, as right now
it wouldn't put a negative incoming *pirq into the .pirq field. I actually
wonder if that's even correct right now, i.e. independent of your change.

Jan
Chen, Jiqian June 20, 2024, 10:23 a.m. UTC | #6
On 2024/6/20 15:43, Jan Beulich wrote:
> On 20.06.2024 09:03, Chen, Jiqian wrote:
>> On 2024/6/18 17:13, Jan Beulich wrote:
>>> On 18.06.2024 10:10, Chen, Jiqian wrote:
>>>> On 2024/6/17 23:10, Jan Beulich wrote:
>>>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>>>  #endif
>>>>>>  }
>>>>>>  
>>>>>> +#define PCI_DEVID(bus, devfn)\
>>>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>>>> +
>>>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>>>
>>>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>>>> sake all excess parentheses be dropped from these.
>>>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>>>> It seems other files also do this. See tools/libs/light/libxl_internal.h
>>>
>>> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
>>> has its own CODING_STYLE, I can't spot anything towards excessive use of
>>> parentheses there.
>> So, which parentheses do you think are excessive use?
> 
> #define PCI_DEVID(bus, devfn)\
>             (((uint16_t)(bus) << 8) | ((devfn) & 0xff))
> 
> #define PCI_SBDF(seg, bus, devfn) \
>             (((uint32_t)(seg) << 16) | PCI_DEVID(bus, devfn))
Thanks, will change in next version.

> 
>>>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>>          goto out_no_irq;
>>>>>>      }
>>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>>> +#ifdef CONFIG_X86
>>>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>>>> +        /*
>>>>>> +         * Old kernel version may not support this function,
>>>>>
>>>>> Just kernel?
>>>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
>>>
>>> Okay, and when the kernel supports it but the underlying hypervisor doesn't
>>> support what the kernel wants to use in order to fulfill the request, all
>> I don't know what things you mentioned hypervisor doesn't support are,
>> because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
>> that relationship can be got only in dom0 instead of Xen hypervisor.
>>
>>> is fine? (See also below for what may be needed in the hypervisor, even if
>> You mean xc_physdev_map_pirq needs gsi?
> 
> I'd put it slightly differently: You arrange for that function to now take a
> GSI when the caller is PVH. But yes, the function, when used with
> MAP_PIRQ_TYPE_GSI, clearly expects a GSI as input (see also below).
> 
>>> this IOCTL would be satisfied by the kernel without needing to interact with
>>> the hypervisor.)
>>>
>>>>>> +         * so if fail, keep using irq; if success, use gsi
>>>>>> +         */
>>>>>> +        if (gsi > 0) {
>>>>>> +            irq = gsi;
>>>>>
>>>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>>>> that IRQs and GSIs use two distinct numbering spaces.
>>>>>
>>>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>>>> the assumption that it'll fail. What if we decide to make the functionality
>>>>> available there, too (if only for informational purposes, or for
>>>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>>>> you'd call ...
>>>>>
>>>>>> +        }
>>>>>> +#endif
>>>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>>>
>>>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>>>> you strictly want to avoid the call on PV Dom0.
>>>>>
>>>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>>>> assume all your testing was done with IRQs whose numbers happened to match
>>>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>>>> public header, where the respective interface struct(s) is/are defined.)
>>>> I feel like you missed out on many of the previous discussions.
>>>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>>>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>>>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>>>> But not when using PVH, so passthrough failed.
>>>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>>>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>>>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
>>>
>>> I understand all of this, to a (I think) sufficient degree at least. Yet what
>>> you're effectively proposing (without explicitly saying so) is that e.g.
>>> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
>>> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
>>> to make (simply because the caller may have no way to express things in pIRQ
>>> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
>>> would be necessary. In fact that field is presently marked as "IN or OUT";
>>> when re-purposed to take a GSI on input, it may end up being necessary to pass
>>> back the pIRQ (in the subject domain's numbering space). Or alternatively it
>>> may be necessary to add yet another sub-function so the GSI can be translated
>>> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
>>> then to be passed into PHYSDEVOP_map_pirq.
>> If I understood correctly, your concerns about this patch are two:
>> First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.
> 
> Yes.
OK, I can change to do this.
But I still have a concern:
Although without my changes, passthrough can work now when dom0 is PV.
And you also agree that: for xc_physdev_map_pirq, when use with MAP_PIRQ_TYPE_GSI, it expects a GSI as input.
Isn't it a wrong for PV dom0 to pass irq in? Why don't we use gsi as it should be used, since we have a function to get gsi now?

> 
>> Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.
> 
> I think so, yes. You also may need to record the output value, so you can later
> use it for unmap. xc_physdev_map_pirq() may also need adjusting, as right now
> it wouldn't put a negative incoming *pirq into the .pirq field. 
xc_physdev_map_pirq's logic is if we pass a negative in, it sets *pirq to index(gsi).
Is its logic right? If not how do we change it?

> I actually wonder if that's even correct right now, i.e. independent of your change.
Even without my changes, passthrough can work for PV dom0, not for PVH dom0.
According to the logic of hypercall PHYSDEVOP_map_pirq,
if pirq is -1, it calls physdev_map_pirq-> allocate_and_map_gsi_pirq-> allocate_pirq -> get_free_pirq to get pirq.
If pirq is set to positive before calling hypercall, it set pirq to its own value in allocate_pirq.

> 
> Jan
Jan Beulich June 20, 2024, 10:37 a.m. UTC | #7
On 20.06.2024 12:23, Chen, Jiqian wrote:
> On 2024/6/20 15:43, Jan Beulich wrote:
>> On 20.06.2024 09:03, Chen, Jiqian wrote:
>>> On 2024/6/18 17:13, Jan Beulich wrote:
>>>> On 18.06.2024 10:10, Chen, Jiqian wrote:
>>>>> On 2024/6/17 23:10, Jan Beulich wrote:
>>>>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>>>>  #endif
>>>>>>>  }
>>>>>>>  
>>>>>>> +#define PCI_DEVID(bus, devfn)\
>>>>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>>>>> +
>>>>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>>>>
>>>>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>>>>> sake all excess parentheses be dropped from these.
>>>>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>>>>> It seems other files also do this. See tools/libs/light/libxl_internal.h
>>>>
>>>> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
>>>> has its own CODING_STYLE, I can't spot anything towards excessive use of
>>>> parentheses there.
>>> So, which parentheses do you think are excessive use?
>>
>> #define PCI_DEVID(bus, devfn)\
>>             (((uint16_t)(bus) << 8) | ((devfn) & 0xff))
>>
>> #define PCI_SBDF(seg, bus, devfn) \
>>             (((uint32_t)(seg) << 16) | PCI_DEVID(bus, devfn))
> Thanks, will change in next version.
> 
>>
>>>>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>>>          goto out_no_irq;
>>>>>>>      }
>>>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>>>> +#ifdef CONFIG_X86
>>>>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>>>>> +        /*
>>>>>>> +         * Old kernel version may not support this function,
>>>>>>
>>>>>> Just kernel?
>>>>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
>>>>
>>>> Okay, and when the kernel supports it but the underlying hypervisor doesn't
>>>> support what the kernel wants to use in order to fulfill the request, all
>>> I don't know what things you mentioned hypervisor doesn't support are,
>>> because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
>>> that relationship can be got only in dom0 instead of Xen hypervisor.
>>>
>>>> is fine? (See also below for what may be needed in the hypervisor, even if
>>> You mean xc_physdev_map_pirq needs gsi?
>>
>> I'd put it slightly differently: You arrange for that function to now take a
>> GSI when the caller is PVH. But yes, the function, when used with
>> MAP_PIRQ_TYPE_GSI, clearly expects a GSI as input (see also below).
>>
>>>> this IOCTL would be satisfied by the kernel without needing to interact with
>>>> the hypervisor.)
>>>>
>>>>>>> +         * so if fail, keep using irq; if success, use gsi
>>>>>>> +         */
>>>>>>> +        if (gsi > 0) {
>>>>>>> +            irq = gsi;
>>>>>>
>>>>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>>>>> that IRQs and GSIs use two distinct numbering spaces.
>>>>>>
>>>>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>>>>> the assumption that it'll fail. What if we decide to make the functionality
>>>>>> available there, too (if only for informational purposes, or for
>>>>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>>>>> you'd call ...
>>>>>>
>>>>>>> +        }
>>>>>>> +#endif
>>>>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>>>>
>>>>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>>>>> you strictly want to avoid the call on PV Dom0.
>>>>>>
>>>>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>>>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>>>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>>>>> assume all your testing was done with IRQs whose numbers happened to match
>>>>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>>>>> public header, where the respective interface struct(s) is/are defined.)
>>>>> I feel like you missed out on many of the previous discussions.
>>>>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>>>>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>>>>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>>>>> But not when using PVH, so passthrough failed.
>>>>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>>>>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>>>>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
>>>>
>>>> I understand all of this, to a (I think) sufficient degree at least. Yet what
>>>> you're effectively proposing (without explicitly saying so) is that e.g.
>>>> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
>>>> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
>>>> to make (simply because the caller may have no way to express things in pIRQ
>>>> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
>>>> would be necessary. In fact that field is presently marked as "IN or OUT";
>>>> when re-purposed to take a GSI on input, it may end up being necessary to pass
>>>> back the pIRQ (in the subject domain's numbering space). Or alternatively it
>>>> may be necessary to add yet another sub-function so the GSI can be translated
>>>> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
>>>> then to be passed into PHYSDEVOP_map_pirq.
>>> If I understood correctly, your concerns about this patch are two:
>>> First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.
>>
>> Yes.
> OK, I can change to do this.
> But I still have a concern:
> Although without my changes, passthrough can work now when dom0 is PV.
> And you also agree that: for xc_physdev_map_pirq, when use with MAP_PIRQ_TYPE_GSI, it expects a GSI as input.
> Isn't it a wrong for PV dom0 to pass irq in? Why don't we use gsi as it should be used, since we have a function to get gsi now?

Indeed this and ...

>>> Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.
>>
>> I think so, yes. You also may need to record the output value, so you can later
>> use it for unmap. xc_physdev_map_pirq() may also need adjusting, as right now
>> it wouldn't put a negative incoming *pirq into the .pirq field. 
> xc_physdev_map_pirq's logic is if we pass a negative in, it sets *pirq to index(gsi).
> Is its logic right? If not how do we change it?

... this matches ...

>> I actually wonder if that's even correct right now, i.e. independent of your change.

... the remark here.

> Even without my changes, passthrough can work for PV dom0, not for PVH dom0.

In the common case. I fear no-one ever tried for a device with an IRQ that
has a source override specified in ACPI.

> According to the logic of hypercall PHYSDEVOP_map_pirq,
> if pirq is -1, it calls physdev_map_pirq-> allocate_and_map_gsi_pirq-> allocate_pirq -> get_free_pirq to get pirq.
> If pirq is set to positive before calling hypercall, it set pirq to its own value in allocate_pirq.

Which is what looks wrong to me. Question is what it was done this way in the
first place.

Jan
Anthony PERARD June 20, 2024, 2:38 p.m. UTC | #8
On Mon, Jun 17, 2024 at 05:00:34PM +0800, Jiqian Chen wrote:
> diff --git a/tools/include/xencall.h b/tools/include/xencall.h
> index fc95ed0fe58e..750aab070323 100644
> --- a/tools/include/xencall.h
> +++ b/tools/include/xencall.h
> @@ -113,6 +113,8 @@ int xencall5(xencall_handle *xcall, unsigned int op,
>               uint64_t arg1, uint64_t arg2, uint64_t arg3,
>               uint64_t arg4, uint64_t arg5);
>  
> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf);

I don't think that an appropriate library for this new feature.
libxencall is a generic lib to make hypercall.

>  /* Variant(s) of the above, as needed, returning "long" instead of "int". */
>  long xencall2L(xencall_handle *xcall, unsigned int op,
>                 uint64_t arg1, uint64_t arg2);
> diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
> index 9ceca0cffc2f..a0381f74d24b 100644
> --- a/tools/include/xenctrl.h
> +++ b/tools/include/xenctrl.h
> @@ -1641,6 +1641,8 @@ int xc_physdev_unmap_pirq(xc_interface *xch,
>                            uint32_t domid,
>                            int pirq);
>  
> +int xc_physdev_gsi_from_dev(xc_interface *xch, uint32_t sbdf);
> +
>  /*
>   *  LOGGING AND ERROR REPORTING
>   */
> diff --git a/tools/libs/call/core.c b/tools/libs/call/core.c
> index 02c4f8e1aefa..6dae50c9a6ba 100644
> --- a/tools/libs/call/core.c
> +++ b/tools/libs/call/core.c
> @@ -173,6 +173,11 @@ int xencall5(xencall_handle *xcall, unsigned int op,
>      return osdep_hypercall(xcall, &call);
>  }
>  
> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf)
> +{
> +    return osdep_oscall(xcall, sbdf);
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/tools/libs/call/libxencall.map b/tools/libs/call/libxencall.map
> index d18a3174e9dc..b92a0b5dc12c 100644
> --- a/tools/libs/call/libxencall.map
> +++ b/tools/libs/call/libxencall.map
> @@ -10,6 +10,8 @@ VERS_1.0 {
>  		xencall4;
>  		xencall5;
>  
> +		xen_oscall_gsi_from_dev;

FYI, never change already released version of a library, this would add
a new function to libxencall.1.0. Instead, when adding a new function
to a library that is supposed to be stable (they have a *.map file in
xen case), add it to a new section, that woud be VERS_1.4 in this case.
But libxencall isn't approriate for this new function, so just for
future reference.

>  		xencall_alloc_buffer;
>  		xencall_free_buffer;
>  		xencall_alloc_buffer_pages;
> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
> index 6d588e6bea8f..92c740e176f2 100644
> --- a/tools/libs/call/linux.c
> +++ b/tools/libs/call/linux.c
> @@ -85,6 +85,21 @@ long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall)
>      return ioctl(xcall->fd, IOCTL_PRIVCMD_HYPERCALL, hypercall);
>  }
>  
> +int osdep_oscall(xencall_handle *xcall, unsigned int sbdf)
> +{
> +    privcmd_gsi_from_dev_t dev_gsi = {
> +        .sbdf = sbdf,
> +        .gsi = -1,
> +    };
> +
> +    if (ioctl(xcall->fd, IOCTL_PRIVCMD_GSI_FROM_DEV, &dev_gsi)) {

Looks like libxencall is only for hypercall, and so I don't think
it's the right place to introducing another ioctl() call.

> +        PERROR("failed to get gsi from dev");
> +        return -1;
> +    }
> +
> +    return dev_gsi.gsi;
> +}
> +
>  static void *alloc_pages_bufdev(xencall_handle *xcall, size_t npages)
>  {
>      void *p;
> diff --git a/tools/libs/call/private.h b/tools/libs/call/private.h
> index 9c3aa432efe2..cd6eb5a3e66f 100644
> --- a/tools/libs/call/private.h
> +++ b/tools/libs/call/private.h
> @@ -57,6 +57,15 @@ int osdep_xencall_close(xencall_handle *xcall);
>  
>  long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall);
>  
> +#if defined(__linux__)
> +int osdep_oscall(xencall_handle *xcall, unsigned int sbdf);
> +#else
> +static inline int osdep_oscall(xencall_handle *xcall, unsigned int sbdf)
> +{
> +    return -1;
> +}
> +#endif
> +
>  void *osdep_alloc_pages(xencall_handle *xcall, size_t nr_pages);
>  void osdep_free_pages(xencall_handle *xcall, void *p, size_t nr_pages);
>  
> diff --git a/tools/libs/ctrl/xc_physdev.c b/tools/libs/ctrl/xc_physdev.c
> index 460a8e779ce8..c1458f3a38b5 100644
> --- a/tools/libs/ctrl/xc_physdev.c
> +++ b/tools/libs/ctrl/xc_physdev.c
> @@ -111,3 +111,7 @@ int xc_physdev_unmap_pirq(xc_interface *xch,
>      return rc;
>  }
>  
> +int xc_physdev_gsi_from_dev(xc_interface *xch, uint32_t sbdf)
> +{

I'm not sure if this is the best place for this new function, but I
can't find another one, so that will do.

> +    return xen_oscall_gsi_from_dev(xch->xcall, sbdf);
> +}
> diff --git a/tools/libs/light/Makefile b/tools/libs/light/Makefile
> index 37e4d1670986..6b616d5ee9b6 100644
> --- a/tools/libs/light/Makefile
> +++ b/tools/libs/light/Makefile
> @@ -40,7 +40,7 @@ OBJS-$(CONFIG_X86) += $(ACPI_OBJS)
>  
>  CFLAGS += -Wno-format-zero-length -Wmissing-declarations -Wformat-nonliteral
>  
> -CFLAGS-$(CONFIG_X86) += -DCONFIG_PCI_SUPP_LEGACY_IRQ
> +CFLAGS-$(CONFIG_X86) += -DCONFIG_PCI_SUPP_LEGACY_IRQ -DCONFIG_X86
>  
>  OBJS-$(CONFIG_X86) += libxl_cpuid.o
>  OBJS-$(CONFIG_X86) += libxl_x86.o
> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> index 96cb4da0794e..376f91759ac6 100644
> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>  #endif
>  }
>  
> +#define PCI_DEVID(bus, devfn)\
> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
> +
> +#define PCI_SBDF(seg, bus, devfn) \
> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
> +
>  static void pci_add_dm_done(libxl__egc *egc,
>                              pci_add_state *pas,
>                              int rc)
> @@ -1418,6 +1424,10 @@ static void pci_add_dm_done(libxl__egc *egc,
>      unsigned long long start, end, flags, size;
>      int irq, i;
>      int r;
> +#ifdef CONFIG_X86
> +    int gsi;
> +    uint32_t sbdf;
> +#endif
>      uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED;
>      uint32_t domainid = domid;
>      bool isstubdom = libxl_is_stubdom(ctx, domid, &domainid);
> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>          goto out_no_irq;
>      }
>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> +#ifdef CONFIG_X86

Could you avoid these #ifdef, and move the new arch specific code (and
maybe existing code) into libxl_x86.c ? There's already examples of arch
specific code.

> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
> +                        (PCI_DEVFN(pci->dev, pci->func)));
> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
> +        /*
> +         * Old kernel version may not support this function,
> +         * so if fail, keep using irq; if success, use gsi
> +         */
> +        if (gsi > 0) {
> +            irq = gsi;
> +        }
> +#endif
>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>          if (r < 0) {
>              LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)",
> @@ -2172,6 +2194,10 @@ static void pci_remove_detached(libxl__egc *egc,
>      int  irq = 0, i, stubdomid = 0;
>      const char *sysfs_path;
>      FILE *f;
> +#ifdef CONFIG_X86
> +    int gsi;
> +    uint32_t sbdf;
> +#endif
>      uint32_t domainid = prs->domid;
>      bool isstubdom;
>  
> @@ -2239,6 +2265,18 @@ skip_bar:
>      }
>  
>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
> +#ifdef CONFIG_X86
> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
> +                        (PCI_DEVFN(pci->dev, pci->func)));
> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
> +        /*
> +         * Old kernel version may not support this function,
> +         * so if fail, keep using irq; if success, use gsi
> +         */
> +        if (gsi > 0) {
> +            irq = gsi;
> +        }
> +#endif
>          rc = xc_physdev_unmap_pirq(ctx->xch, domid, irq);
>          if (rc < 0) {
>              /*

Thanks,
Chen, Jiqian June 21, 2024, 8:15 a.m. UTC | #9
On 2024/6/20 18:37, Jan Beulich wrote:
> On 20.06.2024 12:23, Chen, Jiqian wrote:
>> On 2024/6/20 15:43, Jan Beulich wrote:
>>> On 20.06.2024 09:03, Chen, Jiqian wrote:
>>>> On 2024/6/18 17:13, Jan Beulich wrote:
>>>>> On 18.06.2024 10:10, Chen, Jiqian wrote:
>>>>>> On 2024/6/17 23:10, Jan Beulich wrote:
>>>>>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>>>>>  #endif
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +#define PCI_DEVID(bus, devfn)\
>>>>>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>>>>>> +
>>>>>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>>>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>>>>>
>>>>>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>>>>>> sake all excess parentheses be dropped from these.
>>>>>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>>>>>> It seems other files also do this. See tools/libs/light/libxl_internal.h
>>>>>
>>>>> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
>>>>> has its own CODING_STYLE, I can't spot anything towards excessive use of
>>>>> parentheses there.
>>>> So, which parentheses do you think are excessive use?
>>>
>>> #define PCI_DEVID(bus, devfn)\
>>>             (((uint16_t)(bus) << 8) | ((devfn) & 0xff))
>>>
>>> #define PCI_SBDF(seg, bus, devfn) \
>>>             (((uint32_t)(seg) << 16) | PCI_DEVID(bus, devfn))
>> Thanks, will change in next version.
>>
>>>
>>>>>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>>>>          goto out_no_irq;
>>>>>>>>      }
>>>>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>>>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>>>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>>>>>> +        /*
>>>>>>>> +         * Old kernel version may not support this function,
>>>>>>>
>>>>>>> Just kernel?
>>>>>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
>>>>>
>>>>> Okay, and when the kernel supports it but the underlying hypervisor doesn't
>>>>> support what the kernel wants to use in order to fulfill the request, all
>>>> I don't know what things you mentioned hypervisor doesn't support are,
>>>> because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
>>>> that relationship can be got only in dom0 instead of Xen hypervisor.
>>>>
>>>>> is fine? (See also below for what may be needed in the hypervisor, even if
>>>> You mean xc_physdev_map_pirq needs gsi?
>>>
>>> I'd put it slightly differently: You arrange for that function to now take a
>>> GSI when the caller is PVH. But yes, the function, when used with
>>> MAP_PIRQ_TYPE_GSI, clearly expects a GSI as input (see also below).
>>>
>>>>> this IOCTL would be satisfied by the kernel without needing to interact with
>>>>> the hypervisor.)
>>>>>
>>>>>>>> +         * so if fail, keep using irq; if success, use gsi
>>>>>>>> +         */
>>>>>>>> +        if (gsi > 0) {
>>>>>>>> +            irq = gsi;
>>>>>>>
>>>>>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>>>>>> that IRQs and GSIs use two distinct numbering spaces.
>>>>>>>
>>>>>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>>>>>> the assumption that it'll fail. What if we decide to make the functionality
>>>>>>> available there, too (if only for informational purposes, or for
>>>>>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>>>>>> you'd call ...
>>>>>>>
>>>>>>>> +        }
>>>>>>>> +#endif
>>>>>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>>>>>
>>>>>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>>>>>> you strictly want to avoid the call on PV Dom0.
>>>>>>>
>>>>>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>>>>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>>>>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>>>>>> assume all your testing was done with IRQs whose numbers happened to match
>>>>>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>>>>>> public header, where the respective interface struct(s) is/are defined.)
>>>>>> I feel like you missed out on many of the previous discussions.
>>>>>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>>>>>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>>>>>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>>>>>> But not when using PVH, so passthrough failed.
>>>>>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>>>>>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>>>>>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
>>>>>
>>>>> I understand all of this, to a (I think) sufficient degree at least. Yet what
>>>>> you're effectively proposing (without explicitly saying so) is that e.g.
>>>>> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
>>>>> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
>>>>> to make (simply because the caller may have no way to express things in pIRQ
>>>>> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
>>>>> would be necessary. In fact that field is presently marked as "IN or OUT";
>>>>> when re-purposed to take a GSI on input, it may end up being necessary to pass
>>>>> back the pIRQ (in the subject domain's numbering space). Or alternatively it
>>>>> may be necessary to add yet another sub-function so the GSI can be translated
>>>>> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
>>>>> then to be passed into PHYSDEVOP_map_pirq.
>>>> If I understood correctly, your concerns about this patch are two:
>>>> First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.
>>>
>>> Yes.
>> OK, I can change to do this.
>> But I still have a concern:
>> Although without my changes, passthrough can work now when dom0 is PV.
>> And you also agree that: for xc_physdev_map_pirq, when use with MAP_PIRQ_TYPE_GSI, it expects a GSI as input.
>> Isn't it a wrong for PV dom0 to pass irq in? Why don't we use gsi as it should be used, since we have a function to get gsi now?
> 
> Indeed this and ...
> 
>>>> Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.
>>>
>>> I think so, yes. You also may need to record the output value, so you can later
>>> use it for unmap. xc_physdev_map_pirq() may also need adjusting, as right now
>>> it wouldn't put a negative incoming *pirq into the .pirq field. 
>> xc_physdev_map_pirq's logic is if we pass a negative in, it sets *pirq to index(gsi).
>> Is its logic right? If not how do we change it?
> 
> ... this matches ...
> 
>>> I actually wonder if that's even correct right now, i.e. independent of your change.
> 
> ... the remark here.
So, what should I do next step?
If assume the logic of function xc_physdev_map_pirq and PHYSDEVOP_map_pirq is right,
I think what I did now is right, both PV and PVH dom0 should pass gsi into xc_physdev_map_pirq.
By the way, I found xc_physdev_map_pirq didn't support negative pirq is since your commit 934a5253d932b6f67fe40fc48975a2b0117e4cce, do you remember why?

> 
>> Even without my changes, passthrough can work for PV dom0, not for PVH dom0.
> 
> In the common case. I fear no-one ever tried for a device with an IRQ that
> has a source override specified in ACPI.
> 
>> According to the logic of hypercall PHYSDEVOP_map_pirq,
>> if pirq is -1, it calls physdev_map_pirq-> allocate_and_map_gsi_pirq-> allocate_pirq -> get_free_pirq to get pirq.
>> If pirq is set to positive before calling hypercall, it set pirq to its own value in allocate_pirq.
> 
> Which is what looks wrong to me. Question is what it was done this way in the
> first place.
> 
> Jan
Chen, Jiqian June 21, 2024, 8:34 a.m. UTC | #10
On 2024/6/20 22:38, Anthony PERARD wrote:
> On Mon, Jun 17, 2024 at 05:00:34PM +0800, Jiqian Chen wrote:
>> diff --git a/tools/include/xencall.h b/tools/include/xencall.h
>> index fc95ed0fe58e..750aab070323 100644
>> --- a/tools/include/xencall.h
>> +++ b/tools/include/xencall.h
>> @@ -113,6 +113,8 @@ int xencall5(xencall_handle *xcall, unsigned int op,
>>               uint64_t arg1, uint64_t arg2, uint64_t arg3,
>>               uint64_t arg4, uint64_t arg5);
>>  
>> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf);
> 
> I don't think that an appropriate library for this new feature.
> libxencall is a generic lib to make hypercall.
Do you have a suggested place to put this new function?
This new function is to get gsi of a pci device, and only depend on the dom0 kernel, doesn't need to interact with hypervisor.

> 
>>  /* Variant(s) of the above, as needed, returning "long" instead of "int". */
>>  long xencall2L(xencall_handle *xcall, unsigned int op,
>>                 uint64_t arg1, uint64_t arg2);
>> diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
>> index 9ceca0cffc2f..a0381f74d24b 100644
>> --- a/tools/include/xenctrl.h
>> +++ b/tools/include/xenctrl.h
>> @@ -1641,6 +1641,8 @@ int xc_physdev_unmap_pirq(xc_interface *xch,
>>                            uint32_t domid,
>>                            int pirq);
>>  
>> +int xc_physdev_gsi_from_dev(xc_interface *xch, uint32_t sbdf);
>> +
>>  /*
>>   *  LOGGING AND ERROR REPORTING
>>   */
>> diff --git a/tools/libs/call/core.c b/tools/libs/call/core.c
>> index 02c4f8e1aefa..6dae50c9a6ba 100644
>> --- a/tools/libs/call/core.c
>> +++ b/tools/libs/call/core.c
>> @@ -173,6 +173,11 @@ int xencall5(xencall_handle *xcall, unsigned int op,
>>      return osdep_hypercall(xcall, &call);
>>  }
>>  
>> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf)
>> +{
>> +    return osdep_oscall(xcall, sbdf);
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/tools/libs/call/libxencall.map b/tools/libs/call/libxencall.map
>> index d18a3174e9dc..b92a0b5dc12c 100644
>> --- a/tools/libs/call/libxencall.map
>> +++ b/tools/libs/call/libxencall.map
>> @@ -10,6 +10,8 @@ VERS_1.0 {
>>  		xencall4;
>>  		xencall5;
>>  
>> +		xen_oscall_gsi_from_dev;
> 
> FYI, never change already released version of a library, this would add
> a new function to libxencall.1.0. Instead, when adding a new function
> to a library that is supposed to be stable (they have a *.map file in
> xen case), add it to a new section, that woud be VERS_1.4 in this case.
> But libxencall isn't approriate for this new function, so just for
> future reference.
> 
>>  		xencall_alloc_buffer;
>>  		xencall_free_buffer;
>>  		xencall_alloc_buffer_pages;
>> diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
>> index 6d588e6bea8f..92c740e176f2 100644
>> --- a/tools/libs/call/linux.c
>> +++ b/tools/libs/call/linux.c
>> @@ -85,6 +85,21 @@ long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall)
>>      return ioctl(xcall->fd, IOCTL_PRIVCMD_HYPERCALL, hypercall);
>>  }
>>  
>> +int osdep_oscall(xencall_handle *xcall, unsigned int sbdf)
>> +{
>> +    privcmd_gsi_from_dev_t dev_gsi = {
>> +        .sbdf = sbdf,
>> +        .gsi = -1,
>> +    };
>> +
>> +    if (ioctl(xcall->fd, IOCTL_PRIVCMD_GSI_FROM_DEV, &dev_gsi)) {
> 
> Looks like libxencall is only for hypercall, and so I don't think
> it's the right place to introducing another ioctl() call.
It seems IOCTL_PRIVCMD_HYPERCALL is for hypercall.
What I do here is to introduce new call into privcmd fd.
Maybe I can open "/dev/xen/privcmd" directly, so that I don't have to add the *_oscal function.

> 
>> +        PERROR("failed to get gsi from dev");
>> +        return -1;
>> +    }
>> +
>> +    return dev_gsi.gsi;
>> +}
>> +
>>  static void *alloc_pages_bufdev(xencall_handle *xcall, size_t npages)
>>  {
>>      void *p;
>> diff --git a/tools/libs/call/private.h b/tools/libs/call/private.h
>> index 9c3aa432efe2..cd6eb5a3e66f 100644
>> --- a/tools/libs/call/private.h
>> +++ b/tools/libs/call/private.h
>> @@ -57,6 +57,15 @@ int osdep_xencall_close(xencall_handle *xcall);
>>  
>>  long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall);
>>  
>> +#if defined(__linux__)
>> +int osdep_oscall(xencall_handle *xcall, unsigned int sbdf);
>> +#else
>> +static inline int osdep_oscall(xencall_handle *xcall, unsigned int sbdf)
>> +{
>> +    return -1;
>> +}
>> +#endif
>> +
>>  void *osdep_alloc_pages(xencall_handle *xcall, size_t nr_pages);
>>  void osdep_free_pages(xencall_handle *xcall, void *p, size_t nr_pages);
>>  
>> diff --git a/tools/libs/ctrl/xc_physdev.c b/tools/libs/ctrl/xc_physdev.c
>> index 460a8e779ce8..c1458f3a38b5 100644
>> --- a/tools/libs/ctrl/xc_physdev.c
>> +++ b/tools/libs/ctrl/xc_physdev.c
>> @@ -111,3 +111,7 @@ int xc_physdev_unmap_pirq(xc_interface *xch,
>>      return rc;
>>  }
>>  
>> +int xc_physdev_gsi_from_dev(xc_interface *xch, uint32_t sbdf)
>> +{
> 
> I'm not sure if this is the best place for this new function, but I
> can't find another one, so that will do.
Thanks.

> 
>> +    return xen_oscall_gsi_from_dev(xch->xcall, sbdf);
>> +}
>> diff --git a/tools/libs/light/Makefile b/tools/libs/light/Makefile
>> index 37e4d1670986..6b616d5ee9b6 100644
>> --- a/tools/libs/light/Makefile
>> +++ b/tools/libs/light/Makefile
>> @@ -40,7 +40,7 @@ OBJS-$(CONFIG_X86) += $(ACPI_OBJS)
>>  
>>  CFLAGS += -Wno-format-zero-length -Wmissing-declarations -Wformat-nonliteral
>>  
>> -CFLAGS-$(CONFIG_X86) += -DCONFIG_PCI_SUPP_LEGACY_IRQ
>> +CFLAGS-$(CONFIG_X86) += -DCONFIG_PCI_SUPP_LEGACY_IRQ -DCONFIG_X86
>>  
>>  OBJS-$(CONFIG_X86) += libxl_cpuid.o
>>  OBJS-$(CONFIG_X86) += libxl_x86.o
>> diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
>> index 96cb4da0794e..376f91759ac6 100644
>> --- a/tools/libs/light/libxl_pci.c
>> +++ b/tools/libs/light/libxl_pci.c
>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>  #endif
>>  }
>>  
>> +#define PCI_DEVID(bus, devfn)\
>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>> +
>> +#define PCI_SBDF(seg, bus, devfn) \
>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>> +
>>  static void pci_add_dm_done(libxl__egc *egc,
>>                              pci_add_state *pas,
>>                              int rc)
>> @@ -1418,6 +1424,10 @@ static void pci_add_dm_done(libxl__egc *egc,
>>      unsigned long long start, end, flags, size;
>>      int irq, i;
>>      int r;
>> +#ifdef CONFIG_X86
>> +    int gsi;
>> +    uint32_t sbdf;
>> +#endif
>>      uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED;
>>      uint32_t domainid = domid;
>>      bool isstubdom = libxl_is_stubdom(ctx, domid, &domainid);
>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>          goto out_no_irq;
>>      }
>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>> +#ifdef CONFIG_X86
> 
> Could you avoid these #ifdef, and move the new arch specific code (and
> maybe existing code) into libxl_x86.c ? There's already examples of arch
> specific code.
OK, will do in next version.

> 
>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>> +        /*
>> +         * Old kernel version may not support this function,
>> +         * so if fail, keep using irq; if success, use gsi
>> +         */
>> +        if (gsi > 0) {
>> +            irq = gsi;
>> +        }
>> +#endif
>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>          if (r < 0) {
>>              LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)",
>> @@ -2172,6 +2194,10 @@ static void pci_remove_detached(libxl__egc *egc,
>>      int  irq = 0, i, stubdomid = 0;
>>      const char *sysfs_path;
>>      FILE *f;
>> +#ifdef CONFIG_X86
>> +    int gsi;
>> +    uint32_t sbdf;
>> +#endif
>>      uint32_t domainid = prs->domid;
>>      bool isstubdom;
>>  
>> @@ -2239,6 +2265,18 @@ skip_bar:
>>      }
>>  
>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>> +#ifdef CONFIG_X86
>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>> +        /*
>> +         * Old kernel version may not support this function,
>> +         * so if fail, keep using irq; if success, use gsi
>> +         */
>> +        if (gsi > 0) {
>> +            irq = gsi;
>> +        }
>> +#endif
>>          rc = xc_physdev_unmap_pirq(ctx->xch, domid, irq);
>>          if (rc < 0) {
>>              /*
> 
> Thanks,
>
Jan Beulich June 24, 2024, 8:13 a.m. UTC | #11
On 21.06.2024 10:15, Chen, Jiqian wrote:
> On 2024/6/20 18:37, Jan Beulich wrote:
>> On 20.06.2024 12:23, Chen, Jiqian wrote:
>>> On 2024/6/20 15:43, Jan Beulich wrote:
>>>> On 20.06.2024 09:03, Chen, Jiqian wrote:
>>>>> On 2024/6/18 17:13, Jan Beulich wrote:
>>>>>> On 18.06.2024 10:10, Chen, Jiqian wrote:
>>>>>>> On 2024/6/17 23:10, Jan Beulich wrote:
>>>>>>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>>>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>>>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>>>>>>  #endif
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +#define PCI_DEVID(bus, devfn)\
>>>>>>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>>>>>>> +
>>>>>>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>>>>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>>>>>>
>>>>>>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>>>>>>> sake all excess parentheses be dropped from these.
>>>>>>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>>>>>>> It seems other files also do this. See tools/libs/light/libxl_internal.h
>>>>>>
>>>>>> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
>>>>>> has its own CODING_STYLE, I can't spot anything towards excessive use of
>>>>>> parentheses there.
>>>>> So, which parentheses do you think are excessive use?
>>>>
>>>> #define PCI_DEVID(bus, devfn)\
>>>>             (((uint16_t)(bus) << 8) | ((devfn) & 0xff))
>>>>
>>>> #define PCI_SBDF(seg, bus, devfn) \
>>>>             (((uint32_t)(seg) << 16) | PCI_DEVID(bus, devfn))
>>> Thanks, will change in next version.
>>>
>>>>
>>>>>>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>>>>>          goto out_no_irq;
>>>>>>>>>      }
>>>>>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>>>>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>>>>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>>>>>>> +        /*
>>>>>>>>> +         * Old kernel version may not support this function,
>>>>>>>>
>>>>>>>> Just kernel?
>>>>>>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
>>>>>>
>>>>>> Okay, and when the kernel supports it but the underlying hypervisor doesn't
>>>>>> support what the kernel wants to use in order to fulfill the request, all
>>>>> I don't know what things you mentioned hypervisor doesn't support are,
>>>>> because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
>>>>> that relationship can be got only in dom0 instead of Xen hypervisor.
>>>>>
>>>>>> is fine? (See also below for what may be needed in the hypervisor, even if
>>>>> You mean xc_physdev_map_pirq needs gsi?
>>>>
>>>> I'd put it slightly differently: You arrange for that function to now take a
>>>> GSI when the caller is PVH. But yes, the function, when used with
>>>> MAP_PIRQ_TYPE_GSI, clearly expects a GSI as input (see also below).
>>>>
>>>>>> this IOCTL would be satisfied by the kernel without needing to interact with
>>>>>> the hypervisor.)
>>>>>>
>>>>>>>>> +         * so if fail, keep using irq; if success, use gsi
>>>>>>>>> +         */
>>>>>>>>> +        if (gsi > 0) {
>>>>>>>>> +            irq = gsi;
>>>>>>>>
>>>>>>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>>>>>>> that IRQs and GSIs use two distinct numbering spaces.
>>>>>>>>
>>>>>>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>>>>>>> the assumption that it'll fail. What if we decide to make the functionality
>>>>>>>> available there, too (if only for informational purposes, or for
>>>>>>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>>>>>>> you'd call ...
>>>>>>>>
>>>>>>>>> +        }
>>>>>>>>> +#endif
>>>>>>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>>>>>>
>>>>>>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>>>>>>> you strictly want to avoid the call on PV Dom0.
>>>>>>>>
>>>>>>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>>>>>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>>>>>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>>>>>>> assume all your testing was done with IRQs whose numbers happened to match
>>>>>>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>>>>>>> public header, where the respective interface struct(s) is/are defined.)
>>>>>>> I feel like you missed out on many of the previous discussions.
>>>>>>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>>>>>>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>>>>>>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>>>>>>> But not when using PVH, so passthrough failed.
>>>>>>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>>>>>>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>>>>>>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
>>>>>>
>>>>>> I understand all of this, to a (I think) sufficient degree at least. Yet what
>>>>>> you're effectively proposing (without explicitly saying so) is that e.g.
>>>>>> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
>>>>>> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
>>>>>> to make (simply because the caller may have no way to express things in pIRQ
>>>>>> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
>>>>>> would be necessary. In fact that field is presently marked as "IN or OUT";
>>>>>> when re-purposed to take a GSI on input, it may end up being necessary to pass
>>>>>> back the pIRQ (in the subject domain's numbering space). Or alternatively it
>>>>>> may be necessary to add yet another sub-function so the GSI can be translated
>>>>>> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
>>>>>> then to be passed into PHYSDEVOP_map_pirq.
>>>>> If I understood correctly, your concerns about this patch are two:
>>>>> First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.
>>>>
>>>> Yes.
>>> OK, I can change to do this.
>>> But I still have a concern:
>>> Although without my changes, passthrough can work now when dom0 is PV.
>>> And you also agree that: for xc_physdev_map_pirq, when use with MAP_PIRQ_TYPE_GSI, it expects a GSI as input.
>>> Isn't it a wrong for PV dom0 to pass irq in? Why don't we use gsi as it should be used, since we have a function to get gsi now?
>>
>> Indeed this and ...
>>
>>>>> Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.
>>>>
>>>> I think so, yes. You also may need to record the output value, so you can later
>>>> use it for unmap. xc_physdev_map_pirq() may also need adjusting, as right now
>>>> it wouldn't put a negative incoming *pirq into the .pirq field. 
>>> xc_physdev_map_pirq's logic is if we pass a negative in, it sets *pirq to index(gsi).
>>> Is its logic right? If not how do we change it?
>>
>> ... this matches ...
>>
>>>> I actually wonder if that's even correct right now, i.e. independent of your change.
>>
>> ... the remark here.
> So, what should I do next step?
> If assume the logic of function xc_physdev_map_pirq and PHYSDEVOP_map_pirq is right,
> I think what I did now is right, both PV and PVH dom0 should pass gsi into xc_physdev_map_pirq.

It may sound unfriendly, and I'm willing to accept other maintainers
disagreeing with me, but I think before we think of any extensions of
what we have here, we want to address any issues with what we have.
Since you're working in the area, and since getting the additions right
requires properly understanding how things work (and where things may
not work correctly right now), I view you as being in the best position
to see about doing that (imo) prereq step.

> By the way, I found xc_physdev_map_pirq didn't support negative pirq is since your commit 934a5253d932b6f67fe40fc48975a2b0117e4cce, do you remember why?

Counter question: What is a negative pIRQ?

Jan
Anthony PERARD June 24, 2024, 12:08 p.m. UTC | #12
On Fri, Jun 21, 2024 at 08:34:11AM +0000, Chen, Jiqian wrote:
> On 2024/6/20 22:38, Anthony PERARD wrote:
> > On Mon, Jun 17, 2024 at 05:00:34PM +0800, Jiqian Chen wrote:
> >> diff --git a/tools/include/xencall.h b/tools/include/xencall.h
> >> index fc95ed0fe58e..750aab070323 100644
> >> --- a/tools/include/xencall.h
> >> +++ b/tools/include/xencall.h
> >> @@ -113,6 +113,8 @@ int xencall5(xencall_handle *xcall, unsigned int op,
> >>               uint64_t arg1, uint64_t arg2, uint64_t arg3,
> >>               uint64_t arg4, uint64_t arg5);
> >>  
> >> +int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf);
> > 
> > I don't think that an appropriate library for this new feature.
> > libxencall is a generic lib to make hypercall.
> Do you have a suggested place to put this new function?

This is an internal function, which doesn't need to be exposed in a
public interface, but the implementation is used by another function. So
that can be moved to libxenctrl.
Chen, Jiqian June 25, 2024, 7:38 a.m. UTC | #13
On 2024/6/24 16:13, Jan Beulich wrote:
> On 21.06.2024 10:15, Chen, Jiqian wrote:
>> On 2024/6/20 18:37, Jan Beulich wrote:
>>> On 20.06.2024 12:23, Chen, Jiqian wrote:
>>>> On 2024/6/20 15:43, Jan Beulich wrote:
>>>>> On 20.06.2024 09:03, Chen, Jiqian wrote:
>>>>>> On 2024/6/18 17:13, Jan Beulich wrote:
>>>>>>> On 18.06.2024 10:10, Chen, Jiqian wrote:
>>>>>>>> On 2024/6/17 23:10, Jan Beulich wrote:
>>>>>>>>> On 17.06.2024 11:00, Jiqian Chen wrote:
>>>>>>>>>> --- a/tools/libs/light/libxl_pci.c
>>>>>>>>>> +++ b/tools/libs/light/libxl_pci.c
>>>>>>>>>> @@ -1406,6 +1406,12 @@ static bool pci_supp_legacy_irq(void)
>>>>>>>>>>  #endif
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>> +#define PCI_DEVID(bus, devfn)\
>>>>>>>>>> +            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
>>>>>>>>>> +
>>>>>>>>>> +#define PCI_SBDF(seg, bus, devfn) \
>>>>>>>>>> +            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
>>>>>>>>>
>>>>>>>>> I'm not a maintainer of this file; if I were, I'd ask that for readability's
>>>>>>>>> sake all excess parentheses be dropped from these.
>>>>>>>> Isn't it a coding requirement to enclose each element in parentheses in the macro definition?
>>>>>>>> It seems other files also do this. See tools/libs/light/libxl_internal.h
>>>>>>>
>>>>>>> As said, I'm not a maintainer of this code. Yet while I'm aware that libxl
>>>>>>> has its own CODING_STYLE, I can't spot anything towards excessive use of
>>>>>>> parentheses there.
>>>>>> So, which parentheses do you think are excessive use?
>>>>>
>>>>> #define PCI_DEVID(bus, devfn)\
>>>>>             (((uint16_t)(bus) << 8) | ((devfn) & 0xff))
>>>>>
>>>>> #define PCI_SBDF(seg, bus, devfn) \
>>>>>             (((uint32_t)(seg) << 16) | PCI_DEVID(bus, devfn))
>>>> Thanks, will change in next version.
>>>>
>>>>>
>>>>>>>>>> @@ -1486,6 +1496,18 @@ static void pci_add_dm_done(libxl__egc *egc,
>>>>>>>>>>          goto out_no_irq;
>>>>>>>>>>      }
>>>>>>>>>>      if ((fscanf(f, "%u", &irq) == 1) && irq) {
>>>>>>>>>> +#ifdef CONFIG_X86
>>>>>>>>>> +        sbdf = PCI_SBDF(pci->domain, pci->bus,
>>>>>>>>>> +                        (PCI_DEVFN(pci->dev, pci->func)));
>>>>>>>>>> +        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
>>>>>>>>>> +        /*
>>>>>>>>>> +         * Old kernel version may not support this function,
>>>>>>>>>
>>>>>>>>> Just kernel?
>>>>>>>> Yes, xc_physdev_gsi_from_dev depends on the function implemented on linux kernel side.
>>>>>>>
>>>>>>> Okay, and when the kernel supports it but the underlying hypervisor doesn't
>>>>>>> support what the kernel wants to use in order to fulfill the request, all
>>>>>> I don't know what things you mentioned hypervisor doesn't support are,
>>>>>> because xc_physdev_gsi_from_dev is to get the gsi of pcidev through sbdf information,
>>>>>> that relationship can be got only in dom0 instead of Xen hypervisor.
>>>>>>
>>>>>>> is fine? (See also below for what may be needed in the hypervisor, even if
>>>>>> You mean xc_physdev_map_pirq needs gsi?
>>>>>
>>>>> I'd put it slightly differently: You arrange for that function to now take a
>>>>> GSI when the caller is PVH. But yes, the function, when used with
>>>>> MAP_PIRQ_TYPE_GSI, clearly expects a GSI as input (see also below).
>>>>>
>>>>>>> this IOCTL would be satisfied by the kernel without needing to interact with
>>>>>>> the hypervisor.)
>>>>>>>
>>>>>>>>>> +         * so if fail, keep using irq; if success, use gsi
>>>>>>>>>> +         */
>>>>>>>>>> +        if (gsi > 0) {
>>>>>>>>>> +            irq = gsi;
>>>>>>>>>
>>>>>>>>> I'm still puzzled by this, when by now I think we've sufficiently clarified
>>>>>>>>> that IRQs and GSIs use two distinct numbering spaces.
>>>>>>>>>
>>>>>>>>> Also, as previously indicated, you call this for PV Dom0 as well. Aiui on
>>>>>>>>> the assumption that it'll fail. What if we decide to make the functionality
>>>>>>>>> available there, too (if only for informational purposes, or for
>>>>>>>>> consistency)? Suddenly you're fallback logic wouldn't work anymore, and
>>>>>>>>> you'd call ...
>>>>>>>>>
>>>>>>>>>> +        }
>>>>>>>>>> +#endif
>>>>>>>>>>          r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
>>>>>>>>>
>>>>>>>>> ... the function with a GSI when a pIRQ is meant. Imo, as suggested before,
>>>>>>>>> you strictly want to avoid the call on PV Dom0.
>>>>>>>>>
>>>>>>>>> Also for PVH Dom0: I don't think I've seen changes to the hypercall
>>>>>>>>> handling, yet. How can that be when GSI and IRQ aren't the same, and hence
>>>>>>>>> incoming GSI would need translating to IRQ somewhere? I can once again only
>>>>>>>>> assume all your testing was done with IRQs whose numbers happened to match
>>>>>>>>> their GSI numbers. (The difference, imo, would also need calling out in the
>>>>>>>>> public header, where the respective interface struct(s) is/are defined.)
>>>>>>>> I feel like you missed out on many of the previous discussions.
>>>>>>>> Without my changes, the original codes use irq (read from file /sys/bus/pci/devices/<sbdf>/irq) to do xc_physdev_map_pirq,
>>>>>>>> but xc_physdev_map_pirq require passing into gsi instead of irq, so we need to use gsi whether dom0 is PV or PVH, so for the original codes, they are wrong.
>>>>>>>> Just because by chance, the irq value in the Linux kernel of pv dom0 is equal to the gsi value, so there was no problem with the original pv passthrough.
>>>>>>>> But not when using PVH, so passthrough failed.
>>>>>>>> With my changes, I got gsi through function xc_physdev_gsi_from_dev firstly, and to be compatible with old kernel versions(if the ioctl
>>>>>>>> IOCTL_PRIVCMD_GSI_FROM_DEV is not implemented), I still need to use the irq number, so I need to check the result
>>>>>>>> of gsi, if gsi > 0 means IOCTL_PRIVCMD_GSI_FROM_DEV is implemented I can use the right one gsi, otherwise keep using wrong one irq. 
>>>>>>>
>>>>>>> I understand all of this, to a (I think) sufficient degree at least. Yet what
>>>>>>> you're effectively proposing (without explicitly saying so) is that e.g.
>>>>>>> struct physdev_map_pirq's pirq field suddenly may no longer hold a pIRQ
>>>>>>> number, but (when the caller is PVH) a GSI. This may be a necessary adjustment
>>>>>>> to make (simply because the caller may have no way to express things in pIRQ
>>>>>>> terms), but then suitable adjustments to the handling of PHYSDEVOP_map_pirq
>>>>>>> would be necessary. In fact that field is presently marked as "IN or OUT";
>>>>>>> when re-purposed to take a GSI on input, it may end up being necessary to pass
>>>>>>> back the pIRQ (in the subject domain's numbering space). Or alternatively it
>>>>>>> may be necessary to add yet another sub-function so the GSI can be translated
>>>>>>> to the corresponding pIRQ for the domain that's going to use the IRQ, for that
>>>>>>> then to be passed into PHYSDEVOP_map_pirq.
>>>>>> If I understood correctly, your concerns about this patch are two:
>>>>>> First, when dom0 is PV, I should not use xc_physdev_gsi_from_dev to get gsi to do xc_physdev_map_pirq, I should keep the original code that use irq.
>>>>>
>>>>> Yes.
>>>> OK, I can change to do this.
>>>> But I still have a concern:
>>>> Although without my changes, passthrough can work now when dom0 is PV.
>>>> And you also agree that: for xc_physdev_map_pirq, when use with MAP_PIRQ_TYPE_GSI, it expects a GSI as input.
>>>> Isn't it a wrong for PV dom0 to pass irq in? Why don't we use gsi as it should be used, since we have a function to get gsi now?
>>>
>>> Indeed this and ...
>>>
>>>>>> Second, when dom0 is PVH, I get the gsi, but I should not pass gsi as the fourth parameter of xc_physdev_map_pirq, I should create a new local parameter pirq=-1, and pass it in.
>>>>>
>>>>> I think so, yes. You also may need to record the output value, so you can later
>>>>> use it for unmap. xc_physdev_map_pirq() may also need adjusting, as right now
>>>>> it wouldn't put a negative incoming *pirq into the .pirq field. 
>>>> xc_physdev_map_pirq's logic is if we pass a negative in, it sets *pirq to index(gsi).
>>>> Is its logic right? If not how do we change it?
>>>
>>> ... this matches ...
>>>
>>>>> I actually wonder if that's even correct right now, i.e. independent of your change.
>>>
>>> ... the remark here.
>> So, what should I do next step?
>> If assume the logic of function xc_physdev_map_pirq and PHYSDEVOP_map_pirq is right,
>> I think what I did now is right, both PV and PVH dom0 should pass gsi into xc_physdev_map_pirq.
> 
> It may sound unfriendly, and I'm willing to accept other maintainers
> disagreeing with me, but I think before we think of any extensions of
> what we have here, we want to address any issues with what we have.
> Since you're working in the area, and since getting the additions right
> requires properly understanding how things work (and where things may
> not work correctly right now), I view you as being in the best position
> to see about doing that (imo) prereq step.
Make sense.
OK, I think there may be two issues in the callstack of function xc_physdev_map_pirq
(xc_physdev_map_pirq-> physdev_map_pirq-> allocate_and_map_gsi_pirq-> allocate_pirq).
For function xc_physdev_map_pirq, I think it should have two use cases:
First, when caller pass a pirq that ">=0", they want to map gsi to a specific pirq. In this case, when it reaches allocate_pirq,
if the gsi already has a mapped pirq(current_pirq), and current_pirq is not equal with specific pirq, it fails due to EEXIST, if current_pirq
is equal with specific pirq or the gsi doesn't have a current_pirq, allocate_pirq return the specific pirq. It successes.
Second, when caller pass a pirq that "<0", they want to get a free pirq for gsi. In this case, when it reaches allocate_pirq,
if the gsi already has a mapped pirq(current_pirq), we should return current_pirq, otherwise, it allocate a free pirq through get_free_pirqs
and then return the free pirq, it successes.
Roadmap is below:

													caller pass gsi and pirq into xc_physdev_map_pirq
															(xc_physdev_map_pirq)
															/		\
														*pirq>=0		*pirq<0 [issue 1]
													(map gsi to a specific pirq)	(map gsi to a free pirq)
				 _									/						\
				|							gsi already has a mapped pirq				gsi already has a mapped pirq
				|								(current_pirq)						(current_pirq)
				|							/			\					/		\
				|							yes			no					yes		no
allocate_pirq-------------------------|						/					\			/				\
				|					current_pirq == pirq			return specific pirq	return current_pirq [issue2]		return free pirq
				|				/				\
				|				yes				no
				|			/						\
				|_		return specific pirq					fail -EEXIST

But for current code,
[issue 1]: when *pirq<0, in xc_physdev_map_pirq, it re-sets *pirq to index by default. " map.pirq = *pirq < 0 ? index : *pirq; ", so that it can't allocate a free pirq for gsi(above second case)
[issue 2]: when *pirq<0 and gsi already has mapped pirq and has no need to allocate a new free pirq, in allocate_pirq, it returns *pirq(<0) directly, it means allocate_pirq fail. Here should return the already mapped pirq for that gsi and mean successful.

> 
>> By the way, I found xc_physdev_map_pirq didn't support negative pirq is since your commit 934a5253d932b6f67fe40fc48975a2b0117e4cce, do you remember why?
> 
> Counter question: What is a negative pIRQ?
I mean when caller pass a pirq that "<0" in.

> 
> Jan
diff mbox series

Patch

diff --git a/tools/include/xen-sys/Linux/privcmd.h b/tools/include/xen-sys/Linux/privcmd.h
index bc60e8fd55eb..977f1a058797 100644
--- a/tools/include/xen-sys/Linux/privcmd.h
+++ b/tools/include/xen-sys/Linux/privcmd.h
@@ -95,6 +95,11 @@  typedef struct privcmd_mmap_resource {
 	__u64 addr;
 } privcmd_mmap_resource_t;
 
+typedef struct privcmd_gsi_from_dev {
+	__u32 sbdf;
+	int gsi;
+} privcmd_gsi_from_dev_t;
+
 /*
  * @cmd: IOCTL_PRIVCMD_HYPERCALL
  * @arg: &privcmd_hypercall_t
@@ -114,6 +119,8 @@  typedef struct privcmd_mmap_resource {
 	_IOC(_IOC_NONE, 'P', 6, sizeof(domid_t))
 #define IOCTL_PRIVCMD_MMAP_RESOURCE				\
 	_IOC(_IOC_NONE, 'P', 7, sizeof(privcmd_mmap_resource_t))
+#define IOCTL_PRIVCMD_GSI_FROM_DEV				\
+	_IOC(_IOC_NONE, 'P', 10, sizeof(privcmd_gsi_from_dev_t))
 #define IOCTL_PRIVCMD_UNIMPLEMENTED				\
 	_IOC(_IOC_NONE, 'P', 0xFF, 0)
 
diff --git a/tools/include/xencall.h b/tools/include/xencall.h
index fc95ed0fe58e..750aab070323 100644
--- a/tools/include/xencall.h
+++ b/tools/include/xencall.h
@@ -113,6 +113,8 @@  int xencall5(xencall_handle *xcall, unsigned int op,
              uint64_t arg1, uint64_t arg2, uint64_t arg3,
              uint64_t arg4, uint64_t arg5);
 
+int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf);
+
 /* Variant(s) of the above, as needed, returning "long" instead of "int". */
 long xencall2L(xencall_handle *xcall, unsigned int op,
                uint64_t arg1, uint64_t arg2);
diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 9ceca0cffc2f..a0381f74d24b 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1641,6 +1641,8 @@  int xc_physdev_unmap_pirq(xc_interface *xch,
                           uint32_t domid,
                           int pirq);
 
+int xc_physdev_gsi_from_dev(xc_interface *xch, uint32_t sbdf);
+
 /*
  *  LOGGING AND ERROR REPORTING
  */
diff --git a/tools/libs/call/core.c b/tools/libs/call/core.c
index 02c4f8e1aefa..6dae50c9a6ba 100644
--- a/tools/libs/call/core.c
+++ b/tools/libs/call/core.c
@@ -173,6 +173,11 @@  int xencall5(xencall_handle *xcall, unsigned int op,
     return osdep_hypercall(xcall, &call);
 }
 
+int xen_oscall_gsi_from_dev(xencall_handle *xcall, unsigned int sbdf)
+{
+    return osdep_oscall(xcall, sbdf);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libs/call/libxencall.map b/tools/libs/call/libxencall.map
index d18a3174e9dc..b92a0b5dc12c 100644
--- a/tools/libs/call/libxencall.map
+++ b/tools/libs/call/libxencall.map
@@ -10,6 +10,8 @@  VERS_1.0 {
 		xencall4;
 		xencall5;
 
+		xen_oscall_gsi_from_dev;
+
 		xencall_alloc_buffer;
 		xencall_free_buffer;
 		xencall_alloc_buffer_pages;
diff --git a/tools/libs/call/linux.c b/tools/libs/call/linux.c
index 6d588e6bea8f..92c740e176f2 100644
--- a/tools/libs/call/linux.c
+++ b/tools/libs/call/linux.c
@@ -85,6 +85,21 @@  long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall)
     return ioctl(xcall->fd, IOCTL_PRIVCMD_HYPERCALL, hypercall);
 }
 
+int osdep_oscall(xencall_handle *xcall, unsigned int sbdf)
+{
+    privcmd_gsi_from_dev_t dev_gsi = {
+        .sbdf = sbdf,
+        .gsi = -1,
+    };
+
+    if (ioctl(xcall->fd, IOCTL_PRIVCMD_GSI_FROM_DEV, &dev_gsi)) {
+        PERROR("failed to get gsi from dev");
+        return -1;
+    }
+
+    return dev_gsi.gsi;
+}
+
 static void *alloc_pages_bufdev(xencall_handle *xcall, size_t npages)
 {
     void *p;
diff --git a/tools/libs/call/private.h b/tools/libs/call/private.h
index 9c3aa432efe2..cd6eb5a3e66f 100644
--- a/tools/libs/call/private.h
+++ b/tools/libs/call/private.h
@@ -57,6 +57,15 @@  int osdep_xencall_close(xencall_handle *xcall);
 
 long osdep_hypercall(xencall_handle *xcall, privcmd_hypercall_t *hypercall);
 
+#if defined(__linux__)
+int osdep_oscall(xencall_handle *xcall, unsigned int sbdf);
+#else
+static inline int osdep_oscall(xencall_handle *xcall, unsigned int sbdf)
+{
+    return -1;
+}
+#endif
+
 void *osdep_alloc_pages(xencall_handle *xcall, size_t nr_pages);
 void osdep_free_pages(xencall_handle *xcall, void *p, size_t nr_pages);
 
diff --git a/tools/libs/ctrl/xc_physdev.c b/tools/libs/ctrl/xc_physdev.c
index 460a8e779ce8..c1458f3a38b5 100644
--- a/tools/libs/ctrl/xc_physdev.c
+++ b/tools/libs/ctrl/xc_physdev.c
@@ -111,3 +111,7 @@  int xc_physdev_unmap_pirq(xc_interface *xch,
     return rc;
 }
 
+int xc_physdev_gsi_from_dev(xc_interface *xch, uint32_t sbdf)
+{
+    return xen_oscall_gsi_from_dev(xch->xcall, sbdf);
+}
diff --git a/tools/libs/light/Makefile b/tools/libs/light/Makefile
index 37e4d1670986..6b616d5ee9b6 100644
--- a/tools/libs/light/Makefile
+++ b/tools/libs/light/Makefile
@@ -40,7 +40,7 @@  OBJS-$(CONFIG_X86) += $(ACPI_OBJS)
 
 CFLAGS += -Wno-format-zero-length -Wmissing-declarations -Wformat-nonliteral
 
-CFLAGS-$(CONFIG_X86) += -DCONFIG_PCI_SUPP_LEGACY_IRQ
+CFLAGS-$(CONFIG_X86) += -DCONFIG_PCI_SUPP_LEGACY_IRQ -DCONFIG_X86
 
 OBJS-$(CONFIG_X86) += libxl_cpuid.o
 OBJS-$(CONFIG_X86) += libxl_x86.o
diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index 96cb4da0794e..376f91759ac6 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -1406,6 +1406,12 @@  static bool pci_supp_legacy_irq(void)
 #endif
 }
 
+#define PCI_DEVID(bus, devfn)\
+            ((((uint16_t)(bus)) << 8) | ((devfn) & 0xff))
+
+#define PCI_SBDF(seg, bus, devfn) \
+            ((((uint32_t)(seg)) << 16) | (PCI_DEVID(bus, devfn)))
+
 static void pci_add_dm_done(libxl__egc *egc,
                             pci_add_state *pas,
                             int rc)
@@ -1418,6 +1424,10 @@  static void pci_add_dm_done(libxl__egc *egc,
     unsigned long long start, end, flags, size;
     int irq, i;
     int r;
+#ifdef CONFIG_X86
+    int gsi;
+    uint32_t sbdf;
+#endif
     uint32_t flag = XEN_DOMCTL_DEV_RDM_RELAXED;
     uint32_t domainid = domid;
     bool isstubdom = libxl_is_stubdom(ctx, domid, &domainid);
@@ -1486,6 +1496,18 @@  static void pci_add_dm_done(libxl__egc *egc,
         goto out_no_irq;
     }
     if ((fscanf(f, "%u", &irq) == 1) && irq) {
+#ifdef CONFIG_X86
+        sbdf = PCI_SBDF(pci->domain, pci->bus,
+                        (PCI_DEVFN(pci->dev, pci->func)));
+        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
+        /*
+         * Old kernel version may not support this function,
+         * so if fail, keep using irq; if success, use gsi
+         */
+        if (gsi > 0) {
+            irq = gsi;
+        }
+#endif
         r = xc_physdev_map_pirq(ctx->xch, domid, irq, &irq);
         if (r < 0) {
             LOGED(ERROR, domainid, "xc_physdev_map_pirq irq=%d (error=%d)",
@@ -2172,6 +2194,10 @@  static void pci_remove_detached(libxl__egc *egc,
     int  irq = 0, i, stubdomid = 0;
     const char *sysfs_path;
     FILE *f;
+#ifdef CONFIG_X86
+    int gsi;
+    uint32_t sbdf;
+#endif
     uint32_t domainid = prs->domid;
     bool isstubdom;
 
@@ -2239,6 +2265,18 @@  skip_bar:
     }
 
     if ((fscanf(f, "%u", &irq) == 1) && irq) {
+#ifdef CONFIG_X86
+        sbdf = PCI_SBDF(pci->domain, pci->bus,
+                        (PCI_DEVFN(pci->dev, pci->func)));
+        gsi = xc_physdev_gsi_from_dev(ctx->xch, sbdf);
+        /*
+         * Old kernel version may not support this function,
+         * so if fail, keep using irq; if success, use gsi
+         */
+        if (gsi > 0) {
+            irq = gsi;
+        }
+#endif
         rc = xc_physdev_unmap_pirq(ctx->xch, domid, irq);
         if (rc < 0) {
             /*