diff mbox

[v5,08/11] vpci/bars: add handlers to map the BARs

Message ID 20170814142850.39133-9-roger.pau@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roger Pau Monné Aug. 14, 2017, 2:28 p.m. UTC
Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to snoop BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to
memory space accesses), and maps/unmaps the BARs of the device into
the guest p2m. A rangeset is used in order to figure out which memory
to map/unmap. This makes it easier to keep track of the possible
overlaps with other BARs, and will also simplify MSI-X support, where
certain regions of a BAR might be used for the MSI-X table or PBA.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v4:
 - Expand commit message to mention the reason behind the usage of
   rangesets.
 - Fix comment related to the inclusiveness of rangesets.
 - Fix off-by-one error in the calculation of the end of memory
   regions.
 - Store the state of the BAR (mapped/unmapped) in the vpci_bar
   enabled field, previously was only used by ROMs.
 - Fix double negation of return code.
 - Modify vpci_cmd_write so it has a single call to pci_conf_write16.
 - Print a warning when trying to write to the BAR with memory
   decoding enabled (and ignore the write).
 - Remove header_type local variable, it's used only once.
 - Move the read of the command register.
 - Restore previous command register value in the exit paths.
 - Only set address to INVALID_PADDR if the initial BAR value matches
    ~0 & PCI_BASE_ADDRESS_MEM_MASK.
 - Don't disable the enabled bit in the expansion ROM register, memory
   decoding is already disabled and takes precedence.
 - Don't use INVALID_PADDR, just set the initial BAR address to the
   value found in the hardware.
 - Introduce rom_enabled to store the status of the
   PCI_ROM_ADDRESS_ENABLE bit.
 - Reorder fields of the structure to prevent holes.

Changes since v3:
 - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
   instead of the previous half_word/word/double_word.
 - Constify some of the paramerters.
 - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
 - Simplify the number of fields stored for each BAR, a single address
   field is stored and contains the address of the BAR both on Xen and
   in the guest.
 - Allow the guest to move the BARs around in the physical memory map.
 - Add support for expansion ROM BARs.
 - Do not cache the value of the command register.
 - Remove a label used in vpci_cmd_write.
 - Fix the calculation of the sizing mask in vpci_bar_write.
 - Check the memory decode bit in order to decide if a BAR is
   positioned or not.
 - Disable memory decoding before sizing the BARs in Xen.
 - When mapping/unmapping BARs check if there's overlap between BARs,
   in order to avoid unmapping memory required by another BAR.
 - Introduce a macro to check whether a BAR is mappable or not.
 - Add a comment regarding the lack of support for SR-IOV.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 453 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h    |  29 +++
 3 files changed, 483 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/header.c

Comments

Jan Beulich Sept. 7, 2017, 9:53 a.m. UTC | #1
>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> +                           bool map)
> +{
> +    struct rangeset *mem;
> +    struct map_data data = { .d = d, .map = map };
> +    int rc;
> +
> +    ASSERT(MAPPABLE_BAR(bar));
> +
> +    mem = vpci_get_bar_memory(d, bar);
> +    if ( IS_ERR(mem) )
> +        return PTR_ERR(mem);
> +
> +    rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
> +    rangeset_destroy(mem);
> +    if ( rc )
> +        return rc;
> +
> +    return 0;
> +}

Please make this simply "return rc".

> +static int vpci_modify_bars(const struct pci_dev *pdev, const bool map)

We don't normally constify non-pointing-to types, and even less so
in function parameters.

> +static uint32_t vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
> +                              const void *data)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +
> +    return pci_conf_read16(seg, bus, slot, func, reg);
> +}

I can see why you may want to have the slot and func local
variables, but at least seg and bus look pretty pointless to have
here.

> +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t cmd, void *data)
> +{
> +    uint16_t current_cmd;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +
> +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);

Would you mind making this the initializer of the variable?

> +    /*
> +     * Let the guest play with all the bits directly except for the
> +     * memory decoding one.
> +     */
> +    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
> +    {
> +        /* Memory space access change. */
> +        int rc = vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
> +
> +        if ( rc )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
> +                     seg, bus, slot, func,
> +                     cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);

Perhaps gprintk(), and perhaps XENLOG_WARNING (depending on
the device it may not be that big of a problem)?

> +            return;

I think you should not bail here when it is an unmap that failed -
you'd better disable memory decode in that case.

> +static uint32_t vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
> +                              const void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    uint32_t val;
> +    bool hi = false;
> +
> +    ASSERT(bar->type == VPCI_BAR_MEM32 || bar->type == VPCI_BAR_MEM64_LO ||
> +           bar->type == VPCI_BAR_MEM64_HI);
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( bar->sizing )
> +        val = ~(bar->size - 1) >> (hi ? 32 : 0);

I'm not going to insist on it, but "-bar->size" is certainly easier to
read, and may also produce shorter code (unless the compiler
does the transformation itself anyway).

> +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    bool hi = false;
> +
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY )
> +    {
> +         gdprintk(XENLOG_WARNING,
> +                  "%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled\n",
> +                  seg, bus, slot, func);

I'd again suggest gprintk().

> +        return;
> +    }
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( !hi )
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;

This could be the else to the earlier if().

> +
> +    /*
> +     * The PCI Local Bus Specification suggests writing ~0 to both the high
> +     * and the low part of the BAR registers before attempting to read back
> +     * the size.
> +     *
> +     * However real device BARs registers (at least the ones I've tried)
> +     * will return the size of the BAR just by having written ~0 to one half
> +     * of it, independently of the value of the other half of the register.
> +     * Hence here Xen will switch to returning the size as soon as one half
> +     * of the BAR register has been written with ~0.
> +     */

I don't believe this is correct behavior (but I'd have to play with
some hardware to see whether I can confirm the behavior you
describe): How would you place a BAR at, say, 0x1ffffff0?

> +    if ( val == (hi ? 0xffffffff : (uint32_t)PCI_BASE_ADDRESS_MEM_MASK) )
> +    {
> +        bar->sizing = true;
> +        return;
> +    }
> +    bar->sizing = false;
> +
> +    /* Update the relevant part of the BAR address. */
> +    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    /* Make sure Xen writes back the same value for the BAR RO bits. */
> +    if ( !hi )
> +        val |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                               PCI_FUNC(pdev->devfn), reg) &
> +                               ~PCI_BASE_ADDRESS_MEM_MASK;

Why don't you break out the logic from vpci_bar_read() into a
helper and use it here too, saving the (slow) config space access.

> +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> +                           uint32_t val, void *data)
> +{
> +    struct vpci_bar *rom = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> +    uint32_t addr = val & PCI_ROM_ADDRESS_MASK;
> +
> +    if ( addr == (uint32_t)PCI_ROM_ADDRESS_MASK )
> +    {
> +        rom->sizing = true;
> +        return;
> +    }
> +    rom->sizing = false;
> +
> +    rom->addr = addr;
> +
> +    /* Check if ROM BAR should be mapped. */
> +    if ( (cmd & PCI_COMMAND_MEMORY) &&

Why don't you disallow writes here when memory decoding is enabled
and the ROM BAR is enabled the same way (including a log message)
you do in vpci_bar_write()?

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint16_t cmd;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    switch ( pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f )
> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = 6;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = 2;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */
> +    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                                   &bars[i]);
> +            if ( rc )
> +            {
> +                pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +                return rc;
> +            }
> +
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;
> +
> +        /* Size the BAR and map it. */
> +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> +                              &addr, &size, false, false);
> +        if ( rc < 0 )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = addr;
> +        bars[i].size = size;
> +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                               &bars[i]);
> +        if ( rc )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +    }
> +
> +    /* Check expansion ROM. */
> +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size,
> +                          false, true);
> +    if ( rc < 0 )
> +    {
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +        return rc;
> +    }
> +
> +    if ( size )
> +    {
> +        struct vpci_bar *rom = &header->bars[num_bars];
> +
> +        rom->type = VPCI_BAR_ROM;
> +        rom->size = size;
> +        rom->addr = addr;
> +
> +        rc = vpci_add_register(pdev, vpci_rom_read, vpci_rom_write, rom_reg, 4,
> +                               rom);
> +        if ( rc )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +    }
> +
> +    if ( cmd & PCI_COMMAND_MEMORY )
> +    {
> +        rc = vpci_modify_bars(pdev, true);
> +        if ( rc )
> +        {
> +            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
> +            return rc;
> +        }
> +
> +        /* Enable memory decoding. */
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);

No need for two pci_conf_write16() invocations here - you can
simply arrange to ...

> +    }
> +
> +    return 0;
> +}

... "return rc" here.

> +REGISTER_VPCI_INIT(vpci_init_bars);

Considering that ultimately an error returned from this function will
lead to nothing but a log message, I wonder whether you wouldn't
better do best effort here, i.e. only record the first error, but
continue rather than bailing. Especially for the ROM it may well be
that it's not really needed for proper operation of the device.

Jan
Roger Pau Monné Sept. 12, 2017, 9:54 a.m. UTC | #2
On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> > +
> > +    /*
> > +     * The PCI Local Bus Specification suggests writing ~0 to both the high
> > +     * and the low part of the BAR registers before attempting to read back
> > +     * the size.
> > +     *
> > +     * However real device BARs registers (at least the ones I've tried)
> > +     * will return the size of the BAR just by having written ~0 to one half
> > +     * of it, independently of the value of the other half of the register.
> > +     * Hence here Xen will switch to returning the size as soon as one half
> > +     * of the BAR register has been written with ~0.
> > +     */
> 
> I don't believe this is correct behavior (but I'd have to play with
> some hardware to see whether I can confirm the behavior you
> describe): How would you place a BAR at, say, 0x1ffffff0?

I don't think it's 'correct' behavior either, but FreeBSD has been
sizing BARs like that, and nobody noticed any issues. I just fixed it
recently:

https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863

Roger.
Jan Beulich Sept. 12, 2017, 10:06 a.m. UTC | #3
>>> On 12.09.17 at 11:54, <roger.pau@citrix.com> wrote:
> On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
>> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> > +
>> > +    /*
>> > +     * The PCI Local Bus Specification suggests writing ~0 to both the high
>> > +     * and the low part of the BAR registers before attempting to read back
>> > +     * the size.
>> > +     *
>> > +     * However real device BARs registers (at least the ones I've tried)
>> > +     * will return the size of the BAR just by having written ~0 to one half
>> > +     * of it, independently of the value of the other half of the register.
>> > +     * Hence here Xen will switch to returning the size as soon as one half
>> > +     * of the BAR register has been written with ~0.
>> > +     */
>> 
>> I don't believe this is correct behavior (but I'd have to play with
>> some hardware to see whether I can confirm the behavior you
>> describe): How would you place a BAR at, say, 0x1ffffff0?
> 
> I don't think it's 'correct' behavior either, but FreeBSD has been
> sizing BARs like that, and nobody noticed any issues. I just fixed it
> recently:
> 
> https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863 

Oh, no, that old code was fine afaict. You have to view the two
halves of a 64-bit BAR as completely distinct registers, and
sizing of the full BAR can be done either by writing both, then
reading both, or reading/writing each half. The problem with the
code in the patch here is that you don't treat the two halves as
fully separate registers, implying that one half being written with
all ones _also_ makes the other half return the sizing value
instead of the last written address.

Jan
Roger Pau Monné Sept. 12, 2017, 11:48 a.m. UTC | #4
On Tue, Sep 12, 2017 at 04:06:13AM -0600, Jan Beulich wrote:
> >>> On 12.09.17 at 11:54, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
> >> > +
> >> > +    /*
> >> > +     * The PCI Local Bus Specification suggests writing ~0 to both the high
> >> > +     * and the low part of the BAR registers before attempting to read back
> >> > +     * the size.
> >> > +     *
> >> > +     * However real device BARs registers (at least the ones I've tried)
> >> > +     * will return the size of the BAR just by having written ~0 to one half
> >> > +     * of it, independently of the value of the other half of the register.
> >> > +     * Hence here Xen will switch to returning the size as soon as one half
> >> > +     * of the BAR register has been written with ~0.
> >> > +     */
> >> 
> >> I don't believe this is correct behavior (but I'd have to play with
> >> some hardware to see whether I can confirm the behavior you
> >> describe): How would you place a BAR at, say, 0x1ffffff0?
> > 
> > I don't think it's 'correct' behavior either, but FreeBSD has been
> > sizing BARs like that, and nobody noticed any issues. I just fixed it
> > recently:
> > 
> > https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863 
> 
> Oh, no, that old code was fine afaict. You have to view the two
> halves of a 64-bit BAR as completely distinct registers, and
> sizing of the full BAR can be done either by writing both, then
> reading both, or reading/writing each half.

OK, the example in the specification seems to suggest that you should
first write to both registers, and then read back the values.

> The problem with the
> code in the patch here is that you don't treat the two halves as
> fully separate registers, implying that one half being written with
> all ones _also_ makes the other half return the sizing value
> instead of the last written address.

Right, then I think adding a sizing_hi/sizing_lo field is the right
answer.

Thanks, Roger.
Jan Beulich Sept. 12, 2017, 12:56 p.m. UTC | #5
>>> On 12.09.17 at 13:48, <roger.pau@citrix.com> wrote:
> On Tue, Sep 12, 2017 at 04:06:13AM -0600, Jan Beulich wrote:
>> >>> On 12.09.17 at 11:54, <roger.pau@citrix.com> wrote:
>> > On Thu, Sep 07, 2017 at 03:53:07AM -0600, Jan Beulich wrote:
>> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote:
>> >> > +
>> >> > +    /*
>> >> > +     * The PCI Local Bus Specification suggests writing ~0 to both the 
> high
>> >> > +     * and the low part of the BAR registers before attempting to read 
> back
>> >> > +     * the size.
>> >> > +     *
>> >> > +     * However real device BARs registers (at least the ones I've tried)
>> >> > +     * will return the size of the BAR just by having written ~0 to one 
> half
>> >> > +     * of it, independently of the value of the other half of the 
> register.
>> >> > +     * Hence here Xen will switch to returning the size as soon as one 
> half
>> >> > +     * of the BAR register has been written with ~0.
>> >> > +     */
>> >> 
>> >> I don't believe this is correct behavior (but I'd have to play with
>> >> some hardware to see whether I can confirm the behavior you
>> >> describe): How would you place a BAR at, say, 0x1ffffff0?
>> > 
>> > I don't think it's 'correct' behavior either, but FreeBSD has been
>> > sizing BARs like that, and nobody noticed any issues. I just fixed it
>> > recently:
>> > 
>> > https://svnweb.freebsd.org/base/head/sys/dev/pci/pci.c?r1=312250&r2=321863 
>> 
>> Oh, no, that old code was fine afaict. You have to view the two
>> halves of a 64-bit BAR as completely distinct registers, and
>> sizing of the full BAR can be done either by writing both, then
>> reading both, or reading/writing each half.
> 
> OK, the example in the specification seems to suggest that you should
> first write to both registers, and then read back the values.
> 
>> The problem with the
>> code in the patch here is that you don't treat the two halves as
>> fully separate registers, implying that one half being written with
>> all ones _also_ makes the other half return the sizing value
>> instead of the last written address.
> 
> Right, then I think adding a sizing_hi/sizing_lo field is the right
> answer.

That was my first thought too, but meanwhile I think these flags
are pointless and misleading, and hence should be dropped.
There's really nothing special about those sizing writes: They
simply write a special address, but to the handler this shouldn't
matter. All you need to make sure is that hardwired to zero bits
come back as zero on the following read(s), entirely independent
of what precise value was written (the r/o bits at the bottom of
the first half left aside here, of course).

Jan
diff mbox

Patch

diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@ 
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..9b44c8441a
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,453 @@ 
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xen/p2m-common.h>
+
+#define MAPPABLE_BAR(x)                                                 \
+    ((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||   \
+     (x)->type == VPCI_BAR_ROM)
+
+static struct rangeset *vpci_get_bar_memory(const struct domain *d,
+                                            const struct vpci_bar *map)
+{
+    const struct pci_dev *pdev;
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    int rc;
+
+    if ( !mem )
+        return ERR_PTR(-ENOMEM);
+
+    /*
+     * Create a rangeset that represents the current BAR memory region
+     * and compare it against all the currently active BAR memory regions.
+     * If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
+     *
+     * NB: the rangeset uses inclusive frame numbers.
+     */
+    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
+                            PFN_DOWN(map->addr + map->size - 1));
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return ERR_PTR(rc);
+    }
+
+    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
+    {
+        unsigned int i;
+
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        {
+            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
+            unsigned long start = PFN_DOWN(bar->addr);
+            unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+
+            if ( bar == map || !bar->enabled || !MAPPABLE_BAR(bar) ||
+                 !rangeset_overlaps_range(mem, start, end) )
+                continue;
+
+            rc = rangeset_remove_range(mem, start, end);
+            if ( rc )
+            {
+                rangeset_destroy(mem);
+                return ERR_PTR(rc);
+            }
+        }
+    }
+
+    return mem;
+}
+
+struct map_data {
+    struct domain *d;
+    bool map;
+};
+
+static int vpci_map_range(unsigned long s, unsigned long e, void *data)
+{
+    const struct map_data *map = data;
+
+    return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
+}
+
+static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
+                           bool map)
+{
+    struct rangeset *mem;
+    struct map_data data = { .d = d, .map = map };
+    int rc;
+
+    ASSERT(MAPPABLE_BAR(bar));
+
+    mem = vpci_get_bar_memory(d, bar);
+    if ( IS_ERR(mem) )
+        return PTR_ERR(mem);
+
+    rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
+    rangeset_destroy(mem);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+static int vpci_modify_bars(const struct pci_dev *pdev, const bool map)
+{
+    struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( !MAPPABLE_BAR(bar) ||
+             (bar->type == VPCI_BAR_ROM && !bar->rom_enabled) )
+            continue;
+
+        rc = vpci_modify_bar(pdev->domain, bar, map);
+        if ( rc )
+            return rc;
+
+        bar->enabled = map;
+    }
+
+    return 0;
+}
+
+static uint32_t vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
+                              const void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    return pci_conf_read16(seg, bus, slot, func, reg);
+}
+
+static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
+                           uint32_t cmd, void *data)
+{
+    uint16_t current_cmd;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
+
+    /*
+     * Let the guest play with all the bits directly except for the
+     * memory decoding one.
+     */
+    if ( (cmd ^ current_cmd) & PCI_COMMAND_MEMORY )
+    {
+        /* Memory space access change. */
+        int rc = vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
+
+        if ( rc )
+        {
+            gdprintk(XENLOG_ERR,
+                     "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
+                     seg, bus, slot, func,
+                     cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
+            return;
+        }
+    }
+
+    pci_conf_write16(seg, bus, slot, func, reg, cmd);
+}
+
+static uint32_t vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
+                              const void *data)
+{
+    const struct vpci_bar *bar = data;
+    uint32_t val;
+    bool hi = false;
+
+    ASSERT(bar->type == VPCI_BAR_MEM32 || bar->type == VPCI_BAR_MEM64_LO ||
+           bar->type == VPCI_BAR_MEM64_HI);
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( bar->sizing )
+        val = ~(bar->size - 1) >> (hi ? 32 : 0);
+    else
+        val = bar->addr >> (hi ? 32 : 0);
+
+    if ( !hi )
+    {
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    return val;
+}
+
+static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
+                           uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    bool hi = false;
+
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY )
+    {
+         gdprintk(XENLOG_WARNING,
+                  "%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled\n",
+                  seg, bus, slot, func);
+        return;
+    }
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( !hi )
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * The PCI Local Bus Specification suggests writing ~0 to both the high
+     * and the low part of the BAR registers before attempting to read back
+     * the size.
+     *
+     * However real device BARs registers (at least the ones I've tried)
+     * will return the size of the BAR just by having written ~0 to one half
+     * of it, independently of the value of the other half of the register.
+     * Hence here Xen will switch to returning the size as soon as one half
+     * of the BAR register has been written with ~0.
+     */
+    if ( val == (hi ? 0xffffffff : (uint32_t)PCI_BASE_ADDRESS_MEM_MASK) )
+    {
+        bar->sizing = true;
+        return;
+    }
+    bar->sizing = false;
+
+    /* Update the relevant part of the BAR address. */
+    bar->addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    /* Make sure Xen writes back the same value for the BAR RO bits. */
+    if ( !hi )
+        val |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                               PCI_FUNC(pdev->devfn), reg) &
+                               ~PCI_BASE_ADDRESS_MEM_MASK;
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, val);
+}
+
+static uint32_t vpci_rom_read(struct pci_dev *pdev, unsigned int reg,
+                              const void *data)
+{
+    const struct vpci_bar *rom = data;
+    uint32_t val;
+
+    val = rom->sizing ? ~(rom->size - 1) : rom->addr;
+    val |= rom->rom_enabled ? PCI_ROM_ADDRESS_ENABLE : 0;
+
+    return val;
+}
+
+static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
+                           uint32_t val, void *data)
+{
+    struct vpci_bar *rom = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    uint32_t addr = val & PCI_ROM_ADDRESS_MASK;
+
+    if ( addr == (uint32_t)PCI_ROM_ADDRESS_MASK )
+    {
+        rom->sizing = true;
+        return;
+    }
+    rom->sizing = false;
+
+    rom->addr = addr;
+
+    /* Check if ROM BAR should be mapped. */
+    if ( (cmd & PCI_COMMAND_MEMORY) &&
+         rom->enabled != !!(val & PCI_ROM_ADDRESS_ENABLE) &&
+         vpci_modify_bar(pdev->domain, rom, val & PCI_ROM_ADDRESS_ENABLE) )
+        return;
+
+    rom->rom_enabled = val & PCI_ROM_ADDRESS_ENABLE;
+    pci_conf_write32(pdev->seg, pdev->bus, slot, func, reg, val);
+}
+
+static int vpci_init_bars(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint16_t cmd;
+    uint64_t addr, size;
+    unsigned int i, num_bars, rom_reg;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    int rc;
+
+    switch ( pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = 6;
+        rom_reg = PCI_ROM_ADDRESS;
+        break;
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = 2;
+        rom_reg = PCI_ROM_ADDRESS1;
+        break;
+    default:
+        return -EOPNOTSUPP;
+    }
+
+    /* Setup a handler for the command register. */
+    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    /* Disable memory decoding before sizing. */
+    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    if ( cmd & PCI_COMMAND_MEMORY )
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
+                         cmd & ~PCI_COMMAND_MEMORY);
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                                   &bars[i]);
+            if ( rc )
+            {
+                pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+                return rc;
+            }
+
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+             PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM32;
+
+        /* Size the BAR and map it. */
+        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
+                              &addr, &size, false, false);
+        if ( rc < 0 )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = addr;
+        bars[i].size = size;
+        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                               &bars[i]);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+    }
+
+    /* Check expansion ROM. */
+    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size,
+                          false, true);
+    if ( rc < 0 )
+    {
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+        return rc;
+    }
+
+    if ( size )
+    {
+        struct vpci_bar *rom = &header->bars[num_bars];
+
+        rom->type = VPCI_BAR_ROM;
+        rom->size = size;
+        rom->addr = addr;
+
+        rc = vpci_add_register(pdev, vpci_rom_read, vpci_rom_write, rom_reg, 4,
+                               rom);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+    }
+
+    if ( cmd & PCI_COMMAND_MEMORY )
+    {
+        rc = vpci_modify_bars(pdev, true);
+        if ( rc )
+        {
+            pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+            return rc;
+        }
+
+        /* Enable memory decoding. */
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+    }
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 12f7287d7b..3c6beaaf4a 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -65,6 +65,35 @@  void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            paddr_t addr;
+            uint64_t size;
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM32,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+                VPCI_BAR_ROM,
+            } type;
+            bool prefetchable;
+            bool sizing;
+            /* Store whether the BAR is mapped into guest p2m. */
+            bool enabled;
+            /*
+             * Store whether the ROM enable bit is set (doesn't imply ROM BAR
+             * is mapped into guest p2m). Only used for type VPCI_BAR_ROM.
+             */
+            bool rom_enabled;
+        } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
+        /* FIXME: currently there's no support for SR-IOV. */
+    } header;
+#endif
 };
 
 #endif