Message ID | 20170814142850.39133-3-roger.pau@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> -----Original Message----- > From: Roger Pau Monne [mailto:roger.pau@citrix.com] > Sent: 14 August 2017 15:29 > To: xen-devel@lists.xenproject.org > Cc: boris.ostrovsky@oracle.com; konrad.wilk@oracle.com; Roger Pau Monne > <roger.pau@citrix.com>; Ian Jackson <Ian.Jackson@citrix.com>; Wei Liu > <wei.liu2@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper > <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com> > Subject: [PATCH v5 02/11] vpci: introduce basic handlers to trap accesses to > the PCI config space > > This functionality is going to reside in vpci.c (and the corresponding > vpci.h header), and should be arch-agnostic. The handlers introduced > in this patch setup the basic functionality required in order to trap > accesses to the PCI config space, and allow decoding the address and > finding the corresponding handler that should handle the access > (although no handlers are implemented). > > Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are > setup inside of a x86 HVM file, since that's not shared with other > arches. > > A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen > whether a domain should use the newly introduced vPCI handlers, this > is only enabled for PVH Dom0 at the moment. > > A very simple user-space test is also provided, so that the basic > functionality of the vPCI traps can be asserted. This has been proven > quite helpful during development, since the logic to handle partial > accesses or accesses that expand across multiple registers is not > trivial. > > The handlers for the registers are added to a linked list that's keep > sorted at all times. Both the read and write handlers support accesses > that expand across multiple emulated registers and contain gaps not > emulated. > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > Cc: Ian Jackson <ian.jackson@eu.citrix.com> > Cc: Wei Liu <wei.liu2@citrix.com> > Cc: Jan Beulich <jbeulich@suse.com> > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: Paul Durrant <paul.durrant@citrix.com> > --- > Changes since v4: > * User-space test harness: > - Do not redirect the output of the test. > - Add main.c and emul.h as dependencies of the Makefile target. > - Use the same rule to modify the vpci and list headers. > - Remove underscores from local macro variables. > - Add _check suffix to the test harness multiread function. > - Change the value written by every different size in the multiwrite > test. > - Use { } to initialize the r16 and r20 arrays (instead of { 0 }). > - Perform some of the read checks with the local variable directly. > - Expand some comments. > - Implement a dummy rwlock. > * Hypervisor code: > - Guard the linker script changes with CONFIG_HAS_PCI. > - Rename vpci_access_check to vpci_access_allowed and make it return > bool. > - Make hvm_pci_decode_addr return the register as return value. > - Use ~3 instead of 0xfffc to remove the register offset when > checking accesses to IO ports. > - s/head/prev in vpci_add_register. > - Add parentheses around & in vpci_add_register. > - Fix register removal. > - Change the BUGs in vpci_{read/write}_hw helpers to > ASSERT_UNREACHABLE. > - Make merge_result static and change the computation of the mask to > avoid using a uint64_t. > - Modify vpci_read to only read from hardware the not-emulated gaps. > - Remove the vpci_val union and use a uint32_t instead. > - Change handler read type to return a uint32_t instead of modifying > a variable passed by reference. > - Constify the data opaque parameter of read handlers. > - Change the size parameter of the vpci_{read/write} functions to > unsigned int. > - Place the array of initialization handlers in init.rodata or > .rodata depending on whether late-hwdom is enabled. > - Remove the pci_devs lock, assume the Dom0 is well behaved and won't > remove the device while trying to access it. > - Change the recursive spinlock into a rw lock for performance > reasons. > > Changes since v3: > * User-space test harness: > - Fix spaces in container_of macro. > - Implement a dummy locking functions. > - Remove 'current' macro make current a pointer to the statically > allocated vpcu. > - Remove unneeded parentheses in the pci_conf_readX macros. > - Fix the name of the write test macro. > - Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB > code only). > - Import the max macro. > - Test all possible read/write size combinations with all possible > emulated register sizes. > - Introduce a test for register removal. > * Hypervisor code: > - Use a sorted list in order to store the config space handlers. > - Remove some unneeded 'else' branches. > - Make the IO port handlers always return X86EMUL_OKAY, and set the > data to all 1's in case of read failure (write are simply ignored). > - In hvm_select_ioreq_server reuse local variables when calling > XEN_DMOP_PCI_SBDF. > - Store the pointers to the initialization functions in the .rodata > section. > - Do not ignore the return value of xen_vpci_add_handlers in > setup_one_hwdom_device. > - Remove the vpci_init macro. > - Do not hide the pointers inside of the vpci_{read/write}_t > typedefs. > - Rename priv_data to private in vpci_register. > - Simplify checking for register overlap in vpci_register_cmp. > - Check that the offset and the length match before removing a > register in xen_vpci_remove_register. > - Make vpci_read_hw return a value rather than storing it in a > pointer passed by parameter. > - Handler dispatcher functions vpci_{read/write} no longer return an > error code, errors on reads/writes should be treated like hardware > (writes ignored, reads return all 1's or garbage). > - Make sure pcidevs is locked before calling pci_get_pdev_by_domain. > - Use a recursive spinlock for the vpci lock, so that spin_is_locked > checks that the current CPU is holding the lock. > - Make the code less error-chatty by removing some of the printk's. > - Pass the slot and the function as separate parameters to the > handler dispatchers (instead of passing devfn). > - Allow handlers to be registered with either a read or write > function only, the missing handler will be replaced by a dummy > handler (writes ignored, reads return 1's). > - Introduce PCI_CFG_SPACE_* defines from Linux. > - Simplify the handler dispatchers by removing the recursion, now the > dispatchers iterate over the list of sorted handlers and call them > in order. > - Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT > macros, > and instead provide a merge_result function in order to merge a > register output into a partial result. > - Rename the fields of the vpci_val union to u8/u16/u32. > - Remove the return values from the read/write handlers, errors > should be handled internally and signaled as would be done on > native hardware. > - Remove the usage of the GENMASK macro. > > Changes since v2: > - Generalize the PCI address decoding and use it for IOREQ code also. > > Changes since v1: > - Allow access to cross a word-boundary. > - Add locking. > - Add cleanup to xen_vpci_add_handlers in case of failure. > --- > .gitignore | 3 + > tools/libxl/libxl_x86.c | 2 +- > tools/tests/Makefile | 1 + > tools/tests/vpci/Makefile | 37 ++++ > tools/tests/vpci/emul.h | 128 +++++++++++ > tools/tests/vpci/main.c | 314 +++++++++++++++++++++++++++ > xen/arch/arm/xen.lds.S | 10 + > xen/arch/x86/domain.c | 18 +- > xen/arch/x86/hvm/hvm.c | 2 + > xen/arch/x86/hvm/io.c | 118 +++++++++- > xen/arch/x86/setup.c | 3 +- > xen/arch/x86/xen.lds.S | 10 + > xen/drivers/Makefile | 2 +- > xen/drivers/passthrough/pci.c | 9 +- > xen/drivers/vpci/Makefile | 1 + > xen/drivers/vpci/vpci.c | 443 > ++++++++++++++++++++++++++++++++++++++ > xen/include/asm-x86/domain.h | 1 + > xen/include/asm-x86/hvm/domain.h | 3 + > xen/include/asm-x86/hvm/io.h | 3 + > xen/include/public/arch-x86/xen.h | 5 +- > xen/include/xen/pci.h | 3 + > xen/include/xen/pci_regs.h | 8 + > xen/include/xen/vpci.h | 80 +++++++ > 23 files changed, 1194 insertions(+), 10 deletions(-) > create mode 100644 tools/tests/vpci/Makefile > create mode 100644 tools/tests/vpci/emul.h > create mode 100644 tools/tests/vpci/main.c > create mode 100644 xen/drivers/vpci/Makefile > create mode 100644 xen/drivers/vpci/vpci.c > create mode 100644 xen/include/xen/vpci.h > [snip] > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > index 6cb903def5..cc73df8dc7 100644 > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -36,6 +36,7 @@ > #include <xen/rangeset.h> > #include <xen/monitor.h> > #include <xen/warning.h> > +#include <xen/vpci.h> > #include <asm/shadow.h> > #include <asm/hap.h> > #include <asm/current.h> > @@ -629,6 +630,7 @@ int hvm_domain_initialise(struct domain *d, unsigned > long domcr_flags, > d->arch.hvm_domain.io_bitmap = hvm_io_bitmap; > > register_g2m_portio_handler(d); > + register_vpci_portio_handler(d); > > hvm_ioreq_init(d); > > diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c > index 074cba89da..c3b68eb257 100644 > --- a/xen/arch/x86/hvm/io.c > +++ b/xen/arch/x86/hvm/io.c > @@ -25,6 +25,7 @@ > #include <xen/trace.h> > #include <xen/event.h> > #include <xen/hypercall.h> > +#include <xen/vpci.h> > #include <asm/current.h> > #include <asm/cpufeature.h> > #include <asm/processor.h> > @@ -260,7 +261,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, > unsigned int addr, > unsigned int *bus, unsigned int *slot, > unsigned int *func) > { > - unsigned long bdf; > + unsigned int bdf; Shouldn't this be folded into the previous patch where you introduce this function? > > ASSERT(CF8_ENABLED(cf8)); > > @@ -275,6 +276,121 @@ unsigned int hvm_pci_decode_addr(unsigned int > cf8, unsigned int addr, > return CF8_ADDR_LO(cf8) | (addr & 3); > } > > +/* Do some sanity checks. */ > +static bool vpci_access_allowed(unsigned int reg, unsigned int len) > +{ > + /* Check access size. */ > + if ( len != 1 && len != 2 && len != 4 ) > + return false; > + > + /* Check that access is size aligned. */ > + if ( (reg & (len - 1)) ) > + return false; > + > + return true; > +} > + > +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */ > +static bool vpci_portio_accept(const struct hvm_io_handler *handler, > + const ioreq_t *p) > +{ > + return (p->addr == 0xcf8 && p->size == 4) || (p->addr & ~3) == 0xcfc; > +} > + > +static int vpci_portio_read(const struct hvm_io_handler *handler, > + uint64_t addr, uint32_t size, uint64_t *data) > +{ > + struct domain *d = current->domain; > + unsigned int bus, slot, func, reg; > + > + *data = ~(uint64_t)0; > + > + vpci_rlock(d); > + if ( addr == 0xcf8 ) > + { > + ASSERT(size == 4); > + *data = d->arch.hvm_domain.pci_cf8; > + vpci_runlock(d); > + return X86EMUL_OKAY; > + } > + if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) ) > + { > + vpci_runlock(d); > + return X86EMUL_OKAY; > + } > + > + reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, > &slot, > + &func); > + > + if ( !vpci_access_allowed(reg, size) ) > + { > + vpci_runlock(d); > + return X86EMUL_OKAY; > + } > + > + *data = vpci_read(0, bus, slot, func, reg, size); > + vpci_runlock(d); > + > + return X86EMUL_OKAY; > +} > + > +static int vpci_portio_write(const struct hvm_io_handler *handler, > + uint64_t addr, uint32_t size, uint64_t data) > +{ > + struct domain *d = current->domain; > + unsigned int bus, slot, func, reg; > + > + vpci_wlock(d); > + if ( addr == 0xcf8 ) > + { > + ASSERT(size == 4); > + d->arch.hvm_domain.pci_cf8 = data; > + vpci_wunlock(d); > + return X86EMUL_OKAY; > + } > + if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) ) > + { > + vpci_wunlock(d); > + return X86EMUL_OKAY; > + } > + > + reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, > &slot, > + &func); > + > + if ( !vpci_access_allowed(reg, size) ) > + { > + vpci_wunlock(d); > + return X86EMUL_OKAY; > + } > + > + vpci_write(0, bus, slot, func, reg, size, data); > + vpci_wunlock(d); > + > + return X86EMUL_OKAY; > +} > + > +static const struct hvm_io_ops vpci_portio_ops = { > + .accept = vpci_portio_accept, > + .read = vpci_portio_read, > + .write = vpci_portio_write, > +}; > + > +void register_vpci_portio_handler(struct domain *d) > +{ > + struct hvm_io_handler *handler; > + > + if ( !has_vpci(d) ) > + return; > + > + handler = hvm_next_io_handler(d); > + if ( !handler ) > + return; > + > + rwlock_init(&d->arch.hvm_domain.vpci_lock); > + handler->type = IOREQ_TYPE_PIO; > + handler->ops = &vpci_portio_ops; > +} > + > /* > * Local variables: > * mode: C > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c > index db5df6956d..5b2c0e3fc3 100644 > --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long > mbi_p) > domcr_flags |= DOMCRF_hvm | > ((hvm_funcs.hap_supported && !opt_dom0_shadow) ? > DOMCRF_hap : 0); > - config.emulation_flags = > XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC; > + config.emulation_flags = > XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC| > + XEN_X86_EMU_VPCI; > } > > /* Create initial domain 0. */ > diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S > index ff08bbe42a..af1b30cb2b 100644 > --- a/xen/arch/x86/xen.lds.S > +++ b/xen/arch/x86/xen.lds.S > @@ -76,6 +76,11 @@ SECTIONS > > __2M_rodata_start = .; /* Start of 2M superpages, mapped RO. */ > .rodata : { > +#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM) > + __start_vpci_array = .; > + *(.rodata.vpci) > + __end_vpci_array = .; > +#endif > _srodata = .; > /* Bug frames table */ > __start_bug_frames = .; > @@ -167,6 +172,11 @@ SECTIONS > _einittext = .; > } :text > .init.data : { > +#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM) > + __start_vpci_array = .; > + *(.init.rodata.vpci) > + __end_vpci_array = .; > +#endif > *(.init.rodata) > *(.init.rodata.rel) > *(.init.rodata.str*) > diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile > index 19391802a8..d51c766453 100644 > --- a/xen/drivers/Makefile > +++ b/xen/drivers/Makefile > @@ -1,6 +1,6 @@ > subdir-y += char > subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq > -subdir-$(CONFIG_HAS_PCI) += pci > +subdir-$(CONFIG_HAS_PCI) += pci vpci > subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough > subdir-$(CONFIG_ACPI) += acpi > subdir-$(CONFIG_VIDEO) += video > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c > index 27bdb7163c..54326cf0b8 100644 > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -30,6 +30,7 @@ > #include <xen/radix-tree.h> > #include <xen/softirq.h> > #include <xen/tasklet.h> > +#include <xen/vpci.h> > #include <xsm/xsm.h> > #include <asm/msi.h> > #include "ats.h" > @@ -1030,9 +1031,10 @@ static void __hwdom_init > setup_one_hwdom_device(const struct setup_hwdom *ctxt, > struct pci_dev *pdev) > { > u8 devfn = pdev->devfn; > + int err; > > do { > - int err = ctxt->handler(devfn, pdev); > + err = ctxt->handler(devfn, pdev); > > if ( err ) > { > @@ -1045,6 +1047,11 @@ static void __hwdom_init > setup_one_hwdom_device(const struct setup_hwdom *ctxt, > devfn += pdev->phantom_stride; > } while ( devfn != pdev->devfn && > PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) ); > + > + err = vpci_add_handlers(pdev); > + if ( err ) > + printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n", > + ctxt->d->domain_id, err); > } > > static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, > void *arg) > diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile > new file mode 100644 > index 0000000000..840a906470 > --- /dev/null > +++ b/xen/drivers/vpci/Makefile > @@ -0,0 +1 @@ > +obj-y += vpci.o > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > new file mode 100644 > index 0000000000..f63de97e89 > --- /dev/null > +++ b/xen/drivers/vpci/vpci.c > @@ -0,0 +1,443 @@ > +/* > + * Generic functionality for handling accesses to the PCI configuration space > + * from guests. > + * > + * Copyright (C) 2017 Citrix Systems R&D > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms and conditions of the GNU General Public > + * License, version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public > + * License along with this program; If not, see > <http://www.gnu.org/licenses/>. > + */ > + > +#include <xen/sched.h> > +#include <xen/vpci.h> > + > +extern vpci_register_init_t *const __start_vpci_array[]; > +extern vpci_register_init_t *const __end_vpci_array[]; > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array) > + > +/* Internal struct to store the emulated PCI registers. */ > +struct vpci_register { > + vpci_read_t *read; > + vpci_write_t *write; > + unsigned int size; > + unsigned int offset; > + void *private; > + struct list_head node; > +}; > + > +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev) > +{ > + unsigned int i; > + int rc = 0; > + > + if ( !has_vpci(pdev->domain) ) > + return 0; > + > + pdev->vpci = xzalloc(struct vpci); > + if ( !pdev->vpci ) > + return -ENOMEM; > + > + INIT_LIST_HEAD(&pdev->vpci->handlers); > + > + for ( i = 0; i < NUM_VPCI_INIT; i++ ) > + { > + rc = __start_vpci_array[i](pdev); > + if ( rc ) > + break; > + } > + > + if ( rc ) > + { > + while ( !list_empty(&pdev->vpci->handlers) ) > + { > + struct vpci_register *r = list_first_entry(&pdev->vpci->handlers, > + struct vpci_register, > + node); > + > + list_del(&r->node); > + xfree(r); > + } > + xfree(pdev->vpci); > + } > + > + return rc; > +} > + > +static int vpci_register_cmp(const struct vpci_register *r1, > + const struct vpci_register *r2) > +{ > + /* Return 0 if registers overlap. */ > + if ( r1->offset < r2->offset + r2->size && > + r2->offset < r1->offset + r1->size ) > + return 0; > + if ( r1->offset < r2->offset ) > + return -1; > + if ( r1->offset > r2->offset ) > + return 1; > + > + ASSERT_UNREACHABLE(); > + return 0; > +} > + > +/* Dummy hooks, writes are ignored, reads return 1's */ > +static uint32_t vpci_ignored_read(struct pci_dev *pdev, unsigned int reg, > + const void *data) > +{ > + return ~(uint32_t)0; > +} > + > +static void vpci_ignored_write(struct pci_dev *pdev, unsigned int reg, > + uint32_t val, void *data) > +{ > +} > + > +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t > *read_handler, > + vpci_write_t *write_handler, unsigned int offset, > + unsigned int size, void *data) > +{ > + struct list_head *prev; > + struct vpci_register *r; > + > + /* Some sanity checks. */ > + if ( (size != 1 && size != 2 && size != 4) || > + offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) || > + (!read_handler && !write_handler) ) > + return -EINVAL; > + > + r = xmalloc(struct vpci_register); > + if ( !r ) > + return -ENOMEM; > + > + r->read = read_handler ?: vpci_ignored_read; > + r->write = write_handler ?: vpci_ignored_write; > + r->size = size; > + r->offset = offset; > + r->private = data; > + > + vpci_wlock(pdev->domain); > + > + /* The list of handlers must be keep sorted at all times. */ > + list_for_each ( prev, &pdev->vpci->handlers ) > + { > + const struct vpci_register *this = > + list_entry(prev, const struct vpci_register, node); > + int cmp = vpci_register_cmp(r, this); > + > + if ( cmp < 0 ) > + break; > + if ( cmp == 0 ) > + { > + vpci_wunlock(pdev->domain); > + xfree(r); > + return -EEXIST; > + } > + } > + > + list_add_tail(&r->node, prev); > + vpci_wunlock(pdev->domain); > + > + return 0; > +} > + > +int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset, > + unsigned int size) > +{ > + const struct vpci_register r = { .offset = offset, .size = size }; > + struct vpci_register *rm; > + > + vpci_wlock(pdev->domain); > + list_for_each_entry ( rm, &pdev->vpci->handlers, node ) > + { > + int cmp = vpci_register_cmp(&r, rm); > + > + /* > + * NB: do not use a switch so that we can use break to > + * get out of the list loop earlier if required. > + */ > + if ( !cmp && rm->offset == offset && rm->size == size ) > + { > + list_del(&rm->node); > + vpci_wunlock(pdev->domain); > + xfree(rm); > + return 0; > + } > + if ( cmp <= 0 ) > + break; > + } > + vpci_wunlock(pdev->domain); > + > + return -ENOENT; > +} > + > +/* Wrappers for performing reads/writes to the underlying hardware. */ > +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus, > + unsigned int slot, unsigned int func, > + unsigned int reg, unsigned int size) > +{ > + uint32_t data; > + > + switch ( size ) > + { > + case 4: > + data = pci_conf_read32(seg, bus, slot, func, reg); > + break; > + case 3: > + /* > + * This is possible because a 4byte read can have 1byte trapped and > + * the rest passed-through. > + */ > + if ( reg & 1 ) > + { > + data = pci_conf_read8(seg, bus, slot, func, reg); > + data |= pci_conf_read16(seg, bus, slot, func, reg + 1) << 8; > + } > + else > + { > + data = pci_conf_read16(seg, bus, slot, func, reg); > + data |= pci_conf_read8(seg, bus, slot, func, reg + 2) << 16; > + } > + break; > + case 2: > + data = pci_conf_read16(seg, bus, slot, func, reg); > + break; > + case 1: > + data = pci_conf_read8(seg, bus, slot, func, reg); > + break; > + default: > + ASSERT_UNREACHABLE(); > + data = ~(uint32_t)0; > + break; > + } > + > + return data; > +} > + > +static void vpci_write_hw(unsigned int seg, unsigned int bus, > + unsigned int slot, unsigned int func, > + unsigned int reg, unsigned int size, uint32_t data) > +{ > + switch ( size ) > + { > + case 4: > + pci_conf_write32(seg, bus, slot, func, reg, data); > + break; > + case 3: > + /* > + * This is possible because a 4byte write can have 1byte trapped and > + * the rest passed-through. > + */ > + if ( reg & 1 ) > + { > + pci_conf_write8(seg, bus, slot, func, reg, data); > + pci_conf_write16(seg, bus, slot, func, reg + 1, data >> 8); > + } > + else > + { > + pci_conf_write16(seg, bus, slot, func, reg, data); > + pci_conf_write8(seg, bus, slot, func, reg + 2, data >> 16); > + } > + break; > + case 2: > + pci_conf_write16(seg, bus, slot, func, reg, data); > + break; > + case 1: > + pci_conf_write8(seg, bus, slot, func, reg, data); > + break; > + default: > + ASSERT_UNREACHABLE(); > + break; > + } > +} > + > +/* > + * Merge new data into a partial result. > + * > + * Zero the bytes of 'data' from [offset, offset + size), and > + * merge the value found in 'new' from [0, offset) left shifted > + * by 'offset'. > + */ > +static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, > + unsigned int offset) > +{ > + uint32_t mask = 0xffffffff >> (32 - 8 * size); > + > + return (data & ~(mask << (offset * 8))) | ((new & mask) << (offset * 8)); > +} > + > +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot, > + unsigned int func, unsigned int reg, unsigned int size) > +{ > + struct domain *d = current->domain; > + struct pci_dev *pdev; > + const struct vpci_register *r; > + unsigned int data_offset = 0; > + uint32_t data = ~(uint32_t)0; > + > + ASSERT(vpci_rlocked(d)); > + > + /* Find the PCI dev matching the address. */ > + pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func)); > + if ( !pdev ) > + return vpci_read_hw(seg, bus, slot, func, reg, size); > + > + /* Read from the hardware or the emulated register handlers. */ > + list_for_each_entry ( r, &pdev->vpci->handlers, node ) > + { > + const struct vpci_register emu = { > + .offset = reg + data_offset, > + .size = size - data_offset > + }; > + int cmp = vpci_register_cmp(&emu, r); > + uint32_t val; > + unsigned int read_size; > + > + if ( cmp < 0 ) > + break; > + if ( cmp > 0 ) > + continue; > + > + if ( emu.offset < r->offset ) > + { > + /* Heading gap, read partial content from hardware. */ > + read_size = r->offset - emu.offset; > + val = vpci_read_hw(seg, bus, slot, func, emu.offset, read_size); > + data = merge_result(data, val, read_size, data_offset); > + data_offset += read_size; > + } > + > + val = r->read(pdev, r->offset, r->private); > + > + /* Check if the read is in the middle of a register. */ > + if ( r->offset < emu.offset ) > + val >>= (emu.offset - r->offset) * 8; > + > + /* Find the intersection size between the two sets. */ > + read_size = min(emu.offset + emu.size, r->offset + r->size) - > + max(emu.offset, r->offset); > + /* Merge the emulated data into the native read value. */ > + data = merge_result(data, val, read_size, data_offset); > + data_offset += read_size; > + if ( data_offset == size ) > + break; > + ASSERT(data_offset < size); > + } > + > + if ( data_offset < size ) > + { > + /* Tailing gap, read the remaining. */ > + uint32_t tmp_data = vpci_read_hw(seg, bus, slot, func, > + reg + data_offset, > + size - data_offset); > + > + data = merge_result(data, tmp_data, size - data_offset, data_offset); > + } > + > + return data & (0xffffffff >> (32 - 8 * size)); > +} > + > +/* > + * Perform a maybe partial write to a register. > + * > + * Note that this will only work for simple registers, if Xen needs to > + * trap accesses to rw1c registers (like the status PCI header register) > + * the logic in vpci_write will have to be expanded in order to correctly > + * deal with them. > + */ > +static void vpci_write_helper(struct pci_dev *pdev, > + const struct vpci_register *r, unsigned int size, > + unsigned int offset, uint32_t data) > +{ > + ASSERT(size <= r->size); > + > + if ( size != r->size ) > + { > + uint32_t val; > + > + val = r->read(pdev, r->offset, r->private); > + data = merge_result(val, data, size, offset); > + } > + > + r->write(pdev, r->offset, data & (0xffffffff >> (32 - 8 * r->size)), > + r->private); > +} > + > +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot, > + unsigned int func, unsigned int reg, unsigned int size, > + uint32_t data) > +{ > + struct domain *d = current->domain; > + struct pci_dev *pdev; > + const struct vpci_register *r; > + unsigned int data_offset = 0; > + > + ASSERT(vpci_wlocked(d)); > + > + /* > + * Find the PCI dev matching the address. > + * Passthrough everything that's not trapped. > + */ > + pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func)); > + if ( !pdev ) > + { > + vpci_write_hw(seg, bus, slot, func, reg, size, data); > + return; > + } > + > + /* Write the value to the hardware or emulated registers. */ > + list_for_each_entry ( r, &pdev->vpci->handlers, node ) > + { > + const struct vpci_register emu = { > + .offset = reg + data_offset, > + .size = size - data_offset > + }; > + int cmp = vpci_register_cmp(&emu, r); > + unsigned int write_size; > + > + if ( cmp < 0 ) > + break; > + if ( cmp > 0 ) > + continue; > + > + if ( emu.offset < r->offset ) > + { > + /* Heading gap, write partial content to hardware. */ > + vpci_write_hw(seg, bus, slot, func, emu.offset, > + r->offset - emu.offset, data >> (data_offset * 8)); > + data_offset += r->offset - emu.offset; > + } > + > + /* Find the intersection size between the two sets. */ > + write_size = min(emu.offset + emu.size, r->offset + r->size) - > + max(emu.offset, r->offset); > + vpci_write_helper(pdev, r, write_size, reg + data_offset - r->offset, > + data >> (data_offset * 8)); > + data_offset += write_size; > + if ( data_offset == size ) > + break; > + ASSERT(data_offset < size); > + } > + > + if ( data_offset < size ) > + /* Tailing gap, write the remaining. */ > + vpci_write_hw(seg, bus, slot, func, reg + data_offset, > + size - data_offset, data >> (data_offset * 8)); > +} > + > +/* > + * Local variables: > + * mode: C > + * c-file-style: "BSD" > + * c-basic-offset: 4 > + * tab-width: 4 > + * indent-tabs-mode: nil > + * End: > + */ > diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h > index c10522b7f5..ec14343b27 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -427,6 +427,7 @@ struct arch_domain > #define has_vpit(d) (!!((d)->arch.emulation_flags & > XEN_X86_EMU_PIT)) > #define has_pirq(d) (!!((d)->arch.emulation_flags & \ > XEN_X86_EMU_USE_PIRQ)) > +#define has_vpci(d) (!!((d)->arch.emulation_flags & > XEN_X86_EMU_VPCI)) > > #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) > > diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm- > x86/hvm/domain.h > index d2899c9bb2..3a54d50606 100644 > --- a/xen/include/asm-x86/hvm/domain.h > +++ b/xen/include/asm-x86/hvm/domain.h > @@ -184,6 +184,9 @@ struct hvm_domain { > /* List of guest to machine IO ports mapping. */ > struct list_head g2m_ioport_list; > > + /* Lock for the PCI emulation layer (vPCI). */ > + rwlock_t vpci_lock; > + > /* List of permanently write-mapped pages. */ > struct { > spinlock_t lock; > diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h > index 51659b6c7f..01322a2e21 100644 > --- a/xen/include/asm-x86/hvm/io.h > +++ b/xen/include/asm-x86/hvm/io.h > @@ -160,6 +160,9 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, > unsigned int addr, > */ > void register_g2m_portio_handler(struct domain *d); > > +/* HVM port IO handler for PCI accesses. */ > +void register_vpci_portio_handler(struct domain *d); > + > #endif /* __ASM_X86_HVM_IO_H__ */ > > > diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch- > x86/xen.h > index f21332e897..86a1a09a8d 100644 > --- a/xen/include/public/arch-x86/xen.h > +++ b/xen/include/public/arch-x86/xen.h > @@ -295,12 +295,15 @@ struct xen_arch_domainconfig { > #define XEN_X86_EMU_PIT (1U<<_XEN_X86_EMU_PIT) > #define _XEN_X86_EMU_USE_PIRQ 9 > #define XEN_X86_EMU_USE_PIRQ (1U<<_XEN_X86_EMU_USE_PIRQ) > +#define _XEN_X86_EMU_VPCI 10 > +#define XEN_X86_EMU_VPCI (1U<<_XEN_X86_EMU_VPCI) > > #define XEN_X86_EMU_ALL (XEN_X86_EMU_LAPIC | > XEN_X86_EMU_HPET | \ > XEN_X86_EMU_PM | XEN_X86_EMU_RTC | \ > XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC | \ > XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU | \ > - XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ) > + XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\ > + XEN_X86_EMU_VPCI) > uint32_t emulation_flags; > }; > > diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h > index ea6a66b248..ad5d3ca031 100644 > --- a/xen/include/xen/pci.h > +++ b/xen/include/xen/pci.h > @@ -88,6 +88,9 @@ struct pci_dev { > #define PT_FAULT_THRESHOLD 10 > } fault; > u64 vf_rlen[6]; > + > + /* Data for vPCI. */ > + struct vpci *vpci; > }; > > #define for_each_pdev(domain, pdev) \ > diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h > index ecd6124d91..cc4ee3b83e 100644 > --- a/xen/include/xen/pci_regs.h > +++ b/xen/include/xen/pci_regs.h > @@ -23,6 +23,14 @@ > #define LINUX_PCI_REGS_H > > /* > + * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of > + * configuration space. PCI-X Mode 2 and PCIe devices have 4096 bytes of > + * configuration space. > + */ > +#define PCI_CFG_SPACE_SIZE 256 > +#define PCI_CFG_SPACE_EXP_SIZE 4096 > + > +/* > * Under PCI, each device has 256 bytes of configuration address space, > * of which the first 64 bytes are standardized as follows: > */ > diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h > new file mode 100644 > index 0000000000..12f7287d7b > --- /dev/null > +++ b/xen/include/xen/vpci.h > @@ -0,0 +1,80 @@ > +#ifndef _VPCI_ > +#define _VPCI_ > + > +#include <xen/pci.h> > +#include <xen/types.h> > +#include <xen/list.h> > + > +/* > + * Helpers for locking/unlocking. > + * > + * NB: the recursive variants are used so that spin_is_locked > + * returns whether the lock is hold by the current CPU (instead > + * of just returning whether the lock is hold by any CPU). > + */ The comment doesn't seem to match the use of read-write locks below. > +#define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_rlocked(d) rw_is_locked(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_wlocked(d) rw_is_write_locked(&(d)- > >arch.hvm_domain.vpci_lock) > + > +/* > + * The vPCI handlers will never be called concurrently for the same domain, > it > + * is guaranteed that the vpci domain lock will always be locked when calling > + * any handler. > + */ > +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg, > + const void *data); > + > +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg, > + uint32_t val, void *data); > + > +typedef int vpci_register_init_t(struct pci_dev *dev); > + > +#ifdef CONFIG_LATE_HWDOM > +#define VPCI_SECTION ".rodata.vpci" > +#else > +#define VPCI_SECTION ".init.rodata.vpci" > +#endif > + > +#define REGISTER_VPCI_INIT(x) \ > + static vpci_register_init_t *const x##_entry \ > + __used_section(VPCI_SECTION) = x > + > +/* Add vPCI handlers to device. */ > +int __must_check vpci_add_handlers(struct pci_dev *dev); > + > +/* Add/remove a register handler. */ > +int __must_check vpci_add_register(const struct pci_dev *pdev, > + vpci_read_t *read_handler, > + vpci_write_t *write_handler, > + unsigned int offset, unsigned int size, > + void *data); > +int __must_check vpci_remove_register(const struct pci_dev *pdev, > + unsigned int offset, > + unsigned int size); > + > +/* Generic read/write handlers for the PCI config space. */ > +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot, > + unsigned int func, unsigned int reg, unsigned int size); > +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot, > + unsigned int func, unsigned int reg, unsigned int size, > + uint32_t data); > + > +struct vpci { > + /* List of vPCI handlers for a device. */ > + struct list_head handlers; > +}; > + > +#endif > + > +/* > + * Local variables: > + * mode: C > + * c-file-style: "BSD" > + * c-basic-offset: 4 > + * tab-width: 4 > + * indent-tabs-mode: nil > + * End: > + */ > -- > 2.11.0 (Apple Git-81)
>>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: > Changes since v4: >[...] > * Hypervisor code: >[...] > - Constify the data opaque parameter of read handlers. Is that a good idea? Such callbacks should generally be allowed to modify their state even if the operation is just a read - think of a private lock or statistics/debugging data to be updated. > --- /dev/null > +++ b/tools/tests/vpci/main.c > @@ -0,0 +1,314 @@ > +/* > + * Unit tests for the generic vPCI handler code. > + * > + * Copyright (C) 2017 Citrix Systems R&D > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms and conditions of the GNU General Public > + * License, version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public > + * License along with this program; If not, see <http://www.gnu.org/licenses/>. > + */ > + > +#include "emul.h" > + > +/* Single vcpu (current), and single domain with a single PCI device. */ > +static struct vpci vpci; > + > +static struct domain d = { > + .lock = false, UNLOCKED ? > +int > +main(int argc, char **argv) > +{ > + /* Index storage by offset. */ > + uint32_t r0 = 0xdeadbeef; > + uint8_t r5 = 0xef; > + uint8_t r6 = 0xbe; > + uint8_t r7 = 0xef; > + uint16_t r12 = 0x8696; > + uint8_t r16[4] = { }; > + uint16_t r20[2] = { }; > + uint32_t r24 = 0; > + uint8_t r28, r30; > + unsigned int i; > + int rc; > + > + INIT_LIST_HEAD(&vpci.handlers); > + > + VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0); > + VPCI_READ_CHECK(0, 4, r0); > + VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc); > + > + VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5); > + VPCI_READ_CHECK(5, 1, r5); > + VPCI_WRITE_CHECK(5, 1, 0xba); > + > + VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6); > + VPCI_READ_CHECK(6, 1, r6); > + VPCI_WRITE_CHECK(6, 1, 0xba); > + > + VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7); > + VPCI_READ_CHECK(7, 1, r7); > + VPCI_WRITE_CHECK(7, 1, 0xbd); > + > + VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12); > + VPCI_READ_CHECK(12, 2, r12); > + VPCI_READ_CHECK(12, 4, 0xffff8696); > + > + /* > + * At this point we have the following layout: > + * > + * Note that this refers to the position of the variables, > + * but the value has already changed from the one given at > + * initialization time because write tests have been performed. > + * > + * 32 24 16 8 0 > + * +-----+-----+-----+-----+ > + * | r0 | 0 > + * +-----+-----+-----+-----+ > + * | r7 | r6 | r5 |/////| 32 > + * +-----+-----+-----+-----| > + * |///////////////////////| 64 > + * +-----------+-----------+ > + * |///////////| r12 | 96 > + * +-----------+-----------+ > + * ... > + * / = empty. Maybe better "unwritten"? > --- a/xen/arch/x86/xen.lds.S > +++ b/xen/arch/x86/xen.lds.S > @@ -76,6 +76,11 @@ SECTIONS > > __2M_rodata_start = .; /* Start of 2M superpages, mapped RO. */ > .rodata : { > +#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM) > + __start_vpci_array = .; > + *(.rodata.vpci) > + __end_vpci_array = .; > +#endif > _srodata = .; Please don't put anything ahead of this label. And I'd prefer if ordinary .rodata contributions came first, i.e. please don't follow the bad example ... > /* Bug frames table */ > __start_bug_frames = .; ... this gives (there are plenty of good examples further down in this section). > @@ -167,6 +172,11 @@ SECTIONS > _einittext = .; > } :text > .init.data : { > +#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM) > + __start_vpci_array = .; > + *(.init.rodata.vpci) > + __end_vpci_array = .; > +#endif > *(.init.rodata) Same here then. > +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t *read_handler, > + vpci_write_t *write_handler, unsigned int offset, > + unsigned int size, void *data) > +{ > + struct list_head *prev; > + struct vpci_register *r; > + > + /* Some sanity checks. */ > + if ( (size != 1 && size != 2 && size != 4) || > + offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) || > + (!read_handler && !write_handler) ) > + return -EINVAL; > + > + r = xmalloc(struct vpci_register); > + if ( !r ) > + return -ENOMEM; > + > + r->read = read_handler ?: vpci_ignored_read; > + r->write = write_handler ?: vpci_ignored_write; > + r->size = size; > + r->offset = offset; > + r->private = data; > + > + vpci_wlock(pdev->domain); > + > + /* The list of handlers must be keep sorted at all times. */ kept > +/* > + * Merge new data into a partial result. > + * > + * Zero the bytes of 'data' from [offset, offset + size), and > + * merge the value found in 'new' from [0, offset) left shifted DYM [0, size) here? I also have to admit that I find it strange that you talk of zeroing something here - the net effect of the function is not producing any zeros anywhere afaict. Such a pre-function comment is normally describing the effect of the function as seen to the caller rather than its inner workings. > --- /dev/null > +++ b/xen/include/xen/vpci.h > @@ -0,0 +1,80 @@ > +#ifndef _VPCI_ > +#define _VPCI_ This is a little short (and unusual) for an inclusion guard. > + > +#include <xen/pci.h> > +#include <xen/types.h> > +#include <xen/list.h> > + > +/* > + * Helpers for locking/unlocking. > + * > + * NB: the recursive variants are used so that spin_is_locked > + * returns whether the lock is hold by the current CPU (instead > + * of just returning whether the lock is hold by any CPU). Stale comment? > + */ > +#define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_rlocked(d) rw_is_locked(&(d)->arch.hvm_domain.vpci_lock) > +#define vpci_wlocked(d) rw_is_write_locked(&(d)->arch.hvm_domain.vpci_lock) > + > +/* > + * The vPCI handlers will never be called concurrently for the same domain, it > + * is guaranteed that the vpci domain lock will always be locked when calling > + * any handler. One more? > + */ > +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg, > + const void *data); > + > +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg, > + uint32_t val, void *data); Do these two really need access to the struct pci_dev, rather than just struct vpci? And if they do, then are they really permitted to alter that struct pci_dev? Jan
On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote: > >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: > > Changes since v4: > >[...] > > * Hypervisor code: > >[...] > > - Constify the data opaque parameter of read handlers. > > Is that a good idea? Such callbacks should generally be allowed to > modify their state even if the operation is just a read - think of a > private lock or statistics/debugging data to be updated. Right now the consistency of the opaque data is kept by the global vpci lock, which I like because it makes the code simpler. If the opaque data is not constified for the read handlers then I would be against using a read/write lock. Statistic/debug data IMHO should be kept in a separate structure with it's own lock, that's referenced by the opaque data. This would allow Xen to not allocate this for non-debug builds, reducing memory footprint and lock contention in production. > > +/* > > + * Merge new data into a partial result. > > + * > > + * Zero the bytes of 'data' from [offset, offset + size), and > > + * merge the value found in 'new' from [0, offset) left shifted > > DYM [0, size) here? Yes, will fix. > I also have to admit that I find it strange that > you talk of zeroing something here - the net effect of the function > is not producing any zeros anywhere afaict. Such a pre-function > comment is normally describing the effect of the function as seen > to the caller rather than its inner workings. OK, would it be better to write it as: /* * Merge new data into a partial result. * * Copy the value found in 'new' from [0, size) left shifted by * 'offset' into 'data'. */ > > + */ > > +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg, > > + const void *data); > > + > > +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg, > > + uint32_t val, void *data); > > Do these two really need access to the struct pci_dev, rather than > just struct vpci? And if they do, then are they really permitted to > alter that struct pci_dev? I'm leaning towards pdev because it already contains vpci. AFAICT it should be fine to constify it. Thanks, Roger.
>>> On 06.09.17 at 17:40, <roger.pau@citrix.com> wrote: > On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote: >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: >> > Changes since v4: >> >[...] >> > * Hypervisor code: >> >[...] >> > - Constify the data opaque parameter of read handlers. >> >> Is that a good idea? Such callbacks should generally be allowed to >> modify their state even if the operation is just a read - think of a >> private lock or statistics/debugging data to be updated. > > Right now the consistency of the opaque data is kept by the global > vpci lock, which I like because it makes the code simpler. If the > opaque data is not constified for the read handlers then I would be > against using a read/write lock. > > Statistic/debug data IMHO should be kept in a separate structure with > it's own lock, that's referenced by the opaque data. This would allow > Xen to not allocate this for non-debug builds, reducing memory > footprint and lock contention in production. I don't like this, as it makes adding such transiently needlessly hard (as one would need to drop all the const-s or cast them away). What was the reason for switching to the rwlock anyway? Did you measure any performance problems? Are there Dom0 kernels not serializing PCI config space accesses anyway? Would it be an alternative to make the (spin) lock per-device rather than per- domain? That might also be a good idea for pass-through (as there Dom0 as well as the owning DomU fundamentally have access to the config space of a device, and they'd better be synchronized). >> I also have to admit that I find it strange that >> you talk of zeroing something here - the net effect of the function >> is not producing any zeros anywhere afaict. Such a pre-function >> comment is normally describing the effect of the function as seen >> to the caller rather than its inner workings. > > OK, would it be better to write it as: > > /* > * Merge new data into a partial result. > * > * Copy the value found in 'new' from [0, size) left shifted by > * 'offset' into 'data'. > */ Yes. >> > + */ >> > +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg, >> > + const void *data); >> > + >> > +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg, >> > + uint32_t val, void *data); >> >> Do these two really need access to the struct pci_dev, rather than >> just struct vpci? And if they do, then are they really permitted to >> alter that struct pci_dev? > > I'm leaning towards pdev because it already contains vpci. Well, I certainly guessed that to be the reason for the way things are now. But what data in the physical device structure would such a handler possibly require? Oh, looking at later patches, you take SBDF from there. That's certainly a good enough reason then. > AFAICT it should be fine to constify it. In which case please do. Jan
On Thu, Sep 07, 2017 at 03:06:50AM -0600, Jan Beulich wrote: > >>> On 06.09.17 at 17:40, <roger.pau@citrix.com> wrote: > > On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote: > >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: > >> > Changes since v4: > >> >[...] > >> > * Hypervisor code: > >> >[...] > >> > - Constify the data opaque parameter of read handlers. > >> > >> Is that a good idea? Such callbacks should generally be allowed to > >> modify their state even if the operation is just a read - think of a > >> private lock or statistics/debugging data to be updated. > > > > Right now the consistency of the opaque data is kept by the global > > vpci lock, which I like because it makes the code simpler. If the > > opaque data is not constified for the read handlers then I would be > > against using a read/write lock. > > > > Statistic/debug data IMHO should be kept in a separate structure with > > it's own lock, that's referenced by the opaque data. This would allow > > Xen to not allocate this for non-debug builds, reducing memory > > footprint and lock contention in production. > > I don't like this, as it makes adding such transiently needlessly > hard (as one would need to drop all the const-s or cast them away). Hm, I'm not sure I follow. I was thinking of something along the lines of: struct vpci_msi_stats {...}; struct vpci_msi { [...] struct vpci_msi_stats *stats; }; Then you can easily have a handler like: static void vpci_msi_reg(..., const void *data) { const struct vpci_msi *msi = data; struct vpci_msi_stats *stats = msi->stats; [...] } That should work AFAICT. > What was the reason for switching to the rwlock anyway? Did you > measure any performance problems? Are there Dom0 kernels not > serializing PCI config space accesses anyway? Not that I know of, but the PCIe spec doesn't seem to require non concurrent accesses. PCI of course must not be concurrent. > Would it be an > alternative to make the (spin) lock per-device rather than per- > domain? That might also be a good idea for pass-through (as there > Dom0 as well as the owning DomU fundamentally have access to > the config space of a device, and they'd better be synchronized). That would also work, then I agree it should be a spin lock and the const from the read handlers can be dropped. Unless you say otherwise I'm going to implement this. Thanks, Roger.
>>> On 07.09.17 at 13:30, <roger.pau@citrix.com> wrote: > On Thu, Sep 07, 2017 at 03:06:50AM -0600, Jan Beulich wrote: >> >>> On 06.09.17 at 17:40, <roger.pau@citrix.com> wrote: >> > On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote: >> >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: >> >> > Changes since v4: >> >> >[...] >> >> > * Hypervisor code: >> >> >[...] >> >> > - Constify the data opaque parameter of read handlers. >> >> >> >> Is that a good idea? Such callbacks should generally be allowed to >> >> modify their state even if the operation is just a read - think of a >> >> private lock or statistics/debugging data to be updated. >> > >> > Right now the consistency of the opaque data is kept by the global >> > vpci lock, which I like because it makes the code simpler. If the >> > opaque data is not constified for the read handlers then I would be >> > against using a read/write lock. >> > >> > Statistic/debug data IMHO should be kept in a separate structure with >> > it's own lock, that's referenced by the opaque data. This would allow >> > Xen to not allocate this for non-debug builds, reducing memory >> > footprint and lock contention in production. >> >> I don't like this, as it makes adding such transiently needlessly >> hard (as one would need to drop all the const-s or cast them away). > > Hm, I'm not sure I follow. I was thinking of something along the lines > of: > > struct vpci_msi_stats {...}; > > struct vpci_msi { > [...] > struct vpci_msi_stats *stats; > }; > > Then you can easily have a handler like: > > static void vpci_msi_reg(..., const void *data) > { > const struct vpci_msi *msi = data; > struct vpci_msi_stats *stats = msi->stats; > [...] > } > > That should work AFAICT. Sure, but the structure needs allocating. Which again is something I wouldn't want to have to worry about when adding _temporary_ debugging or statistics code. But this is all moot anyway with ... >> Would it be an >> alternative to make the (spin) lock per-device rather than per- >> domain? That might also be a good idea for pass-through (as there >> Dom0 as well as the owning DomU fundamentally have access to >> the config space of a device, and they'd better be synchronized). > > That would also work, then I agree it should be a spin lock and the > const from the read handlers can be dropped. Unless you say otherwise > I'm going to implement this. ... this. Jan
On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote: > >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: > > + /* > > + * At this point we have the following layout: > > + * > > + * Note that this refers to the position of the variables, > > + * but the value has already changed from the one given at > > + * initialization time because write tests have been performed. > > + * > > + * 32 24 16 8 0 > > + * +-----+-----+-----+-----+ > > + * | r0 | 0 > > + * +-----+-----+-----+-----+ > > + * | r7 | r6 | r5 |/////| 32 > > + * +-----+-----+-----+-----| > > + * |///////////////////////| 64 > > + * +-----------+-----------+ > > + * |///////////| r12 | 96 > > + * +-----------+-----------+ > > + * ... > > + * / = empty. > > Maybe better "unwritten"? I've been thinking about this, and I'm not sure unwritten is better, in fact the test will write to this registers, it's just that there's no backing handlers so writes will be discarded and reads will return ~0. So I think "empty" or maybe "unhandled" is more descriptive. Thanks, Roger.
>>> On 08.09.17 at 16:41, <roger.pau@citrix.com> wrote: > On Mon, Sep 04, 2017 at 09:38:11AM -0600, Jan Beulich wrote: >> >>> On 14.08.17 at 16:28, <roger.pau@citrix.com> wrote: >> > + /* >> > + * At this point we have the following layout: >> > + * >> > + * Note that this refers to the position of the variables, >> > + * but the value has already changed from the one given at >> > + * initialization time because write tests have been performed. >> > + * >> > + * 32 24 16 8 0 >> > + * +-----+-----+-----+-----+ >> > + * | r0 | 0 >> > + * +-----+-----+-----+-----+ >> > + * | r7 | r6 | r5 |/////| 32 >> > + * +-----+-----+-----+-----| >> > + * |///////////////////////| 64 >> > + * +-----------+-----------+ >> > + * |///////////| r12 | 96 >> > + * +-----------+-----------+ >> > + * ... >> > + * / = empty. >> >> Maybe better "unwritten"? > > I've been thinking about this, and I'm not sure unwritten is better, > in fact the test will write to this registers, it's just that there's > no backing handlers so writes will be discarded and reads will return > ~0. > > So I think "empty" or maybe "unhandled" is more descriptive. "unhandled" then please - registers can't possibly be empty imo. Jan
Hi Roger, On 14/08/17 15:28, Roger Pau Monne wrote: > This functionality is going to reside in vpci.c (and the corresponding > vpci.h header), and should be arch-agnostic. The handlers introduced > in this patch setup the basic functionality required in order to trap > accesses to the PCI config space, and allow decoding the address and > finding the corresponding handler that should handle the access > (although no handlers are implemented). If I understand correctly this patch, the virtual BDF will always correspond to the physical BDF. Is that right? If so, would you mind to explain why such restriction? Cheers,
On Tue, Sep 12, 2017 at 11:42:38AM +0100, Julien Grall wrote: > Hi Roger, > > On 14/08/17 15:28, Roger Pau Monne wrote: > > This functionality is going to reside in vpci.c (and the corresponding > > vpci.h header), and should be arch-agnostic. The handlers introduced > > in this patch setup the basic functionality required in order to trap > > accesses to the PCI config space, and allow decoding the address and > > finding the corresponding handler that should handle the access > > (although no handlers are implemented). > > If I understand correctly this patch, the virtual BDF will always correspond > to the physical BDF. Is that right? If so, would you mind to explain why > such restriction? Yes, this is not a limitation of this patch, but of the implementation that follows. Likely this will be expanded when support for DomU is added, but for Dom0 at least on x86 adding such a translation layer is not needed, since I see no reason to present a different PCI topology to Dom0. Thanks, Roger.
On 12/09/17 11:58, Roger Pau Monné wrote: > On Tue, Sep 12, 2017 at 11:42:38AM +0100, Julien Grall wrote: >> Hi Roger, >> >> On 14/08/17 15:28, Roger Pau Monne wrote: >>> This functionality is going to reside in vpci.c (and the corresponding >>> vpci.h header), and should be arch-agnostic. The handlers introduced >>> in this patch setup the basic functionality required in order to trap >>> accesses to the PCI config space, and allow decoding the address and >>> finding the corresponding handler that should handle the access >>> (although no handlers are implemented). >> >> If I understand correctly this patch, the virtual BDF will always correspond >> to the physical BDF. Is that right? If so, would you mind to explain why >> such restriction? > > Yes, this is not a limitation of this patch, but of the implementation > that follows. Likely this will be expanded when support for DomU is > added, but for Dom0 at least on x86 adding such a translation layer is > not needed, since I see no reason to present a different PCI topology > to Dom0. I think this will be the same for ARM. Dom0 will always have pBDF == vBDF. For guest will likely want pBDF != vBDF. This answer to my other question regarding the plan to support vBDF != pBDF :). Thanks. > > Thanks, Roger. >
diff --git a/.gitignore b/.gitignore index 594ffd9a7f..7f24ce72b1 100644 --- a/.gitignore +++ b/.gitignore @@ -237,6 +237,9 @@ tools/tests/regression/build/* tools/tests/regression/downloads/* tools/tests/mem-sharing/memshrtool tools/tests/mce-test/tools/xen-mceinj +tools/tests/vpci/list.h +tools/tests/vpci/vpci.[hc] +tools/tests/vpci/test_vpci tools/xcutils/lsevtchn tools/xcutils/readnotes tools/xenbackendd/_paths.h diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index 455f6f0bed..dd7fc78a99 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -11,7 +11,7 @@ int libxl__arch_domain_prepare_config(libxl__gc *gc, if (d_config->c_info.type == LIBXL_DOMAIN_TYPE_HVM) { if (d_config->b_info.device_model_version != LIBXL_DEVICE_MODEL_VERSION_NONE) { - xc_config->emulation_flags = XEN_X86_EMU_ALL; + xc_config->emulation_flags = (XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI); } else if (libxl_defbool_val(d_config->b_info.u.hvm.apic)) { /* * HVM guests without device model may want diff --git a/tools/tests/Makefile b/tools/tests/Makefile index 7162945121..f6942a93fb 100644 --- a/tools/tests/Makefile +++ b/tools/tests/Makefile @@ -13,6 +13,7 @@ endif SUBDIRS-$(CONFIG_X86) += x86_emulator SUBDIRS-y += xen-access SUBDIRS-y += xenstore +SUBDIRS-$(CONFIG_HAS_PCI) += vpci .PHONY: all clean install distclean uninstall all clean distclean: %: subdirs-% diff --git a/tools/tests/vpci/Makefile b/tools/tests/vpci/Makefile new file mode 100644 index 0000000000..e45fcb5cd9 --- /dev/null +++ b/tools/tests/vpci/Makefile @@ -0,0 +1,37 @@ +XEN_ROOT=$(CURDIR)/../../.. +include $(XEN_ROOT)/tools/Rules.mk + +TARGET := test_vpci + +.PHONY: all +all: $(TARGET) + +.PHONY: run +run: $(TARGET) + ./$(TARGET) + +$(TARGET): vpci.c vpci.h list.h main.c emul.h + $(HOSTCC) -g -o $@ vpci.c main.c + +.PHONY: clean +clean: + rm -rf $(TARGET) *.o *~ vpci.h vpci.c list.h + +.PHONY: distclean +distclean: clean + +.PHONY: install +install: + +vpci.c: $(XEN_ROOT)/xen/drivers/vpci/vpci.c + # Trick the compiler so it doesn't complain about missing symbols + sed -e '/#include/d' \ + -e '1s;^;#include "emul.h"\ + vpci_register_init_t *const __start_vpci_array[1]\;\ + vpci_register_init_t *const __end_vpci_array[1]\;\ + ;' <$< >$@ + +list.h: $(XEN_ROOT)/xen/include/xen/list.h +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h +list.h vpci.h: + sed -e '/#include/d' <$< >$@ diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h new file mode 100644 index 0000000000..69f08ba1e7 --- /dev/null +++ b/tools/tests/vpci/emul.h @@ -0,0 +1,128 @@ +/* + * Unit tests for the generic vPCI handler code. + * + * Copyright (C) 2017 Citrix Systems R&D + * + * This program is free software; you can redistribute it and/or + * modify it under the terms and conditions of the GNU General Public + * License, version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; If not, see <http://www.gnu.org/licenses/>. + */ + +#ifndef _TEST_VPCI_ +#define _TEST_VPCI_ + +#include <stdlib.h> +#include <stdio.h> +#include <stddef.h> +#include <stdint.h> +#include <stdbool.h> +#include <errno.h> +#include <assert.h> + +#define container_of(ptr, type, member) ({ \ + typeof(((type *)0)->member) *mptr = (ptr); \ + \ + (type *)((char *)mptr - offsetof(type, member)); \ +}) + +#define smp_wmb() +#define prefetch(x) __builtin_prefetch(x) +#define ASSERT(x) assert(x) +#define __must_check __attribute__((__warn_unused_result__)) + +#include "list.h" + +struct pci_dev { + struct domain *domain; + struct vpci *vpci; +}; + +struct domain { + enum { + UNLOCKED, + RLOCKED, + WLOCKED, + } lock; +}; + +struct vcpu +{ + struct domain *domain; +}; + +extern struct vcpu *current; +extern struct pci_dev test_pdev; + +#include "vpci.h" + +#define __hwdom_init + +#define has_vpci(d) true + +/* Define our own locks. */ +#undef vpci_rlock +#undef vpci_wlock +#undef vpci_runlock +#undef vpci_wunlock +#undef vpci_rlocked +#undef vpci_wlocked +#define vpci_rlock(d) ((d)->lock = RLOCKED) +#define vpci_wlock(d) ((d)->lock = WLOCKED) +#define vpci_runlock(d) ((d)->lock = UNLOCKED) +#define vpci_wunlock(d) ((d)->lock = UNLOCKED) +#define vpci_rlocked(d) ((d)->lock == RLOCKED) +#define vpci_wlocked(d) ((d)->lock == WLOCKED) + +#define xzalloc(type) ((type *)calloc(1, sizeof(type))) +#define xmalloc(type) ((type *)malloc(sizeof(type))) +#define xfree(p) free(p) + +#define pci_get_pdev_by_domain(...) &test_pdev + +/* Dummy native helpers. Writes are ignored, reads return 1's. */ +#define pci_conf_read8(...) 0xff +#define pci_conf_read16(...) 0xffff +#define pci_conf_read32(...) 0xffffffff +#define pci_conf_write8(...) +#define pci_conf_write16(...) +#define pci_conf_write32(...) + +#define PCI_CFG_SPACE_EXP_SIZE 4096 + +#define BUG() assert(0) +#define ASSERT_UNREACHABLE() assert(0) + +#define min(x, y) ({ \ + const typeof(x) tx = (x); \ + const typeof(y) ty = (y); \ + \ + (void) (&tx == &ty); \ + tx < ty ? tx : ty; \ +}) + +#define max(x, y) ({ \ + const typeof(x) tx = (x); \ + const typeof(y) ty = (y); \ + \ + (void) (&tx == &ty); \ + tx > ty ? tx : ty; \ +}) + +#endif + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c new file mode 100644 index 0000000000..33736937bf --- /dev/null +++ b/tools/tests/vpci/main.c @@ -0,0 +1,314 @@ +/* + * Unit tests for the generic vPCI handler code. + * + * Copyright (C) 2017 Citrix Systems R&D + * + * This program is free software; you can redistribute it and/or + * modify it under the terms and conditions of the GNU General Public + * License, version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; If not, see <http://www.gnu.org/licenses/>. + */ + +#include "emul.h" + +/* Single vcpu (current), and single domain with a single PCI device. */ +static struct vpci vpci; + +static struct domain d = { + .lock = false, +}; + +struct pci_dev test_pdev = { + .domain = &d, + .vpci = &vpci, +}; + +static struct vcpu v = { + .domain = &d +}; + +struct vcpu *current = &v; + +/* Dummy hooks, write stores data, read fetches it. */ +static uint32_t vpci_read8(struct pci_dev *pdev, unsigned int reg, + const void *data) +{ + return *(const uint8_t *)data; +} + +static void vpci_write8(struct pci_dev *pdev, unsigned int reg, + uint32_t val, void *data) +{ + *(uint8_t *)data = val; +} + +static uint32_t vpci_read16(struct pci_dev *pdev, unsigned int reg, + const void *data) +{ + return *(const uint16_t *)data; +} + +static void vpci_write16(struct pci_dev *pdev, unsigned int reg, + uint32_t val, void *data) +{ + *(uint16_t *)data = val; +} + +static uint32_t vpci_read32(struct pci_dev *pdev, unsigned int reg, + const void *data) +{ + return *(const uint32_t *)data; +} + +static void vpci_write32(struct pci_dev *pdev, unsigned int reg, + uint32_t val, void *data) +{ + *(uint32_t *)data = val; +} + +#define VPCI_READ(reg, size, data) ({ \ + vpci_rlock(&d); \ + data = vpci_read(0, 0, 0, 0, reg, size); \ + vpci_runlock(&d); \ +}) + +#define VPCI_READ_CHECK(reg, size, expected) ({ \ + uint32_t rd; \ + \ + VPCI_READ(reg, size, rd); \ + assert(rd == (expected)); \ +}) + +#define VPCI_WRITE(reg, size, data) ({ \ + vpci_wlock(&d); \ + vpci_write(0, 0, 0, 0, reg, size, data); \ + vpci_wunlock(&d); \ +}) + +#define VPCI_WRITE_CHECK(reg, size, data) ({ \ + VPCI_WRITE(reg, size, data); \ + VPCI_READ_CHECK(reg, size, data); \ +}) + +#define VPCI_ADD_REG(fread, fwrite, off, size, store) \ + assert(!vpci_add_register(&test_pdev, fread, fwrite, off, size, &store)) + +#define VPCI_ADD_INVALID_REG(fread, fwrite, off, size) \ + assert(vpci_add_register(&test_pdev, fread, fwrite, off, size, NULL)) + +#define VPCI_REMOVE_REG(off, size) \ + assert(!vpci_remove_register(&test_pdev, off, size)) + +#define VPCI_REMOVE_INVALID_REG(off, size) \ + assert(vpci_remove_register(&test_pdev, off, size)) + +/* Read a 32b register using all possible sizes. */ +void multiread4_check(unsigned int reg, uint32_t val) +{ + unsigned int i; + + /* Read using bytes. */ + for ( i = 0; i < 4; i++ ) + VPCI_READ_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX); + + /* Read using 2bytes. */ + for ( i = 0; i < 2; i++ ) + VPCI_READ_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX); + + VPCI_READ_CHECK(reg, 4, val); +} + +void multiwrite4_check(unsigned int reg) +{ + unsigned int i; + uint32_t val = 0xa2f51732; + + /* Write using bytes. */ + for ( i = 0; i < 4; i++ ) + VPCI_WRITE_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX); + multiread4_check(reg, val); + + /* Change the value each time to be sure writes work fine. */ + val = 0x2b836fda; + /* Write using 2bytes. */ + for ( i = 0; i < 2; i++ ) + VPCI_WRITE_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX); + multiread4_check(reg, val); + + val = 0xc4693beb; + VPCI_WRITE_CHECK(reg, 4, val); + multiread4_check(reg, val); +} + +int +main(int argc, char **argv) +{ + /* Index storage by offset. */ + uint32_t r0 = 0xdeadbeef; + uint8_t r5 = 0xef; + uint8_t r6 = 0xbe; + uint8_t r7 = 0xef; + uint16_t r12 = 0x8696; + uint8_t r16[4] = { }; + uint16_t r20[2] = { }; + uint32_t r24 = 0; + uint8_t r28, r30; + unsigned int i; + int rc; + + INIT_LIST_HEAD(&vpci.handlers); + + VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0); + VPCI_READ_CHECK(0, 4, r0); + VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc); + + VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5); + VPCI_READ_CHECK(5, 1, r5); + VPCI_WRITE_CHECK(5, 1, 0xba); + + VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6); + VPCI_READ_CHECK(6, 1, r6); + VPCI_WRITE_CHECK(6, 1, 0xba); + + VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7); + VPCI_READ_CHECK(7, 1, r7); + VPCI_WRITE_CHECK(7, 1, 0xbd); + + VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12); + VPCI_READ_CHECK(12, 2, r12); + VPCI_READ_CHECK(12, 4, 0xffff8696); + + /* + * At this point we have the following layout: + * + * Note that this refers to the position of the variables, + * but the value has already changed from the one given at + * initialization time because write tests have been performed. + * + * 32 24 16 8 0 + * +-----+-----+-----+-----+ + * | r0 | 0 + * +-----+-----+-----+-----+ + * | r7 | r6 | r5 |/////| 32 + * +-----+-----+-----+-----| + * |///////////////////////| 64 + * +-----------+-----------+ + * |///////////| r12 | 96 + * +-----------+-----------+ + * ... + * / = empty. + */ + + /* Try to add an overlapping register handler. */ + VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4); + + /* Try to add a non-aligned register. */ + VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2); + + /* Try to add a register with wrong size. */ + VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3); + + /* Try to add a register with missing handlers. */ + VPCI_ADD_INVALID_REG(NULL, NULL, 8, 2); + + /* Read/write of unset register. */ + VPCI_READ_CHECK(8, 4, 0xffffffff); + VPCI_READ_CHECK(8, 2, 0xffff); + VPCI_READ_CHECK(8, 1, 0xff); + VPCI_WRITE(10, 2, 0xbeef); + VPCI_READ_CHECK(10, 2, 0xffff); + + /* Read of multiple registers */ + VPCI_WRITE_CHECK(7, 1, 0xbd); + VPCI_READ_CHECK(4, 4, 0xbdbabaff); + + /* Partial read of a register. */ + VPCI_WRITE_CHECK(0, 4, 0x1a1b1c1d); + VPCI_READ_CHECK(2, 1, 0x1b); + VPCI_READ_CHECK(6, 2, 0xbdba); + + /* Write of multiple registers. */ + VPCI_WRITE_CHECK(4, 4, 0xaabbccff); + + /* Partial write of a register. */ + VPCI_WRITE_CHECK(2, 1, 0xfe); + VPCI_WRITE_CHECK(6, 2, 0xfebc); + + /* + * Test all possible read/write size combinations. + * + * Place 4 1B registers at 128bits (16B), 2 2B registers at 160bits + * (20B) and finally 1 4B register at 192bits (24B). + * + * Then perform all possible write and read sizes on each of them. + * + * ... + * 32 24 16 8 0 + * +------+------+------+------+ + * |r16[3]|r16[2]|r16[1]|r16[0]| 16 + * +------+------+------+------+ + * | r20[1] | r20[0] | 20 + * +-------------+-------------| + * | r24 | 24 + * +-------------+-------------+ + * + */ + VPCI_ADD_REG(vpci_read8, vpci_write8, 16, 1, r16[0]); + VPCI_ADD_REG(vpci_read8, vpci_write8, 17, 1, r16[1]); + VPCI_ADD_REG(vpci_read8, vpci_write8, 18, 1, r16[2]); + VPCI_ADD_REG(vpci_read8, vpci_write8, 19, 1, r16[3]); + + VPCI_ADD_REG(vpci_read16, vpci_write16, 20, 2, r20[0]); + VPCI_ADD_REG(vpci_read16, vpci_write16, 22, 2, r20[1]); + + VPCI_ADD_REG(vpci_read32, vpci_write32, 24, 4, r24); + + /* Check the initial value is 0. */ + multiread4_check(16, 0); + multiread4_check(20, 0); + multiread4_check(24, 0); + + multiwrite4_check(16); + multiwrite4_check(20); + multiwrite4_check(24); + + /* + * Check multiple non-consecutive gaps on the same read/write: + * + * 32 24 16 8 0 + * +------+------+------+------+ + * |//////| r30 |//////| r28 | 28 + * +------+------+------+------+ + * + */ + VPCI_ADD_REG(vpci_read8, vpci_write8, 28, 1, r28); + VPCI_ADD_REG(vpci_read8, vpci_write8, 30, 1, r30); + VPCI_WRITE_CHECK(28, 4, 0xffacffdc); + + /* Finally try to remove a couple of registers. */ + VPCI_REMOVE_REG(28, 1); + VPCI_REMOVE_REG(24, 4); + VPCI_REMOVE_REG(12, 2); + + VPCI_REMOVE_INVALID_REG(20, 1); + VPCI_REMOVE_INVALID_REG(16, 2); + VPCI_REMOVE_INVALID_REG(30, 2); + + return 0; +} + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S index 2d54f224ec..6690516ff1 100644 --- a/xen/arch/arm/xen.lds.S +++ b/xen/arch/arm/xen.lds.S @@ -41,6 +41,11 @@ SECTIONS . = ALIGN(PAGE_SIZE); .rodata : { +#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM) + __start_vpci_array = .; + *(.rodata.vpci) + __end_vpci_array = .; +#endif _srodata = .; /* Read-only data */ /* Bug frames table */ __start_bug_frames = .; @@ -131,6 +136,11 @@ SECTIONS } :text . = ALIGN(PAGE_SIZE); .init.data : { +#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM) + __start_vpci_array = .; + *(.init.rodata.vpci) + __end_vpci_array = .; +#endif *(.init.rodata) *(.init.rodata.rel) *(.init.rodata.str*) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index baaf8151d2..7a862ea671 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -376,11 +376,21 @@ static bool emulation_flags_ok(const struct domain *d, uint32_t emflags) if ( is_hvm_domain(d) ) { if ( is_hardware_domain(d) && - emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) ) - return false; - if ( !is_hardware_domain(d) && emflags && - emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC ) + emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC| + XEN_X86_EMU_VPCI) ) return false; + if ( !is_hardware_domain(d) ) + { + switch ( emflags ) + { + case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI: + case XEN_X86_EMU_LAPIC: + case 0: + break; + default: + return false; + } + } } else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT ) { diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index 6cb903def5..cc73df8dc7 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -36,6 +36,7 @@ #include <xen/rangeset.h> #include <xen/monitor.h> #include <xen/warning.h> +#include <xen/vpci.h> #include <asm/shadow.h> #include <asm/hap.h> #include <asm/current.h> @@ -629,6 +630,7 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags, d->arch.hvm_domain.io_bitmap = hvm_io_bitmap; register_g2m_portio_handler(d); + register_vpci_portio_handler(d); hvm_ioreq_init(d); diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c index 074cba89da..c3b68eb257 100644 --- a/xen/arch/x86/hvm/io.c +++ b/xen/arch/x86/hvm/io.c @@ -25,6 +25,7 @@ #include <xen/trace.h> #include <xen/event.h> #include <xen/hypercall.h> +#include <xen/vpci.h> #include <asm/current.h> #include <asm/cpufeature.h> #include <asm/processor.h> @@ -260,7 +261,7 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr, unsigned int *bus, unsigned int *slot, unsigned int *func) { - unsigned long bdf; + unsigned int bdf; ASSERT(CF8_ENABLED(cf8)); @@ -275,6 +276,121 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr, return CF8_ADDR_LO(cf8) | (addr & 3); } +/* Do some sanity checks. */ +static bool vpci_access_allowed(unsigned int reg, unsigned int len) +{ + /* Check access size. */ + if ( len != 1 && len != 2 && len != 4 ) + return false; + + /* Check that access is size aligned. */ + if ( (reg & (len - 1)) ) + return false; + + return true; +} + +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */ +static bool vpci_portio_accept(const struct hvm_io_handler *handler, + const ioreq_t *p) +{ + return (p->addr == 0xcf8 && p->size == 4) || (p->addr & ~3) == 0xcfc; +} + +static int vpci_portio_read(const struct hvm_io_handler *handler, + uint64_t addr, uint32_t size, uint64_t *data) +{ + struct domain *d = current->domain; + unsigned int bus, slot, func, reg; + + *data = ~(uint64_t)0; + + vpci_rlock(d); + if ( addr == 0xcf8 ) + { + ASSERT(size == 4); + *data = d->arch.hvm_domain.pci_cf8; + vpci_runlock(d); + return X86EMUL_OKAY; + } + if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) ) + { + vpci_runlock(d); + return X86EMUL_OKAY; + } + + reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot, + &func); + + if ( !vpci_access_allowed(reg, size) ) + { + vpci_runlock(d); + return X86EMUL_OKAY; + } + + *data = vpci_read(0, bus, slot, func, reg, size); + vpci_runlock(d); + + return X86EMUL_OKAY; +} + +static int vpci_portio_write(const struct hvm_io_handler *handler, + uint64_t addr, uint32_t size, uint64_t data) +{ + struct domain *d = current->domain; + unsigned int bus, slot, func, reg; + + vpci_wlock(d); + if ( addr == 0xcf8 ) + { + ASSERT(size == 4); + d->arch.hvm_domain.pci_cf8 = data; + vpci_wunlock(d); + return X86EMUL_OKAY; + } + if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) ) + { + vpci_wunlock(d); + return X86EMUL_OKAY; + } + + reg = hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot, + &func); + + if ( !vpci_access_allowed(reg, size) ) + { + vpci_wunlock(d); + return X86EMUL_OKAY; + } + + vpci_write(0, bus, slot, func, reg, size, data); + vpci_wunlock(d); + + return X86EMUL_OKAY; +} + +static const struct hvm_io_ops vpci_portio_ops = { + .accept = vpci_portio_accept, + .read = vpci_portio_read, + .write = vpci_portio_write, +}; + +void register_vpci_portio_handler(struct domain *d) +{ + struct hvm_io_handler *handler; + + if ( !has_vpci(d) ) + return; + + handler = hvm_next_io_handler(d); + if ( !handler ) + return; + + rwlock_init(&d->arch.hvm_domain.vpci_lock); + handler->type = IOREQ_TYPE_PIO; + handler->ops = &vpci_portio_ops; +} + /* * Local variables: * mode: C diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index db5df6956d..5b2c0e3fc3 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long mbi_p) domcr_flags |= DOMCRF_hvm | ((hvm_funcs.hap_supported && !opt_dom0_shadow) ? DOMCRF_hap : 0); - config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC; + config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC| + XEN_X86_EMU_VPCI; } /* Create initial domain 0. */ diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S index ff08bbe42a..af1b30cb2b 100644 --- a/xen/arch/x86/xen.lds.S +++ b/xen/arch/x86/xen.lds.S @@ -76,6 +76,11 @@ SECTIONS __2M_rodata_start = .; /* Start of 2M superpages, mapped RO. */ .rodata : { +#if defined(CONFIG_HAS_PCI) && defined(CONFIG_LATE_HWDOM) + __start_vpci_array = .; + *(.rodata.vpci) + __end_vpci_array = .; +#endif _srodata = .; /* Bug frames table */ __start_bug_frames = .; @@ -167,6 +172,11 @@ SECTIONS _einittext = .; } :text .init.data : { +#if defined(CONFIG_HAS_PCI) && !defined(CONFIG_LATE_HWDOM) + __start_vpci_array = .; + *(.init.rodata.vpci) + __end_vpci_array = .; +#endif *(.init.rodata) *(.init.rodata.rel) *(.init.rodata.str*) diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile index 19391802a8..d51c766453 100644 --- a/xen/drivers/Makefile +++ b/xen/drivers/Makefile @@ -1,6 +1,6 @@ subdir-y += char subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq -subdir-$(CONFIG_HAS_PCI) += pci +subdir-$(CONFIG_HAS_PCI) += pci vpci subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough subdir-$(CONFIG_ACPI) += acpi subdir-$(CONFIG_VIDEO) += video diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 27bdb7163c..54326cf0b8 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -30,6 +30,7 @@ #include <xen/radix-tree.h> #include <xen/softirq.h> #include <xen/tasklet.h> +#include <xen/vpci.h> #include <xsm/xsm.h> #include <asm/msi.h> #include "ats.h" @@ -1030,9 +1031,10 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt, struct pci_dev *pdev) { u8 devfn = pdev->devfn; + int err; do { - int err = ctxt->handler(devfn, pdev); + err = ctxt->handler(devfn, pdev); if ( err ) { @@ -1045,6 +1047,11 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt, devfn += pdev->phantom_stride; } while ( devfn != pdev->devfn && PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) ); + + err = vpci_add_handlers(pdev); + if ( err ) + printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n", + ctxt->d->domain_id, err); } static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg) diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile new file mode 100644 index 0000000000..840a906470 --- /dev/null +++ b/xen/drivers/vpci/Makefile @@ -0,0 +1 @@ +obj-y += vpci.o diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c new file mode 100644 index 0000000000..f63de97e89 --- /dev/null +++ b/xen/drivers/vpci/vpci.c @@ -0,0 +1,443 @@ +/* + * Generic functionality for handling accesses to the PCI configuration space + * from guests. + * + * Copyright (C) 2017 Citrix Systems R&D + * + * This program is free software; you can redistribute it and/or + * modify it under the terms and conditions of the GNU General Public + * License, version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; If not, see <http://www.gnu.org/licenses/>. + */ + +#include <xen/sched.h> +#include <xen/vpci.h> + +extern vpci_register_init_t *const __start_vpci_array[]; +extern vpci_register_init_t *const __end_vpci_array[]; +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array) + +/* Internal struct to store the emulated PCI registers. */ +struct vpci_register { + vpci_read_t *read; + vpci_write_t *write; + unsigned int size; + unsigned int offset; + void *private; + struct list_head node; +}; + +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev) +{ + unsigned int i; + int rc = 0; + + if ( !has_vpci(pdev->domain) ) + return 0; + + pdev->vpci = xzalloc(struct vpci); + if ( !pdev->vpci ) + return -ENOMEM; + + INIT_LIST_HEAD(&pdev->vpci->handlers); + + for ( i = 0; i < NUM_VPCI_INIT; i++ ) + { + rc = __start_vpci_array[i](pdev); + if ( rc ) + break; + } + + if ( rc ) + { + while ( !list_empty(&pdev->vpci->handlers) ) + { + struct vpci_register *r = list_first_entry(&pdev->vpci->handlers, + struct vpci_register, + node); + + list_del(&r->node); + xfree(r); + } + xfree(pdev->vpci); + } + + return rc; +} + +static int vpci_register_cmp(const struct vpci_register *r1, + const struct vpci_register *r2) +{ + /* Return 0 if registers overlap. */ + if ( r1->offset < r2->offset + r2->size && + r2->offset < r1->offset + r1->size ) + return 0; + if ( r1->offset < r2->offset ) + return -1; + if ( r1->offset > r2->offset ) + return 1; + + ASSERT_UNREACHABLE(); + return 0; +} + +/* Dummy hooks, writes are ignored, reads return 1's */ +static uint32_t vpci_ignored_read(struct pci_dev *pdev, unsigned int reg, + const void *data) +{ + return ~(uint32_t)0; +} + +static void vpci_ignored_write(struct pci_dev *pdev, unsigned int reg, + uint32_t val, void *data) +{ +} + +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t *read_handler, + vpci_write_t *write_handler, unsigned int offset, + unsigned int size, void *data) +{ + struct list_head *prev; + struct vpci_register *r; + + /* Some sanity checks. */ + if ( (size != 1 && size != 2 && size != 4) || + offset >= PCI_CFG_SPACE_EXP_SIZE || (offset & (size - 1)) || + (!read_handler && !write_handler) ) + return -EINVAL; + + r = xmalloc(struct vpci_register); + if ( !r ) + return -ENOMEM; + + r->read = read_handler ?: vpci_ignored_read; + r->write = write_handler ?: vpci_ignored_write; + r->size = size; + r->offset = offset; + r->private = data; + + vpci_wlock(pdev->domain); + + /* The list of handlers must be keep sorted at all times. */ + list_for_each ( prev, &pdev->vpci->handlers ) + { + const struct vpci_register *this = + list_entry(prev, const struct vpci_register, node); + int cmp = vpci_register_cmp(r, this); + + if ( cmp < 0 ) + break; + if ( cmp == 0 ) + { + vpci_wunlock(pdev->domain); + xfree(r); + return -EEXIST; + } + } + + list_add_tail(&r->node, prev); + vpci_wunlock(pdev->domain); + + return 0; +} + +int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset, + unsigned int size) +{ + const struct vpci_register r = { .offset = offset, .size = size }; + struct vpci_register *rm; + + vpci_wlock(pdev->domain); + list_for_each_entry ( rm, &pdev->vpci->handlers, node ) + { + int cmp = vpci_register_cmp(&r, rm); + + /* + * NB: do not use a switch so that we can use break to + * get out of the list loop earlier if required. + */ + if ( !cmp && rm->offset == offset && rm->size == size ) + { + list_del(&rm->node); + vpci_wunlock(pdev->domain); + xfree(rm); + return 0; + } + if ( cmp <= 0 ) + break; + } + vpci_wunlock(pdev->domain); + + return -ENOENT; +} + +/* Wrappers for performing reads/writes to the underlying hardware. */ +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus, + unsigned int slot, unsigned int func, + unsigned int reg, unsigned int size) +{ + uint32_t data; + + switch ( size ) + { + case 4: + data = pci_conf_read32(seg, bus, slot, func, reg); + break; + case 3: + /* + * This is possible because a 4byte read can have 1byte trapped and + * the rest passed-through. + */ + if ( reg & 1 ) + { + data = pci_conf_read8(seg, bus, slot, func, reg); + data |= pci_conf_read16(seg, bus, slot, func, reg + 1) << 8; + } + else + { + data = pci_conf_read16(seg, bus, slot, func, reg); + data |= pci_conf_read8(seg, bus, slot, func, reg + 2) << 16; + } + break; + case 2: + data = pci_conf_read16(seg, bus, slot, func, reg); + break; + case 1: + data = pci_conf_read8(seg, bus, slot, func, reg); + break; + default: + ASSERT_UNREACHABLE(); + data = ~(uint32_t)0; + break; + } + + return data; +} + +static void vpci_write_hw(unsigned int seg, unsigned int bus, + unsigned int slot, unsigned int func, + unsigned int reg, unsigned int size, uint32_t data) +{ + switch ( size ) + { + case 4: + pci_conf_write32(seg, bus, slot, func, reg, data); + break; + case 3: + /* + * This is possible because a 4byte write can have 1byte trapped and + * the rest passed-through. + */ + if ( reg & 1 ) + { + pci_conf_write8(seg, bus, slot, func, reg, data); + pci_conf_write16(seg, bus, slot, func, reg + 1, data >> 8); + } + else + { + pci_conf_write16(seg, bus, slot, func, reg, data); + pci_conf_write8(seg, bus, slot, func, reg + 2, data >> 16); + } + break; + case 2: + pci_conf_write16(seg, bus, slot, func, reg, data); + break; + case 1: + pci_conf_write8(seg, bus, slot, func, reg, data); + break; + default: + ASSERT_UNREACHABLE(); + break; + } +} + +/* + * Merge new data into a partial result. + * + * Zero the bytes of 'data' from [offset, offset + size), and + * merge the value found in 'new' from [0, offset) left shifted + * by 'offset'. + */ +static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, + unsigned int offset) +{ + uint32_t mask = 0xffffffff >> (32 - 8 * size); + + return (data & ~(mask << (offset * 8))) | ((new & mask) << (offset * 8)); +} + +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot, + unsigned int func, unsigned int reg, unsigned int size) +{ + struct domain *d = current->domain; + struct pci_dev *pdev; + const struct vpci_register *r; + unsigned int data_offset = 0; + uint32_t data = ~(uint32_t)0; + + ASSERT(vpci_rlocked(d)); + + /* Find the PCI dev matching the address. */ + pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func)); + if ( !pdev ) + return vpci_read_hw(seg, bus, slot, func, reg, size); + + /* Read from the hardware or the emulated register handlers. */ + list_for_each_entry ( r, &pdev->vpci->handlers, node ) + { + const struct vpci_register emu = { + .offset = reg + data_offset, + .size = size - data_offset + }; + int cmp = vpci_register_cmp(&emu, r); + uint32_t val; + unsigned int read_size; + + if ( cmp < 0 ) + break; + if ( cmp > 0 ) + continue; + + if ( emu.offset < r->offset ) + { + /* Heading gap, read partial content from hardware. */ + read_size = r->offset - emu.offset; + val = vpci_read_hw(seg, bus, slot, func, emu.offset, read_size); + data = merge_result(data, val, read_size, data_offset); + data_offset += read_size; + } + + val = r->read(pdev, r->offset, r->private); + + /* Check if the read is in the middle of a register. */ + if ( r->offset < emu.offset ) + val >>= (emu.offset - r->offset) * 8; + + /* Find the intersection size between the two sets. */ + read_size = min(emu.offset + emu.size, r->offset + r->size) - + max(emu.offset, r->offset); + /* Merge the emulated data into the native read value. */ + data = merge_result(data, val, read_size, data_offset); + data_offset += read_size; + if ( data_offset == size ) + break; + ASSERT(data_offset < size); + } + + if ( data_offset < size ) + { + /* Tailing gap, read the remaining. */ + uint32_t tmp_data = vpci_read_hw(seg, bus, slot, func, + reg + data_offset, + size - data_offset); + + data = merge_result(data, tmp_data, size - data_offset, data_offset); + } + + return data & (0xffffffff >> (32 - 8 * size)); +} + +/* + * Perform a maybe partial write to a register. + * + * Note that this will only work for simple registers, if Xen needs to + * trap accesses to rw1c registers (like the status PCI header register) + * the logic in vpci_write will have to be expanded in order to correctly + * deal with them. + */ +static void vpci_write_helper(struct pci_dev *pdev, + const struct vpci_register *r, unsigned int size, + unsigned int offset, uint32_t data) +{ + ASSERT(size <= r->size); + + if ( size != r->size ) + { + uint32_t val; + + val = r->read(pdev, r->offset, r->private); + data = merge_result(val, data, size, offset); + } + + r->write(pdev, r->offset, data & (0xffffffff >> (32 - 8 * r->size)), + r->private); +} + +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot, + unsigned int func, unsigned int reg, unsigned int size, + uint32_t data) +{ + struct domain *d = current->domain; + struct pci_dev *pdev; + const struct vpci_register *r; + unsigned int data_offset = 0; + + ASSERT(vpci_wlocked(d)); + + /* + * Find the PCI dev matching the address. + * Passthrough everything that's not trapped. + */ + pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func)); + if ( !pdev ) + { + vpci_write_hw(seg, bus, slot, func, reg, size, data); + return; + } + + /* Write the value to the hardware or emulated registers. */ + list_for_each_entry ( r, &pdev->vpci->handlers, node ) + { + const struct vpci_register emu = { + .offset = reg + data_offset, + .size = size - data_offset + }; + int cmp = vpci_register_cmp(&emu, r); + unsigned int write_size; + + if ( cmp < 0 ) + break; + if ( cmp > 0 ) + continue; + + if ( emu.offset < r->offset ) + { + /* Heading gap, write partial content to hardware. */ + vpci_write_hw(seg, bus, slot, func, emu.offset, + r->offset - emu.offset, data >> (data_offset * 8)); + data_offset += r->offset - emu.offset; + } + + /* Find the intersection size between the two sets. */ + write_size = min(emu.offset + emu.size, r->offset + r->size) - + max(emu.offset, r->offset); + vpci_write_helper(pdev, r, write_size, reg + data_offset - r->offset, + data >> (data_offset * 8)); + data_offset += write_size; + if ( data_offset == size ) + break; + ASSERT(data_offset < size); + } + + if ( data_offset < size ) + /* Tailing gap, write the remaining. */ + vpci_write_hw(seg, bus, slot, func, reg + data_offset, + size - data_offset, data >> (data_offset * 8)); +} + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index c10522b7f5..ec14343b27 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -427,6 +427,7 @@ struct arch_domain #define has_vpit(d) (!!((d)->arch.emulation_flags & XEN_X86_EMU_PIT)) #define has_pirq(d) (!!((d)->arch.emulation_flags & \ XEN_X86_EMU_USE_PIRQ)) +#define has_vpci(d) (!!((d)->arch.emulation_flags & XEN_X86_EMU_VPCI)) #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h index d2899c9bb2..3a54d50606 100644 --- a/xen/include/asm-x86/hvm/domain.h +++ b/xen/include/asm-x86/hvm/domain.h @@ -184,6 +184,9 @@ struct hvm_domain { /* List of guest to machine IO ports mapping. */ struct list_head g2m_ioport_list; + /* Lock for the PCI emulation layer (vPCI). */ + rwlock_t vpci_lock; + /* List of permanently write-mapped pages. */ struct { spinlock_t lock; diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h index 51659b6c7f..01322a2e21 100644 --- a/xen/include/asm-x86/hvm/io.h +++ b/xen/include/asm-x86/hvm/io.h @@ -160,6 +160,9 @@ unsigned int hvm_pci_decode_addr(unsigned int cf8, unsigned int addr, */ void register_g2m_portio_handler(struct domain *d); +/* HVM port IO handler for PCI accesses. */ +void register_vpci_portio_handler(struct domain *d); + #endif /* __ASM_X86_HVM_IO_H__ */ diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h index f21332e897..86a1a09a8d 100644 --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -295,12 +295,15 @@ struct xen_arch_domainconfig { #define XEN_X86_EMU_PIT (1U<<_XEN_X86_EMU_PIT) #define _XEN_X86_EMU_USE_PIRQ 9 #define XEN_X86_EMU_USE_PIRQ (1U<<_XEN_X86_EMU_USE_PIRQ) +#define _XEN_X86_EMU_VPCI 10 +#define XEN_X86_EMU_VPCI (1U<<_XEN_X86_EMU_VPCI) #define XEN_X86_EMU_ALL (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET | \ XEN_X86_EMU_PM | XEN_X86_EMU_RTC | \ XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC | \ XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU | \ - XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ) + XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\ + XEN_X86_EMU_VPCI) uint32_t emulation_flags; }; diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index ea6a66b248..ad5d3ca031 100644 --- a/xen/include/xen/pci.h +++ b/xen/include/xen/pci.h @@ -88,6 +88,9 @@ struct pci_dev { #define PT_FAULT_THRESHOLD 10 } fault; u64 vf_rlen[6]; + + /* Data for vPCI. */ + struct vpci *vpci; }; #define for_each_pdev(domain, pdev) \ diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h index ecd6124d91..cc4ee3b83e 100644 --- a/xen/include/xen/pci_regs.h +++ b/xen/include/xen/pci_regs.h @@ -23,6 +23,14 @@ #define LINUX_PCI_REGS_H /* + * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of + * configuration space. PCI-X Mode 2 and PCIe devices have 4096 bytes of + * configuration space. + */ +#define PCI_CFG_SPACE_SIZE 256 +#define PCI_CFG_SPACE_EXP_SIZE 4096 + +/* * Under PCI, each device has 256 bytes of configuration address space, * of which the first 64 bytes are standardized as follows: */ diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h new file mode 100644 index 0000000000..12f7287d7b --- /dev/null +++ b/xen/include/xen/vpci.h @@ -0,0 +1,80 @@ +#ifndef _VPCI_ +#define _VPCI_ + +#include <xen/pci.h> +#include <xen/types.h> +#include <xen/list.h> + +/* + * Helpers for locking/unlocking. + * + * NB: the recursive variants are used so that spin_is_locked + * returns whether the lock is hold by the current CPU (instead + * of just returning whether the lock is hold by any CPU). + */ +#define vpci_rlock(d) read_lock(&(d)->arch.hvm_domain.vpci_lock) +#define vpci_wlock(d) write_lock(&(d)->arch.hvm_domain.vpci_lock) +#define vpci_runlock(d) read_unlock(&(d)->arch.hvm_domain.vpci_lock) +#define vpci_wunlock(d) write_unlock(&(d)->arch.hvm_domain.vpci_lock) +#define vpci_rlocked(d) rw_is_locked(&(d)->arch.hvm_domain.vpci_lock) +#define vpci_wlocked(d) rw_is_write_locked(&(d)->arch.hvm_domain.vpci_lock) + +/* + * The vPCI handlers will never be called concurrently for the same domain, it + * is guaranteed that the vpci domain lock will always be locked when calling + * any handler. + */ +typedef uint32_t vpci_read_t(struct pci_dev *pdev, unsigned int reg, + const void *data); + +typedef void vpci_write_t(struct pci_dev *pdev, unsigned int reg, + uint32_t val, void *data); + +typedef int vpci_register_init_t(struct pci_dev *dev); + +#ifdef CONFIG_LATE_HWDOM +#define VPCI_SECTION ".rodata.vpci" +#else +#define VPCI_SECTION ".init.rodata.vpci" +#endif + +#define REGISTER_VPCI_INIT(x) \ + static vpci_register_init_t *const x##_entry \ + __used_section(VPCI_SECTION) = x + +/* Add vPCI handlers to device. */ +int __must_check vpci_add_handlers(struct pci_dev *dev); + +/* Add/remove a register handler. */ +int __must_check vpci_add_register(const struct pci_dev *pdev, + vpci_read_t *read_handler, + vpci_write_t *write_handler, + unsigned int offset, unsigned int size, + void *data); +int __must_check vpci_remove_register(const struct pci_dev *pdev, + unsigned int offset, + unsigned int size); + +/* Generic read/write handlers for the PCI config space. */ +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot, + unsigned int func, unsigned int reg, unsigned int size); +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot, + unsigned int func, unsigned int reg, unsigned int size, + uint32_t data); + +struct vpci { + /* List of vPCI handlers for a device. */ + struct list_head handlers; +}; + +#endif + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */
This functionality is going to reside in vpci.c (and the corresponding vpci.h header), and should be arch-agnostic. The handlers introduced in this patch setup the basic functionality required in order to trap accesses to the PCI config space, and allow decoding the address and finding the corresponding handler that should handle the access (although no handlers are implemented). Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup inside of a x86 HVM file, since that's not shared with other arches. A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen whether a domain should use the newly introduced vPCI handlers, this is only enabled for PVH Dom0 at the moment. A very simple user-space test is also provided, so that the basic functionality of the vPCI traps can be asserted. This has been proven quite helpful during development, since the logic to handle partial accesses or accesses that expand across multiple registers is not trivial. The handlers for the registers are added to a linked list that's keep sorted at all times. Both the read and write handlers support accesses that expand across multiple emulated registers and contain gaps not emulated. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com> --- Changes since v4: * User-space test harness: - Do not redirect the output of the test. - Add main.c and emul.h as dependencies of the Makefile target. - Use the same rule to modify the vpci and list headers. - Remove underscores from local macro variables. - Add _check suffix to the test harness multiread function. - Change the value written by every different size in the multiwrite test. - Use { } to initialize the r16 and r20 arrays (instead of { 0 }). - Perform some of the read checks with the local variable directly. - Expand some comments. - Implement a dummy rwlock. * Hypervisor code: - Guard the linker script changes with CONFIG_HAS_PCI. - Rename vpci_access_check to vpci_access_allowed and make it return bool. - Make hvm_pci_decode_addr return the register as return value. - Use ~3 instead of 0xfffc to remove the register offset when checking accesses to IO ports. - s/head/prev in vpci_add_register. - Add parentheses around & in vpci_add_register. - Fix register removal. - Change the BUGs in vpci_{read/write}_hw helpers to ASSERT_UNREACHABLE. - Make merge_result static and change the computation of the mask to avoid using a uint64_t. - Modify vpci_read to only read from hardware the not-emulated gaps. - Remove the vpci_val union and use a uint32_t instead. - Change handler read type to return a uint32_t instead of modifying a variable passed by reference. - Constify the data opaque parameter of read handlers. - Change the size parameter of the vpci_{read/write} functions to unsigned int. - Place the array of initialization handlers in init.rodata or .rodata depending on whether late-hwdom is enabled. - Remove the pci_devs lock, assume the Dom0 is well behaved and won't remove the device while trying to access it. - Change the recursive spinlock into a rw lock for performance reasons. Changes since v3: * User-space test harness: - Fix spaces in container_of macro. - Implement a dummy locking functions. - Remove 'current' macro make current a pointer to the statically allocated vpcu. - Remove unneeded parentheses in the pci_conf_readX macros. - Fix the name of the write test macro. - Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB code only). - Import the max macro. - Test all possible read/write size combinations with all possible emulated register sizes. - Introduce a test for register removal. * Hypervisor code: - Use a sorted list in order to store the config space handlers. - Remove some unneeded 'else' branches. - Make the IO port handlers always return X86EMUL_OKAY, and set the data to all 1's in case of read failure (write are simply ignored). - In hvm_select_ioreq_server reuse local variables when calling XEN_DMOP_PCI_SBDF. - Store the pointers to the initialization functions in the .rodata section. - Do not ignore the return value of xen_vpci_add_handlers in setup_one_hwdom_device. - Remove the vpci_init macro. - Do not hide the pointers inside of the vpci_{read/write}_t typedefs. - Rename priv_data to private in vpci_register. - Simplify checking for register overlap in vpci_register_cmp. - Check that the offset and the length match before removing a register in xen_vpci_remove_register. - Make vpci_read_hw return a value rather than storing it in a pointer passed by parameter. - Handler dispatcher functions vpci_{read/write} no longer return an error code, errors on reads/writes should be treated like hardware (writes ignored, reads return all 1's or garbage). - Make sure pcidevs is locked before calling pci_get_pdev_by_domain. - Use a recursive spinlock for the vpci lock, so that spin_is_locked checks that the current CPU is holding the lock. - Make the code less error-chatty by removing some of the printk's. - Pass the slot and the function as separate parameters to the handler dispatchers (instead of passing devfn). - Allow handlers to be registered with either a read or write function only, the missing handler will be replaced by a dummy handler (writes ignored, reads return 1's). - Introduce PCI_CFG_SPACE_* defines from Linux. - Simplify the handler dispatchers by removing the recursion, now the dispatchers iterate over the list of sorted handlers and call them in order. - Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT macros, and instead provide a merge_result function in order to merge a register output into a partial result. - Rename the fields of the vpci_val union to u8/u16/u32. - Remove the return values from the read/write handlers, errors should be handled internally and signaled as would be done on native hardware. - Remove the usage of the GENMASK macro. Changes since v2: - Generalize the PCI address decoding and use it for IOREQ code also. Changes since v1: - Allow access to cross a word-boundary. - Add locking. - Add cleanup to xen_vpci_add_handlers in case of failure. --- .gitignore | 3 + tools/libxl/libxl_x86.c | 2 +- tools/tests/Makefile | 1 + tools/tests/vpci/Makefile | 37 ++++ tools/tests/vpci/emul.h | 128 +++++++++++ tools/tests/vpci/main.c | 314 +++++++++++++++++++++++++++ xen/arch/arm/xen.lds.S | 10 + xen/arch/x86/domain.c | 18 +- xen/arch/x86/hvm/hvm.c | 2 + xen/arch/x86/hvm/io.c | 118 +++++++++- xen/arch/x86/setup.c | 3 +- xen/arch/x86/xen.lds.S | 10 + xen/drivers/Makefile | 2 +- xen/drivers/passthrough/pci.c | 9 +- xen/drivers/vpci/Makefile | 1 + xen/drivers/vpci/vpci.c | 443 ++++++++++++++++++++++++++++++++++++++ xen/include/asm-x86/domain.h | 1 + xen/include/asm-x86/hvm/domain.h | 3 + xen/include/asm-x86/hvm/io.h | 3 + xen/include/public/arch-x86/xen.h | 5 +- xen/include/xen/pci.h | 3 + xen/include/xen/pci_regs.h | 8 + xen/include/xen/vpci.h | 80 +++++++ 23 files changed, 1194 insertions(+), 10 deletions(-) create mode 100644 tools/tests/vpci/Makefile create mode 100644 tools/tests/vpci/emul.h create mode 100644 tools/tests/vpci/main.c create mode 100644 xen/drivers/vpci/Makefile create mode 100644 xen/drivers/vpci/vpci.c create mode 100644 xen/include/xen/vpci.h