Message ID | 1656433761-9163-1-git-send-email-akaher@vmware.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | MMIO should have more priority then IO | expand |
[+cc Matthew] On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote: > Port IO instructions (PIO) are less efficient than MMIO (memory > mapped I/O). They require twice as many PCI accesses and PIO > instructions are serializing. As a result, MMIO should be preferred > when possible over PIO. > > Bare metal test result > 1 million reads using raw_pci_read() took: > PIO: 0.433153 Sec. > MMIO: 0.268792 Sec. > > Virtual Machine test result > 1 hundred thousand reads using raw_pci_read() took: > PIO: 12.809 Sec. > MMIO: took 8.517 Sec. > > Signed-off-by: Ajay Kaher <akaher@vmware.com> > --- > arch/x86/pci/common.c | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c > index 3507f456f..0b3383d9c 100644 > --- a/arch/x86/pci/common.c > +++ b/arch/x86/pci/common.c > @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops; > int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, > int reg, int len, u32 *val) > { > + if (raw_pci_ext_ops) > + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); > if (domain == 0 && reg < 256 && raw_pci_ops) > return raw_pci_ops->read(domain, bus, devfn, reg, len, val); > - if (raw_pci_ext_ops) > - return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); > return -EINVAL; This organization of raw_pci_read() dates to b6ce068a1285 ("Change pci_raw_ops to pci_raw_read/write"), by Matthew. Cc'd him for comment, since I think he considered the ordering at the time. > } > > int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn, > int reg, int len, u32 val) > { > + if (raw_pci_ext_ops) > + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); > if (domain == 0 && reg < 256 && raw_pci_ops) > return raw_pci_ops->write(domain, bus, devfn, reg, len, val); > - if (raw_pci_ext_ops) > - return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); > return -EINVAL; > } > > -- > 2.30.0 >
On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote: > Port IO instructions (PIO) are less efficient than MMIO (memory > mapped I/O). They require twice as many PCI accesses and PIO > instructions are serializing. As a result, MMIO should be preferred > when possible over PIO. > > Bare metal test result > 1 million reads using raw_pci_read() took: > PIO: 0.433153 Sec. > MMIO: 0.268792 Sec. > > Virtual Machine test result > 1 hundred thousand reads using raw_pci_read() took: > PIO: 12.809 Sec. > MMIO: took 8.517 Sec. > > Signed-off-by: Ajay Kaher <akaher@vmware.com> > --- > arch/x86/pci/common.c | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c > index 3507f456f..0b3383d9c 100644 > --- a/arch/x86/pci/common.c > +++ b/arch/x86/pci/common.c > @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops; > int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, > int reg, int len, u32 *val) > { > + if (raw_pci_ext_ops) > + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); > if (domain == 0 && reg < 256 && raw_pci_ops) > return raw_pci_ops->read(domain, bus, devfn, reg, len, val); > - if (raw_pci_ext_ops) > - return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); > return -EINVAL; > } > > int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn, > int reg, int len, u32 val) > { > + if (raw_pci_ext_ops) > + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); > if (domain == 0 && reg < 256 && raw_pci_ops) > return raw_pci_ops->write(domain, bus, devfn, reg, len, val); > - if (raw_pci_ext_ops) > - return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); > return -EINVAL; > } > > -- > 2.30.0 > <formletter> This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly. </formletter>
On June 28, 2022 11:12:41 PM PDT, Greg KH <gregkh@linuxfoundation.org> wrote: >On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote: >> Port IO instructions (PIO) are less efficient than MMIO (memory >> mapped I/O). They require twice as many PCI accesses and PIO >> instructions are serializing. As a result, MMIO should be preferred >> when possible over PIO. >> >> Bare metal test result >> 1 million reads using raw_pci_read() took: >> PIO: 0.433153 Sec. >> MMIO: 0.268792 Sec. >> >> Virtual Machine test result >> 1 hundred thousand reads using raw_pci_read() took: >> PIO: 12.809 Sec. >> MMIO: took 8.517 Sec. >> >> Signed-off-by: Ajay Kaher <akaher@vmware.com> >> --- >> arch/x86/pci/common.c | 8 ++++---- >> 1 files changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c >> index 3507f456f..0b3383d9c 100644 >> --- a/arch/x86/pci/common.c >> +++ b/arch/x86/pci/common.c >> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops; >> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, >> int reg, int len, u32 *val) >> { >> + if (raw_pci_ext_ops) >> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); >> if (domain == 0 && reg < 256 && raw_pci_ops) >> return raw_pci_ops->read(domain, bus, devfn, reg, len, val); >> - if (raw_pci_ext_ops) >> - return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); >> return -EINVAL; >> } >> >> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn, >> int reg, int len, u32 val) >> { >> + if (raw_pci_ext_ops) >> + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); >> if (domain == 0 && reg < 256 && raw_pci_ops) >> return raw_pci_ops->write(domain, bus, devfn, reg, len, val); >> - if (raw_pci_ext_ops) >> - return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); >> return -EINVAL; >> } >> >> -- >> 2.30.0 >> > ><formletter> > >This is not the correct way to submit patches for inclusion in the >stable kernel tree. Please read: > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html >for how to do this properly. > ></formletter> The statement in the header is also incorrect.
On 28/06/22, 11:39 PM, "Bjorn Helgaas" <helgaas@kernel.org> wrote: > [+cc Matthew] > > On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote: >> Port IO instructions (PIO) are less efficient than MMIO (memory >> mapped I/O). They require twice as many PCI accesses and PIO >> instructions are serializing. As a result, MMIO should be preferred >> when possible over PIO. >> >> Bare metal test result >> 1 million reads using raw_pci_read() took: >> PIO: 0.433153 Sec. >> MMIO: 0.268792 Sec. >> >> Virtual Machine test result >> 1 hundred thousand reads using raw_pci_read() took: >> PIO: 12.809 Sec. >> MMIO: took 8.517 Sec. >> >> Signed-off-by: Ajay Kaher <akaher@vmware.com> >> --- >> arch/x86/pci/common.c | 8 ++++---- >> 1 files changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c >> index 3507f456f..0b3383d9c 100644 >> --- a/arch/x86/pci/common.c >> +++ b/arch/x86/pci/common.c >> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops; >> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, >> int reg, int len, u32 *val) >> { >> + if (raw_pci_ext_ops) >> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); >> if (domain == 0 && reg < 256 && raw_pci_ops) >> return raw_pci_ops->read(domain, bus, devfn, reg, len, val); >> - if (raw_pci_ext_ops) >> - return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); >> return -EINVAL; > > This organization of raw_pci_read() dates to b6ce068a1285 ("Change > pci_raw_ops to pci_raw_read/write"), by Matthew. Cc'd him for > comment, since I think he considered the ordering at the time. Thanks Bjorn for quick response. Matthew, b6ce068a1285 is old commit. It will be very helpful if you could provide some detail on ordering as Bjorn mentioned above. - Ajay >> } >> >> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn, >> int reg, int len, u32 val) >> { >> + if (raw_pci_ext_ops) >> + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); >> if (domain == 0 && reg < 256 && raw_pci_ops) >> return raw_pci_ops->write(domain, bus, devfn, reg, len, val); >> - if (raw_pci_ext_ops) >> - return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); >> return -EINVAL; >> } >> >> -- >> 2.30.0 >>
On Fri, Jul 08, 2022 at 05:56:07AM +0000, Ajay Kaher wrote: > > On 28/06/22, 11:39 PM, "Bjorn Helgaas" <helgaas@kernel.org> wrote: > > [+cc Matthew] > > > > On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote: > >> Port IO instructions (PIO) are less efficient than MMIO (memory > >> mapped I/O). They require twice as many PCI accesses and PIO > >> instructions are serializing. As a result, MMIO should be preferred > >> when possible over PIO. > >> > >> Bare metal test result > >> 1 million reads using raw_pci_read() took: > >> PIO: 0.433153 Sec. > >> MMIO: 0.268792 Sec. > >> > >> Virtual Machine test result > >> 1 hundred thousand reads using raw_pci_read() took: > >> PIO: 12.809 Sec. > >> MMIO: took 8.517 Sec. While this is true, we're talking about config space accesses. These should be rare. What workload does this make any measurable difference to? And looking at the results above, it's not so much the PIO vs MMIO that makes a difference, it's the virtualisation. A mmio access goes from 269ns to 85us. Rather than messing around with preferring MMIO over PIO for config space, having an "enlightenment" to do config space accesses would be a more profitable path. > >> Signed-off-by: Ajay Kaher <akaher@vmware.com> > >> --- > >> arch/x86/pci/common.c | 8 ++++---- > >> 1 files changed, 4 insertions(+), 4 deletions(-) > >> > >> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c > >> index 3507f456f..0b3383d9c 100644 > >> --- a/arch/x86/pci/common.c > >> +++ b/arch/x86/pci/common.c > >> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops; > >> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, > >> int reg, int len, u32 *val) > >> { > >> + if (raw_pci_ext_ops) > >> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); > >> if (domain == 0 && reg < 256 && raw_pci_ops) > >> return raw_pci_ops->read(domain, bus, devfn, reg, len, val); > >> - if (raw_pci_ext_ops) > >> - return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); > >> return -EINVAL; > > > > This organization of raw_pci_read() dates to b6ce068a1285 ("Change > > pci_raw_ops to pci_raw_read/write"), by Matthew. Cc'd him for > > comment, since I think he considered the ordering at the time. > > Thanks Bjorn for quick response. > > Matthew, b6ce068a1285 is old commit. It will be very helpful if you could > provide some detail on ordering as Bjorn mentioned above. Sorry for the delay; this came in while I was on holiday. I'll note that b6ce068a1285 does _not_ cause any changes in whether MMIO or PIO is used for accesses below 256 bytes. Yes, it introduces this code, but it also removes lines like this: - if (reg < 256) - return pci_conf1_read(seg,bus,devfn,reg,len,value); from the MMCONFIG accessors. Those were introduced in a0ca99096094 by Ivan Kokshaysky. But if you look further back in the file history, you can find all kinds of nasty bugs; broken BIOSes, resource conflicts, bleh. You'd hope they've all been fixed by now, but do you want to bet? I still have a working machine here which hung when using MMCONFIG for all accesses. The problem lay in, IIRC, the graphics BAR. When attempting to size it (by writing all-ones to it and seeing what bits remained as zero), it happens to overlay the MMCONFIG area. That meant that the subsequent attempt to write to the BAR actually ended up being a write to graphics memory ... and so did all subsequent MMCONFIG accesses. In short, here be dragons, and you need to move very VERY carefully to avoid breaking peoples machines. We do have the possibility of a white-list approach. We can set 'raw_pci_ops' to NULL on machines which we're certain mmconfig works. I still think you're better off having a special raw_pci_ops for vmware than you are dinking about trying to save 50% of the time. If any of this is really worth it at all, which I doubt.
On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote: > And looking at the results above, it's not so much the PIO vs MMIO > that makes a difference, it's the virtualisation. A mmio access goes > from 269ns to 85us. Rather than messing around with preferring MMIO > over PIO for config space, having an "enlightenment" to do config > space accesses would be a more profitable path. I am unfamiliar with the motivation for this patch, but I just wanted to briefly regard the advice about enlightments. “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so let’s regard the generic term. I think that you consider the bare-metal results as the possible results from a paravirtual machine, which is mostly wrong. Para-virtualization usually still requires a VM-exit and for the most part the hypervisor/host runs similar code for MMIO/hypercall (conceptually; the code of paravirtual and fully-virtual devices is often different, but IIUC, this is not what Ajay measured). Para-virtualization could have *perhaps* helped to reduce the number of PIO/MMIO and improve performance this way. If, for instance, all the PIO/MMIO are done during initialization, a paravirtual interface can be use to batch them together, and that would help. But it is more complicated to get a performance benefit from paravirtualization if the PIO/MMIO accesses are “spread”, for instance, done after each interrupt. Para-virtauilzation and full-virtualization both have pros and cons. Para-virtualization is many times more efficient, but requires the VM to have dedicated device drivers for the matter. Try to run a less-common OS than Linux and it would not work since the OS would not have drivers for the paras-virtual devices. And even if you add support today for a para-virtual devices, there are many deployed OSes that do not have such support, and you would not be able to run them in a VM. Regardless to virtualization, Ajay’s results show PIO is slower on bare-metal, and according to his numbers by 165ns, which is significant. Emulating PIO in hypervisors on x86 is inherently more complex than MMIO, so the results he got would most likely happen on all hypervisors. tl;dr: Let’s keep this discussion focused and put paravirtualization aside. It is not a solution for all the problems in the world.
On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote: > On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote: > > > And looking at the results above, it's not so much the PIO vs MMIO > > that makes a difference, it's the virtualisation. A mmio access goes > > from 269ns to 85us. Rather than messing around with preferring MMIO > > over PIO for config space, having an "enlightenment" to do config > > space accesses would be a more profitable path. > > I am unfamiliar with the motivation for this patch, but I just wanted to > briefly regard the advice about enlightments. > > “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so > let’s regard the generic term. I think that you consider the bare-metal > results as the possible results from a paravirtual machine, which is mostly > wrong. Para-virtualization usually still requires a VM-exit and for the most > part the hypervisor/host runs similar code for MMIO/hypercall (conceptually; > the code of paravirtual and fully-virtual devices is often different, but > IIUC, this is not what Ajay measured). > > Para-virtualization could have *perhaps* helped to reduce the number of > PIO/MMIO and improve performance this way. If, for instance, all the > PIO/MMIO are done during initialization, a paravirtual interface can be use > to batch them together, and that would help. But it is more complicated to > get a performance benefit from paravirtualization if the PIO/MMIO accesses > are “spread”, for instance, done after each interrupt. What kind of lousy programming interface requires you to do a config space access after every interrupt? This is looney-tunes. You've used a lot of words to not answer the question that was so important that I asked it twice. What's the use case, what's the workload that would benefit from this patch? > Para-virtauilzation and full-virtualization both have pros and cons. > Para-virtualization is many times more efficient, but requires the VM to > have dedicated device drivers for the matter. Try to run a less-common OS > than Linux and it would not work since the OS would not have drivers for the > paras-virtual devices. And even if you add support today for a para-virtual > devices, there are many deployed OSes that do not have such support, and you > would not be able to run them in a VM. > > Regardless to virtualization, Ajay’s results show PIO is slower on > bare-metal, and according to his numbers by 165ns, which is significant. > Emulating PIO in hypervisors on x86 is inherently more complex than MMIO, so > the results he got would most likely happen on all hypervisors. > > tl;dr: Let’s keep this discussion focused and put paravirtualization aside. > It is not a solution for all the problems in the world.
On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote: > ⚠ External Email > > On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote: >> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote: >> >>> And looking at the results above, it's not so much the PIO vs MMIO >>> that makes a difference, it's the virtualisation. A mmio access goes >>> from 269ns to 85us. Rather than messing around with preferring MMIO >>> over PIO for config space, having an "enlightenment" to do config >>> space accesses would be a more profitable path. >> >> I am unfamiliar with the motivation for this patch, but I just wanted to >> briefly regard the advice about enlightments. >> >> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so >> let’s regard the generic term. I think that you consider the bare-metal >> results as the possible results from a paravirtual machine, which is mostly >> wrong. Para-virtualization usually still requires a VM-exit and for the most >> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually; >> the code of paravirtual and fully-virtual devices is often different, but >> IIUC, this is not what Ajay measured). >> >> Para-virtualization could have *perhaps* helped to reduce the number of >> PIO/MMIO and improve performance this way. If, for instance, all the >> PIO/MMIO are done during initialization, a paravirtual interface can be use >> to batch them together, and that would help. But it is more complicated to >> get a performance benefit from paravirtualization if the PIO/MMIO accesses >> are “spread”, for instance, done after each interrupt. > > What kind of lousy programming interface requires you to do a config > space access after every interrupt? This is looney-tunes. Wild example, hence the “for instance”. > > You've used a lot of words to not answer the question that was so > important that I asked it twice. What's the use case, what's the > workload that would benefit from this patch? Well, you used a lot of words to say “it causes problems” without saying which. It appeared you have misconceptions about paravirtualization that I wanted to correct. As I said before, I am not familiar with the exact motivation for this patch. I now understood from Ajay that it shortens VM boot time considerably. I was talking to Ajay to see if there is a possibility of a VMware specific solution. I am afraid that init_hypervisor_platform() might take place too late.
On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote: > On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote: > > > ⚠ External Email > > > > On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote: > >> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote: > >> > >>> And looking at the results above, it's not so much the PIO vs MMIO > >>> that makes a difference, it's the virtualisation. A mmio access goes > >>> from 269ns to 85us. Rather than messing around with preferring MMIO > >>> over PIO for config space, having an "enlightenment" to do config > >>> space accesses would be a more profitable path. > >> > >> I am unfamiliar with the motivation for this patch, but I just wanted to > >> briefly regard the advice about enlightments. > >> > >> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so > >> let’s regard the generic term. I think that you consider the bare-metal > >> results as the possible results from a paravirtual machine, which is mostly > >> wrong. Para-virtualization usually still requires a VM-exit and for the most > >> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually; > >> the code of paravirtual and fully-virtual devices is often different, but > >> IIUC, this is not what Ajay measured). > >> > >> Para-virtualization could have *perhaps* helped to reduce the number of > >> PIO/MMIO and improve performance this way. If, for instance, all the > >> PIO/MMIO are done during initialization, a paravirtual interface can be use > >> to batch them together, and that would help. But it is more complicated to > >> get a performance benefit from paravirtualization if the PIO/MMIO accesses > >> are “spread”, for instance, done after each interrupt. > > > > What kind of lousy programming interface requires you to do a config > > space access after every interrupt? This is looney-tunes. > > Wild example, hence the “for instance”. Stupid example that doesn't help. > > You've used a lot of words to not answer the question that was so > > important that I asked it twice. What's the use case, what's the > > workload that would benefit from this patch? > > Well, you used a lot of words to say “it causes problems” without saying > which. It appeared you have misconceptions about paravirtualization that > I wanted to correct. Well now, that's some bullshit. I did my fucking research. I went back 14+ years in history to figure out what was going on back then. I cited commit IDs. You're just tossing off some opinions. I have no misconceptions about whatever you want to call the mechanism for communicating with the hypervisor at a higher level than "prod this byte". For example, one of the more intensive things we use config space for is sizing BARs. If we had a hypercall to siz a BAR, that would eliminate: - Read current value from BAR - Write all-ones to BAR - Read new value from BAR - Write original value back to BAR Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses. Just because I don't use your terminology, you think I have "misconceptions"? Fuck you, you condescending piece of shit. > As I said before, I am not familiar with the exact motivation for this > patch. I now understood from Ajay that it shortens VM boot time > considerably. And yet, no numbers. Yes, microbenchmark numbers that provde nothing, but no numbers about how much it improves boot time. > I was talking to Ajay to see if there is a possibility of a VMware specific > solution. I am afraid that init_hypervisor_platform() might take place too > late. >
On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@infradead.org> wrote: > ⚠ External Email > > On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote: >> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote: >> >>> ⚠ External Email >>> >>> On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote: >>>> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote: >>>> >>>>> And looking at the results above, it's not so much the PIO vs MMIO >>>>> that makes a difference, it's the virtualisation. A mmio access goes >>>>> from 269ns to 85us. Rather than messing around with preferring MMIO >>>>> over PIO for config space, having an "enlightenment" to do config >>>>> space accesses would be a more profitable path. >>>> >>>> I am unfamiliar with the motivation for this patch, but I just wanted to >>>> briefly regard the advice about enlightments. >>>> >>>> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so >>>> let’s regard the generic term. I think that you consider the bare-metal >>>> results as the possible results from a paravirtual machine, which is mostly >>>> wrong. Para-virtualization usually still requires a VM-exit and for the most >>>> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually; >>>> the code of paravirtual and fully-virtual devices is often different, but >>>> IIUC, this is not what Ajay measured). >>>> >>>> Para-virtualization could have *perhaps* helped to reduce the number of >>>> PIO/MMIO and improve performance this way. If, for instance, all the >>>> PIO/MMIO are done during initialization, a paravirtual interface can be use >>>> to batch them together, and that would help. But it is more complicated to >>>> get a performance benefit from paravirtualization if the PIO/MMIO accesses >>>> are “spread”, for instance, done after each interrupt. >>> >>> What kind of lousy programming interface requires you to do a config >>> space access after every interrupt? This is looney-tunes. >> >> Wild example, hence the “for instance”. > > Stupid example that doesn't help. > >>> You've used a lot of words to not answer the question that was so >>> important that I asked it twice. What's the use case, what's the >>> workload that would benefit from this patch? >> >> Well, you used a lot of words to say “it causes problems” without saying >> which. It appeared you have misconceptions about paravirtualization that >> I wanted to correct. > > Well now, that's some bullshit. I did my fucking research. I went > back 14+ years in history to figure out what was going on back then. > I cited commit IDs. You're just tossing off some opinions. > > I have no misconceptions about whatever you want to call the mechanism > for communicating with the hypervisor at a higher level than "prod this > byte". For example, one of the more intensive things we use config > space for is sizing BARs. If we had a hypercall to siz a BAR, that > would eliminate: > > - Read current value from BAR > - Write all-ones to BAR > - Read new value from BAR > - Write original value back to BAR > > Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses. > > Just because I don't use your terminology, you think I have > "misconceptions"? Fuck you, you condescending piece of shit. Matthew, I did not mean to sound condescending and I apologize if I did. You have my *full* respect for your coding/design skills. Out of my respect to you, I am giving you a pass on your conduct this time and *this time only*. Do not use such language with me or my colleagues again. The only reason I got involved in this discussion is that I feel that my colleagues have concerns about kernel toxic environment. Back to the issue at hand: I think that a new paravirtual interface is a possible solution, with some serious drawbacks. Xen did something similar, IIRC, to a certain extent. More reasonable, I think, based on what you said before, is to check if we run on a hypervisor, and update raw_pci_ops accordingly. There is an issue of whether hypervisor detection might take place too late, but I think this can be relatively easily resolved. The question is whether assigned devices might still be broken. Based on the information that you provided - I do not know. If you can answer this question, that would be helpful. Let’s also wait for Ajay to give some numbers about boot time with this change.
On 09/07/22, 1:19 AM, "Nadav Amit" <namit@vmware.com> wrote: > On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@infradead.org> wrote: >> I have no misconceptions about whatever you want to call the mechanism >> for communicating with the hypervisor at a higher level than "prod this >> byte". For example, one of the more intensive things we use config >> space for is sizing BARs. If we had a hypercall to siz a BAR, that >> would eliminate: >> >> - Read current value from BAR >> - Write all-ones to BAR >> - Read new value from BAR >> - Write original value back to BAR >> >> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses. To improve further we can have following mechanism: Map (as read only) the 'virtual device config i.e. 4KB ECAM' to VM MMIO. VM will have direct read access using MMIO but not using PIO. Virtual Machine test result with above mechanism: 1 hundred thousand read using raw_pci_read() took: PIO: 12.809 Sec. MMIO: 0.010 Sec. And while VM booting, PCI scan and initialization time have been reduced by ~65%. In our case it reduced to ~18 mSec from ~55 mSec. Thanks Matthew, for sharing history and your views on this patch. As you mentioned ordering change may impact some Hardware, so it's better to have this change for VMware hypervisor or generic to all hypervisor. - Ajay > Back to the issue at hand: I think that a new paravirtual interface is a > possible solution, with some serious drawbacks. Xen did something similar, > IIRC, to a certain extent. > > More reasonable, I think, based on what you said before, is to check if we > run on a hypervisor, and update raw_pci_ops accordingly. There is an issue > of whether hypervisor detection might take place too late, but I think this > can be relatively easily resolved. The question is whether assigned devices > might still be broken. Based on the information that you provided - I do not > know. > > If you can answer this question, that would be helpful. Let’s also wait for > Ajay to give some numbers about boot time with this change.
On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@vmware.com> wrote: > On 09/07/22, 1:19 AM, "Nadav Amit" <namit@vmware.com> wrote: > >> On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@infradead.org> wrote: > >>> I have no misconceptions about whatever you want to call the mechanism >>> for communicating with the hypervisor at a higher level than "prod this >>> byte". For example, one of the more intensive things we use config >>> space for is sizing BARs. If we had a hypercall to siz a BAR, that >>> would eliminate: >>> >>> - Read current value from BAR >>> - Write all-ones to BAR >>> - Read new value from BAR >>> - Write original value back to BAR >>> >>> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses. > > To improve further we can have following mechanism: > Map (as read only) the 'virtual device config i.e. 4KB ECAM' to > VM MMIO. VM will have direct read access using MMIO but > not using PIO. > > Virtual Machine test result with above mechanism: > 1 hundred thousand read using raw_pci_read() took: > PIO: 12.809 Sec. > MMIO: 0.010 Sec. > > And while VM booting, PCI scan and initialization time have been > reduced by ~65%. In our case it reduced to ~18 mSec from ~55 mSec. > > Thanks Matthew, for sharing history and your views on this patch. > > As you mentioned ordering change may impact some Hardware, so > it's better to have this change for VMware hypervisor or generic to > all hypervisor. I was chatting with Ajay, since I personally did not fully understand his use-case from the email. Others may have fully understood and can ignore this email. Here is a short summary of my understanding: During boot-time there are many PCI reads. Currently, when these reads are performed by a virtual machine, they all cause a VM-exit, and therefore each one of them induces a considerable overhead. When using MMIO (but not PIO), it is possible to map the PCI BARs of the virtual machine to some memory area that holds the values that the “emulated hardware” is supposed to return. The memory region is mapped as "read-only” in the NPT/EPT, so reads from these BAR regions would be treated as regular memory reads. Writes would still be trapped and emulated by the hypervisor. I have a vague recollection from some similar project that I had 10 years ago that this might not work for certain emulated device registers. For instance some hardware registers, specifically those the report hardware events, are “clear-on-read”. Apparently, Ajay took that into consideration. That is the reason for this quite amazing difference - several orders of magnitude - between the overhead that is caused by raw_pci_read(): 120us for PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would take 120us (I would have expected it to be 10 times faster, at least), but the benefit is quite clear.
On 11/07/22, 10:34 PM, "Nadav Amit" <namit@vmware.com> wrote: > On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@vmware.com> wrote: > > During boot-time there are many PCI reads. Currently, when these reads are > performed by a virtual machine, they all cause a VM-exit, and therefore each > one of them induces a considerable overhead. > > When using MMIO (but not PIO), it is possible to map the PCI BARs of the > virtual machine to some memory area that holds the values that the “emulated > hardware” is supposed to return. The memory region is mapped as "read-only” > in the NPT/EPT, so reads from these BAR regions would be treated as regular > memory reads. Writes would still be trapped and emulated by the hypervisor. I guess some typo mistake in above paragraph, it's per-device PCI config space i.e. 4KB ECAM not PCI BARs. Please read above paragraph as: When using MMIO (but not PIO), it is possible to map the PCI config space of the virtual machine to some memory area that holds the values that the “emulated hardware” is supposed to return. The memory region is mapped as "read-only” in the NPT/EPT, so reads from these PCI config space would be treated as regular memory reads. Writes would still be trapped and emulated by the hypervisor. We will send v2 or new patch which will be VMware specific. > I have a vague recollection from some similar project that I had 10 years > ago that this might not work for certain emulated device registers. For > instance some hardware registers, specifically those the report hardware > events, are “clear-on-read”. Apparently, Ajay took that into consideration. > > That is the reason for this quite amazing difference - several orders of > magnitude - between the overhead that is caused by raw_pci_read(): 120us for > PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would > take 120us (I would have expected it to be 10 times faster, at least), but > the benefit is quite clear.
On July 11, 2022 10:53:54 AM PDT, Ajay Kaher <akaher@vmware.com> wrote: > >On 11/07/22, 10:34 PM, "Nadav Amit" <namit@vmware.com> wrote: > >> On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@vmware.com> wrote: >> >> During boot-time there are many PCI reads. Currently, when these reads are >> performed by a virtual machine, they all cause a VM-exit, and therefore each >> one of them induces a considerable overhead. >> >> When using MMIO (but not PIO), it is possible to map the PCI BARs of the >> virtual machine to some memory area that holds the values that the “emulated >> hardware” is supposed to return. The memory region is mapped as "read-only” >> in the NPT/EPT, so reads from these BAR regions would be treated as regular >> memory reads. Writes would still be trapped and emulated by the hypervisor. > >I guess some typo mistake in above paragraph, it's per-device PCI config space >i.e. 4KB ECAM not PCI BARs. Please read above paragraph as: > >When using MMIO (but not PIO), it is possible to map the PCI config space of the >virtual machine to some memory area that holds the values that the “emulated >hardware” is supposed to return. The memory region is mapped as "read-only” >in the NPT/EPT, so reads from these PCI config space would be treated as regular >memory reads. Writes would still be trapped and emulated by the hypervisor. > >We will send v2 or new patch which will be VMware specific. > >> I have a vague recollection from some similar project that I had 10 years >> ago that this might not work for certain emulated device registers. For >> instance some hardware registers, specifically those the report hardware >> events, are “clear-on-read”. Apparently, Ajay took that into consideration. >> >> That is the reason for this quite amazing difference - several orders of >> magnitude - between the overhead that is caused by raw_pci_read(): 120us for >> PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would >> take 120us (I would have expected it to be 10 times faster, at least), but >> the benefit is quite clear. > > > For one thing, please correct the explanation. It does not take "more PCI cycles" to use PIO – they are exactly the same, in fact. The source of improvements are all in the CPU and VMM interfaces; on the PCI bus, they are (mostly) just address spaces. "Using MMIO may allow a VMM to map a shadow memory area readonly, so read transactions can be executed without needing any VMEXIT at all. In contrast, PIO transactions to PCI configuration space are done through an indirect address-data interface, requiring two VMEXITs per transaction regardless of the properties of the underlying register." You should call out exactly what is being done to prevent incorrect handling of registers with read side effects (I believe that would be all on the VMM side; unfortunately the presence of a register with read side effects probably would mean losing this optimization for the entire 4K page = this entire function, but read side effects have always been discouraged although not prohibited in config space.)
On 7/8/22 20:43, Matthew Wilcox wrote: > Just because I don't use your terminology, you think I have > "misconceptions"? Fuck you, you condescending piece of shit. Matthew, perhaps your vacation was not long enough? This is not even a violation of the code of conduct, it's a violation of basic human decency. Paolo
Hi Matthew, On 7/8/22 12:43 PM, Matthew Wilcox wrote: > On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote: >> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote: >> > > Just because I don't use your terminology, you think I have > "misconceptions"? Fuck you, you condescending piece of shit. > This is clearly unacceptable and violates the Code of Conduct. Others including the person this was directed at already called out the violation: https://lore.kernel.org/all/85071FE5-E37A-44CF-9EF7-CB80C116A876@vmware.com/ https://lore.kernel.org/all/880aac01-9c78-34f7-8a11-d48179e1635c@redhat.com/ You should be well aware that this type of language and personal attack is a clear violation of the Linux kernel Contributor Covenant Code of Conduct as outlined in the following: https://www.kernel.org/doc/html/latest/process/code-of-conduct.html Refer to the Code of Conduct and refrain from violating the Code of Conduct in the future. thanks, -- Shuah (On behalf of the Code of Conduct Committee)
On Tue, Jul 12, 2022 at 03:20:57PM -0600, Shuah Khan wrote: > Hi Matthew, > > On 7/8/22 12:43 PM, Matthew Wilcox wrote: > > On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote: > > > On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > Just because I don't use your terminology, you think I have > > "misconceptions"? Fuck you, you condescending piece of shit. > > > > This is clearly unacceptable and violates the Code of Conduct. In a message I sent to this mailing list on July 8, I used language which offended some readers. I would like to apologise for my choice of words. I remain committed to fostering a community where disagreements can be handled respectfully and will do my best to uphold these important principles in the future. I would like to thank the Code of Conduct committee for their diligence in overseeing these matters.
On 8/31/22 15:44, Matthew Wilcox wrote: > On Tue, Jul 12, 2022 at 03:20:57PM -0600, Shuah Khan wrote: >> Hi Matthew, >> >> On 7/8/22 12:43 PM, Matthew Wilcox wrote: >>> On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote: >>>> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote: >>>> >>> >>> Just because I don't use your terminology, you think I have >>> "misconceptions"? Fuck you, you condescending piece of shit. >>> >> >> This is clearly unacceptable and violates the Code of Conduct. > > In a message I sent to this mailing list on July 8, I used language > which offended some readers. I would like to apologise for my choice of > words. I remain committed to fostering a community where disagreements > can be handled respectfully and will do my best to uphold these important > principles in the future. I would like to thank the Code of Conduct > committee for their diligence in overseeing these matters. > Thank you Matthew for taking this important step to show your commitment to keeping our community a welcoming place that fosters respectful and productive technical discourse. thanks, -- Shuah (On behalf of the Code of Conduct Committee)
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c index 3507f456f..0b3383d9c 100644 --- a/arch/x86/pci/common.c +++ b/arch/x86/pci/common.c @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops; int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, int reg, int len, u32 *val) { + if (raw_pci_ext_ops) + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); if (domain == 0 && reg < 256 && raw_pci_ops) return raw_pci_ops->read(domain, bus, devfn, reg, len, val); - if (raw_pci_ext_ops) - return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val); return -EINVAL; } int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn, int reg, int len, u32 val) { + if (raw_pci_ext_ops) + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); if (domain == 0 && reg < 256 && raw_pci_ops) return raw_pci_ops->write(domain, bus, devfn, reg, len, val); - if (raw_pci_ext_ops) - return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val); return -EINVAL; }
Port IO instructions (PIO) are less efficient than MMIO (memory mapped I/O). They require twice as many PCI accesses and PIO instructions are serializing. As a result, MMIO should be preferred when possible over PIO. Bare metal test result 1 million reads using raw_pci_read() took: PIO: 0.433153 Sec. MMIO: 0.268792 Sec. Virtual Machine test result 1 hundred thousand reads using raw_pci_read() took: PIO: 12.809 Sec. MMIO: took 8.517 Sec. Signed-off-by: Ajay Kaher <akaher@vmware.com> --- arch/x86/pci/common.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-)