Message ID | 1434751712-24333-6-git-send-email-mcgrof@do-not-panic.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote: > From: "Luis R. Rodriguez" <mcgrof@suse.com> > > PCI BARs tell us whether prefetching is safe, but they don't say anything > about write combining (WC). WC changes ordering rules and allows writes to > be collapsed, so it's not safe in general to use it on a prefetchable > region. Well, the PCIe spec at least specifies that a prefetchable BAR also tolerates write merging... > Add pci_iomap_wc() and pci_iomap_wc_range() so drivers can take advantage > of write combining when they know it's safe. > > On architectures that don't fully support WC, e.g., x86 without PAT, > drivers for legacy framebuffers may get some of the benefit by using > arch_phys_wc_add() in addition to pci_iomap_wc(). But arch_phys_wc_add() > is unreliable and should be avoided in general. On x86, it uses MTRRs, > which are limited in number and size, so the results will vary based on > driver loading order. > > The goals of adding pci_iomap_wc() are to: > > - Give drivers an architecture-independent way to use WC so they can stop > using interfaces like mtrr_add() (on x86, pci_iomap_wc() uses > PAT when available) > > - Move toward using _PAGE_CACHE_MODE_UC, not _PAGE_CACHE_MODE_UC_MINUS, > on x86 on ioremap_nocache() (see de33c442ed2a ("x86 PAT: fix > performance drop for glx, use UC minus for ioremap(), ioremap_nocache() > and pci_mmap_page_range()") > > Link: http://lkml.kernel.org/r/1426893517-2511-6-git-send-email-mcgrof@do-not-panic.com > Original-posting: http://lkml.kernel.org/r/1432163293-20965-1-git-send-email-mcgrof@do-not-panic.com > Cc: Toshi Kani <toshi.kani@hp.com> > Cc: Andy Lutomirski <luto@amacapital.net> > Cc: Suresh Siddha <sbsiddha@gmail.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Juergen Gross <jgross@suse.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Dave Airlie <airlied@redhat.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Antonino Daplas <adaplas@gmail.com> > Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com> > Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: venkatesh.pallipadi@intel.com > Cc: Stefan Bader <stefan.bader@canonical.com> > Cc: Ville Syrjälä <syrjala@sci.fi> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Borislav Petkov <bp@suse.de> > Cc: Davidlohr Bueso <dbueso@suse.de> > Cc: konrad.wilk@oracle.com > Cc: ville.syrjala@linux.intel.com > Cc: david.vrabel@citrix.com > Cc: jbeulich@suse.com > Cc: Roger Pau Monné <roger.pau@citrix.com> > Cc: linux-fbdev@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: xen-devel@lists.xensource.com > Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> > --- > include/asm-generic/pci_iomap.h | 14 ++++++++++ > lib/pci_iomap.c | 61 +++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 75 insertions(+) > > diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h > index 7389c87..b1e17fc 100644 > --- a/include/asm-generic/pci_iomap.h > +++ b/include/asm-generic/pci_iomap.h > @@ -15,9 +15,13 @@ struct pci_dev; > #ifdef CONFIG_PCI > /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */ > extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max); > +extern void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max); > extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, > unsigned long offset, > unsigned long maxlen); > +extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, > + unsigned long offset, > + unsigned long maxlen); > /* Create a virtual mapping cookie for a port on a given PCI device. > * Do not call this directly, it exists to make it easier for architectures > * to override */ > @@ -34,12 +38,22 @@ static inline void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned lon > return NULL; > } > > +static inline void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max) > +{ > + return NULL; > +} > static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, > unsigned long offset, > unsigned long maxlen) > { > return NULL; > } > +static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, > + unsigned long offset, > + unsigned long maxlen) > +{ > + return NULL; > +} > #endif > > #endif /* __ASM_GENERIC_IO_H */ > diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c > index bcce5f1..9604dcb 100644 > --- a/lib/pci_iomap.c > +++ b/lib/pci_iomap.c > @@ -52,6 +52,46 @@ void __iomem *pci_iomap_range(struct pci_dev *dev, > EXPORT_SYMBOL(pci_iomap_range); > > /** > + * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR > + * @dev: PCI device that owns the BAR > + * @bar: BAR number > + * @offset: map memory at the given offset in BAR > + * @maxlen: max length of the memory to map > + * > + * Using this function you will get a __iomem address to your device BAR. > + * You can access it using ioread*() and iowrite*(). These functions hide > + * the details if this is a MMIO or PIO address space and will just do what > + * you expect from them in the correct way. When possible write combining > + * is used. > + * > + * @maxlen specifies the maximum length to map. If you want to get access to > + * the complete BAR from offset to the end, pass %0 here. > + * */ > +void __iomem *pci_iomap_wc_range(struct pci_dev *dev, > + int bar, > + unsigned long offset, > + unsigned long maxlen) > +{ > + resource_size_t start = pci_resource_start(dev, bar); > + resource_size_t len = pci_resource_len(dev, bar); > + unsigned long flags = pci_resource_flags(dev, bar); > + > + if (len <= offset || !start) > + return NULL; > + len -= offset; > + start += offset; > + if (maxlen && len > maxlen) > + len = maxlen; > + if (flags & IORESOURCE_IO) > + return NULL; > + if (flags & IORESOURCE_MEM) > + return ioremap_wc(start, len); > + /* What? */ > + return NULL; > +} > +EXPORT_SYMBOL_GPL(pci_iomap_wc_range); > + > +/** > * pci_iomap - create a virtual mapping cookie for a PCI BAR > * @dev: PCI device that owns the BAR > * @bar: BAR number > @@ -70,4 +110,25 @@ void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen) > return pci_iomap_range(dev, bar, 0, maxlen); > } > EXPORT_SYMBOL(pci_iomap); > + > +/** > + * pci_iomap_wc - create a virtual WC mapping cookie for a PCI BAR > + * @dev: PCI device that owns the BAR > + * @bar: BAR number > + * @maxlen: length of the memory to map > + * > + * Using this function you will get a __iomem address to your device BAR. > + * You can access it using ioread*() and iowrite*(). These functions hide > + * the details if this is a MMIO or PIO address space and will just do what > + * you expect from them in the correct way. When possible write combining > + * is used. > + * > + * @maxlen specifies the maximum length to map. If you want to get access to > + * the complete BAR without checking for its length first, pass %0 here. > + * */ > +void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long maxlen) > +{ > + return pci_iomap_wc_range(dev, bar, 0, maxlen); > +} > +EXPORT_SYMBOL_GPL(pci_iomap_wc); > #endif /* CONFIG_PCI */ -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote: > > From: "Luis R. Rodriguez" <mcgrof@suse.com> > > > > PCI BARs tell us whether prefetching is safe, but they don't say anything > > about write combining (WC). WC changes ordering rules and allows writes to > > be collapsed, so it's not safe in general to use it on a prefetchable > > region. > > Well, the PCIe spec at least specifies that a prefetchable BAR also > tolerates write merging... How can that be determined and can that be used as a full bullet proof hint to enable wc ? And are you sure? :) Reason all this was stated was to be apologetic over why we can't automate this behind the scenes. Otherwise we could amend what you stated into the commit log to elaborate on our technical apology. Let me know! Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-06-24 at 18:38 +0200, Luis R. Rodriguez wrote: > On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote: > > On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote: > > > From: "Luis R. Rodriguez" <mcgrof@suse.com> > > > > > > PCI BARs tell us whether prefetching is safe, but they don't say anything > > > about write combining (WC). WC changes ordering rules and allows writes to > > > be collapsed, so it's not safe in general to use it on a prefetchable > > > region. > > > > Well, the PCIe spec at least specifies that a prefetchable BAR also > > tolerates write merging... > > How can that be determined and can that be used as a full bullet proof hint > to enable wc ? And are you sure? :) Well, I"m sure the spec says that ;-) But it could be new to PCIe, I haven't checked legacy PCI. > Reason all this was stated was to be > apologetic over why we can't automate this behind the scenes. Otherwise > we could amend what you stated into the commit log to elaborate on our > technical apology. Let me know! At least on powerpc, for mmap of resource to userspace, we take off the garded bit in the PTE for prefetchable BARs. This has the effect architecturally of enabling both prefetch and write combine (ie. side effect) though afaik, the implementations probably don't actually prefetch. We've done that for years. In fact we don't have a way to split the notions, it's either G or no G, which carries both meanings. Do you have example/case of a device having problems ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 24, 2015 at 3:05 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2015-06-24 at 18:38 +0200, Luis R. Rodriguez wrote: >> On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote: >> > On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote: >> > > From: "Luis R. Rodriguez" <mcgrof@suse.com> >> > > >> > > PCI BARs tell us whether prefetching is safe, but they don't say anything >> > > about write combining (WC). WC changes ordering rules and allows writes to >> > > be collapsed, so it's not safe in general to use it on a prefetchable >> > > region. >> > >> > Well, the PCIe spec at least specifies that a prefetchable BAR also >> > tolerates write merging... >> >> How can that be determined and can that be used as a full bullet proof hint >> to enable wc ? And are you sure? :) > > Well, I"m sure the spec says that ;-) But it could be new to PCIe, I > haven't checked legacy PCI. OK cool so to be clear from what I gather you are suggesting (or not and letting me make it) is that we might be able to enforce write-merging on prefetchable areas, and if we can *ensure* we do this then automatically enable write-combining behind the scenes? >> Reason all this was stated was to be >> apologetic over why we can't automate this behind the scenes. Otherwise >> we could amend what you stated into the commit log to elaborate on our >> technical apology. Let me know! > > At least on powerpc, for mmap of resource to userspace, we take off the > garded bit in the PTE for prefetchable BARs. This has the effect > architecturally of enabling both prefetch and write combine (ie. side > effect) That's pretty darn sexy. > though afaik, the implementations probably don't actually > prefetch. We've done that for years. Neat! > In fact we don't have a way to split the notions, it's either G or no G, > which carries both meanings. Interesting. > Do you have example/case of a device having problems ? Nope but at least what made me squint at this being a possible "feature" was that in practice when reviewing all of the kernels pending device drivers using MTRR (potential write-combine candidates) I encountered a slew of them which had the architectural unfortunate practice of combining PCI bars for MMIO and their respective write-combined desirable area (framebuffer for video, PIO buffers for infiniband, etc). Now, to me that read more as a practice for old school devices when such things were likely still being evaluated, more modern devices seem to adhere to sticking a full PCI bar with write-combining or not. Did you not encounter such mismatch splits on powerpc ? Was such possibility addressed? If what you are implying here is applicable to the x86 world I'm all for enabling this as we'd have less code to maintain but I'll note that getting a clarification alone on that prefetchable != write-combining was in and of itself hard, I'd be surprised if we could get full architectural buy-in to this as an immediate automatic feature. Because of this and because PAT did have some errata as well, I would not be surprised if some PCI bridges / devices would end up finding corner cases, as such if we can really do what you're saying and unless we can get some super sane certainty over it across the board, I'd be inclined to leave such things as a part of a new API. Maybe have some folks test using the new API for all calls and after some sanity of testing / releases consider a full switch. That is, unless of course you're sure all this is sane and would wager all-in on it from the get-go. Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-06-24 at 15:29 -0700, Luis R. Rodriguez wrote: > Nope but at least what made me squint at this being a possible > "feature" was that in practice when reviewing all of the kernels > pending device drivers using MTRR (potential write-combine candidates) > I encountered a slew of them which had the architectural unfortunate > practice of combining PCI bars for MMIO and their respective > write-combined desirable area (framebuffer for video, PIO buffers for > infiniband, etc). Now, to me that read more as a practice for old > school devices when such things were likely still being evaluated, > more modern devices seem to adhere to sticking a full PCI bar with > write-combining or not. Did you not encounter such mismatch splits on > powerpc ? Was such possibility addressed? Yes, I remember we dealt with some networking (or maybe IB) stuff back in the day. We dealt with it by using a WC mapping and explicit barriers to prevent combine when not wanted. It is to be noted that on powerpc at least, writel() and co will never combine due to the memory barriers in them. Only "normal" stores (or __raw_writel) will. On Intel things I different I assume... The problem I see is that architectures can provide widely different mechanisms and semantics in those areas and it's hard to define a generic driver interface. > If what you are implying here is applicable to the x86 world I'm all > for enabling this as we'd have less code to maintain but I'll note > that getting a clarification alone on that prefetchable != > write-combining was in and of itself hard, I'd be surprised if we > could get full architectural buy-in to this as an immediate automatic > feature. I'm happy not to make it automatic for kernel space. As for user mappings, maybe the right thing to do is to let us do what we do by default with a quirk that can set a flag in pci_dev to disable that behaviour (maybe on a per BAR basis ?). I think the common case is that WC works. > Because of this and because PAT did have some errata as well, > I would not be surprised if some PCI bridges / devices would end up > finding corner cases, as such if we can really do what you're saying > and unless we can get some super sane certainty over it across the > board, I'd be inclined to leave such things as a part of a new API. > Maybe have some folks test using the new API for all calls and after > some sanity of testing / releases consider a full switch. > > That is, unless of course you're sure all this is sane and would wager > all-in on it from the get-go. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 25, 2015 at 09:38:01AM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2015-06-24 at 15:29 -0700, Luis R. Rodriguez wrote: > > > Nope but at least what made me squint at this being a possible > > "feature" was that in practice when reviewing all of the kernels > > pending device drivers using MTRR (potential write-combine candidates) > > I encountered a slew of them which had the architectural unfortunate > > practice of combining PCI bars for MMIO and their respective > > write-combined desirable area (framebuffer for video, PIO buffers for > > infiniband, etc). Now, to me that read more as a practice for old > > school devices when such things were likely still being evaluated, > > more modern devices seem to adhere to sticking a full PCI bar with > > write-combining or not. Did you not encounter such mismatch splits on > > powerpc ? Was such possibility addressed? > > Yes, I remember we dealt with some networking (or maybe IB) stuff back > in the day. We dealt with it by using a WC mapping and explicit barriers > to prevent combine when not wanted. > > It is to be noted that on powerpc at least, writel() and co will never > combine due to the memory barriers in them. Only "normal" stores (or > __raw_writel) will. > > On Intel things I different I assume... And the people who really know seem to be eaten by volcanoes or not have time. > The problem I see is that architectures can provide widely different > mechanisms and semantics in those areas and it's hard to define a > generic driver interface. Provided asm generic helpers are defined this should work though. The question is just if there is enough motivation. Doesn't sound like it or as you note maybe for userspace there might be. My position is that if it was too late for PCIE or if this was too ambigious for PCIE perhaps the next generation bus archicture or ammendments (I have no clue if this would would be possible) will make this part of future device negotiation clear and fully expected, not a wonderful side effect. > > If what you are implying here is applicable to the x86 world I'm all > > for enabling this as we'd have less code to maintain but I'll note > > that getting a clarification alone on that prefetchable != > > write-combining was in and of itself hard, I'd be surprised if we > > could get full architectural buy-in to this as an immediate automatic > > feature. > > I'm happy not to make it automatic for kernel space. OK thanks I'll proceed with these patches then. > As for user mappings, Which APIs were you considering in this regard BTW? > maybe the right thing to do is to let us do what we do by > default with a quirk that can set a flag in pci_dev to disable that > behaviour (maybe on a per BAR basis ?). That might mean it could restrict userspace WC to require devices to have WC parts on a full PCI BAR. Although this is restrictive having reviewed most WC uses in the kernel I'd think this would be a fair compromise to make, but again, if things are still murky perhaps best we kiss this idea good bye for now and hope for it to come in on future buses or ammendments (if that's even possible?). > I think the common case is that WC works. If WC does not I will note one hack which migh be worth mentioning -- just for the record, this was devised as a shortcoming of a device where they failed to split things properly and that *without* WC performance suffered quite a bit so they made one full PCI BAR WC and as a work around this: http://lkml.kernel.org/r/20150416041837.GA5712@hykim-PC That is for registers that needed it: write; wmb; Then if they wanted to wait till the NIC has seen the write, they did: write; wmb; read; Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote: > > OK thanks I'll proceed with these patches then. > > > As for user mappings, > > Which APIs were you considering in this regard BTW? mmap of the generic /sys/bus/pci/.../resource* > > maybe the right thing to do is to let us do what we do by > > default with a quirk that can set a flag in pci_dev to disable that > > behaviour (maybe on a per BAR basis ?). > > That might mean it could restrict userspace WC to require devices > to have WC parts on a full PCI BAR. Although this is restrictive > having reviewed most WC uses in the kernel I'd think this would be > a fair compromise to make, but again, if things are still murky > perhaps best we kiss this idea good bye for now and hope for it > to come in on future buses or ammendments (if that's even possible?). > > > I think the common case is that WC works. > > If WC does not I will note one hack which migh be worth mentioning -- > just for > the record, this was devised as a shortcoming of a device where they > failed to > split things properly and that *without* WC performance suffered quite > a bit so > they made one full PCI BAR WC and as a work around this: > > http://lkml.kernel.org/r/20150416041837.GA5712@hykim-PC > > That is for registers that needed it: > > write; wmb; > > Then if they wanted to wait till the NIC has seen the write, they did: > > write; wmb; read; > Right, and as I mentioned, on some archs like powerpc (and possibly more), writel() and co contains an implicit mb() > Luis-- -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 24, 2015 at 5:52 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote: >> >> OK thanks I'll proceed with these patches then. >> >> > As for user mappings, >> >> Which APIs were you considering in this regard BTW? > > mmap of the generic /sys/bus/pci/.../resource* Like? Got a demo patch in mind ? :) Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2015-06-24 at 17:58 -0700, Luis R. Rodriguez wrote: > On Wed, Jun 24, 2015 at 5:52 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote: > >> > >> OK thanks I'll proceed with these patches then. > >> > >> > As for user mappings, > >> > >> Which APIs were you considering in this regard BTW? > > > > mmap of the generic /sys/bus/pci/.../resource* > > Like? Got a demo patch in mind ? :) Nope. I was just thinking out loud. Today I have yet to see a problem with what we do so ... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
/ From: Benjamin Herrenschmidt [benh@kernel.crashing.org] | Sent: Wednesday, June 24, 2015 4:38 PM | | It is to be noted that on powerpc at least, writel() and co will never | combine due to the memory barriers in them. Only "normal" stores (or \ __raw_writel) will. / From: Benjamin Herrenschmidt [benh@kernel.crashing.org] | Sent: Wednesday, June 24, 2015 5:52 PM | | Right, and as I mentioned, on some archs like powerpc (and possibly \ more), writel() and co contains an implicit mb() Hhmmm, interesting. I definitely hadn't known that. In our network drivers we explicitly manage all of our own memory barrier semantics. (wmb() between writes to DMA Queues and writes to Device Registers informing hardware of new data available to be DMA'ed, etc.) And we do have code to take advantage of Write Combining Mappings using writeq(). It looks like this is implemented eventually as out_be64() in arch/powerpc/include/asm/io.h for PPC-BE: #define DEF_MMIO_OUT_D(name, size, insn) \ static inline void name(volatile u##size __iomem *addr, u##size val) \ { \ __asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0" \ : "=m" (*addr) : "r" (val) : "memory"); \ IO_SET_SYNC_FLAG(); \ } DEF_MMIO_OUT_D(out_be64, 64, std); So, if understand your notes and the source correctly, all of our code to do register reads/writes are getting SYNC instructions, as are the 64-bit writeq()s we're attempting to use for our Write Combining code on PowerPC. Hhmmm indeed ... Is there a reference I can read on this so I can understand when and where we can use the __raw_*() APIs? Can these Raw Read/Write operations be reordered with respect to each other or are the use of the various flavors of SYNC instructions just to maintain order between Cached Memory Accesses and I/O Instructions? Casey-- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 25 June 2015 15:01:56 Casey Leedom wrote: > > Is there a reference I can read on this so I can understand > when and where we can use the __raw_*() APIs? Can these > Raw Read/Write operations be reordered with respect to > each other or are the use of the various flavors of SYNC > instructions just to maintain order between Cached Memory > Accesses and I/O Instructions? The interpretation is not consistent across architectures. My best description would be that the __raw_*() accessors should only be used for accessing RAM areas that are known to have no side-effects and can be read in any size (8-bit to 64-bit wide), any alignment, and do not have a specific endianess. If you are dealing with MMIO registers that have a fixed endianess and size, the correct accessor would be readl_relaxed(), which is like readl() but lacks the barriers on certain architectures. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| From: Arnd Bergmann [arnd@arndb.de] | Sent: Thursday, June 25, 2015 1:44 PM | | On Thursday 25 June 2015 15:01:56 Casey Leedom wrote: | > | > Is there a reference I can read on this so I can understand | > when and where we can use the __raw_*() APIs? Can these | > Raw Read/Write operations be reordered with respect to | > each other or are the use of the various flavors of SYNC | > instructions just to maintain order between Cached Memory | > Accesses and I/O Instructions? | | The interpretation is not consistent across architectures. | | My best description would be that the __raw_*() accessors should | only be used for accessing RAM areas that are known to have no | side-effects and can be read in any size (8-bit to 64-bit wide), | any alignment, and do not have a specific endianess. | | If you are dealing with MMIO registers that have a fixed endianess | and size, the correct accessor would be readl_relaxed(), which | is like readl() but lacks the barriers on certain architectures. Ah, thanks. I see now that the __raw_*() APIs don't do any of the Endian Swizzling. Unfortunately the *_relaxed() APIs on PowerPC are just defined as the normal *() routines. From arch/powerpc/include/asm/io.h: /* * We don't do relaxed operations yet, at least not with this semantic */ #define readb_relaxed(addr) readb(addr) #define readw_relaxed(addr) readw(addr) #define readl_relaxed(addr) readl(addr) #define readq_relaxed(addr) readq(addr) #define writeb_relaxed(v, addr) writeb(v, addr) #define writew_relaxed(v, addr) writew(v, addr) #define writel_relaxed(v, addr) writel(v, addr) #define writeq_relaxed(v, addr) writeq(v, addr) (And in fact, this is true for most of the architectures.) So we could go ahead and use these but for now there wouldn't be any effect. Hhmmm, so what do PowerPC Drivers do when they want to take advantage of Write Combining? Do their own Endian Swizzling with the __raw_*() APIs? Casey-- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote: > Hhmmm, so what do PowerPC Drivers do when they want to take > advantage of Write Combining? Do their own Endian Swizzling > with the __raw_*() APIs? Yeah either, we need to fix our relaxed implementation (patch welcome :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote: > > Ah, thanks. I see now that the __raw_*() APIs don't do any of the > Endian Swizzling. Unfortunately the *_relaxed() APIs on PowerPC > are just defined as the normal *() routines. From > arch/powerpc/include/asm/io.h: > > /* > * We don't do relaxed operations yet, at least not with this > semantic > */ Yes so I was looking at this but there are some difficulties. Architecturally, even with I=1 G=1 mappings (normal ioremap), we have no guarantee of ordering of load vs. store unless I misunderstood something. I think all current implementations provide some of that but without barriers in the accessors, we aren't architecturally correct. However, having those barriers will cause issues with G=0 (write combine). It's unclear whether eieio() will provide the required ordering for I=1 G=0 mappings and it will probably break write combine. I'm looking into it with our HW guys and will try to come up with a solution for power, but it doesn't help that our memory model conflates write combining with other relaxations and that all our barriers also prevent write combine. Maybe we can bias the relaxed accessors toward write, by having no barriers in it, and putting extra ones in reads. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks for looking into this Ben. As it stands now, it seems as if Write Combined mappings simply aren't supported and/or all driver writers trying to utilize Write Combined mappings have to "hand roll" their own solutions which really isn't supportable. One thing that might be considered is simply to treat the desire to utilize the Write Combining hardware as a separate issue and develop writel_wc(), writeq_wc(), etc. Those could be defined as legal only for Write Combined Mappings and would "do the right thing" for each architecture. The initial implementations could simply be writel(), writeq(), etc. till each architecture'si ssues were investigated and understood. Additionally, it would be very useful to have some sort of predicate which could be used to determine if an architecture supported Write Combining. Our drivers would use this to eliminate a wasteful attempt to use write combining on those architectures where it didn't work. Casey
On Fri, Jun 26, 2015 at 08:51:58AM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote: > > Hhmmm, so what do PowerPC Drivers do when they want to take > > advantage of Write Combining? Do their own Endian Swizzling > > with the __raw_*() APIs? > > Yeah either, we need to fix our relaxed implementation (patch > welcome :-) Yikes, so although powerpc has useful heuristics to automatically enable WC the default write ops have been nullifying its effects? Shouldn't this have been blatantly obvious in performance benchmarks and expectations? If not a change should give considerable performance gains... If the WC was nullified on powerpc then the experience and value of the heuristics have less value and even if they are going to be only considered for userspace mmap() it should be taken with a bit grain of salt it seems. Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-06-26 at 16:24 +0000, Casey Leedom wrote: > Thanks for looking into this Ben. As it stands now, it seems as > if Write Combined mappings simply aren't supported and/or all > driver writers trying to utilize Write Combined mappings have to > "hand roll" their own solutions which really isn't supportable. > > One thing that might be considered is simply to treat the desire > to utilize the Write Combining hardware as a separate issue and > develop writel_wc(), writeq_wc(), etc. Those could be defined > as legal only for Write Combined Mappings and would "do the > right thing" for each architecture. The question then is what is "the right thing". In the powerpc case, we'll have a non-garded mapping, which means we also get no ordering between load and stores. > The initial implementations > could simply be writel(), writeq(), etc. till each architecture'si > ssues were investigated and understood. > > Additionally, it would be very useful to have some sort of > predicate which could be used to determine if an architecture > supported Write Combining. Our drivers would use this to > eliminate a wasteful attempt to use write combining on those > architectures where it didn't work. > > Casey > > ________________________________________ > From: Benjamin Herrenschmidt [benh@kernel.crashing.org] > Sent: Thursday, June 25, 2015 7:02 PM > To: Casey Leedom > Cc: Arnd Bergmann; Luis R. Rodriguez; Michael S. Tsirkin; Bjorn Helgaas; Toshi Kani; Andy Lutomirski; Juergen Gross; Tomi Valkeinen; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; xen-devel@lists.xensource.com; linux-fbdev; Suresh Siddha; Ingo Molnar; Thomas Gleixner; Daniel Vetter; Dave Airlie; Antonino Daplas; Jean-Christophe Plagniol-Villard; Dave Hansen; venkatesh.pallipadi@intel.com; Stefan Bader; ville.syrjala@linux.intel.com; David Vrabel; Jan Beulich; Roger Pau Monné > Subject: Re: [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants > > On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote: > > > > Ah, thanks. I see now that the __raw_*() APIs don't do any of the > > Endian Swizzling. Unfortunately the *_relaxed() APIs on PowerPC > > are just defined as the normal *() routines. From > > arch/powerpc/include/asm/io.h: > > > > /* > > * We don't do relaxed operations yet, at least not with this > > semantic > > */ > > Yes so I was looking at this but there are some difficulties. > Architecturally, even with I=1 G=1 mappings (normal ioremap), we have no > guarantee of ordering of load vs. store unless I misunderstood > something. I think all current implementations provide some of that but > without barriers in the accessors, we aren't architecturally correct. > > However, having those barriers will cause issues with G=0 (write > combine). It's unclear whether eieio() will provide the required > ordering for I=1 G=0 mappings and it will probably break write combine. > > I'm looking into it with our HW guys and will try to come up with a > solution for power, but it doesn't help that our memory model conflates > write combining with other relaxations and that all our barriers also > prevent write combine. > > Maybe we can bias the relaxed accessors toward write, by having no > barriers in it, and putting extra ones in reads. > > Cheers, > Ben. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-06-26 at 21:31 +0200, Luis R. Rodriguez wrote: > > Yeah either, we need to fix our relaxed implementation (patch > > welcome :-) > > Yikes, so although powerpc has useful heuristics to automatically > enable WC the default write ops have been nullifying its effects? The heuristic is for userspace mapping which don't use the kernel accessors. To cut a long story short, this was all done back in the day to speed up X.org which doesn't use fancy accessors with barriers to access the framebuffer (neither does the kernel fbdev layer). > Shouldn't this have been blatantly obvious in performance benchmarks > and expectations? > If not a change should give considerable performance > gains... If the WC was nullified on powerpc then the experience and > value of the heuristics have less value and even if they are going > to be only considered for userspace mmap() it should be taken with > a bit grain of salt it seems. It wasn't nullified for the main user at the time, the fb. And I mentioned an IB adapter or two for which the code had been hand tuned. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-06-26 at 15:41 -0700, Luis R. Rodriguez wrote: > > It wasn't nullified for the main user at the time, the fb. And I > > mentioned an IB adapter or two for which the code had been hand > tuned. > > This still means there could be some affected drivers when used on > powerpc, no? Yes. In fact what about things like ARM who also have barriers in their writel() ? Won't they also break WC ? I'm trying to work with the architect and designers here to figure out exactly where we stand and what we can do. As spelled out by our architecture, things don't look great, because basically, we only have attribute bit (garded) which when not set implies both WC and out of order (& prefetch), and unclear barrier semantics in that case as well. I *think* we might be able to settle with something along the lines of "writel_relaxed() will allow combine on a WC mapping" but how I'm going to get there is TBD. It would be interesting to clarify the semantics of using the relaxed accessors in combination with WC anyway. I wouldn't mind if the definition involved also relaxing general ordering :-) It would definitely make my life easier. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 26, 2015 at 4:54 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Fri, 2015-06-26 at 15:41 -0700, Luis R. Rodriguez wrote: > >> > It wasn't nullified for the main user at the time, the fb. And I >> > mentioned an IB adapter or two for which the code had been hand >> tuned. >> >> This still means there could be some affected drivers when used on >> powerpc, no? > > Yes. In fact what about things like ARM who also have barriers in their > writel() ? Won't they also break WC ? Not sure if they have that write-combine notion on their CPUs / interconnects. Russel, Stefano, does ARM have write-combining ? Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jun 27, 2015 at 08:00:48AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2015-06-26 at 16:24 +0000, Casey Leedom wrote: > > Thanks for looking into this Ben. As it stands now, it seems as > > if Write Combined mappings simply aren't supported and/or all > > driver writers trying to utilize Write Combined mappings have to > > "hand roll" their own solutions which really isn't supportable. > > > > One thing that might be considered is simply to treat the desire > > to utilize the Write Combining hardware as a separate issue and > > develop writel_wc(), writeq_wc(), etc. That seems rather sloppy and cumbersome, its best to just provide the infrastructure for initial mapping for an area and let the hardware do it for you. With the current design drivers would just use regular read/write on all areas and the only thing that will set them apart will be the mapping. With what you propose we'd end up having to shift a whole bunch of reads/writes for just the purpose of adding WC to an area that didn't have wc before. That's a waste of code and time. > > Those could be defined > > as legal only for Write Combined Mappings and would "do the > > right thing" for each architecture. Yuck. > The question then is what is "the right thing". In the powerpc case, > we'll have a non-garded mapping, which means we also get no ordering > between load and stores. I don't follow, you *ordering* between load and stores for WC? We should not need that for WC, its why WC is used for only very specific things such as framebuffer and PIO (which BTw I still don't quite get all this use case for infiniband to be honest, and I will note I do see some proprietary hardware extensions like bursts but nothing covering all this in a general doc, I think I think it all just has to do that this is a hardware hack in reality, which we sell as a feature). Luis -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-07-02 at 20:49 +0200, Luis R. Rodriguez wrote: > > The question then is what is "the right thing". In the powerpc case, > > we'll have a non-garded mapping, which means we also get no ordering > > between load and stores. > > I don't follow, you *ordering* between load and stores for WC? We should > not need that for WC, its why WC is used for only very specific things > such as framebuffer and PIO (which BTw I still don't quite get all this > use case for infiniband to be honest, and I will note I do see some > proprietary hardware extensions like bursts but nothing covering all > this in a general doc, I think I think it all just has to do that this > is a hardware hack in reality, which we sell as a feature). Well, that's the problem, the semantics that we provide to drivers aren't well defined. The words "write combine" themselves only specify that writes to subsequent addresses can be combined into larger transactions. That's in itself is already quite vague (are their boundaries, limits ? some can depend on bus type, etc...) though in practice is probably sufficient. However, overloading a _wc mapping with additional memory model differences such as loss of ordering between load and stores, etc... is not an obvious thing to do. I agree it would make *my* life easier if we did it since this is precisely the semantics provided by a "G=0" mapping on ppc, but we need to agree and *document* it, otherwise bad things will happen eventually. We also need to document in that case which barriers can be used to explicitly enforce the ordering on such a mapping and which barriers can be used to break write combine (they don't necessarily have to be the same). We also need to document which accessors will actually provide the write combine "feature" of a _wc mapping. For example while writel() will do it on Intel, it will not on ppc and I wouldn't be surprised if a bunch of other archs fall in the same bucket as ppc (basically anything that has barriers in their writel implementation to order vs. DMA etc...). So we might need to explicitly document that writel_relaxed() needs to be used. Finally what are the precise guaranteed semantics of writeX/readX, writeX_relaxed/readX_relaxed and __raw_ (everything else) on a _wc mapping, what do we mandate and document, what do we leave to be implementation dependent ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 2, 2015, at 11:49 AM, Luis R. Rodriguez <mcgrof@suse.com> wrote: > > On Sat, Jun 27, 2015 at 08:00:48AM +1000, Benjamin Herrenschmidt wrote: >> On Fri, 2015-06-26 at 16:24 +0000, Casey Leedom wrote: >>> Thanks for looking into this Ben. As it stands now, it seems as >>> if Write Combined mappings simply aren't supported and/or all >>> driver writers trying to utilize Write Combined mappings have to >>> "hand roll" their own solutions which really isn't supportable. >>> >>> One thing that might be considered is simply to treat the desire >>> to utilize the Write Combining hardware as a separate issue and >>> develop writel_wc(), writeq_wc(), etc. > > That seems rather sloppy and cumbersome, its best to just provide the > infrastructure for initial mapping for an area and let the hardware do it for > you. With the current design drivers would just use regular read/write on all > areas and the only thing that will set them apart will be the mapping. With > what you propose we'd end up having to shift a whole bunch of reads/writes for > just the purpose of adding WC to an area that didn't have wc before. That's > a waste of code and time. > >>> Those could be defined >>> as legal only for Write Combined Mappings and would "do the >>> right thing" for each architecture. > > Yuck. Yeah, probably. But it sounds like from Ben’s response, we should really use write_relaxed() as an already existing API for this. But most of the architectures just define this as writel() so some work would need to be done to get it to work. >> The question then is what is "the right thing". In the powerpc case, >> we'll have a non-garded mapping, which means we also get no ordering >> between load and stores. > > I don't follow, you *ordering* between load and stores for WC? We should > not need that for WC, its why WC is used for only very specific things > such as framebuffer and PIO (which BTw I still don't quite get all this > use case for infiniband to be honest, and I will note I do see some > proprietary hardware extensions like bursts but nothing covering all > this in a general doc, I think I think it all just has to do that this > is a hardware hack in reality, which we sell as a feature). I was talking about the work that our drivers (cxgb4, cxgb4vf, etc.) do to ensure correct ordering between writes to memory and I/O space. For instance, issuing a wmb() between writes to a DMA buffer and the write to a register telling the hardware that the data is available. This isn’t necessary on the Strongly Ordered Intel architectures but is necessary on other architectures. Casey-- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h index 7389c87..b1e17fc 100644 --- a/include/asm-generic/pci_iomap.h +++ b/include/asm-generic/pci_iomap.h @@ -15,9 +15,13 @@ struct pci_dev; #ifdef CONFIG_PCI /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */ extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max); +extern void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max); extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, unsigned long offset, unsigned long maxlen); +extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, + unsigned long offset, + unsigned long maxlen); /* Create a virtual mapping cookie for a port on a given PCI device. * Do not call this directly, it exists to make it easier for architectures * to override */ @@ -34,12 +38,22 @@ static inline void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned lon return NULL; } +static inline void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max) +{ + return NULL; +} static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, unsigned long offset, unsigned long maxlen) { return NULL; } +static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, + unsigned long offset, + unsigned long maxlen) +{ + return NULL; +} #endif #endif /* __ASM_GENERIC_IO_H */ diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c index bcce5f1..9604dcb 100644 --- a/lib/pci_iomap.c +++ b/lib/pci_iomap.c @@ -52,6 +52,46 @@ void __iomem *pci_iomap_range(struct pci_dev *dev, EXPORT_SYMBOL(pci_iomap_range); /** + * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR + * @dev: PCI device that owns the BAR + * @bar: BAR number + * @offset: map memory at the given offset in BAR + * @maxlen: max length of the memory to map + * + * Using this function you will get a __iomem address to your device BAR. + * You can access it using ioread*() and iowrite*(). These functions hide + * the details if this is a MMIO or PIO address space and will just do what + * you expect from them in the correct way. When possible write combining + * is used. + * + * @maxlen specifies the maximum length to map. If you want to get access to + * the complete BAR from offset to the end, pass %0 here. + * */ +void __iomem *pci_iomap_wc_range(struct pci_dev *dev, + int bar, + unsigned long offset, + unsigned long maxlen) +{ + resource_size_t start = pci_resource_start(dev, bar); + resource_size_t len = pci_resource_len(dev, bar); + unsigned long flags = pci_resource_flags(dev, bar); + + if (len <= offset || !start) + return NULL; + len -= offset; + start += offset; + if (maxlen && len > maxlen) + len = maxlen; + if (flags & IORESOURCE_IO) + return NULL; + if (flags & IORESOURCE_MEM) + return ioremap_wc(start, len); + /* What? */ + return NULL; +} +EXPORT_SYMBOL_GPL(pci_iomap_wc_range); + +/** * pci_iomap - create a virtual mapping cookie for a PCI BAR * @dev: PCI device that owns the BAR * @bar: BAR number @@ -70,4 +110,25 @@ void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen) return pci_iomap_range(dev, bar, 0, maxlen); } EXPORT_SYMBOL(pci_iomap); + +/** + * pci_iomap_wc - create a virtual WC mapping cookie for a PCI BAR + * @dev: PCI device that owns the BAR + * @bar: BAR number + * @maxlen: length of the memory to map + * + * Using this function you will get a __iomem address to your device BAR. + * You can access it using ioread*() and iowrite*(). These functions hide + * the details if this is a MMIO or PIO address space and will just do what + * you expect from them in the correct way. When possible write combining + * is used. + * + * @maxlen specifies the maximum length to map. If you want to get access to + * the complete BAR without checking for its length first, pass %0 here. + * */ +void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long maxlen) +{ + return pci_iomap_wc_range(dev, bar, 0, maxlen); +} +EXPORT_SYMBOL_GPL(pci_iomap_wc); #endif /* CONFIG_PCI */