[v7,5/9] PCI: Add pci_iomap_wc() variants

Message ID	1434751712-24333-6-git-send-email-mcgrof@do-not-panic.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fbdev-owner@kernel.org> From: "Luis R. Rodriguez" <mcgrof@do-not-panic.com> To: arnd@arndb.de Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, xen-devel@lists.xensource.com, linux-fbdev@vger.kernel.org, "Luis R. Rodriguez" <mcgrof@suse.com>, Toshi Kani <toshi.kani@hp.com>, Andy Lutomirski <luto@amacapital.net>, Suresh Siddha <sbsiddha@gmail.com>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>, Juergen Gross <jgross@suse.com>, Daniel Vetter <daniel.vetter@ffwll.ch>, Dave Airlie <airlied@redhat.com>, Bjorn Helgaas <bhelgaas@google.com>, Antonino Daplas <adaplas@gmail.com>, Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>, Tomi Valkeinen <tomi.valkeinen@ti.com>, Dave Hansen <dave.hansen@linux.intel.com>, "Michael S. Tsirkin" <mst@redhat.com>, venkatesh.pallipadi@intel.com, Stefan Bader <stefan.bader@canonical.com>, =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <syrjala@sci.fi>, Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>, Borislav Petkov <bp@suse.de>, Davidlohr Bueso <dbueso@suse.de>, konrad.wilk@oracle.com, ville.syrjala@linux.intel.com, david.vrabel@citrix.com, jbeulich@suse.com, =?UTF-8?q?Roger=20Pau=20Monn=C3=A9?= <roger.pau@citrix.com> Subject: [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants Date: Fri, 19 Jun 2015 15:08:28 -0700 Message-Id: <1434751712-24333-6-git-send-email-mcgrof@do-not-panic.com> In-Reply-To: <1434751712-24333-1-git-send-email-mcgrof@do-not-panic.com> References: <1434751712-24333-1-git-send-email-mcgrof@do-not-panic.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-fbdev-owner@vger.kernel.org Precedence: bulk

Luis R. Rodriguez June 19, 2015, 10:08 p.m. UTC

From: "Luis R. Rodriguez" <mcgrof@suse.com>

PCI BARs tell us whether prefetching is safe, but they don't say anything
about write combining (WC).  WC changes ordering rules and allows writes to
be collapsed, so it's not safe in general to use it on a prefetchable
region.

Add pci_iomap_wc() and pci_iomap_wc_range() so drivers can take advantage
of write combining when they know it's safe.

On architectures that don't fully support WC, e.g., x86 without PAT,
drivers for legacy framebuffers may get some of the benefit by using
arch_phys_wc_add() in addition to pci_iomap_wc().  But arch_phys_wc_add()
is unreliable and should be avoided in general.  On x86, it uses MTRRs,
which are limited in number and size, so the results will vary based on
driver loading order.

The goals of adding pci_iomap_wc() are to:

- Give drivers an architecture-independent way to use WC so they can stop
  using interfaces like mtrr_add() (on x86, pci_iomap_wc() uses
  PAT when available)

- Move toward using _PAGE_CACHE_MODE_UC, not _PAGE_CACHE_MODE_UC_MINUS,
  on x86 on ioremap_nocache() (see de33c442ed2a ("x86 PAT: fix
  performance drop for glx, use UC minus for ioremap(), ioremap_nocache()
  and pci_mmap_page_range()")

Link: http://lkml.kernel.org/r/1426893517-2511-6-git-send-email-mcgrof@do-not-panic.com
Original-posting: http://lkml.kernel.org/r/1432163293-20965-1-git-send-email-mcgrof@do-not-panic.com
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: venkatesh.pallipadi@intel.com
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: konrad.wilk@oracle.com
Cc: ville.syrjala@linux.intel.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xensource.com
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
---
 include/asm-generic/pci_iomap.h | 14 ++++++++++
 lib/pci_iomap.c                 | 61 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

Benjamin Herrenschmidt June 23, 2015, 10:42 p.m. UTC | #1

On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" <mcgrof@suse.com>
> 
> PCI BARs tell us whether prefetching is safe, but they don't say anything
> about write combining (WC).  WC changes ordering rules and allows writes to
> be collapsed, so it's not safe in general to use it on a prefetchable
> region.

Well, the PCIe spec at least specifies that a prefetchable BAR also
tolerates write merging... 

> Add pci_iomap_wc() and pci_iomap_wc_range() so drivers can take advantage
> of write combining when they know it's safe.
> 
> On architectures that don't fully support WC, e.g., x86 without PAT,
> drivers for legacy framebuffers may get some of the benefit by using
> arch_phys_wc_add() in addition to pci_iomap_wc().  But arch_phys_wc_add()
> is unreliable and should be avoided in general.  On x86, it uses MTRRs,
> which are limited in number and size, so the results will vary based on
> driver loading order.
> 
> The goals of adding pci_iomap_wc() are to:
> 
> - Give drivers an architecture-independent way to use WC so they can stop
>   using interfaces like mtrr_add() (on x86, pci_iomap_wc() uses
>   PAT when available)
> 
> - Move toward using _PAGE_CACHE_MODE_UC, not _PAGE_CACHE_MODE_UC_MINUS,
>   on x86 on ioremap_nocache() (see de33c442ed2a ("x86 PAT: fix
>   performance drop for glx, use UC minus for ioremap(), ioremap_nocache()
>   and pci_mmap_page_range()")
> 
> Link: http://lkml.kernel.org/r/1426893517-2511-6-git-send-email-mcgrof@do-not-panic.com
> Original-posting: http://lkml.kernel.org/r/1432163293-20965-1-git-send-email-mcgrof@do-not-panic.com
> Cc: Toshi Kani <toshi.kani@hp.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Suresh Siddha <sbsiddha@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Antonino Daplas <adaplas@gmail.com>
> Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: venkatesh.pallipadi@intel.com
> Cc: Stefan Bader <stefan.bader@canonical.com>
> Cc: Ville Syrjälä <syrjala@sci.fi>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: konrad.wilk@oracle.com
> Cc: ville.syrjala@linux.intel.com
> Cc: david.vrabel@citrix.com
> Cc: jbeulich@suse.com
> Cc: Roger Pau Monné <roger.pau@citrix.com>
> Cc: linux-fbdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: xen-devel@lists.xensource.com
> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
> ---
>  include/asm-generic/pci_iomap.h | 14 ++++++++++
>  lib/pci_iomap.c                 | 61 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 75 insertions(+)
> 
> diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
> index 7389c87..b1e17fc 100644
> --- a/include/asm-generic/pci_iomap.h
> +++ b/include/asm-generic/pci_iomap.h
> @@ -15,9 +15,13 @@ struct pci_dev;
>  #ifdef CONFIG_PCI
>  /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
>  extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
> +extern void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max);
>  extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>  				     unsigned long offset,
>  				     unsigned long maxlen);
> +extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> +					unsigned long offset,
> +					unsigned long maxlen);
>  /* Create a virtual mapping cookie for a port on a given PCI device.
>   * Do not call this directly, it exists to make it easier for architectures
>   * to override */
> @@ -34,12 +38,22 @@ static inline void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned lon
>  	return NULL;
>  }
>  
> +static inline void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long max)
> +{
> +	return NULL;
> +}
>  static inline void __iomem *pci_iomap_range(struct pci_dev *dev, int bar,
>  					    unsigned long offset,
>  					    unsigned long maxlen)
>  {
>  	return NULL;
>  }
> +static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
> +					       unsigned long offset,
> +					       unsigned long maxlen)
> +{
> +	return NULL;
> +}
>  #endif
>  
>  #endif /* __ASM_GENERIC_IO_H */
> diff --git a/lib/pci_iomap.c b/lib/pci_iomap.c
> index bcce5f1..9604dcb 100644
> --- a/lib/pci_iomap.c
> +++ b/lib/pci_iomap.c
> @@ -52,6 +52,46 @@ void __iomem *pci_iomap_range(struct pci_dev *dev,
>  EXPORT_SYMBOL(pci_iomap_range);
>  
>  /**
> + * pci_iomap_wc_range - create a virtual WC mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @offset: map memory at the given offset in BAR
> + * @maxlen: max length of the memory to map
> + *
> + * Using this function you will get a __iomem address to your device BAR.
> + * You can access it using ioread*() and iowrite*(). These functions hide
> + * the details if this is a MMIO or PIO address space and will just do what
> + * you expect from them in the correct way. When possible write combining
> + * is used.
> + *
> + * @maxlen specifies the maximum length to map. If you want to get access to
> + * the complete BAR from offset to the end, pass %0 here.
> + * */
> +void __iomem *pci_iomap_wc_range(struct pci_dev *dev,
> +				 int bar,
> +				 unsigned long offset,
> +				 unsigned long maxlen)
> +{
> +	resource_size_t start = pci_resource_start(dev, bar);
> +	resource_size_t len = pci_resource_len(dev, bar);
> +	unsigned long flags = pci_resource_flags(dev, bar);
> +
> +	if (len <= offset || !start)
> +		return NULL;
> +	len -= offset;
> +	start += offset;
> +	if (maxlen && len > maxlen)
> +		len = maxlen;
> +	if (flags & IORESOURCE_IO)
> +		return NULL;
> +	if (flags & IORESOURCE_MEM)
> +		return ioremap_wc(start, len);
> +	/* What? */
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_wc_range);
> +
> +/**
>   * pci_iomap - create a virtual mapping cookie for a PCI BAR
>   * @dev: PCI device that owns the BAR
>   * @bar: BAR number
> @@ -70,4 +110,25 @@ void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long maxlen)
>  	return pci_iomap_range(dev, bar, 0, maxlen);
>  }
>  EXPORT_SYMBOL(pci_iomap);
> +
> +/**
> + * pci_iomap_wc - create a virtual WC mapping cookie for a PCI BAR
> + * @dev: PCI device that owns the BAR
> + * @bar: BAR number
> + * @maxlen: length of the memory to map
> + *
> + * Using this function you will get a __iomem address to your device BAR.
> + * You can access it using ioread*() and iowrite*(). These functions hide
> + * the details if this is a MMIO or PIO address space and will just do what
> + * you expect from them in the correct way. When possible write combining
> + * is used.
> + *
> + * @maxlen specifies the maximum length to map. If you want to get access to
> + * the complete BAR without checking for its length first, pass %0 here.
> + * */
> +void __iomem *pci_iomap_wc(struct pci_dev *dev, int bar, unsigned long maxlen)
> +{
> +	return pci_iomap_wc_range(dev, bar, 0, maxlen);
> +}
> +EXPORT_SYMBOL_GPL(pci_iomap_wc);
>  #endif /* CONFIG_PCI */


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain June 24, 2015, 4:38 p.m. UTC | #2

On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote:
> > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > 
> > PCI BARs tell us whether prefetching is safe, but they don't say anything
> > about write combining (WC).  WC changes ordering rules and allows writes to
> > be collapsed, so it's not safe in general to use it on a prefetchable
> > region.
> 
> Well, the PCIe spec at least specifies that a prefetchable BAR also
> tolerates write merging... 

How can that be determined and can that be used as a full bullet proof hint
to enable wc ? And are you sure? :) Reason all this was stated was to be
apologetic over why we can't automate this behind the scenes. Otherwise
we could amend what you stated into the commit log to elaborate on our
technical apology. Let me know!

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 24, 2015, 10:05 p.m. UTC | #3

On Wed, 2015-06-24 at 18:38 +0200, Luis R. Rodriguez wrote:
> On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote:
> > > From: "Luis R. Rodriguez" <mcgrof@suse.com>
> > > 
> > > PCI BARs tell us whether prefetching is safe, but they don't say anything
> > > about write combining (WC).  WC changes ordering rules and allows writes to
> > > be collapsed, so it's not safe in general to use it on a prefetchable
> > > region.
> > 
> > Well, the PCIe spec at least specifies that a prefetchable BAR also
> > tolerates write merging... 
> 
> How can that be determined and can that be used as a full bullet proof hint
> to enable wc ? And are you sure? :) 

Well, I"m sure the spec says that ;-) But it could be new to PCIe, I
haven't checked legacy PCI.

> Reason all this was stated was to be
> apologetic over why we can't automate this behind the scenes. Otherwise
> we could amend what you stated into the commit log to elaborate on our
> technical apology. Let me know!

At least on powerpc, for mmap of resource to userspace, we take off the
garded bit in the PTE for prefetchable BARs. This has the effect
architecturally of enabling both prefetch and write combine (ie. side
effect) though afaik, the implementations probably don't actually
prefetch. We've done that for years.

In fact we don't have a way to split the notions, it's either G or no G,
which carries both meanings.

Do you have example/case of a device having problems ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain June 24, 2015, 10:29 p.m. UTC | #4

On Wed, Jun 24, 2015 at 3:05 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2015-06-24 at 18:38 +0200, Luis R. Rodriguez wrote:
>> On Wed, Jun 24, 2015 at 08:42:23AM +1000, Benjamin Herrenschmidt wrote:
>> > On Fri, 2015-06-19 at 15:08 -0700, Luis R. Rodriguez wrote:
>> > > From: "Luis R. Rodriguez" <mcgrof@suse.com>
>> > >
>> > > PCI BARs tell us whether prefetching is safe, but they don't say anything
>> > > about write combining (WC).  WC changes ordering rules and allows writes to
>> > > be collapsed, so it's not safe in general to use it on a prefetchable
>> > > region.
>> >
>> > Well, the PCIe spec at least specifies that a prefetchable BAR also
>> > tolerates write merging...
>>
>> How can that be determined and can that be used as a full bullet proof hint
>> to enable wc ? And are you sure? :)
>
> Well, I"m sure the spec says that ;-) But it could be new to PCIe, I
> haven't checked legacy PCI.

OK cool so to be clear from what I gather you are suggesting (or not
and letting me make it) is that we might be able to enforce
write-merging on prefetchable areas, and if we can *ensure* we do this
then automatically enable write-combining behind the scenes?

>> Reason all this was stated was to be
>> apologetic over why we can't automate this behind the scenes. Otherwise
>> we could amend what you stated into the commit log to elaborate on our
>> technical apology. Let me know!
>
> At least on powerpc, for mmap of resource to userspace, we take off the
> garded bit in the PTE for prefetchable BARs. This has the effect
> architecturally of enabling both prefetch and write combine (ie. side
> effect)

That's pretty darn sexy.

> though afaik, the implementations probably don't actually
> prefetch. We've done that for years.

Neat!

> In fact we don't have a way to split the notions, it's either G or no G,
> which carries both meanings.

Interesting.

> Do you have example/case of a device having problems ?

Nope but at least what made me squint at this being a possible
"feature" was that in practice when reviewing all of the kernels
pending device drivers using MTRR (potential write-combine candidates)
I encountered a slew of them which had the architectural unfortunate
practice of combining PCI bars for MMIO and their respective
write-combined desirable area (framebuffer for video, PIO buffers for
infiniband, etc). Now, to me that read more as a practice for old
school devices when such things were likely still being evaluated,
more modern devices seem to adhere to sticking a full PCI bar with
write-combining or not. Did you not encounter such mismatch splits on
powerpc ? Was such possibility addressed?

If what you are implying here is applicable to the x86 world I'm all
for enabling this as we'd have less code to maintain but I'll note
that getting a clarification alone on that prefetchable !=
write-combining was in and of itself hard, I'd be surprised if we
could get full architectural buy-in to this as an immediate automatic
feature. Because of this and because PAT did have some errata as well,
I would not be surprised if some PCI bridges / devices would end up
finding corner cases, as such if we can really do what you're saying
and unless we can get some super sane certainty over it across the
board, I'd be inclined to leave such things as a part of a new API.
Maybe have some folks test using the new API for all calls and after
some sanity of testing / releases consider a full switch.

That is, unless of course you're sure all this is sane and would wager
all-in on it from the get-go.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 24, 2015, 11:38 p.m. UTC | #5

On Wed, 2015-06-24 at 15:29 -0700, Luis R. Rodriguez wrote:

> Nope but at least what made me squint at this being a possible
> "feature" was that in practice when reviewing all of the kernels
> pending device drivers using MTRR (potential write-combine candidates)
> I encountered a slew of them which had the architectural unfortunate
> practice of combining PCI bars for MMIO and their respective
> write-combined desirable area (framebuffer for video, PIO buffers for
> infiniband, etc). Now, to me that read more as a practice for old
> school devices when such things were likely still being evaluated,
> more modern devices seem to adhere to sticking a full PCI bar with
> write-combining or not. Did you not encounter such mismatch splits on
> powerpc ? Was such possibility addressed?

Yes, I remember we dealt with some networking (or maybe IB) stuff back
in the day. We dealt with it by using a WC mapping and explicit barriers
to prevent combine when not wanted.

It is to be noted that on powerpc at least, writel() and co will never
combine due to the memory barriers in them. Only "normal" stores (or
__raw_writel) will.

On Intel things I different I assume...

The problem I see is that architectures can provide widely different
mechanisms and semantics in those areas and it's hard to define a
generic driver interface.

> If what you are implying here is applicable to the x86 world I'm all
> for enabling this as we'd have less code to maintain but I'll note
> that getting a clarification alone on that prefetchable !=
> write-combining was in and of itself hard, I'd be surprised if we
> could get full architectural buy-in to this as an immediate automatic
> feature.

I'm happy not to make it automatic for kernel space. As for user
mappings, maybe the right thing to do is to let us do what we do by
default with a quirk that can set a flag in pci_dev to disable that
behaviour (maybe on a per BAR basis ?). I think the common case is
that WC works.

>  Because of this and because PAT did have some errata as well,
> I would not be surprised if some PCI bridges / devices would end up
> finding corner cases, as such if we can really do what you're saying
> and unless we can get some super sane certainty over it across the
> board, I'd be inclined to leave such things as a part of a new API.
> Maybe have some folks test using the new API for all calls and after
> some sanity of testing / releases consider a full switch.
> 
> That is, unless of course you're sure all this is sane and would wager
> all-in on it from the get-go.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain June 25, 2015, 12:08 a.m. UTC | #6

On Thu, Jun 25, 2015 at 09:38:01AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2015-06-24 at 15:29 -0700, Luis R. Rodriguez wrote:
> 
> > Nope but at least what made me squint at this being a possible
> > "feature" was that in practice when reviewing all of the kernels
> > pending device drivers using MTRR (potential write-combine candidates)
> > I encountered a slew of them which had the architectural unfortunate
> > practice of combining PCI bars for MMIO and their respective
> > write-combined desirable area (framebuffer for video, PIO buffers for
> > infiniband, etc). Now, to me that read more as a practice for old
> > school devices when such things were likely still being evaluated,
> > more modern devices seem to adhere to sticking a full PCI bar with
> > write-combining or not. Did you not encounter such mismatch splits on
> > powerpc ? Was such possibility addressed?
> 
> Yes, I remember we dealt with some networking (or maybe IB) stuff back
> in the day. We dealt with it by using a WC mapping and explicit barriers
> to prevent combine when not wanted.
> 
> It is to be noted that on powerpc at least, writel() and co will never
> combine due to the memory barriers in them. Only "normal" stores (or
> __raw_writel) will.
> 
> On Intel things I different I assume...

And the people who really know seem to be eaten by volcanoes or not have time.

> The problem I see is that architectures can provide widely different
> mechanisms and semantics in those areas and it's hard to define a
> generic driver interface.

Provided asm generic helpers are defined this should work though. The question
is just if there is enough motivation. Doesn't sound like it or as you note
maybe for userspace there might be. My position is that if it was too late for
PCIE or if this was too ambigious for PCIE perhaps the next generation bus
archicture or ammendments (I have no clue if this would would be possible) will
make this part of future device negotiation clear and fully expected, not a
wonderful side effect.

> > If what you are implying here is applicable to the x86 world I'm all
> > for enabling this as we'd have less code to maintain but I'll note
> > that getting a clarification alone on that prefetchable !=
> > write-combining was in and of itself hard, I'd be surprised if we
> > could get full architectural buy-in to this as an immediate automatic
> > feature.
> 
> I'm happy not to make it automatic for kernel space.

OK thanks I'll proceed with these patches then.

> As for user mappings,

Which APIs were you considering in this regard BTW?

> maybe the right thing to do is to let us do what we do by
> default with a quirk that can set a flag in pci_dev to disable that
> behaviour (maybe on a per BAR basis ?).

That might mean it could restrict userspace WC to require devices
to have WC parts on a full PCI BAR. Although this is restrictive
having reviewed most WC uses in the kernel I'd think this would be
a fair compromise to make, but again, if things are still murky
perhaps best we kiss this idea good bye for now and hope for it
to come in on future buses or ammendments (if that's even possible?).

> I think the common case is that WC works.

If WC does not I will note one hack which migh be worth mentioning -- just for
the record, this was devised as a shortcoming of a device where they failed to
split things properly and that *without* WC performance suffered quite a bit so
they made one full PCI BAR WC and as a work around this:

http://lkml.kernel.org/r/20150416041837.GA5712@hykim-PC

That is for registers that needed it:

write; wmb;

Then if they wanted to wait till the NIC has seen the write, they did:

write; wmb; read;

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 25, 2015, 12:52 a.m. UTC | #7

On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote:
> 
> OK thanks I'll proceed with these patches then.
> 
> > As for user mappings,
> 
> Which APIs were you considering in this regard BTW?

mmap of the generic /sys/bus/pci/.../resource*

> > maybe the right thing to do is to let us do what we do by
> > default with a quirk that can set a flag in pci_dev to disable that
> > behaviour (maybe on a per BAR basis ?).
> 
> That might mean it could restrict userspace WC to require devices
> to have WC parts on a full PCI BAR. Although this is restrictive
> having reviewed most WC uses in the kernel I'd think this would be
> a fair compromise to make, but again, if things are still murky
> perhaps best we kiss this idea good bye for now and hope for it
> to come in on future buses or ammendments (if that's even possible?).
> 
> > I think the common case is that WC works.
> 
> If WC does not I will note one hack which migh be worth mentioning --
> just for
> the record, this was devised as a shortcoming of a device where they
> failed to
> split things properly and that *without* WC performance suffered quite
> a bit so
> they made one full PCI BAR WC and as a work around this:
> 
> http://lkml.kernel.org/r/20150416041837.GA5712@hykim-PC
> 
> That is for registers that needed it:
> 
> write; wmb;
> 
> Then if they wanted to wait till the NIC has seen the write, they did:
> 
> write; wmb; read;
> 

Right, and as I mentioned, on some archs like powerpc (and possibly
more), writel() and co contains an implicit mb()

>   Luis--


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain June 25, 2015, 12:58 a.m. UTC | #8

On Wed, Jun 24, 2015 at 5:52 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote:
>>
>> OK thanks I'll proceed with these patches then.
>>
>> > As for user mappings,
>>
>> Which APIs were you considering in this regard BTW?
>
> mmap of the generic /sys/bus/pci/.../resource*

Like? Got a demo patch in mind ? :)

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 25, 2015, 1:12 a.m. UTC | #9

On Wed, 2015-06-24 at 17:58 -0700, Luis R. Rodriguez wrote:
> On Wed, Jun 24, 2015 at 5:52 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Thu, 2015-06-25 at 02:08 +0200, Luis R. Rodriguez wrote:
> >>
> >> OK thanks I'll proceed with these patches then.
> >>
> >> > As for user mappings,
> >>
> >> Which APIs were you considering in this regard BTW?
> >
> > mmap of the generic /sys/bus/pci/.../resource*
> 
> Like? Got a demo patch in mind ? :)

Nope. I was just thinking out loud. Today I have yet to see a problem
with what we do so ...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Casey Leedom June 25, 2015, 3:01 p.m. UTC | #10

/ From: Benjamin Herrenschmidt [benh@kernel.crashing.org]
| Sent: Wednesday, June 24, 2015 4:38 PM
| 
| It is to be noted that on powerpc at least, writel() and co will never
| combine due to the memory barriers in them. Only "normal" stores (or
\ __raw_writel) will.

/ From: Benjamin Herrenschmidt [benh@kernel.crashing.org]
| Sent: Wednesday, June 24, 2015 5:52 PM
| 
| Right, and as I mentioned, on some archs like powerpc (and possibly
\ more), writel() and co contains an implicit mb()

  Hhmmm, interesting.  I definitely hadn't known that.  In our network
drivers we explicitly manage all of our own memory barrier semantics.
(wmb() between writes to DMA Queues and writes to Device Registers
informing hardware of new data available to be DMA'ed, etc.)  And we
do have code to take advantage of Write Combining Mappings using
writeq().  It looks like this is implemented eventually as out_be64()
in arch/powerpc/include/asm/io.h for PPC-BE:

#define DEF_MMIO_OUT_D(name, size, insn)                                \
static inline void name(volatile u##size __iomem *addr, u##size val)    \
{                                                                       \
        __asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0"                 \
                : "=m" (*addr) : "r" (val) : "memory");                 \
        IO_SET_SYNC_FLAG();                                             \
}

DEF_MMIO_OUT_D(out_be64, 64, std);

  So, if understand your notes and the source correctly, all of
our code to do register reads/writes are getting SYNC instructions,
as are the 64-bit writeq()s we're attempting to use for our Write
Combining code on PowerPC.  Hhmmm indeed ...

  Is there a reference I can read on this so I can understand
when and where we can use the __raw_*() APIs?  Can these
Raw Read/Write operations be reordered with respect to
each other or are the use of the various flavors of SYNC
instructions just to maintain order between Cached Memory
Accesses and I/O Instructions?

Casey--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann June 25, 2015, 8:44 p.m. UTC | #11

On Thursday 25 June 2015 15:01:56 Casey Leedom wrote:
> 
>   Is there a reference I can read on this so I can understand
> when and where we can use the __raw_*() APIs?  Can these
> Raw Read/Write operations be reordered with respect to
> each other or are the use of the various flavors of SYNC
> instructions just to maintain order between Cached Memory
> Accesses and I/O Instructions?

The interpretation is not consistent across architectures.

My best description would be that the __raw_*() accessors should
only be used for accessing RAM areas that are known to have no
side-effects and can be read in any size (8-bit to 64-bit wide),
any alignment, and do not have a specific endianess.

If you are dealing with MMIO registers that have a fixed endianess
and size, the correct accessor would be readl_relaxed(), which
is like readl() but lacks the barriers on certain architectures.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Casey Leedom June 25, 2015, 9:40 p.m. UTC | #12

| From: Arnd Bergmann [arnd@arndb.de]
| Sent: Thursday, June 25, 2015 1:44 PM
| 
| On Thursday 25 June 2015 15:01:56 Casey Leedom wrote:
| >
| >   Is there a reference I can read on this so I can understand
| > when and where we can use the __raw_*() APIs?  Can these
| > Raw Read/Write operations be reordered with respect to
| > each other or are the use of the various flavors of SYNC
| > instructions just to maintain order between Cached Memory
| > Accesses and I/O Instructions?
| 
| The interpretation is not consistent across architectures.
| 
| My best description would be that the __raw_*() accessors should
| only be used for accessing RAM areas that are known to have no
| side-effects and can be read in any size (8-bit to 64-bit wide),
| any alignment, and do not have a specific endianess.
| 
| If you are dealing with MMIO registers that have a fixed endianess
| and size, the correct accessor would be readl_relaxed(), which
| is like readl() but lacks the barriers on certain architectures.

Ah, thanks.  I see now that the __raw_*() APIs don't do any of the
Endian Swizzling.  Unfortunately the *_relaxed() APIs on PowerPC
are just defined as the normal *() routines.  From
arch/powerpc/include/asm/io.h:

    /*
     * We don't do relaxed operations yet, at least not with this semantic
     */
    #define readb_relaxed(addr)     readb(addr)
    #define readw_relaxed(addr)     readw(addr)
    #define readl_relaxed(addr)     readl(addr)
    #define readq_relaxed(addr)     readq(addr)
    #define writeb_relaxed(v, addr) writeb(v, addr)
    #define writew_relaxed(v, addr) writew(v, addr)
    #define writel_relaxed(v, addr) writel(v, addr)
    #define writeq_relaxed(v, addr) writeq(v, addr)

    (And in fact, this is true for most of the architectures.)

So we could go ahead and use these but for now there wouldn't be
any effect.

Hhmmm, so what do PowerPC Drivers do when they want to take
advantage of Write Combining?  Do their own Endian Swizzling
with the __raw_*() APIs?

Casey--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 25, 2015, 10:51 p.m. UTC | #13

On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote:
> Hhmmm, so what do PowerPC Drivers do when they want to take
> advantage of Write Combining?  Do their own Endian Swizzling
> with the __raw_*() APIs?

Yeah either, we need to fix our relaxed implementation (patch
welcome :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 26, 2015, 2:02 a.m. UTC | #14

On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote:
> 
> Ah, thanks.  I see now that the __raw_*() APIs don't do any of the
> Endian Swizzling.  Unfortunately the *_relaxed() APIs on PowerPC
> are just defined as the normal *() routines.  From
> arch/powerpc/include/asm/io.h:
> 
>     /*
>      * We don't do relaxed operations yet, at least not with this
> semantic
>      */

Yes so I was looking at this but there are some difficulties.
Architecturally, even with I=1 G=1 mappings (normal ioremap), we have no
guarantee of ordering of load vs. store unless I misunderstood
something. I think all current implementations provide some of that but
without barriers in the accessors, we aren't architecturally correct.

However, having those barriers will cause issues with G=0 (write
combine). It's unclear whether eieio() will provide the required
ordering for I=1 G=0 mappings and it will probably break write combine.

I'm looking into it with our HW guys and will try to come up with a
solution for power, but it doesn't help that our memory model conflates
write combining with other relaxations and that all our barriers also
prevent write combine.

Maybe we can bias the relaxed accessors toward write, by having no
barriers in it, and putting extra ones in reads.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Casey Leedom June 26, 2015, 4:24 p.m. UTC | #15

Thanks for looking into this Ben.  As it stands now, it seems as
if Write Combined mappings simply aren't supported and/or all
driver writers trying to utilize Write Combined mappings have to
"hand roll" their own solutions which really isn't supportable.

  One thing that might be considered is simply to treat the desire
to utilize the Write Combining hardware as a separate issue and
develop writel_wc(), writeq_wc(), etc.  Those could be defined
as legal only for Write Combined Mappings and would "do the
right thing" for each architecture.  The initial implementations
could simply be writel(), writeq(), etc. till each architecture'si
ssues were investigated and understood.

  Additionally, it would be very useful to have some sort of
predicate which could be used to determine if an architecture
supported Write Combining.  Our drivers would use this to
eliminate a wasteful attempt to use write combining on those
architectures where it didn't work.

Casey

Luis Chamberlain June 26, 2015, 7:31 p.m. UTC | #16

On Fri, Jun 26, 2015 at 08:51:58AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote:
> > Hhmmm, so what do PowerPC Drivers do when they want to take
> > advantage of Write Combining?  Do their own Endian Swizzling
> > with the __raw_*() APIs?
> 
> Yeah either, we need to fix our relaxed implementation (patch
> welcome :-)

Yikes, so although powerpc has useful heuristics to automatically
enable WC the default write ops have been nullifying its effects?
Shouldn't this have been blatantly obvious in performance benchmarks
and expectations? If not a change should give considerable performance
gains... If the WC was nullified on powerpc then the experience and
value of the heuristics have less value and even if they are going
to be only considered for userspace mmap() it should be taken with
a bit grain of salt it seems.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 26, 2015, 10 p.m. UTC | #17

On Fri, 2015-06-26 at 16:24 +0000, Casey Leedom wrote:
>   Thanks for looking into this Ben.  As it stands now, it seems as
> if Write Combined mappings simply aren't supported and/or all
> driver writers trying to utilize Write Combined mappings have to
> "hand roll" their own solutions which really isn't supportable.
> 
>   One thing that might be considered is simply to treat the desire
> to utilize the Write Combining hardware as a separate issue and
> develop writel_wc(), writeq_wc(), etc.  Those could be defined
> as legal only for Write Combined Mappings and would "do the
> right thing" for each architecture.  

The question then is what is "the right thing". In the powerpc case,
we'll have a non-garded mapping, which means we also get no ordering
between load and stores.

> The initial implementations
> could simply be writel(), writeq(), etc. till each architecture'si
> ssues were investigated and understood.
> 
>   Additionally, it would be very useful to have some sort of
> predicate which could be used to determine if an architecture
> supported Write Combining.  Our drivers would use this to
> eliminate a wasteful attempt to use write combining on those
> architectures where it didn't work.
> 
> Casey
> 
> ________________________________________
> From: Benjamin Herrenschmidt [benh@kernel.crashing.org]
> Sent: Thursday, June 25, 2015 7:02 PM
> To: Casey Leedom
> Cc: Arnd Bergmann; Luis R. Rodriguez; Michael S. Tsirkin; Bjorn Helgaas; Toshi Kani; Andy Lutomirski; Juergen Gross; Tomi Valkeinen; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; xen-devel@lists.xensource.com; linux-fbdev; Suresh Siddha; Ingo Molnar; Thomas Gleixner; Daniel Vetter; Dave Airlie; Antonino Daplas; Jean-Christophe Plagniol-Villard; Dave Hansen; venkatesh.pallipadi@intel.com; Stefan Bader; ville.syrjala@linux.intel.com; David Vrabel; Jan Beulich; Roger Pau Monné
> Subject: Re: [PATCH v7 5/9] PCI: Add pci_iomap_wc() variants
> 
> On Thu, 2015-06-25 at 21:40 +0000, Casey Leedom wrote:
> >
> > Ah, thanks.  I see now that the __raw_*() APIs don't do any of the
> > Endian Swizzling.  Unfortunately the *_relaxed() APIs on PowerPC
> > are just defined as the normal *() routines.  From
> > arch/powerpc/include/asm/io.h:
> >
> >     /*
> >      * We don't do relaxed operations yet, at least not with this
> > semantic
> >      */
> 
> Yes so I was looking at this but there are some difficulties.
> Architecturally, even with I=1 G=1 mappings (normal ioremap), we have no
> guarantee of ordering of load vs. store unless I misunderstood
> something. I think all current implementations provide some of that but
> without barriers in the accessors, we aren't architecturally correct.
> 
> However, having those barriers will cause issues with G=0 (write
> combine). It's unclear whether eieio() will provide the required
> ordering for I=1 G=0 mappings and it will probably break write combine.
> 
> I'm looking into it with our HW guys and will try to come up with a
> solution for power, but it doesn't help that our memory model conflates
> write combining with other relaxations and that all our barriers also
> prevent write combine.
> 
> Maybe we can bias the relaxed accessors toward write, by having no
> barriers in it, and putting extra ones in reads.
> 
> Cheers,
> Ben.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 26, 2015, 10:04 p.m. UTC | #18

On Fri, 2015-06-26 at 21:31 +0200, Luis R. Rodriguez wrote:
> > Yeah either, we need to fix our relaxed implementation (patch
> > welcome :-)
> 
> Yikes, so although powerpc has useful heuristics to automatically
> enable WC the default write ops have been nullifying its effects?

The heuristic is for userspace mapping which don't use the kernel
accessors. To cut a long story short, this was all done back in the day
to speed up X.org which doesn't use fancy accessors with barriers to
access the framebuffer (neither does the kernel fbdev layer).

> Shouldn't this have been blatantly obvious in performance benchmarks
> and expectations? 

> If not a change should give considerable performance
> gains... If the WC was nullified on powerpc then the experience and
> value of the heuristics have less value and even if they are going
> to be only considered for userspace mmap() it should be taken with
> a bit grain of salt it seems.

It wasn't nullified for the main user at the time, the fb. And I
mentioned an IB adapter or two for which the code had been hand tuned.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt June 26, 2015, 11:54 p.m. UTC | #19

On Fri, 2015-06-26 at 15:41 -0700, Luis R. Rodriguez wrote:

> > It wasn't nullified for the main user at the time, the fb. And I
> > mentioned an IB adapter or two for which the code had been hand
> tuned.
> 
> This still means there could be some affected drivers when used on
> powerpc, no?

Yes. In fact what about things like ARM who also have barriers in their
writel() ? Won't they also break WC ?

I'm trying to work with the architect and designers here to figure out
exactly where we stand and what we can do. As spelled out by our
architecture, things don't look great, because basically, we only have
attribute bit (garded) which when not set implies both WC and out of
order (& prefetch), and unclear barrier semantics in that case as well.

I *think* we might be able to settle with something along the lines of
"writel_relaxed() will allow combine on a WC mapping" but how I'm going
to get there is TBD.

It would be interesting to clarify the semantics of using the relaxed
accessors in combination with WC anyway. I wouldn't mind if the
definition involved also relaxing general ordering :-) It would
definitely make my life easier.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain June 27, 2015, 1:16 a.m. UTC | #20

On Fri, Jun 26, 2015 at 4:54 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2015-06-26 at 15:41 -0700, Luis R. Rodriguez wrote:
>
>> > It wasn't nullified for the main user at the time, the fb. And I
>> > mentioned an IB adapter or two for which the code had been hand
>> tuned.
>>
>> This still means there could be some affected drivers when used on
>> powerpc, no?
>
> Yes. In fact what about things like ARM who also have barriers in their
> writel() ? Won't they also break WC ?

Not sure if they have that write-combine notion on their CPUs /
interconnects. Russel, Stefano, does ARM have write-combining ?

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain July 2, 2015, 6:49 p.m. UTC | #21

On Sat, Jun 27, 2015 at 08:00:48AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2015-06-26 at 16:24 +0000, Casey Leedom wrote:
> >   Thanks for looking into this Ben.  As it stands now, it seems as
> > if Write Combined mappings simply aren't supported and/or all
> > driver writers trying to utilize Write Combined mappings have to
> > "hand roll" their own solutions which really isn't supportable.
> > 
> >   One thing that might be considered is simply to treat the desire
> > to utilize the Write Combining hardware as a separate issue and
> > develop writel_wc(), writeq_wc(), etc.

That seems rather sloppy and cumbersome, its best to just provide the
infrastructure for initial mapping for an area and let the hardware do it for
you. With the current design drivers would just use regular read/write on all
areas and the only thing that will set them apart will be the mapping.  With
what you propose we'd end up having to shift a whole bunch of reads/writes for
just the purpose of adding WC to an area that didn't have wc before. That's
a waste of code and time.

> > Those could be defined
> > as legal only for Write Combined Mappings and would "do the
> > right thing" for each architecture.  

Yuck.

> The question then is what is "the right thing". In the powerpc case,
> we'll have a non-garded mapping, which means we also get no ordering
> between load and stores.

I don't follow, you *ordering* between load and stores for WC? We should
not need that for WC, its why WC is used for only very specific things
such as framebuffer and PIO (which BTw I still don't quite get all this
use case for infiniband to be honest, and I will note I do see some
proprietary hardware extensions like bursts but nothing covering all
this in a general doc, I think I think it all just has to do that this
is a hardware hack in reality, which we sell as a feature).

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Herrenschmidt July 2, 2015, 10:26 p.m. UTC | #22

On Thu, 2015-07-02 at 20:49 +0200, Luis R. Rodriguez wrote:
> > The question then is what is "the right thing". In the powerpc case,
> > we'll have a non-garded mapping, which means we also get no ordering
> > between load and stores.
> 
> I don't follow, you *ordering* between load and stores for WC? We should
> not need that for WC, its why WC is used for only very specific things
> such as framebuffer and PIO (which BTw I still don't quite get all this
> use case for infiniband to be honest, and I will note I do see some
> proprietary hardware extensions like bursts but nothing covering all
> this in a general doc, I think I think it all just has to do that this
> is a hardware hack in reality, which we sell as a feature).

Well, that's the problem, the semantics that we provide to drivers
aren't well defined.

The words "write combine" themselves only specify that writes to subsequent
addresses can be combined into larger transactions. That's in itself is already
quite vague (are their boundaries, limits ? some can depend on bus type, etc...)
though in practice is probably sufficient.

However, overloading a _wc mapping with additional memory model differences
such as loss of ordering between load and stores, etc... is not an obvious
thing to do. I agree it would make *my* life easier if we did it since this
is precisely the semantics provided by a "G=0" mapping on ppc, but we need
to agree and *document* it, otherwise bad things will happen eventually.

We also need to document in that case which barriers can be used to explicitly
enforce the ordering on such a mapping and which barriers can be used to break
write combine (they don't necessarily have to be the same).

We also need to document which accessors will actually provide the write combine
"feature" of a _wc mapping. For example while writel() will do it on Intel, it
will not on ppc and I wouldn't be surprised if a bunch of other archs fall in
the same bucket as ppc (basically anything that has barriers in their writel
implementation to order vs. DMA etc...).

So we might need to explicitly document that writel_relaxed() needs to be used.

Finally what are the precise guaranteed semantics of writeX/readX,
writeX_relaxed/readX_relaxed and __raw_ (everything else) on a _wc mapping, what
do we mandate and document, what do we leave to be implementation dependent ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Casey Leedom July 3, 2015, 12:14 a.m. UTC | #23

> On Jul 2, 2015, at 11:49 AM, Luis R. Rodriguez <mcgrof@suse.com> wrote:
> 
> On Sat, Jun 27, 2015 at 08:00:48AM +1000, Benjamin Herrenschmidt wrote:
>> On Fri, 2015-06-26 at 16:24 +0000, Casey Leedom wrote:
>>>  Thanks for looking into this Ben.  As it stands now, it seems as
>>> if Write Combined mappings simply aren't supported and/or all
>>> driver writers trying to utilize Write Combined mappings have to
>>> "hand roll" their own solutions which really isn't supportable.
>>> 
>>>  One thing that might be considered is simply to treat the desire
>>> to utilize the Write Combining hardware as a separate issue and
>>> develop writel_wc(), writeq_wc(), etc.
> 
> That seems rather sloppy and cumbersome, its best to just provide the
> infrastructure for initial mapping for an area and let the hardware do it for
> you. With the current design drivers would just use regular read/write on all
> areas and the only thing that will set them apart will be the mapping.  With
> what you propose we'd end up having to shift a whole bunch of reads/writes for
> just the purpose of adding WC to an area that didn't have wc before. That's
> a waste of code and time.
> 
>>> Those could be defined
>>> as legal only for Write Combined Mappings and would "do the
>>> right thing" for each architecture.  
> 
> Yuck.

  Yeah, probably.  But it sounds like from Ben’s response, we should really use write_relaxed() as an already existing API for this.  But most of the architectures just define this as writel() so some work would need to be done to get it to work.

>> The question then is what is "the right thing". In the powerpc case,
>> we'll have a non-garded mapping, which means we also get no ordering
>> between load and stores.
> 
> I don't follow, you *ordering* between load and stores for WC? We should
> not need that for WC, its why WC is used for only very specific things
> such as framebuffer and PIO (which BTw I still don't quite get all this
> use case for infiniband to be honest, and I will note I do see some
> proprietary hardware extensions like bursts but nothing covering all
> this in a general doc, I think I think it all just has to do that this
> is a hardware hack in reality, which we sell as a feature).

  I was talking about the work that our drivers (cxgb4, cxgb4vf, etc.) do to ensure correct ordering between writes to memory and I/O space.  For instance, issuing a wmb() between writes to a DMA buffer and the write to a register telling the hardware that the data is available.  This isn’t necessary on the Strongly Ordered Intel architectures but is necessary on other architectures.

Casey--
To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v7,5/9] PCI: Add pci_iomap_wc() variants

Commit Message

Comments

Patch