scsi: lpfc: use memcpy_toio instead of writeq

Message ID	20180223153700.2186058-1-arnd@arndb.de (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Arnd Bergmann <arnd@arndb.de> To: James Smart <james.smart@broadcom.com>, Dick Kennedy <dick.kennedy@broadcom.com>, "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>, "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de>, Hannes Reinecke <hare@suse.com>, Johannes Thumshirn <jthumshirn@suse.de>, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] scsi: lpfc: use memcpy_toio instead of writeq Date: Fri, 23 Feb 2018 16:36:49 +0100 Message-Id: <20180223153700.2186058-1-arnd@arndb.de> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Arnd Bergmann Feb. 23, 2018, 3:36 p.m. UTC

32-bit architectures generally cannot use writeq(), so we now get a build
failure for the lpfc driver:

drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean 'writeb'? [-Werror=implicit-function-declaration]

Another problem here is that writing out actual data (unlike accessing
mmio registers) means we must write the data with the same endianess
that we have read from memory, but writeq() will perform byte swaps
and add barriers inbetween accesses as we do for registers.

Using memcpy_toio() should do the right thing here, using register
sized stores with correct endianess conversion and barriers (i.e. none),
but on some architectures might fall back to byte-size access.

Side note: shouldn't the driver use ioremap_wc() instead of ioremap()
to get a write-combining mapping on all architectures that support this?

Fixes: 1351e69fc6db ("scsi: lpfc: Add push-to-adapter support to sli4")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/scsi/lpfc/lpfc_sli.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

Andy Shevchenko Feb. 23, 2018, 3:59 p.m. UTC | #1

On Fri, Feb 23, 2018 at 5:36 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> 32-bit architectures generally cannot use writeq(), so we now get a build
> failure for the lpfc driver:
>
> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean 'writeb'? [-Werror=implicit-function-declaration]
>
> Another problem here is that writing out actual data (unlike accessing
> mmio registers) means we must write the data with the same endianess
> that we have read from memory, but writeq() will perform byte swaps
> and add barriers inbetween accesses as we do for registers.
>
> Using memcpy_toio() should do the right thing here, using register
> sized stores with correct endianess conversion and barriers (i.e. none),
> but on some architectures might fall back to byte-size access.
>
> Side note: shouldn't the driver use ioremap_wc() instead of ioremap()
> to get a write-combining mapping on all architectures that support this?

IIRC memcpy_toio() doesn't increment the destination address.

lo_hi or hi_lo helpers sound better.

> Fixes: 1351e69fc6db ("scsi: lpfc: Add push-to-adapter support to sli4")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

> @@ -115,7 +115,6 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>         struct lpfc_register doorbell;
>         uint32_t host_index;
>         uint32_t idx;
> -       uint32_t i = 0;
>         uint8_t *tmp;
>
>         /* sanity check on queue memory */
> @@ -138,12 +137,10 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>         if (q->phba->sli3_options & LPFC_SLI4_PHWQ_ENABLED)
>                 bf_set(wqe_wqid, &wqe->generic.wqe_com, q->queue_id);
>         lpfc_sli_pcimem_bcopy(wqe, temp_wqe, q->entry_size);
> -       if (q->dpp_enable && q->phba->cfg_enable_dpp) {
> +       if (q->dpp_enable && q->phba->cfg_enable_dpp)
>                 /* write to DPP aperture taking advatage of Combined Writes */
> -               tmp = (uint8_t *)wqe;
> -               for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
> -                       writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);
> -       }
> +               memcpy_toio(tmp, q->dpp_regaddr, q->entry_size);
> +
>         /* ensure WQE bcopy and DPP flushed before doorbell write */
>         wmb();
>
> --
> 2.9.0
>

Andy Shevchenko Feb. 23, 2018, 4:13 p.m. UTC | #2

On Fri, Feb 23, 2018 at 5:59 PM, Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Fri, Feb 23, 2018 at 5:36 PM, Arnd Bergmann <arnd@arndb.de> wrote:

> IIRC memcpy_toio() doesn't increment the destination address.
>
> lo_hi or hi_lo helpers sound better.

Ah, sorry, I messed up with writesl() / etc.

memcpy_toio() has another side-effect, but here it seems unimportant.

David Laight Feb. 23, 2018, 4:41 p.m. UTC | #3

From: Arnd Bergmann
> Sent: 23 February 2018 15:37
> 
> 32-bit architectures generally cannot use writeq(), so we now get a build
> failure for the lpfc driver:
> 
> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean
> 'writeb'? [-Werror=implicit-function-declaration]
> 
> Another problem here is that writing out actual data (unlike accessing
> mmio registers) means we must write the data with the same endianess
> that we have read from memory, but writeq() will perform byte swaps
> and add barriers inbetween accesses as we do for registers.
> 
> Using memcpy_toio() should do the right thing here, using register
> sized stores with correct endianess conversion and barriers (i.e. none),
> but on some architectures might fall back to byte-size access.
...

Have you looked at the performance impact of this on x86?
Last time I looked memcpy_toio() aliased directly to memcpy().
memcpy() is run-time patched between several different algorithms.
On recent Intel cpus memcpy() is implemented as 'rep movsb' relying
on the hardware to DTRT.
For uncached accesses (typical for io) the 'RT' has to be byte transfers.
So instead of the 8 byte transfers (on 64 bit) you get single bytes.
This won't be what is intended!
memcpy_toio() should probably use 'rep movsd' for the bulk of the transfer.

	David

Arnd Bergmann Feb. 23, 2018, 4:44 p.m. UTC | #4

On Fri, Feb 23, 2018 at 5:41 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Arnd Bergmann
>> Sent: 23 February 2018 15:37
>>
>> 32-bit architectures generally cannot use writeq(), so we now get a build
>> failure for the lpfc driver:
>>
>> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
>> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean
>> 'writeb'? [-Werror=implicit-function-declaration]
>>
>> Another problem here is that writing out actual data (unlike accessing
>> mmio registers) means we must write the data with the same endianess
>> that we have read from memory, but writeq() will perform byte swaps
>> and add barriers inbetween accesses as we do for registers.
>>
>> Using memcpy_toio() should do the right thing here, using register
>> sized stores with correct endianess conversion and barriers (i.e. none),
>> but on some architectures might fall back to byte-size access.
> ...
>
> Have you looked at the performance impact of this on x86?
> Last time I looked memcpy_toio() aliased directly to memcpy().
> memcpy() is run-time patched between several different algorithms.
> On recent Intel cpus memcpy() is implemented as 'rep movsb' relying
> on the hardware to DTRT.
> For uncached accesses (typical for io) the 'RT' has to be byte transfers.
> So instead of the 8 byte transfers (on 64 bit) you get single bytes.
> This won't be what is intended!
> memcpy_toio() should probably use 'rep movsd' for the bulk of the transfer.

I'm not that familiar with x86, but I would have guessed that on a
write-combining I/O mapping, the hardware will combine the 'rep movsb'
output data the same was as on a cacheable mapping.

      Arnd

Andy Shevchenko Feb. 23, 2018, 4:51 p.m. UTC | #5

On Fri, Feb 23, 2018 at 6:41 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Arnd Bergmann
>> Sent: 23 February 2018 15:37
>>
>> 32-bit architectures generally cannot use writeq(), so we now get a build
>> failure for the lpfc driver:
>>
>> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
>> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean
>> 'writeb'? [-Werror=implicit-function-declaration]
>>
>> Another problem here is that writing out actual data (unlike accessing
>> mmio registers) means we must write the data with the same endianess
>> that we have read from memory, but writeq() will perform byte swaps
>> and add barriers inbetween accesses as we do for registers.
>>
>> Using memcpy_toio() should do the right thing here, using register
>> sized stores with correct endianess conversion and barriers (i.e. none),
>> but on some architectures might fall back to byte-size access.
> ...
>
> Have you looked at the performance impact of this on x86?
> Last time I looked memcpy_toio() aliased directly to memcpy().
> memcpy() is run-time patched between several different algorithms.
> On recent Intel cpus memcpy() is implemented as 'rep movsb' relying
> on the hardware to DTRT.
> For uncached accesses (typical for io) the 'RT' has to be byte transfers.
> So instead of the 8 byte transfers (on 64 bit) you get single bytes.
> This won't be what is intended!
> memcpy_toio() should probably use 'rep movsd' for the bulk of the transfer.

Maybe I'm wrong but it uses movsq on 64-bit and movsl on 32-bit.

The side-effect I referred previously is about tails, i.e. unaligned
bytes are transferred in portions
like
  7 on 64-bit will be  4 + 2 + 1,
  5 = 4 + 1
etc

Similar way on 32-bit.

Andy Shevchenko Feb. 23, 2018, 4:53 p.m. UTC | #6

On Fri, Feb 23, 2018 at 6:51 PM, Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Fri, Feb 23, 2018 at 6:41 PM, David Laight <David.Laight@aculab.com> wrote:
>> From: Arnd Bergmann
>>> Sent: 23 February 2018 15:37
>>>
>>> 32-bit architectures generally cannot use writeq(), so we now get a build
>>> failure for the lpfc driver:
>>>
>>> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
>>> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean
>>> 'writeb'? [-Werror=implicit-function-declaration]
>>>
>>> Another problem here is that writing out actual data (unlike accessing
>>> mmio registers) means we must write the data with the same endianess
>>> that we have read from memory, but writeq() will perform byte swaps
>>> and add barriers inbetween accesses as we do for registers.
>>>
>>> Using memcpy_toio() should do the right thing here, using register
>>> sized stores with correct endianess conversion and barriers (i.e. none),
>>> but on some architectures might fall back to byte-size access.
>> ...
>>
>> Have you looked at the performance impact of this on x86?
>> Last time I looked memcpy_toio() aliased directly to memcpy().
>> memcpy() is run-time patched between several different algorithms.
>> On recent Intel cpus memcpy() is implemented as 'rep movsb' relying
>> on the hardware to DTRT.
>> For uncached accesses (typical for io) the 'RT' has to be byte transfers.
>> So instead of the 8 byte transfers (on 64 bit) you get single bytes.
>> This won't be what is intended!
>> memcpy_toio() should probably use 'rep movsd' for the bulk of the transfer.
>
> Maybe I'm wrong but it uses movsq on 64-bit and movsl on 32-bit.
>
> The side-effect I referred previously is about tails, i.e. unaligned
> bytes are transferred in portions
> like
>   7 on 64-bit will be  4 + 2 + 1,
>   5 = 4 + 1
> etc
>
> Similar way on 32-bit.

Same for leading bytes as well.

arch/x86/lib/memcpy_64.S

So, I *hope* that in the code in question there is no unaligned access is used.

David Laight Feb. 23, 2018, 5:09 p.m. UTC | #7

From: Andy Shevchenko

> Sent: 23 February 2018 16:51

> On Fri, Feb 23, 2018 at 6:41 PM, David Laight <David.Laight@aculab.com> wrote:

> > From: Arnd Bergmann

> >> Sent: 23 February 2018 15:37

> >>

> >> 32-bit architectures generally cannot use writeq(), so we now get a build

> >> failure for the lpfc driver:

> >>

> >> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':

> >> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean

> >> 'writeb'? [-Werror=implicit-function-declaration]

> >>

> >> Another problem here is that writing out actual data (unlike accessing

> >> mmio registers) means we must write the data with the same endianess

> >> that we have read from memory, but writeq() will perform byte swaps

> >> and add barriers inbetween accesses as we do for registers.

> >>

> >> Using memcpy_toio() should do the right thing here, using register

> >> sized stores with correct endianess conversion and barriers (i.e. none),

> >> but on some architectures might fall back to byte-size access.

> > ...

> >

> > Have you looked at the performance impact of this on x86?

> > Last time I looked memcpy_toio() aliased directly to memcpy().

> > memcpy() is run-time patched between several different algorithms.

> > On recent Intel cpus memcpy() is implemented as 'rep movsb' relying

> > on the hardware to DTRT.

> > For uncached accesses (typical for io) the 'RT' has to be byte transfers.

> > So instead of the 8 byte transfers (on 64 bit) you get single bytes.

> > This won't be what is intended!

> > memcpy_toio() should probably use 'rep movsd' for the bulk of the transfer.

> 

> Maybe I'm wrong but it uses movsq on 64-bit and movsl on 32-bit.


(Let's not argue about the instruction mnemonic). 

You might expect that, but last time I looked at the bus cycles on a PCIe slave
that wasn't what I saw.

> The side-effect I referred previously is about tails, i.e. unaligned

> bytes are transferred in portions

> like

>   7 on 64-bit will be  4 + 2 + 1,

>   5 = 4 + 1


on 64bit memcpy() is allowed to do:
	(long *)(tgt+len)[-1] = (long *)(src+len)[-1];
	rep_movsq(tgt, src, len >> 3);
provided the length is at least 8.

The misaligned PCIe transfer generates a single TLP covering 12 bytes with the
relevant byte enables set for the first and last 32bit words.

	David

Andy Shevchenko Feb. 23, 2018, 5:12 p.m. UTC | #8

On Fri, Feb 23, 2018 at 7:09 PM, David Laight <David.Laight@aculab.com> wrote:
> From: Andy Shevchenko
>> Sent: 23 February 2018 16:51
>> On Fri, Feb 23, 2018 at 6:41 PM, David Laight <David.Laight@aculab.com> wrote:


>> The side-effect I referred previously is about tails, i.e. unaligned
>> bytes are transferred in portions
>> like
>>   7 on 64-bit will be  4 + 2 + 1,
>>   5 = 4 + 1
>
> on 64bit memcpy() is allowed to do:
>         (long *)(tgt+len)[-1] = (long *)(src+len)[-1];
>         rep_movsq(tgt, src, len >> 3);
> provided the length is at least 8.
>
> The misaligned PCIe transfer generates a single TLP covering 12 bytes with the
> relevant byte enables set for the first and last 32bit words.

But is it guaranteed on any type of bus?
memcpy_toio() is a generic helper, so, first of all we need to be sure
what CPU on its side does, this is I think is pretty straight forward
since it's all written in asm for 64-bit case.
So, what about buses?

David Laight Feb. 23, 2018, 5:45 p.m. UTC | #9

From: Andy Shevchenko

> Sent: 23 February 2018 17:13

> To: David Laight

> Cc: Arnd Bergmann; James Smart; Dick Kennedy; James E.J. Bottomley; Martin K. Petersen; Hannes

> Reinecke; Johannes Thumshirn; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org

> Subject: Re: [PATCH] scsi: lpfc: use memcpy_toio instead of writeq

> 

> On Fri, Feb 23, 2018 at 7:09 PM, David Laight <David.Laight@aculab.com> wrote:

> > From: Andy Shevchenko

> >> Sent: 23 February 2018 16:51

> >> On Fri, Feb 23, 2018 at 6:41 PM, David Laight <David.Laight@aculab.com> wrote:

> 

> 

> >> The side-effect I referred previously is about tails, i.e. unaligned

> >> bytes are transferred in portions

> >> like

> >>   7 on 64-bit will be  4 + 2 + 1,

> >>   5 = 4 + 1

> >

> > on 64bit memcpy() is allowed to do:

> >         (long *)(tgt+len)[-1] = (long *)(src+len)[-1];

> >         rep_movsq(tgt, src, len >> 3);

> > provided the length is at least 8.

> >

> > The misaligned PCIe transfer generates a single TLP covering 12 bytes with the

> > relevant byte enables set for the first and last 32bit words.

> 

> But is it guaranteed on any type of bus?

> memcpy_toio() is a generic helper, so, first of all we need to be sure

> what CPU on its side does, this is I think is pretty straight forward

> since it's all written in asm for 64-bit case.

I've just done a compile test, on x86-64 memcpy_toio() generates a
call to memcpy() (checked with objdump -dr).
That is on a system running a 4.14 kernel, so is probably using the system
headers from that release.
I'd need to do a run-time test on a newer system verify what the PCIe slave
sees - but I changed our driver to do its own copy loops in order to avoid
byte transfers some time ago.

FWIW I was originally doing copy_to/from_user() directly to PCIe memory addresses!

On x86 'memory' on devices can always be accesses by simple instructions.
Hardware 'IO' addresses are not valid for memcpy_to/fromio().

	David

Arnd Bergmann Feb. 23, 2018, 9:02 p.m. UTC | #10

On Fri, Feb 23, 2018 at 4:36 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> @@ -138,12 +137,10 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>         if (q->phba->sli3_options & LPFC_SLI4_PHWQ_ENABLED)
>                 bf_set(wqe_wqid, &wqe->generic.wqe_com, q->queue_id);
>         lpfc_sli_pcimem_bcopy(wqe, temp_wqe, q->entry_size);
> -       if (q->dpp_enable && q->phba->cfg_enable_dpp) {
> +       if (q->dpp_enable && q->phba->cfg_enable_dpp)
>                 /* write to DPP aperture taking advatage of Combined Writes */
> -               tmp = (uint8_t *)wqe;
> -               for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
> -                       writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);
> -       }
> +               memcpy_toio(tmp, q->dpp_regaddr, q->entry_size);
> +
>         /* ensure WQE bcopy and DPP flushed before doorbell write */
>         wmb();
>

Not sure where we are with the question of whether memcpy_toio
is a good replacement or not, but further build testing showed that
my patch was completely broken in more than one way:

I mixed up the source and destination arguments, and I used
the uninitialized 'tmp' instead of 'wqe'. Don't try this patch.

       Arnd

James Smart Feb. 24, 2018, 10:24 p.m. UTC | #11

About to post a patch to fix. Rather than fidgeting with the copy 
routine, I want to go back to what we originally proposed - writeq() on 
64bit, writel() on 32-bit.

-- james


On 2/23/2018 1:02 PM, Arnd Bergmann wrote:
> On Fri, Feb 23, 2018 at 4:36 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> @@ -138,12 +137,10 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>>          if (q->phba->sli3_options & LPFC_SLI4_PHWQ_ENABLED)
>>                  bf_set(wqe_wqid, &wqe->generic.wqe_com, q->queue_id);
>>          lpfc_sli_pcimem_bcopy(wqe, temp_wqe, q->entry_size);
>> -       if (q->dpp_enable && q->phba->cfg_enable_dpp) {
>> +       if (q->dpp_enable && q->phba->cfg_enable_dpp)
>>                  /* write to DPP aperture taking advatage of Combined Writes */
>> -               tmp = (uint8_t *)wqe;
>> -               for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
>> -                       writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);
>> -       }
>> +               memcpy_toio(tmp, q->dpp_regaddr, q->entry_size);
>> +
>>          /* ensure WQE bcopy and DPP flushed before doorbell write */
>>          wmb();
>>
> Not sure where we are with the question of whether memcpy_toio
> is a good replacement or not, but further build testing showed that
> my patch was completely broken in more than one way:
>
> I mixed up the source and destination arguments, and I used
> the uninitialized 'tmp' instead of 'wqe'. Don't try this patch.
>
>         Arnd

Johannes Thumshirn Feb. 25, 2018, 10:02 a.m. UTC | #12

Arnd Bergmann <arnd@arndb.de> writes:
> 32-bit architectures generally cannot use writeq(), so we now get a build
> failure for the lpfc driver:
>
> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean 'writeb'? [-Werror=implicit-function-declaration]
Hi Arnd,
why can't we use the writeq() from 'io-64-nonatomic-lo-hi.h'? I always
thought these are compat versions for 32 Bit archs and even asked James
to do so, what's why he did the change in the first place.

My apologies for this James.

Thanks,
        Johannes


>
> Another problem here is that writing out actual data (unlike accessing
> mmio registers) means we must write the data with the same endianess
> that we have read from memory, but writeq() will perform byte swaps
> and add barriers inbetween accesses as we do for registers.
>
> Using memcpy_toio() should do the right thing here, using register
> sized stores with correct endianess conversion and barriers (i.e. none),
> but on some architectures might fall back to byte-size access.
>
> Side note: shouldn't the driver use ioremap_wc() instead of ioremap()
> to get a write-combining mapping on all architectures that support this?
>
> Fixes: 1351e69fc6db ("scsi: lpfc: Add push-to-adapter support to sli4")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/scsi/lpfc/lpfc_sli.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
> index 4ce3ca6f4b79..6749d41753b4 100644
> --- a/drivers/scsi/lpfc/lpfc_sli.c
> +++ b/drivers/scsi/lpfc/lpfc_sli.c
> @@ -115,7 +115,6 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>  	struct lpfc_register doorbell;
>  	uint32_t host_index;
>  	uint32_t idx;
> -	uint32_t i = 0;
>  	uint8_t *tmp;
>  
>  	/* sanity check on queue memory */
> @@ -138,12 +137,10 @@ lpfc_sli4_wq_put(struct lpfc_queue *q, union lpfc_wqe *wqe)
>  	if (q->phba->sli3_options & LPFC_SLI4_PHWQ_ENABLED)
>  		bf_set(wqe_wqid, &wqe->generic.wqe_com, q->queue_id);
>  	lpfc_sli_pcimem_bcopy(wqe, temp_wqe, q->entry_size);
> -	if (q->dpp_enable && q->phba->cfg_enable_dpp) {
> +	if (q->dpp_enable && q->phba->cfg_enable_dpp)
>  		/* write to DPP aperture taking advatage of Combined Writes */
> -		tmp = (uint8_t *)wqe;
> -		for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
> -			writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);
> -	}
> +		memcpy_toio(tmp, q->dpp_regaddr, q->entry_size);
> +
>  	/* ensure WQE bcopy and DPP flushed before doorbell write */
>  	wmb();

Arnd Bergmann Feb. 26, 2018, 9:03 a.m. UTC | #13

On Sun, Feb 25, 2018 at 11:02 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>> 32-bit architectures generally cannot use writeq(), so we now get a build
>> failure for the lpfc driver:
>>
>> drivers/scsi/lpfc/lpfc_sli.c: In function 'lpfc_sli4_wq_put':
>> drivers/scsi/lpfc/lpfc_sli.c:145:4: error: implicit declaration of function 'writeq'; did you mean 'writeb'? [-Werror=implicit-function-declaration]
> Hi Arnd,
> why can't we use the writeq() from 'io-64-nonatomic-lo-hi.h'? I always
> thought these are compat versions for 32 Bit archs and even asked James
> to do so, what's why he did the change in the first place.

That could work as well, if someone can figure out what the correct incantation
is that works on big-endian 32-bit architectures. I think the simplest
version that
works everywhere would be

lo_hi_writeq(__le32_to_cpup(__le32 __force *)p) |
(u64)__le32_to_cpup(__le32 __force *)p +1) << 32));

but this is ugly as hell and makes my head spin. I definitely would't want that
applied without being tested properly first on a variety of architectures.

There are three variants that I'd prefer over that:

- use memcpy_toio() and change the x86 implementation to guarantee aligned
  word aligned accesses on the output if at all possible (which seems to be
  a good idea anyway)
- use a loop around __raw_writeq()/__raw_writel() depending on CONFIG_64BIT
- add generic memcpy_toio_32() and memcpy_toio_64() helpers in linux/io.h
  and use those.

       Arnd

scsi: lpfc: use memcpy_toio instead of writeq

Commit Message

Comments

Patch