diff mbox series

[v2,1/7] crypto: hisilicon/qm - obtain the mailbox configuration at one time

Message ID 20230811140749.5202-2-qianweili@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Herbert Xu
Headers show
Series crypto: hisilicon - fix some issues in hisilicon drivers | expand

Commit Message

Weili Qian Aug. 11, 2023, 2:07 p.m. UTC
The malibox needs to be triggered by a 128bit atomic operation. The
reason is that one QM hardware entity in one accelerator servers QM
mailbox MMIO interfaces in related PF and VFs. A mutex cannot lock
mailbox processes in different functions. When multiple functions access
the mailbox simultaneously, if the generic IO interface readq/writeq
is used to access the mailbox, the data read from mailbox or written to
mailbox is unpredictable. Therefore, the generic IO interface is changed
to a 128bit atomic operation.

Signed-off-by: Weili Qian <qianweili@huawei.com>
---
 drivers/crypto/hisilicon/qm.c | 160 ++++++++++++++++++++++------------
 include/linux/hisi_acc_qm.h   |   1 -
 2 files changed, 105 insertions(+), 56 deletions(-)
 mode change 100644 => 100755 drivers/crypto/hisilicon/qm.c

Comments

Herbert Xu Aug. 18, 2023, 8:13 a.m. UTC | #1
On Fri, Aug 11, 2023 at 10:07:43PM +0800, Weili Qian wrote:
> The malibox needs to be triggered by a 128bit atomic operation. The
> reason is that one QM hardware entity in one accelerator servers QM
> mailbox MMIO interfaces in related PF and VFs. A mutex cannot lock
> mailbox processes in different functions. When multiple functions access
> the mailbox simultaneously, if the generic IO interface readq/writeq
> is used to access the mailbox, the data read from mailbox or written to
> mailbox is unpredictable. Therefore, the generic IO interface is changed
> to a 128bit atomic operation.
> 
> Signed-off-by: Weili Qian <qianweili@huawei.com>
> ---
>  drivers/crypto/hisilicon/qm.c | 160 ++++++++++++++++++++++------------
>  include/linux/hisi_acc_qm.h   |   1 -
>  2 files changed, 105 insertions(+), 56 deletions(-)
>  mode change 100644 => 100755 drivers/crypto/hisilicon/qm.c

...

> -	qm_mb_write(qm, mailbox);
> +#if IS_ENABLED(CONFIG_ARM64)
> +	asm volatile("ldp %0, %1, %3\n"
> +		     "stp %0, %1, %2\n"
> +		     "dmb oshst\n"
> +		     : "=&r" (tmp0),
> +		       "=&r" (tmp1),
> +		       "+Q" (*((char *)dst))
> +		     : "Q" (*((char __iomem *)fun_base))
> +		     : "memory");
> +#endif

You should add a generic 128-bite write primitive for arm64 instead
of doing it in raw assembly in the driver.

Thanks,
Ard Biesheuvel Aug. 18, 2023, 10:21 a.m. UTC | #2
On Fri, 18 Aug 2023 at 10:13, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Aug 11, 2023 at 10:07:43PM +0800, Weili Qian wrote:
> > The malibox needs to be triggered by a 128bit atomic operation. The
> > reason is that one QM hardware entity in one accelerator servers QM
> > mailbox MMIO interfaces in related PF and VFs. A mutex cannot lock
> > mailbox processes in different functions. When multiple functions access
> > the mailbox simultaneously, if the generic IO interface readq/writeq
> > is used to access the mailbox, the data read from mailbox or written to
> > mailbox is unpredictable. Therefore, the generic IO interface is changed
> > to a 128bit atomic operation.
> >
> > Signed-off-by: Weili Qian <qianweili@huawei.com>
> > ---
> >  drivers/crypto/hisilicon/qm.c | 160 ++++++++++++++++++++++------------
> >  include/linux/hisi_acc_qm.h   |   1 -
> >  2 files changed, 105 insertions(+), 56 deletions(-)
> >  mode change 100644 => 100755 drivers/crypto/hisilicon/qm.c
>
> ...
>
> > -     qm_mb_write(qm, mailbox);
> > +#if IS_ENABLED(CONFIG_ARM64)
> > +     asm volatile("ldp %0, %1, %3\n"
> > +                  "stp %0, %1, %2\n"
> > +                  "dmb oshst\n"
> > +                  : "=&r" (tmp0),
> > +                    "=&r" (tmp1),
> > +                    "+Q" (*((char *)dst))
> > +                  : "Q" (*((char __iomem *)fun_base))
> > +                  : "memory");
> > +#endif
>
> You should add a generic 128-bite write primitive for arm64 instead
> of doing it in raw assembly in the driver.
>

IIRC there have been other cases (ThunderX?) where 128 bit MMIO
accessors were needed for some peripheral, but common arm64 helpers
were rejected on the basis that this atomic behavior is not
architectural.

Obviously, using inline asm in the driver is not the right way either,
so perhaps we should consider introducing some common infrastructure
anyway, including some expectation management about their guarantees.
Herbert Xu Aug. 19, 2023, 4:12 a.m. UTC | #3
On Fri, Aug 18, 2023 at 12:21:02PM +0200, Ard Biesheuvel wrote:
>
> IIRC there have been other cases (ThunderX?) where 128 bit MMIO
> accessors were needed for some peripheral, but common arm64 helpers
> were rejected on the basis that this atomic behavior is not
> architectural.
> 
> Obviously, using inline asm in the driver is not the right way either,
> so perhaps we should consider introducing some common infrastructure
> anyway, including some expectation management about their guarantees.

The ones in

	drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h

look better.  So perhaps copy them into hisilicon?

Cheers,
Ard Biesheuvel Aug. 19, 2023, 7:33 a.m. UTC | #4
On Sat, 19 Aug 2023 at 06:12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Fri, Aug 18, 2023 at 12:21:02PM +0200, Ard Biesheuvel wrote:
> >
> > IIRC there have been other cases (ThunderX?) where 128 bit MMIO
> > accessors were needed for some peripheral, but common arm64 helpers
> > were rejected on the basis that this atomic behavior is not
> > architectural.
> >
> > Obviously, using inline asm in the driver is not the right way either,
> > so perhaps we should consider introducing some common infrastructure
> > anyway, including some expectation management about their guarantees.
>
> The ones in
>
>         drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h
>
> look better.  So perhaps copy them into hisilicon?
>

No, that otx2_write128() routine looks buggy, actually, The ! at the
end means writeback, and so the register holding addr will be
modified, which is not reflect in the asm constraints. It also lacks a
barrier.

The generic version just ORs the low and high qwords together, so it
obviously only exists for compile coverage (and the generic atomic add
is clearly broken too)
Herbert Xu Aug. 21, 2023, 8:19 a.m. UTC | #5
On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote:
>
> No, that otx2_write128() routine looks buggy, actually, The ! at the
> end means writeback, and so the register holding addr will be
> modified, which is not reflect in the asm constraints. It also lacks a
> barrier.

OK.  But at least having a helper called write128 looks a lot
cleaner than just having unexplained assembly in the code.

Cheers,
Will Deacon Aug. 21, 2023, 10:26 a.m. UTC | #6
On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote:
> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote:
> >
> > No, that otx2_write128() routine looks buggy, actually, The ! at the
> > end means writeback, and so the register holding addr will be
> > modified, which is not reflect in the asm constraints. It also lacks a
> > barrier.
> 
> OK.  But at least having a helper called write128 looks a lot
> cleaner than just having unexplained assembly in the code.

I guess we want something similar to how writeq() is handled on 32-bit
architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.

It's then CPU-dependent on whether you get atomicity.

Will
Weili Qian Aug. 21, 2023, 12:45 p.m. UTC | #7
On 2023/8/21 18:26, Will Deacon wrote:
> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote:
>> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote:
>>>
>>> No, that otx2_write128() routine looks buggy, actually, The ! at the
>>> end means writeback, and so the register holding addr will be
>>> modified, which is not reflect in the asm constraints. It also lacks a
>>> barrier.
>>
>> OK.  But at least having a helper called write128 looks a lot
>> cleaner than just having unexplained assembly in the code.
> 
> I guess we want something similar to how writeq() is handled on 32-bit
> architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.
> 
> It's then CPU-dependent on whether you get atomicity.
> 
> Will
> .
> 

Thanks for the review.

Since the HiSilicon accelerator devices are supported only on the ARM64 platform,
the following 128bit read and write interfaces are added to the driver, is this OK?

#if defined(CONFIG_ARM64)
static void qm_write128(void __iomem *addr, const void *buffer)
{
	unsigned long tmp0 = 0, tmp1 = 0;

	asm volatile("ldp %0, %1, %3\n"
		     "stp %0, %1, %2\n"
		     "dmb oshst\n"
		     : "=&r" (tmp0),
		     "=&r" (tmp1),
		     "+Q" (*((char __iomem *)addr))
		     : "Q" (*((char *)buffer))
		     : "memory");
}

static void qm_read128(void *buffer, const void __iomem *addr)
{
	unsigned long tmp0 = 0, tmp1 = 0;

	asm volatile("ldp %0, %1, %3\n"
		     "stp %0, %1, %2\n"
		     "dmb oshst\n"
		     : "=&r" (tmp0),
		       "=&r" (tmp1),
		       "+Q" (*((char *)buffer))
		     : "Q" (*((char __iomem *)addr))
		     : "memory");
}

#else
static void qm_write128(void __iomem *addr, const void *buffer)
{

}

static void qm_read128(void *buffer, const void __iomem *addr)
{

}
#endif

Thanks,
Weili
Ard Biesheuvel Aug. 21, 2023, 2:36 p.m. UTC | #8
On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote:
>
>
>
> On 2023/8/21 18:26, Will Deacon wrote:
> > On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote:
> >> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote:
> >>>
> >>> No, that otx2_write128() routine looks buggy, actually, The ! at the
> >>> end means writeback, and so the register holding addr will be
> >>> modified, which is not reflect in the asm constraints. It also lacks a
> >>> barrier.
> >>
> >> OK.  But at least having a helper called write128 looks a lot
> >> cleaner than just having unexplained assembly in the code.
> >
> > I guess we want something similar to how writeq() is handled on 32-bit
> > architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.
> >
> > It's then CPU-dependent on whether you get atomicity.
> >
> > Will
> > .
> >
>
> Thanks for the review.
>
> Since the HiSilicon accelerator devices are supported only on the ARM64 platform,
> the following 128bit read and write interfaces are added to the driver, is this OK?

No, this does not belong in the driver. As Will is suggesting, we
should define some generic helpers for this, and provide an arm64
implementation if the generic one does not result in the correct
codegen.

>
> #if defined(CONFIG_ARM64)
> static void qm_write128(void __iomem *addr, const void *buffer)
> {
>         unsigned long tmp0 = 0, tmp1 = 0;
>
>         asm volatile("ldp %0, %1, %3\n"
>                      "stp %0, %1, %2\n"
>                      "dmb oshst\n"
>                      : "=&r" (tmp0),
>                      "=&r" (tmp1),
>                      "+Q" (*((char __iomem *)addr))

This constraint describes the first byte of addr, which is sloppy but
probably works fine.

>                      : "Q" (*((char *)buffer))

This constraint describes the first byte of buffer, which might cause
problems because the asm reads the entire buffer not just the first
byte.

>                      : "memory");
> }
>
> static void qm_read128(void *buffer, const void __iomem *addr)
> {
>         unsigned long tmp0 = 0, tmp1 = 0;
>
>         asm volatile("ldp %0, %1, %3\n"
>                      "stp %0, %1, %2\n"
>                      "dmb oshst\n"

Is this the right barrier for a read?

>                      : "=&r" (tmp0),
>                        "=&r" (tmp1),
>                        "+Q" (*((char *)buffer))
>                      : "Q" (*((char __iomem *)addr))
>                      : "memory");
> }
>
> #else
> static void qm_write128(void __iomem *addr, const void *buffer)
> {
>
> }
>
> static void qm_read128(void *buffer, const void __iomem *addr)
> {
>
> }
> #endif
>

Have you tried using __uint128_t accesses instead?

E.g., something like

static void qm_write128(void __iomem *addr, const void *buffer)
{
    volatile __uint128_t *dst = (volatile __uint128_t __force *)addr;
    const __uint128_t *src __aligned(1) = buffer;

    WRITE_ONCE(*dst, *src);
    dma_wmb();
}

should produce the right instruction sequence, and works on all
architectures that have CONFIG_ARCH_SUPPORTS_INT128=y

This needs a big fat comment describing that the MMIO access needs to
be atomic, but we could consider it as a fallback if we decide not to
bother with special MMIO accessors, although I suspect we'll be
needing them more widely at some point anyway.
Arnd Bergmann Aug. 21, 2023, 8:47 p.m. UTC | #9
On Mon, Aug 21, 2023, at 08:45, Weili Qian wrote:
> On 2023/8/21 18:26, Will Deacon wrote:
>> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote:

> Thanks for the review.
>
> Since the HiSilicon accelerator devices are supported only on the ARM64 
> platform,
> the following 128bit read and write interfaces are added to the driver, 
> is this OK?
>
> #if defined(CONFIG_ARM64)
> static void qm_write128(void __iomem *addr, const void *buffer)
> {

The naming makes it specific to the driver, which is not
appropriate for a global definition. Just follow the
generic naming. I guess you wouldn't have to do both
the readl/writel and the ioread32/iowrite32 variants, so
just start with the ioread/iowrite version. That also
avoids having to come up with a new letter.

You have the arguments in the wrong order compared to iowrite32(),
which is very confusing.

Instead of the pointer to the buffer, I'd suggest passing the
value by value here, to make it behave like the other ones.

This does mean it won't build on targets that don't
define u128, but I think we can handle that in a Kconfig
symbol.

> unsigned long tmp0 = 0, tmp1 = 0;

Don't initialize local variable to zero, that is generally
a bad idea because it hides cases where they are not
initialized properly.

> 	asm volatile("ldp %0, %1, %3\n"
> 		     "stp %0, %1, %2\n"

This is missing the endian-swap for big-endian kernels.

> 		     "dmb oshst\n"

You have the barrier at the wrong end of the helper, it
needs to before the actual store to have the intended
effect.

Also, you should really use the generic __io_bw() helper
rather than open-coding it.

> 		     : "=&r" (tmp0),
> 		     "=&r" (tmp1),

The tmp0/tmp1 registers are technically a clobber, not
an in/out, though ideally these should be turned
into inputs.

> 		     "+Q" (*((char __iomem *)addr))
> 		     : "Q" (*((char *)buffer))

wrong length

> 		     : "memory");
> }

The memory clobber is usually part of the barrier.

> static void qm_read128(void *buffer, const void __iomem *addr)
> {
> 	unsigned long tmp0 = 0, tmp1 = 0;
>
> 	asm volatile("ldp %0, %1, %3\n"
> 		     "stp %0, %1, %2\n"
> 		     "dmb oshst\n"
> 		     : "=&r" (tmp0),
> 		       "=&r" (tmp1),
> 		       "+Q" (*((char *)buffer))
> 		     : "Q" (*((char __iomem *)addr))
> 		     : "memory");
> }

Same thing, but you are missing the control dependency
from __io_ar() here, rather than just open-coding it.

> #else
> static void qm_write128(void __iomem *addr, const void *buffer)
> {
>
> }

This is missing the entire implementation?

      Arnd
Weili Qian Aug. 25, 2023, 3:07 a.m. UTC | #10
On 2023/8/21 22:36, Ard Biesheuvel wrote:
> On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote:
>>
>>
>>
>> On 2023/8/21 18:26, Will Deacon wrote:
>>> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote:
>>>> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote:
>>>>>
>>>>> No, that otx2_write128() routine looks buggy, actually, The ! at the
>>>>> end means writeback, and so the register holding addr will be
>>>>> modified, which is not reflect in the asm constraints. It also lacks a
>>>>> barrier.
>>>>
>>>> OK.  But at least having a helper called write128 looks a lot
>>>> cleaner than just having unexplained assembly in the code.
>>>
>>> I guess we want something similar to how writeq() is handled on 32-bit
>>> architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.
>>>
>>> It's then CPU-dependent on whether you get atomicity.
>>>
>>> Will
>>> .
>>>
>>
>> Thanks for the review.
>>
>> Since the HiSilicon accelerator devices are supported only on the ARM64 platform,
>> the following 128bit read and write interfaces are added to the driver, is this OK?
> 
> No, this does not belong in the driver. As Will is suggesting, we
> should define some generic helpers for this, and provide an arm64
> implementation if the generic one does not result in the correct
> codegen.
> 
Sorry, I misunderstood here.

>>
>> #if defined(CONFIG_ARM64)
>> static void qm_write128(void __iomem *addr, const void *buffer)
>> {
>>         unsigned long tmp0 = 0, tmp1 = 0;
>>
>>         asm volatile("ldp %0, %1, %3\n"
>>                      "stp %0, %1, %2\n"
>>                      "dmb oshst\n"
>>                      : "=&r" (tmp0),
>>                      "=&r" (tmp1),
>>                      "+Q" (*((char __iomem *)addr))
> 
> This constraint describes the first byte of addr, which is sloppy but
> probably works fine.
> 
>>                      : "Q" (*((char *)buffer))
> 
> This constraint describes the first byte of buffer, which might cause
> problems because the asm reads the entire buffer not just the first
> byte.
I don't understand why constraint describes the first byte of buffer,and the compilation result seems to be ok.

 1811     1afc:       a9400a61        ldp     x1, x2, [x19]
 1812     1b00:       a9000801        stp     x1, x2, [x0]
 1813     1b04:       d50332bf        dmb     oshst
Maybe I'm wrong about 'Q', could you explain it or where can I learn more about this?

> 
>>                      : "memory");
>> }
>>
>> static void qm_read128(void *buffer, const void __iomem *addr)
>> {
>>         unsigned long tmp0 = 0, tmp1 = 0;
>>
>>         asm volatile("ldp %0, %1, %3\n"
>>                      "stp %0, %1, %2\n"
>>                      "dmb oshst\n"
> 
> Is this the right barrier for a read?
Should be "dmb oshld\n".
> 
>>                      : "=&r" (tmp0),
>>                        "=&r" (tmp1),
>>                        "+Q" (*((char *)buffer))
>>                      : "Q" (*((char __iomem *)addr))
>>                      : "memory");
>> }
>>
>> #else
>> static void qm_write128(void __iomem *addr, const void *buffer)
>> {
>>
>> }
>>
>> static void qm_read128(void *buffer, const void __iomem *addr)
>> {
>>
>> }
>> #endif
>>
> 
> Have you tried using __uint128_t accesses instead?
> 
> E.g., something like
> 
> static void qm_write128(void __iomem *addr, const void *buffer)
> {
>     volatile __uint128_t *dst = (volatile __uint128_t __force *)addr;
>     const __uint128_t *src __aligned(1) = buffer;
> 
>     WRITE_ONCE(*dst, *src);
>     dma_wmb();
> }
> 
> should produce the right instruction sequence, and works on all
> architectures that have CONFIG_ARCH_SUPPORTS_INT128=y
> 

I tried this, but WRITE_ONCE does not support type __uint128_t.
->WRITE_ONCE
 ->compiletime_assert_rwonce_type
  ->compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),	\
		"Unsupported access size for {READ,WRITE}_ONCE().")

So can we define generic IO helpers based on patchset
https://lore.kernel.org/all/20180124090519.6680-4-ynorov@caviumnetworks.com/
Part of the implementation is as follows:

add writeo() in include/asm-generic/io.h

#ifdef CONFIG_ARCH_SUPPORTS_INT128
#ifndef writeo
#define writeo writeo
static inline void writeo(__uint128_t value, volatile void __iomem *addr)
{
	__io_bw();
	__raw_writeo((__uint128_t __force)__cpu_to_le128(value), addr); //__cpu_to_le128 will implement.
	__io_aw();
}
#endif
#endif /* CONFIG_ARCH_SUPPORTS_INT128 */


#ifdef CONFIG_ARCH_SUPPORTS_INT128
#ifndef iowrite128
#define iowrite128 iowrite128
static inline void iowrite128(__uint128_t value, volatile void __iomem *addr)
{
	writeo(value, addr);
}
#endif
#endif /* CONFIG_ARCH_SUPPORTS_INT128  */

in arch/arm64/include/asm/io.h

#ifdef CONFIG_ARCH_SUPPORTS_INT128
#define __raw_writeo __raw_writeo
static __always_inline void  __raw_writeo(__uint128_t val, volatile void __iomem *addr)
{
	u64 hi, lo;

	lo = (u64)val;
	hi = (u64)(val >> 64);

	asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(lo), "rZ"(hi), "r"(addr));
}
#endif /* CONFIG_ARCH_SUPPORTS_INT128 */

And add io{read|write}128bits in include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.
static inline void lo_hi_writeo(__uint128_t val, volatile void __iomem *addr)
{
	writeq(val, addr);
	writeq(val >> 64, addr);
}
#ifndef writeo
#define writeo lo_hi_writeo
#endif

static inline void hi_lo_writeo(__uint128_t val, volatile void __iomem *addr)
{
	writeq(val >> 64, addr);
	writeq(val, addr);
}
#ifndef writeo
#define writeo hi_lo_writeo
#endif

If this is ok, I'm going to implement reado and writeo, then submit the patchset.

Thanks,
Weili

> This needs a big fat comment describing that the MMIO access needs to
> be atomic, but we could consider it as a fallback if we decide not to
> bother with special MMIO accessors, although I suspect we'll be
> needing them more widely at some point anyway.
> .
>
Weili Qian Aug. 25, 2023, 3:12 a.m. UTC | #11
On 2023/8/22 4:47, Arnd Bergmann wrote:
> On Mon, Aug 21, 2023, at 08:45, Weili Qian wrote:
>> On 2023/8/21 18:26, Will Deacon wrote:
>>> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote:
> 
>> Thanks for the review.
>>
>> Since the HiSilicon accelerator devices are supported only on the ARM64 
>> platform,
>> the following 128bit read and write interfaces are added to the driver, 
>> is this OK?
>>
>> #if defined(CONFIG_ARM64)
>> static void qm_write128(void __iomem *addr, const void *buffer)
>> {
> 
> The naming makes it specific to the driver, which is not
> appropriate for a global definition. Just follow the
> generic naming. I guess you wouldn't have to do both
> the readl/writel and the ioread32/iowrite32 variants, so
> just start with the ioread/iowrite version. That also
> avoids having to come up with a new letter.
> 
> You have the arguments in the wrong order compared to iowrite32(),
> which is very confusing.
> 
> Instead of the pointer to the buffer, I'd suggest passing the
> value by value here, to make it behave like the other ones.
> 
> This does mean it won't build on targets that don't
> define u128, but I think we can handle that in a Kconfig
> symbol.
Ok, I will add generic IO helpers ioread128/iowrite128 for this,
keep it consistent with ioread32/iowrite32, submit patchset later.
And remove them from the driver.

> 
>> unsigned long tmp0 = 0, tmp1 = 0;
> 
> Don't initialize local variable to zero, that is generally
> a bad idea because it hides cases where they are not
> initialized properly.
> 
>> 	asm volatile("ldp %0, %1, %3\n"
>> 		     "stp %0, %1, %2\n"
> 
> This is missing the endian-swap for big-endian kernels.
The input parameter data has been endian-swap.
> 
>> 		     "dmb oshst\n"
> 
> You have the barrier at the wrong end of the helper, it
> needs to before the actual store to have the intended
> effect.
> 
> Also, you should really use the generic __io_bw() helper
> rather than open-coding it.
OK.
> 
>> 		     : "=&r" (tmp0),
>> 		     "=&r" (tmp1),
> 
> The tmp0/tmp1 registers are technically a clobber, not
> an in/out, though ideally these should be turned
> into inputs.
> 
>> 		     "+Q" (*((char __iomem *)addr))
>> 		     : "Q" (*((char *)buffer))
> 
> wrong length
> 
>> 		     : "memory");
>> }
> 
> The memory clobber is usually part of the barrier.
Yeah, the memory can be removed.
> 
>> static void qm_read128(void *buffer, const void __iomem *addr)
>> {
>> 	unsigned long tmp0 = 0, tmp1 = 0;
>>
>> 	asm volatile("ldp %0, %1, %3\n"
>> 		     "stp %0, %1, %2\n"
>> 		     "dmb oshst\n"
>> 		     : "=&r" (tmp0),
>> 		       "=&r" (tmp1),
>> 		       "+Q" (*((char *)buffer))
>> 		     : "Q" (*((char __iomem *)addr))
>> 		     : "memory");
>> }
> 
> Same thing, but you are missing the control dependency
> from __io_ar() here, rather than just open-coding it.
> 
>> #else
>> static void qm_write128(void __iomem *addr, const void *buffer)
>> {
>>
>> }
> 
> This is missing the entire implementation?
If the interface is implemented in the driver, the driver runs only on the ARM64 platform.
Therefore, there is no need to implement.

> 
>       Arnd
> .
> 

Thanks,
Weili
Arnd Bergmann Aug. 29, 2023, 7:37 p.m. UTC | #12
On Thu, Aug 24, 2023, at 23:07, Weili Qian wrote:
> On 2023/8/21 22:36, Ard Biesheuvel wrote:
>> On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote:

>>>                      : "Q" (*((char *)buffer))
>> 
>> This constraint describes the first byte of buffer, which might cause
>> problems because the asm reads the entire buffer not just the first
>> byte.
> I don't understand why constraint describes the first byte of 
> buffer,and the compilation result seems to be ok.
>
>  1811     1afc:       a9400a61        ldp     x1, x2, [x19]
>  1812     1b00:       a9000801        stp     x1, x2, [x0]
>  1813     1b04:       d50332bf        dmb     oshst
> Maybe I'm wrong about 'Q', could you explain it or where can I learn 
> more about this?

The "Q" is not the problem here, it's the cast to (char *), which
tells the compiler that only the first byte is used here, and
allows it to not actually store the rest of the buffer into
memory.

It's not a problem on the __iomem pointer side, since gcc never
accesses those directly, and for the version taking a __u128 literal
or two __u64 registers it is also ok.

>>>         unsigned long tmp0 = 0, tmp1 = 0;
>>>
>>>         asm volatile("ldp %0, %1, %3\n"
>>>                      "stp %0, %1, %2\n"
>>>                      "dmb oshst\n"
>> 
>> Is this the right barrier for a read?
> Should be "dmb oshld\n".

As I said, this needs to be __io_ar(), which might be
defined in a different way.

>> 
>> Have you tried using __uint128_t accesses instead?
>> 
>> E.g., something like
>> 
>> static void qm_write128(void __iomem *addr, const void *buffer)
>> {
>>     volatile __uint128_t *dst = (volatile __uint128_t __force *)addr;
>>     const __uint128_t *src __aligned(1) = buffer;
>> 
>>     WRITE_ONCE(*dst, *src);
>>     dma_wmb();
>> }
>> 
>> should produce the right instruction sequence, and works on all
>> architectures that have CONFIG_ARCH_SUPPORTS_INT128=y
>> 
>
> I tried this, but WRITE_ONCE does not support type __uint128_t.
> ->WRITE_ONCE
>  ->compiletime_assert_rwonce_type
>   ->compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),	\
> 		"Unsupported access size for {READ,WRITE}_ONCE().")

On top of that, WRITE_ONCE() does not guarantee atomicity, and
dma_wmb() might not be the correct barrier.

> So can we define generic IO helpers based on patchset
> https://lore.kernel.org/all/20180124090519.6680-4-ynorov@caviumnetworks.com/
> Part of the implementation is as follows:
>
> add writeo() in include/asm-generic/io.h
>
> #ifdef CONFIG_ARCH_SUPPORTS_INT128
> #ifndef writeo
> #define writeo writeo
> static inline void writeo(__uint128_t value, volatile void __iomem 
> *addr)
> {
> 	__io_bw();
> 	__raw_writeo((__uint128_t __force)__cpu_to_le128(value), addr); 
> //__cpu_to_le128 will implement.
> 	__io_aw();
> }
> #endif
> #endif /* CONFIG_ARCH_SUPPORTS_INT128 */

Right, this is fairly close to what we need. The 'o' notation
might be slightly controversial, which is why I suggested
definining only iowrite128() to avoid having to agree on
the correct letter for 16-byte stores.

> in arch/arm64/include/asm/io.h
>
> #ifdef CONFIG_ARCH_SUPPORTS_INT128
> #define __raw_writeo __raw_writeo
> static __always_inline void  __raw_writeo(__uint128_t val, volatile 
> void __iomem *addr)
> {
> 	u64 hi, lo;
>
> 	lo = (u64)val;
> 	hi = (u64)(val >> 64);
>
> 	asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(lo), "rZ"(hi), "r"(addr));
> }
> #endif /* CONFIG_ARCH_SUPPORTS_INT128 */

This definition looks fine.

> And add io{read|write}128bits in include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.
> static inline void lo_hi_writeo(__uint128_t val, volatile void __iomem *addr)
> {
> 	writeq(val, addr);
> 	writeq(val >> 64, addr);
> }

This also looks fine. 

      Arnd
Weili Qian Aug. 31, 2023, 12:05 p.m. UTC | #13
On 2023/8/30 3:37, Arnd Bergmann wrote:
> On Thu, Aug 24, 2023, at 23:07, Weili Qian wrote:
>> On 2023/8/21 22:36, Ard Biesheuvel wrote:
>>> On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote:
> 
>>>>                      : "Q" (*((char *)buffer))
>>>
>>> This constraint describes the first byte of buffer, which might cause
>>> problems because the asm reads the entire buffer not just the first
>>> byte.
>> I don't understand why constraint describes the first byte of 
>> buffer,and the compilation result seems to be ok.
>>
>>  1811     1afc:       a9400a61        ldp     x1, x2, [x19]
>>  1812     1b00:       a9000801        stp     x1, x2, [x0]
>>  1813     1b04:       d50332bf        dmb     oshst
>> Maybe I'm wrong about 'Q', could you explain it or where can I learn 
>> more about this?
> 
> The "Q" is not the problem here, it's the cast to (char *), which
> tells the compiler that only the first byte is used here, and
> allows it to not actually store the rest of the buffer into
> memory.
> 
> It's not a problem on the __iomem pointer side, since gcc never
> accesses those directly, and for the version taking a __u128 literal
> or two __u64 registers it is also ok.

Thanks for your reply, I have got it.

> 
>>>>         unsigned long tmp0 = 0, tmp1 = 0;
>>>>
>>>>         asm volatile("ldp %0, %1, %3\n"
>>>>                      "stp %0, %1, %2\n"
>>>>                      "dmb oshst\n"
>>>
>>> Is this the right barrier for a read?
>> Should be "dmb oshld\n".
> 
> As I said, this needs to be __io_ar(), which might be
> defined in a different way.
> 
>>>
>>> Have you tried using __uint128_t accesses instead?
>>>
>>> E.g., something like
>>>
>>> static void qm_write128(void __iomem *addr, const void *buffer)
>>> {
>>>     volatile __uint128_t *dst = (volatile __uint128_t __force *)addr;
>>>     const __uint128_t *src __aligned(1) = buffer;
>>>
>>>     WRITE_ONCE(*dst, *src);
>>>     dma_wmb();
>>> }
>>>
>>> should produce the right instruction sequence, and works on all
>>> architectures that have CONFIG_ARCH_SUPPORTS_INT128=y
>>>
>>
>> I tried this, but WRITE_ONCE does not support type __uint128_t.
>> ->WRITE_ONCE
>>  ->compiletime_assert_rwonce_type
>>   ->compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),	\
>> 		"Unsupported access size for {READ,WRITE}_ONCE().")
> 
> On top of that, WRITE_ONCE() does not guarantee atomicity, and
> dma_wmb() might not be the correct barrier.
> 
>> So can we define generic IO helpers based on patchset
>> https://lore.kernel.org/all/20180124090519.6680-4-ynorov@caviumnetworks.com/
>> Part of the implementation is as follows:
>>
>> add writeo() in include/asm-generic/io.h
>>
>> #ifdef CONFIG_ARCH_SUPPORTS_INT128
>> #ifndef writeo
>> #define writeo writeo
>> static inline void writeo(__uint128_t value, volatile void __iomem 
>> *addr)
>> {
>> 	__io_bw();
>> 	__raw_writeo((__uint128_t __force)__cpu_to_le128(value), addr); 
>> //__cpu_to_le128 will implement.
>> 	__io_aw();
>> }
>> #endif
>> #endif /* CONFIG_ARCH_SUPPORTS_INT128 */
> 
> Right, this is fairly close to what we need. The 'o' notation
> might be slightly controversial, which is why I suggested
> definining only iowrite128() to avoid having to agree on
> the correct letter for 16-byte stores.

Okay, I'll just define iowrite128() and ioread128().

Thanks,
Weili

> 
>> in arch/arm64/include/asm/io.h
>>
>> #ifdef CONFIG_ARCH_SUPPORTS_INT128
>> #define __raw_writeo __raw_writeo
>> static __always_inline void  __raw_writeo(__uint128_t val, volatile 
>> void __iomem *addr)
>> {
>> 	u64 hi, lo;
>>
>> 	lo = (u64)val;
>> 	hi = (u64)(val >> 64);
>>
>> 	asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(lo), "rZ"(hi), "r"(addr));
>> }
>> #endif /* CONFIG_ARCH_SUPPORTS_INT128 */
> 
> This definition looks fine.
> 
>> And add io{read|write}128bits in include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h.
>> static inline void lo_hi_writeo(__uint128_t val, volatile void __iomem *addr)
>> {
>> 	writeq(val, addr);
>> 	writeq(val >> 64, addr);
>> }
> 
> This also looks fine. 
> 
>       Arnd
> .
>
diff mbox series

Patch

diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
old mode 100644
new mode 100755
index a99fd589445c..13cd2617170e
--- a/drivers/crypto/hisilicon/qm.c
+++ b/drivers/crypto/hisilicon/qm.c
@@ -33,6 +33,10 @@ 
 #define QM_MB_CMD_DATA_SHIFT		32
 #define QM_MB_CMD_DATA_MASK		GENMASK(31, 0)
 #define QM_MB_STATUS_MASK		GENMASK(12, 9)
+#define QM_MB_BUSY_MASK			BIT(13)
+#define QM_MB_SIZE			16
+#define QM_MB_MAX_WAIT_CNT		6000
+#define QM_MB_WAIT_READY_CNT		10
 
 /* sqc shift */
 #define QM_SQ_HOP_NUM_SHIFT		0
@@ -597,17 +601,6 @@  static void qm_mb_pre_init(struct qm_mailbox *mailbox, u8 cmd,
 	mailbox->rsvd = 0;
 }
 
-/* return 0 mailbox ready, -ETIMEDOUT hardware timeout */
-int hisi_qm_wait_mb_ready(struct hisi_qm *qm)
-{
-	u32 val;
-
-	return readl_relaxed_poll_timeout(qm->io_base + QM_MB_CMD_SEND_BASE,
-					  val, !((val >> QM_MB_BUSY_SHIFT) &
-					  0x1), POLL_PERIOD, POLL_TIMEOUT);
-}
-EXPORT_SYMBOL_GPL(hisi_qm_wait_mb_ready);
-
 /* 128 bit should be written to hardware at one time to trigger a mailbox */
 static void qm_mb_write(struct hisi_qm *qm, const void *src)
 {
@@ -618,7 +611,7 @@  static void qm_mb_write(struct hisi_qm *qm, const void *src)
 #endif
 
 	if (!IS_ENABLED(CONFIG_ARM64)) {
-		memcpy_toio(fun_base, src, 16);
+		memcpy_toio(fun_base, src, QM_MB_SIZE);
 		dma_wmb();
 		return;
 	}
@@ -635,35 +628,95 @@  static void qm_mb_write(struct hisi_qm *qm, const void *src)
 #endif
 }
 
-static int qm_mb_nolock(struct hisi_qm *qm, struct qm_mailbox *mailbox)
+/* 128 bit should be read from hardware at one time */
+static void qm_mb_read(struct hisi_qm *qm, void *dst)
 {
-	int ret;
-	u32 val;
+	const void __iomem *fun_base = qm->io_base + QM_MB_CMD_SEND_BASE;
+
+#if IS_ENABLED(CONFIG_ARM64)
+	unsigned long tmp0 = 0, tmp1 = 0;
+#endif
 
-	if (unlikely(hisi_qm_wait_mb_ready(qm))) {
-		dev_err(&qm->pdev->dev, "QM mailbox is busy to start!\n");
-		ret = -EBUSY;
-		goto mb_busy;
+	if (!IS_ENABLED(CONFIG_ARM64)) {
+		memcpy_fromio(dst, fun_base, QM_MB_SIZE);
+		dma_wmb();
+		return;
 	}
 
-	qm_mb_write(qm, mailbox);
+#if IS_ENABLED(CONFIG_ARM64)
+	asm volatile("ldp %0, %1, %3\n"
+		     "stp %0, %1, %2\n"
+		     "dmb oshst\n"
+		     : "=&r" (tmp0),
+		       "=&r" (tmp1),
+		       "+Q" (*((char *)dst))
+		     : "Q" (*((char __iomem *)fun_base))
+		     : "memory");
+#endif
+}
+
+int hisi_qm_wait_mb_ready(struct hisi_qm *qm)
+{
+	struct qm_mailbox mailbox;
+	int i = 0;
+
+	while (i++ < QM_MB_WAIT_READY_CNT) {
+		qm_mb_read(qm, &mailbox);
+		if (!(le16_to_cpu(mailbox.w0) & QM_MB_BUSY_MASK))
+			return 0;
 
-	if (unlikely(hisi_qm_wait_mb_ready(qm))) {
-		dev_err(&qm->pdev->dev, "QM mailbox operation timeout!\n");
-		ret = -ETIMEDOUT;
-		goto mb_busy;
+		usleep_range(WAIT_PERIOD_US_MIN, WAIT_PERIOD_US_MAX);
 	}
 
-	val = readl(qm->io_base + QM_MB_CMD_SEND_BASE);
-	if (val & QM_MB_STATUS_MASK) {
-		dev_err(&qm->pdev->dev, "QM mailbox operation failed!\n");
-		ret = -EIO;
-		goto mb_busy;
+	dev_err(&qm->pdev->dev, "QM mailbox is busy to start!\n");
+
+	return -EBUSY;
+}
+EXPORT_SYMBOL_GPL(hisi_qm_wait_mb_ready);
+
+static int qm_wait_mb_finish(struct hisi_qm *qm, struct qm_mailbox *mailbox)
+{
+	struct device *dev = &qm->pdev->dev;
+	int i = 0;
+
+	while (++i) {
+		qm_mb_read(qm, mailbox);
+		if (!(le16_to_cpu(mailbox->w0) & QM_MB_BUSY_MASK))
+			break;
+
+		if (i == QM_MB_MAX_WAIT_CNT) {
+			dev_err(dev, "QM mailbox operation timeout!\n");
+			return -ETIMEDOUT;
+		}
+
+		usleep_range(WAIT_PERIOD_US_MIN, WAIT_PERIOD_US_MAX);
+	}
+
+	if (le16_to_cpu(mailbox->w0) & QM_MB_STATUS_MASK) {
+		dev_err(dev, "QM mailbox operation failed!\n");
+		return -EIO;
 	}
 
 	return 0;
+}
+
+static int qm_mb_nolock(struct hisi_qm *qm, struct qm_mailbox *mailbox)
+{
+	int ret;
 
-mb_busy:
+	ret = hisi_qm_wait_mb_ready(qm);
+	if (ret)
+		goto mb_err_cnt_increase;
+
+	qm_mb_write(qm, mailbox);
+
+	ret = qm_wait_mb_finish(qm, mailbox);
+	if (ret)
+		goto mb_err_cnt_increase;
+
+	return 0;
+
+mb_err_cnt_increase:
 	atomic64_inc(&qm->debug.dfx.mb_err_cnt);
 	return ret;
 }
@@ -687,6 +740,24 @@  int hisi_qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue,
 }
 EXPORT_SYMBOL_GPL(hisi_qm_mb);
 
+static int hisi_qm_mb_read(struct hisi_qm *qm, u64 *base, u8 cmd, u16 queue)
+{
+	struct qm_mailbox mailbox;
+	int ret;
+
+	qm_mb_pre_init(&mailbox, cmd, 0, queue, 1);
+	mutex_lock(&qm->mailbox_lock);
+	ret = qm_mb_nolock(qm, &mailbox);
+	mutex_unlock(&qm->mailbox_lock);
+	if (ret)
+		return ret;
+
+	*base = le32_to_cpu(mailbox.base_l) |
+		((u64)le32_to_cpu(mailbox.base_h) << 32);
+
+	return 0;
+}
+
 static void qm_db_v1(struct hisi_qm *qm, u16 qn, u8 cmd, u16 index, u8 priority)
 {
 	u64 doorbell;
@@ -1308,12 +1379,10 @@  static int qm_get_vft_v2(struct hisi_qm *qm, u32 *base, u32 *number)
 	u64 sqc_vft;
 	int ret;
 
-	ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1);
+	ret = hisi_qm_mb_read(qm, &sqc_vft, QM_MB_CMD_SQC_VFT_V2, 0);
 	if (ret)
 		return ret;
 
-	sqc_vft = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
-		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << 32);
 	*base = QM_SQC_VFT_BASE_MASK_V2 & (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2);
 	*number = (QM_SQC_VFT_NUM_MASK_V2 &
 		   (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1;
@@ -1484,25 +1553,6 @@  static enum acc_err_result qm_hw_error_handle_v2(struct hisi_qm *qm)
 	return ACC_ERR_RECOVERED;
 }
 
-static int qm_get_mb_cmd(struct hisi_qm *qm, u64 *msg, u16 fun_num)
-{
-	struct qm_mailbox mailbox;
-	int ret;
-
-	qm_mb_pre_init(&mailbox, QM_MB_CMD_DST, 0, fun_num, 0);
-	mutex_lock(&qm->mailbox_lock);
-	ret = qm_mb_nolock(qm, &mailbox);
-	if (ret)
-		goto err_unlock;
-
-	*msg = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) |
-		  ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << 32);
-
-err_unlock:
-	mutex_unlock(&qm->mailbox_lock);
-	return ret;
-}
-
 static void qm_clear_cmd_interrupt(struct hisi_qm *qm, u64 vf_mask)
 {
 	u32 val;
@@ -1522,7 +1572,7 @@  static void qm_handle_vf_msg(struct hisi_qm *qm, u32 vf_id)
 	u64 msg;
 	int ret;
 
-	ret = qm_get_mb_cmd(qm, &msg, vf_id);
+	ret = hisi_qm_mb_read(qm, &msg, QM_MB_CMD_DST, vf_id);
 	if (ret) {
 		dev_err(dev, "failed to get msg from VF(%u)!\n", vf_id);
 		return;
@@ -4755,7 +4805,7 @@  static int qm_wait_pf_reset_finish(struct hisi_qm *qm)
 	 * Whether message is got successfully,
 	 * VF needs to ack PF by clearing the interrupt.
 	 */
-	ret = qm_get_mb_cmd(qm, &msg, 0);
+	ret = hisi_qm_mb_read(qm, &msg, QM_MB_CMD_DST, 0);
 	qm_clear_cmd_interrupt(qm, 0);
 	if (ret) {
 		dev_err(dev, "failed to get msg from PF in reset done!\n");
@@ -4809,7 +4859,7 @@  static void qm_handle_cmd_msg(struct hisi_qm *qm, u32 fun_num)
 	 * Get the msg from source by sending mailbox. Whether message is got
 	 * successfully, destination needs to ack source by clearing the interrupt.
 	 */
-	ret = qm_get_mb_cmd(qm, &msg, fun_num);
+	ret = hisi_qm_mb_read(qm, &msg, QM_MB_CMD_DST, fun_num);
 	qm_clear_cmd_interrupt(qm, BIT(fun_num));
 	if (ret) {
 		dev_err(dev, "failed to get msg from source!\n");
diff --git a/include/linux/hisi_acc_qm.h b/include/linux/hisi_acc_qm.h
index 39fbfb4be944..0f83c19a8f36 100644
--- a/include/linux/hisi_acc_qm.h
+++ b/include/linux/hisi_acc_qm.h
@@ -52,7 +52,6 @@ 
 #define QM_MB_OP_SHIFT			14
 #define QM_MB_CMD_DATA_ADDR_L		0x304
 #define QM_MB_CMD_DATA_ADDR_H		0x308
-#define QM_MB_MAX_WAIT_CNT		6000
 
 /* doorbell */
 #define QM_DOORBELL_CMD_SQ              0