diff mbox series

[v1,1/1] ARM: Select DMA_DIRECT_REMAP to fix restricted DMA

Message ID 20230926175208.9298-2-james.quinlan@broadcom.com (mailing list archive)
State New, archived
Headers show
Series [v1,1/1] ARM: Select DMA_DIRECT_REMAP to fix restricted DMA | expand

Commit Message

Jim Quinlan Sept. 26, 2023, 5:52 p.m. UTC
Without this commit, the use of dma_alloc_coherent() while
using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
For example, the common Wifi 7260 chip (iwlwifi) works fine
on arm64 with restricted memory but not on arm, unless this
commit is applied.

Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
---
 arch/arm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

Comments

kernel test robot Sept. 27, 2023, 7:13 a.m. UTC | #1
Hi Jim,

kernel test robot noticed the following build errors:

[auto build test ERROR on arm/for-next]
[also build test ERROR on linus/master hch-configfs/for-next arm/fixes v6.6-rc3 next-20230927]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jim-Quinlan/ARM-Select-DMA_DIRECT_REMAP-to-fix-restricted-DMA/20230927-025212
base:   git://git.armlinux.org.uk/~rmk/linux-arm.git for-next
patch link:    https://lore.kernel.org/r/20230926175208.9298-2-james.quinlan%40broadcom.com
patch subject: [PATCH v1 1/1] ARM: Select DMA_DIRECT_REMAP to fix restricted DMA
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20230927/202309271425.sxoPXWOX-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230927/202309271425.sxoPXWOX-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309271425.sxoPXWOX-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/dma/pool.c: In function 'atomic_pool_expand':
>> kernel/dma/pool.c:105:44: error: implicit declaration of function 'pgprot_dmacoherent' [-Werror=implicit-function-declaration]
     105 |                                            pgprot_dmacoherent(PAGE_KERNEL),
         |                                            ^~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/pgprot_dmacoherent +105 kernel/dma/pool.c

d7e673ec2c8e0ea Nicolas Saenz Julienne 2020-08-14   78  
54adadf9b08571f David Rientjes         2020-04-20   79  static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
54adadf9b08571f David Rientjes         2020-04-20   80  			      gfp_t gfp)
e860c299ac0d738 David Rientjes         2020-04-14   81  {
54adadf9b08571f David Rientjes         2020-04-20   82  	unsigned int order;
892fc9f6835ecf0 Dan Carpenter          2020-08-26   83  	struct page *page = NULL;
e860c299ac0d738 David Rientjes         2020-04-14   84  	void *addr;
54adadf9b08571f David Rientjes         2020-04-20   85  	int ret = -ENOMEM;
54adadf9b08571f David Rientjes         2020-04-20   86  
23baf831a32c04f Kirill A. Shutemov     2023-03-15   87  	/* Cannot allocate larger than MAX_ORDER */
23baf831a32c04f Kirill A. Shutemov     2023-03-15   88  	order = min(get_order(pool_size), MAX_ORDER);
54adadf9b08571f David Rientjes         2020-04-20   89  
54adadf9b08571f David Rientjes         2020-04-20   90  	do {
54adadf9b08571f David Rientjes         2020-04-20   91  		pool_size = 1 << (PAGE_SHIFT + order);
d7e673ec2c8e0ea Nicolas Saenz Julienne 2020-08-14   92  		if (cma_in_zone(gfp))
d7e673ec2c8e0ea Nicolas Saenz Julienne 2020-08-14   93  			page = dma_alloc_from_contiguous(NULL, 1 << order,
d7e673ec2c8e0ea Nicolas Saenz Julienne 2020-08-14   94  							 order, false);
d7e673ec2c8e0ea Nicolas Saenz Julienne 2020-08-14   95  		if (!page)
c84dc6e68a1d246 David Rientjes         2020-04-14   96  			page = alloc_pages(gfp, order);
54adadf9b08571f David Rientjes         2020-04-20   97  	} while (!page && order-- > 0);
e860c299ac0d738 David Rientjes         2020-04-14   98  	if (!page)
e860c299ac0d738 David Rientjes         2020-04-14   99  		goto out;
e860c299ac0d738 David Rientjes         2020-04-14  100  
c84dc6e68a1d246 David Rientjes         2020-04-14  101  	arch_dma_prep_coherent(page, pool_size);
e860c299ac0d738 David Rientjes         2020-04-14  102  
76a19940bd62a81 David Rientjes         2020-04-14  103  #ifdef CONFIG_DMA_DIRECT_REMAP
c84dc6e68a1d246 David Rientjes         2020-04-14  104  	addr = dma_common_contiguous_remap(page, pool_size,
e860c299ac0d738 David Rientjes         2020-04-14 @105  					   pgprot_dmacoherent(PAGE_KERNEL),
e860c299ac0d738 David Rientjes         2020-04-14  106  					   __builtin_return_address(0));
e860c299ac0d738 David Rientjes         2020-04-14  107  	if (!addr)
54adadf9b08571f David Rientjes         2020-04-20  108  		goto free_page;
76a19940bd62a81 David Rientjes         2020-04-14  109  #else
76a19940bd62a81 David Rientjes         2020-04-14  110  	addr = page_to_virt(page);
76a19940bd62a81 David Rientjes         2020-04-14  111  #endif
76a19940bd62a81 David Rientjes         2020-04-14  112  	/*
76a19940bd62a81 David Rientjes         2020-04-14  113  	 * Memory in the atomic DMA pools must be unencrypted, the pools do not
2f5388a29be82a6 Christoph Hellwig      2020-08-17  114  	 * shrink so no re-encryption occurs in dma_direct_free().
76a19940bd62a81 David Rientjes         2020-04-14  115  	 */
76a19940bd62a81 David Rientjes         2020-04-14  116  	ret = set_memory_decrypted((unsigned long)page_to_virt(page),
76a19940bd62a81 David Rientjes         2020-04-14  117  				   1 << order);
76a19940bd62a81 David Rientjes         2020-04-14  118  	if (ret)
76a19940bd62a81 David Rientjes         2020-04-14  119  		goto remove_mapping;
54adadf9b08571f David Rientjes         2020-04-20  120  	ret = gen_pool_add_virt(pool, (unsigned long)addr, page_to_phys(page),
54adadf9b08571f David Rientjes         2020-04-20  121  				pool_size, NUMA_NO_NODE);
e860c299ac0d738 David Rientjes         2020-04-14  122  	if (ret)
76a19940bd62a81 David Rientjes         2020-04-14  123  		goto encrypt_mapping;
e860c299ac0d738 David Rientjes         2020-04-14  124  
2edc5bb3c5cc421 David Rientjes         2020-04-14  125  	dma_atomic_pool_size_add(gfp, pool_size);
e860c299ac0d738 David Rientjes         2020-04-14  126  	return 0;
e860c299ac0d738 David Rientjes         2020-04-14  127  
76a19940bd62a81 David Rientjes         2020-04-14  128  encrypt_mapping:
76a19940bd62a81 David Rientjes         2020-04-14  129  	ret = set_memory_encrypted((unsigned long)page_to_virt(page),
76a19940bd62a81 David Rientjes         2020-04-14  130  				   1 << order);
76a19940bd62a81 David Rientjes         2020-04-14  131  	if (WARN_ON_ONCE(ret)) {
76a19940bd62a81 David Rientjes         2020-04-14  132  		/* Decrypt succeeded but encrypt failed, purposely leak */
76a19940bd62a81 David Rientjes         2020-04-14  133  		goto out;
76a19940bd62a81 David Rientjes         2020-04-14  134  	}
e860c299ac0d738 David Rientjes         2020-04-14  135  remove_mapping:
76a19940bd62a81 David Rientjes         2020-04-14  136  #ifdef CONFIG_DMA_DIRECT_REMAP
c84dc6e68a1d246 David Rientjes         2020-04-14  137  	dma_common_free_remap(addr, pool_size);
76a19940bd62a81 David Rientjes         2020-04-14  138  #endif
76a19940bd62a81 David Rientjes         2020-04-14  139  free_page: __maybe_unused
c84dc6e68a1d246 David Rientjes         2020-04-14  140  	__free_pages(page, order);
e860c299ac0d738 David Rientjes         2020-04-14  141  out:
54adadf9b08571f David Rientjes         2020-04-20  142  	return ret;
54adadf9b08571f David Rientjes         2020-04-20  143  }
54adadf9b08571f David Rientjes         2020-04-20  144
Linus Walleij Sept. 27, 2023, 11:10 p.m. UTC | #2
Hi Jim,

thanks for your patch!

On Tue, Sep 26, 2023 at 7:52 PM Jim Quinlan <james.quinlan@broadcom.com> wrote:

> Without this commit, the use of dma_alloc_coherent() while
> using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
> For example, the common Wifi 7260 chip (iwlwifi) works fine
> on arm64 with restricted memory but not on arm, unless this
> commit is applied.
>
> Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>

(...)
> +       select DMA_DIRECT_REMAP

Christoph invented that symbol so he can certainly
explain what is missing to use this on ARM.

This looks weird to me, because:
> git grep atomic_pool_init
arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)

Now you have two atomic DMA pools in the kernel,
and a lot more than that is duplicated. I'm amazed that it
compiles at all.

Clearly if you want to do this, surely the ARM-specific
arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
needs to be removed at the same time?

However I don't think it's that simple, because Christoph would surely
had done this a long time ago if it was that simple.

Yours,
Linus Walleij
Jim Quinlan Sept. 28, 2023, 12:07 p.m. UTC | #3
On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>
> Hi Jim,
>
> thanks for your patch!
>
> On Tue, Sep 26, 2023 at 7:52 PM Jim Quinlan <james.quinlan@broadcom.com> wrote:
>
> > Without this commit, the use of dma_alloc_coherent() while
> > using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
> > For example, the common Wifi 7260 chip (iwlwifi) works fine
> > on arm64 with restricted memory but not on arm, unless this
> > commit is applied.
> >
> > Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
>
> (...)
> > +       select DMA_DIRECT_REMAP
>
> Christoph invented that symbol so he can certainly
> explain what is missing to use this on ARM.
>
> This looks weird to me, because:
> > git grep atomic_pool_init
> arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
> kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
>
> Now you have two atomic DMA pools in the kernel,
> and a lot more than that is duplicated. I'm amazed that it
> compiles at all.
>
> Clearly if you want to do this, surely the ARM-specific
> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> needs to be removed at the same time?
>
> However I don't think it's that simple, because Christoph would surely
> had done this a long time ago if it was that simple.

Hello Linus,

Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
I debugged it enough to see that the host driver's
writes to the dma_alloc_coherent() region  were not appearing in
memory, and that
led me to DMA_DIRECT_REMAP.

BTW, I tested "restricted-dma" on the master-tip the other day and it
failed for both arm64 and arm.
Please take this with a large grain of salt  as this was a quick test
and  I won't  have time to
confirm and bisect until  next week at the earliest.

Regards,
Jim Quinlan
Broadcom STB/CM


>
> Yours,
> Linus Walleij
Jim Quinlan Sept. 28, 2023, 1:09 p.m. UTC | #4
On Thu, Sep 28, 2023 at 8:07 AM Jim Quinlan <james.quinlan@broadcom.com> wrote:
>
> On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> >
> > Hi Jim,
> >
> > thanks for your patch!
> >
> > On Tue, Sep 26, 2023 at 7:52 PM Jim Quinlan <james.quinlan@broadcom.com> wrote:
> >
> > > Without this commit, the use of dma_alloc_coherent() while
> > > using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
> > > For example, the common Wifi 7260 chip (iwlwifi) works fine
> > > on arm64 with restricted memory but not on arm, unless this
> > > commit is applied.
> > >
> > > Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
> >
> > (...)
> > > +       select DMA_DIRECT_REMAP
> >
> > Christoph invented that symbol so he can certainly
> > explain what is missing to use this on ARM.
> >
> > This looks weird to me, because:
> > > git grep atomic_pool_init
> > arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
> > kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
> >
> > Now you have two atomic DMA pools in the kernel,
> > and a lot more than that is duplicated. I'm amazed that it
> > compiles at all.

Ah, I did not communicate this well at all.  The patch compiles on our
ARM brcmstb_defconfig
for 5.15, 6.1, and upstream.  The kernel test-bot tells me it doesn't
compile on whatever
config it is using (looks like a missing header file).

My patch does not work on upstream; I only supplied it to show what
"fixes" 6.1 and 5.15.
For upstream on ARM, restricted-memory does not work w/ or w/o the patch.
For upstream on ARM64, restricted memory does not seem to be working either.

Regards,
Jim Quinlan
Broadcom STB/CM
> >
> > Clearly if you want to do this, surely the ARM-specific
> > arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> > needs to be removed at the same time?
> >
> > However I don't think it's that simple, because Christoph would surely
> > had done this a long time ago if it was that simple.
>
> Hello Linus,
>
> Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> I debugged it enough to see that the host driver's
> writes to the dma_alloc_coherent() region  were not appearing in
> memory, and that
> led me to DMA_DIRECT_REMAP.
>
> BTW, I tested "restricted-dma" on the master-tip the other day and it
> failed for both arm64 and arm.
> Please take this with a large grain of salt  as this was a quick test
> and  I won't  have time to
> confirm and bisect until  next week at the earliest.
>
> Regards,
> Jim Quinlan
> Broadcom STB/CM
>
>
> >
> > Yours,
> > Linus Walleij
Arnd Bergmann Sept. 28, 2023, 1:32 p.m. UTC | #5
On Thu, Sep 28, 2023, at 08:07, Jim Quinlan wrote:
> On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>>
>> Clearly if you want to do this, surely the ARM-specific
>> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
>> needs to be removed at the same time?
>
>
> Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> I debugged it enough to see that the host driver's
> writes to the dma_alloc_coherent() region  were not appearing in
> memory, and that
> led me to DMA_DIRECT_REMAP.

Usually when you see a mismatch between the data observed by the
device and the CPU, the problem is an incorrect "dma-coherent"
property in the DT: either the device is coherent and accesses
the cache but the CPU tries to bypass it because the property
is missing, or there is an extraneous property and the CPU
goes the through the cache but the devices bypasses it.

It could also be a driver bug if the device  mixes up the
address spaces, e.g. passing virt_to_phys(pointer) rather
than the DMA address returned by dma_alloc_coherent().

    Arnd
Jim Quinlan Sept. 28, 2023, 2 p.m. UTC | #6
On Thu, Sep 28, 2023 at 9:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Sep 28, 2023, at 08:07, Jim Quinlan wrote:
> > On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> >>
> >> Clearly if you want to do this, surely the ARM-specific
> >> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> >> needs to be removed at the same time?
> >
> >
> > Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> > I debugged it enough to see that the host driver's
> > writes to the dma_alloc_coherent() region  were not appearing in
> > memory, and that
> > led me to DMA_DIRECT_REMAP.
>
> Usually when you see a mismatch between the data observed by the
> device and the CPU, the problem is an incorrect "dma-coherent"
> property in the DT: either the device is coherent and accesses
> the cache but the CPU tries to bypass it because the property
> is missing, or there is an extraneous property and the CPU
> goes the through the cache but the devices bypasses it.

I just searched, there are no "dt-coherent" properties in our device tree.
Also, even if we did have them, wouldn't things also fail when not using
restricted DMA?

>
> It could also be a driver bug if the device  mixes up the
> address spaces, e.g. passing virt_to_phys(pointer) rather
> than the DMA address returned by dma_alloc_coherent().

This is an Intel 7260 part using the iwlwifi driver, I doubt it has
errors of that kind.

Regards,
Jim Quinlan
Broadcom STB/CM
>
>     Arnd
Jim Quinlan Sept. 28, 2023, 2:01 p.m. UTC | #7
On Thu, Sep 28, 2023 at 10:00 AM Jim Quinlan <james.quinlan@broadcom.com> wrote:
>
> On Thu, Sep 28, 2023 at 9:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, Sep 28, 2023, at 08:07, Jim Quinlan wrote:
> > > On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> > >>
> > >> Clearly if you want to do this, surely the ARM-specific
> > >> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> > >> needs to be removed at the same time?
> > >
> > >
> > > Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> > > I debugged it enough to see that the host driver's
> > > writes to the dma_alloc_coherent() region  were not appearing in
> > > memory, and that
> > > led me to DMA_DIRECT_REMAP.
> >
> > Usually when you see a mismatch between the data observed by the
> > device and the CPU, the problem is an incorrect "dma-coherent"
> > property in the DT: either the device is coherent and accesses
> > the cache but the CPU tries to bypass it because the property
> > is missing, or there is an extraneous property and the CPU
> > goes the through the cache but the devices bypasses it.
>
> I just searched, there are no "dt-coherent" properties in our device tree.
> Also, even if we did have them, wouldn't things also fail when not using
> restricted DMA?

s/dt-coherent/dma-coherent/

>
> >
> > It could also be a driver bug if the device  mixes up the
> > address spaces, e.g. passing virt_to_phys(pointer) rather
> > than the DMA address returned by dma_alloc_coherent().
>
> This is an Intel 7260 part using the iwlwifi driver, I doubt it has
> errors of that kind.
>
> Regards,
> Jim Quinlan
> Broadcom STB/CM
> >
> >     Arnd
Arnd Bergmann Sept. 28, 2023, 3:16 p.m. UTC | #8
On Thu, Sep 28, 2023, at 10:00, Jim Quinlan wrote:
> On Thu, Sep 28, 2023 at 9:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Thu, Sep 28, 2023, at 08:07, Jim Quinlan wrote:
>> > On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>> >>
>> >> Clearly if you want to do this, surely the ARM-specific
>> >> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
>> >> needs to be removed at the same time?
>> >
>> >
>> > Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
>> > I debugged it enough to see that the host driver's
>> > writes to the dma_alloc_coherent() region  were not appearing in
>> > memory, and that
>> > led me to DMA_DIRECT_REMAP.
>>
>> Usually when you see a mismatch between the data observed by the
>> device and the CPU, the problem is an incorrect "dma-coherent"
>> property in the DT: either the device is coherent and accesses
>> the cache but the CPU tries to bypass it because the property
>> is missing, or there is an extraneous property and the CPU
>> goes the through the cache but the devices bypasses it.
>
> I just searched, there are no "dt-coherent" properties in our device tree.
> Also, even if we did have them, wouldn't things also fail when not using
> restricted DMA?

Correct, it should be independent of restricted DMA, but it might
work by chance that way even if it's still wrong. If your DT
is marked as non-coherent (note: the property to look for
is "dma-coherent", not "dt-coherent"), can you check the
datasheet of the SoC to if that is actually correct?

If the chip is designed to support high-speed devices on
PCIe, it's likely that the PCIe root complex is either coherent
with the caches, or can (and should) be configured that way
for performance reasons.

>> It could also be a driver bug if the device  mixes up the
>> address spaces, e.g. passing virt_to_phys(pointer) rather
>> than the DMA address returned by dma_alloc_coherent().
>
> This is an Intel 7260 part using the iwlwifi driver, I doubt it has
> errors of that kind.

It's unlikely but not impossible, as the driver has some
unusual constructs, using a lot of coherent mappings that
might otherwise be streaming mappings, and relying on
dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
data, but without the corresponding dma_sync_single_for_cpu().
If all the testing happens on x86, this might easily lead
to a bug that only shows up on non-coherent systems but
is never seen during testing.

If the problem is not the "dma-coherent" property, can you
double-check if using a different PCIe device works, or narrow
down which specific buffer you saw get corrupted?

     Arnd
Robin Murphy Sept. 28, 2023, 3:33 p.m. UTC | #9
On 28/09/2023 4:16 pm, Arnd Bergmann wrote:
> On Thu, Sep 28, 2023, at 10:00, Jim Quinlan wrote:
>> On Thu, Sep 28, 2023 at 9:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
>>>
>>> On Thu, Sep 28, 2023, at 08:07, Jim Quinlan wrote:
>>>> On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>>>>>
>>>>> Clearly if you want to do this, surely the ARM-specific
>>>>> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
>>>>> needs to be removed at the same time?
>>>>
>>>>
>>>> Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
>>>> I debugged it enough to see that the host driver's
>>>> writes to the dma_alloc_coherent() region  were not appearing in
>>>> memory, and that
>>>> led me to DMA_DIRECT_REMAP.
>>>
>>> Usually when you see a mismatch between the data observed by the
>>> device and the CPU, the problem is an incorrect "dma-coherent"
>>> property in the DT: either the device is coherent and accesses
>>> the cache but the CPU tries to bypass it because the property
>>> is missing, or there is an extraneous property and the CPU
>>> goes the through the cache but the devices bypasses it.
>>
>> I just searched, there are no "dt-coherent" properties in our device tree.
>> Also, even if we did have them, wouldn't things also fail when not using
>> restricted DMA?
> 
> Correct, it should be independent of restricted DMA, but it might
> work by chance that way even if it's still wrong. If your DT
> is marked as non-coherent (note: the property to look for
> is "dma-coherent", not "dt-coherent"), can you check the
> datasheet of the SoC to if that is actually correct?
> 
> If the chip is designed to support high-speed devices on
> PCIe, it's likely that the PCIe root complex is either coherent
> with the caches, or can (and should) be configured that way
> for performance reasons.
> 
>>> It could also be a driver bug if the device  mixes up the
>>> address spaces, e.g. passing virt_to_phys(pointer) rather
>>> than the DMA address returned by dma_alloc_coherent().
>>
>> This is an Intel 7260 part using the iwlwifi driver, I doubt it has
>> errors of that kind.
> 
> It's unlikely but not impossible, as the driver has some
> unusual constructs, using a lot of coherent mappings that
> might otherwise be streaming mappings, and relying on
> dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
> data, but without the corresponding dma_sync_single_for_cpu().
> If all the testing happens on x86, this might easily lead
> to a bug that only shows up on non-coherent systems but
> is never seen during testing.

Probably the significant thing about restricted DMA is that it forces 
all streaming DMA to be bounce-buffered. That should expose busted 
synchronisation even more decisively than a lack of coherency. If 
there's no IOMMU, then testing the driver in the absence of restricted 
DMA but with "swiotlb=force" should confirm or disprove that.

Robin.

> If the problem is not the "dma-coherent" property, can you
> double-check if using a different PCIe device works, or narrow
> down which specific buffer you saw get corrupted?
> 
>       Arnd
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Robin Murphy Sept. 28, 2023, 3:47 p.m. UTC | #10
On 28/09/2023 1:07 pm, Jim Quinlan wrote:
> On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>>
>> Hi Jim,
>>
>> thanks for your patch!
>>
>> On Tue, Sep 26, 2023 at 7:52 PM Jim Quinlan <james.quinlan@broadcom.com> wrote:
>>
>>> Without this commit, the use of dma_alloc_coherent() while
>>> using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
>>> For example, the common Wifi 7260 chip (iwlwifi) works fine
>>> on arm64 with restricted memory but not on arm, unless this
>>> commit is applied.
>>>
>>> Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
>>
>> (...)
>>> +       select DMA_DIRECT_REMAP
>>
>> Christoph invented that symbol so he can certainly
>> explain what is missing to use this on ARM.
>>
>> This looks weird to me, because:
>>> git grep atomic_pool_init
>> arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
>> kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
>>
>> Now you have two atomic DMA pools in the kernel,
>> and a lot more than that is duplicated. I'm amazed that it
>> compiles at all.
>>
>> Clearly if you want to do this, surely the ARM-specific
>> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
>> needs to be removed at the same time?
>>
>> However I don't think it's that simple, because Christoph would surely
>> had done this a long time ago if it was that simple.
> 
> Hello Linus,
> 
> Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> I debugged it enough to see that the host driver's
> writes to the dma_alloc_coherent() region  were not appearing in
> memory, and that
> led me to DMA_DIRECT_REMAP.

Oh, another thing - the restricted-dma-pool is really only for streaming 
DMA - IIRC there can be cases where the emergency fallback of trying to 
allocate out of the bounce buffer won't work properly. Are you also 
using an additional shared-dma-pool carveout to satisfy the coherent 
allocations, per the DT binding?

Thanks,
Robin.
Arnd Bergmann Sept. 28, 2023, 4:20 p.m. UTC | #11
On Thu, Sep 28, 2023, at 11:33, Robin Murphy wrote:
> On 28/09/2023 4:16 pm, Arnd Bergmann wrote:
>
>> It's unlikely but not impossible, as the driver has some
>> unusual constructs, using a lot of coherent mappings that
>> might otherwise be streaming mappings, and relying on
>> dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
>> data, but without the corresponding dma_sync_single_for_cpu().
>> If all the testing happens on x86, this might easily lead
>> to a bug that only shows up on non-coherent systems but
>> is never seen during testing.
>
> Probably the significant thing about restricted DMA is that it forces 
> all streaming DMA to be bounce-buffered. That should expose busted 
> synchronisation even more decisively than a lack of coherency. If 
> there's no IOMMU, then testing the driver in the absence of restricted 
> DMA but with "swiotlb=force" should confirm or disprove that.

I see this sequence in the iwlwifi driver, in the
iwl_save_fw_paging() function:

       block = alloc_pages(GFP_KERNEL, order);
       phys = dma_map_page(dev, block, 0,
               PAGE_SIZE << order, DMA_BIDIRECTIONAL);
       memcpy(page_address(block), ...);
       dma_sync_single_for_device(dev, phys, size, DMA_BIDIRECTIONAL);

Which clearly violates the interface by writing into
a page that is already owned by the device, without
giving it back to the cpu first. Not sure if or how this
would explain actual data corruption on armv7, since we
write back the buffers in both the map and sync operations
and never invalidate the cache, but the driver also doesn't
ever read from the buffer (despite it being bidirectional).
If it's not this problem, there is a good chance of others.

     Arnd
Christophe Leroy Sept. 28, 2023, 4:24 p.m. UTC | #12
Hi,

For the form,

Le 26/09/2023 à 19:52, Jim Quinlan a écrit :
> Without this commit, the use of dma_alloc_coherent() while

Instead, say "Without selecting DMA_DIRECT_REMAP, the use of ...."

> using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
> For example, the common Wifi 7260 chip (iwlwifi) works fine
> on arm64 with restricted memory but not on arm, unless this
> commit is applied.

Say " .... unless DMA_DIRECT_REMAP is selected"

> 
> Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
> ---
>   arch/arm/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 9557808e8937..b6f1cea923cf 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -34,6 +34,7 @@ config ARM
>   	select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7
>   	select ARCH_SUPPORTS_ATOMIC_RMW
>   	select ARCH_SUPPORTS_HUGETLBFS if ARM_LPAE
> +	select DMA_DIRECT_REMAP

On powerpc we try to keep those in alphabetical order. Don't you do the 
same on ARM ?

>   	select ARCH_USE_BUILTIN_BSWAP
>   	select ARCH_USE_CMPXCHG_LOCKREF
>   	select ARCH_USE_MEMTEST
Jim Quinlan Sept. 29, 2023, 7:24 p.m. UTC | #13
On Thu, Sep 28, 2023 at 11:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Sep 28, 2023, at 10:00, Jim Quinlan wrote:
> > On Thu, Sep 28, 2023 at 9:32 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >>
> >> On Thu, Sep 28, 2023, at 08:07, Jim Quinlan wrote:
> >> > On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> >> >>
> >> >> Clearly if you want to do this, surely the ARM-specific
> >> >> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> >> >> needs to be removed at the same time?
> >> >
> >> >
> >> > Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> >> > I debugged it enough to see that the host driver's
> >> > writes to the dma_alloc_coherent() region  were not appearing in
> >> > memory, and that
> >> > led me to DMA_DIRECT_REMAP.
> >>
> >> Usually when you see a mismatch between the data observed by the
> >> device and the CPU, the problem is an incorrect "dma-coherent"
> >> property in the DT: either the device is coherent and accesses
> >> the cache but the CPU tries to bypass it because the property
> >> is missing, or there is an extraneous property and the CPU
> >> goes the through the cache but the devices bypasses it.
> >
> > I just searched, there are no "dt-coherent" properties in our device tree.
> > Also, even if we did have them, wouldn't things also fail when not using
> > restricted DMA?
>
> Correct, it should be independent of restricted DMA, but it might
> work by chance that way even if it's still wrong. If your DT
> is marked as non-coherent (note: the property to look for
> is "dma-coherent", not "dt-coherent"), can you check the
> datasheet of the SoC to if that is actually correct?

Our PCIe RC device is not dma-coherent.

>
> If the chip is designed to support high-speed devices on
> PCIe, it's likely that the PCIe root complex is either coherent
> with the caches, or can (and should) be configured that way
> for performance reasons.

Our RC is definitely not coherent with the ARM/ARM64 caches.

>
> >> It could also be a driver bug if the device  mixes up the
> >> address spaces, e.g. passing virt_to_phys(pointer) rather
> >> than the DMA address returned by dma_alloc_coherent().
> >
> > This is an Intel 7260 part using the iwlwifi driver, I doubt it has
> > errors of that kind.
>
> It's unlikely but not impossible, as the driver has some
> unusual constructs, using a lot of coherent mappings that
> might otherwise be streaming mappings, and relying on
> dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
> data, but without the corresponding dma_sync_single_for_cpu().
> If all the testing happens on x86, this might easily lead
> to a bug that only shows up on non-coherent systems but
> is never seen during testing.
>
> If the problem is not the "dma-coherent" property, can you
> double-check if using a different PCIe device works, or narrow
> down which specific buffer you saw get corrupted?

I've done some testing, below are the results.  The new two devices, a
USB controller
and an M2 NVMe stick, behave the same as iwlwifi.

Note that I'm not advocating that "select DMA_DIRECT_REMAP" is the
anser, I'm just
showing that it fixes my examples.

Regards,
Jim Quinlan
Broadcom STB/CM


VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
                      ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
5.15     iwlwifi        P       P        P       F             P
5.15     nvme           P       P        P       F             P
5.15     usb            P       P        P       F             P

6.1      iwlwifi        P       P        P       F             P
6.1      nvme           P       P        P       F             P
6.1      usb            P        P       P       F             P

Upstrm   iwlwifi        P       P        F       F             F
Upstrm   nvme           P       P        F       F             F
Upstrm   usb            P       P        F       F             F
                      ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
VER      PCI-DEV                       <--------- RESTRICTED DMA --------->

LEGEND:
  P       := pass, driver probe and some functional test performed
  F       := fail, usually when probe is called; impossible to do
functional test
  Upstrm  := 633b47cb009d "Merge tag 'scsi-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi"

  iwlwifi := 7260 Wifi 8086:08b1
  nvme    := 1e95:1007
  usb     := Supahub, 1912:0014



>
>      Arnd
Arnd Bergmann Sept. 29, 2023, 7:52 p.m. UTC | #14
On Fri, Sep 29, 2023, at 15:24, Jim Quinlan wrote:
> On Thu, Sep 28, 2023 at 11:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
>> On Thu, Sep 28, 2023, at 10:00, Jim Quinlan wrote:
>
> Our RC is definitely not coherent with the ARM/ARM64 caches.

Ok, thanks for the confirmation.

>> It's unlikely but not impossible, as the driver has some
>> unusual constructs, using a lot of coherent mappings that
>> might otherwise be streaming mappings, and relying on
>> dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
>> data, but without the corresponding dma_sync_single_for_cpu().
>> If all the testing happens on x86, this might easily lead
>> to a bug that only shows up on non-coherent systems but
>> is never seen during testing.
>>
>> If the problem is not the "dma-coherent" property, can you
>> double-check if using a different PCIe device works, or narrow
>> down which specific buffer you saw get corrupted?
>
> I've done some testing, below are the results.  The new two devices, a
> USB controller
> and an M2 NVMe stick, behave the same as iwlwifi.
>
> Note that I'm not advocating that "select DMA_DIRECT_REMAP" is the
> anser, I'm just showing that it fixes my examples.

Ok, so I think we can stop looking at the device drivers for
bugs then.

> VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
>                       ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
> 5.15     iwlwifi        P       P        P       F             P
> 5.15     nvme           P       P        P       F             P
> 5.15     usb            P       P        P       F             P
>
> 6.1      iwlwifi        P       P        P       F             P
> 6.1      nvme           P       P        P       F             P
> 6.1      usb            P        P       P       F             P
>
> Upstrm   iwlwifi        P       P        F       F             F
> Upstrm   nvme           P       P        F       F             F
> Upstrm   usb            P       P        F       F             F
>                       ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
> VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
>
> LEGEND:
>   P       := pass, driver probe and some functional test performed
>   F       := fail, usually when probe is called; impossible to do
> functional test
>   Upstrm  := 633b47cb009d "Merge tag 'scsi-fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi"
>
>   iwlwifi := 7260 Wifi 8086:08b1
>   nvme    := 1e95:1007
>   usb     := Supahub, 1912:0014

Thanks for the thorough testing, that looks very useful, even though
I don't have an answer immediately. Maybe Robin can see something
here that I don't.

It's particularly interesting how arm64 only started failing
on fairly recent kernels, so if nothing else helps you could
always try bisecting the history between 6.1 and 633b47cb009d,
hoping that the commit that broke it points us to the arm32
problem.

The only change I see in that time frame that seems related
is 7bd6680b47fa ("Revert "Revert "arm64: dma: Drop cache
invalidation from arch_dma_prep_coherent()"""), so you could
start by reverting that. However, it's probably something
else since this is for the coherent mappings, not the
streaming ones.

       Arnd
Jim Quinlan Sept. 29, 2023, 9:13 p.m. UTC | #15
On Fri, Sep 29, 2023 at 3:52 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Sep 29, 2023, at 15:24, Jim Quinlan wrote:
> > On Thu, Sep 28, 2023 at 11:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> On Thu, Sep 28, 2023, at 10:00, Jim Quinlan wrote:
> >
> > Our RC is definitely not coherent with the ARM/ARM64 caches.
>
> Ok, thanks for the confirmation.
>
> >> It's unlikely but not impossible, as the driver has some
> >> unusual constructs, using a lot of coherent mappings that
> >> might otherwise be streaming mappings, and relying on
> >> dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
> >> data, but without the corresponding dma_sync_single_for_cpu().
> >> If all the testing happens on x86, this might easily lead
> >> to a bug that only shows up on non-coherent systems but
> >> is never seen during testing.
> >>
> >> If the problem is not the "dma-coherent" property, can you
> >> double-check if using a different PCIe device works, or narrow
> >> down which specific buffer you saw get corrupted?
> >
> > I've done some testing, below are the results.  The new two devices, a
> > USB controller
> > and an M2 NVMe stick, behave the same as iwlwifi.
> >
> > Note that I'm not advocating that "select DMA_DIRECT_REMAP" is the
> > anser, I'm just showing that it fixes my examples.
>
> Ok, so I think we can stop looking at the device drivers for
> bugs then.
>
> > VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
> >                       ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
> > 5.15     iwlwifi        P       P        P       F             P
> > 5.15     nvme           P       P        P       F             P
> > 5.15     usb            P       P        P       F             P
> >
> > 6.1      iwlwifi        P       P        P       F             P
> > 6.1      nvme           P       P        P       F             P
> > 6.1      usb            P        P       P       F             P
> >
> > Upstrm   iwlwifi        P       P        F       F             F
> > Upstrm   nvme           P       P        F       F             F
> > Upstrm   usb            P       P        F       F             F
> >                       ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
> > VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
> >
> > LEGEND:
> >   P       := pass, driver probe and some functional test performed
> >   F       := fail, usually when probe is called; impossible to do
> > functional test
> >   Upstrm  := 633b47cb009d "Merge tag 'scsi-fixes' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi"
> >
> >   iwlwifi := 7260 Wifi 8086:08b1
> >   nvme    := 1e95:1007
> >   usb     := Supahub, 1912:0014
>
> Thanks for the thorough testing, that looks very useful, even though
> I don't have an answer immediately. Maybe Robin can see something
> here that I don't.
>
> It's particularly interesting how arm64 only started failing
> on fairly recent kernels, so if nothing else helps you could
> always try bisecting the history between 6.1 and 633b47cb009d,
> hoping that the commit that broke it points us to the arm32
> problem.

Yes I plan to do that.

There's also one other datapoint I have but have only tried it on 6.1
and 5.15 ARM  builds.
I can set a board to run its PCIe  HW in EP mode, and then connect it
to a board running its
PCIe HW in RC mode.

On the RC side, I've written a small PCIe host driver [1].  On the EP
board, I just run this shell
command when I get the prompt:

while true ; do
    for i in `seq 0 15` ; do
        devmem $(printf "0x%x" $((0x50a000000 + ($i << 16))))
    done
    echo
    sleep 1
done

You will have to take my word that I've configured the PCIe window on
the EP board at 0x5_0a000000
to correspond to the restricted memory region on the RC side (0x4a000000).
With ARM+restricted DMA  I see something like this as the output [2].
The values should appear
all at once or at least in order -- as they do on ARM64 -- but they do
not for ARM.

FWIW,
Jim Quinlan
Broadcom STB/CM

[1]
#define NUM_DMA_REGIONS 16

for (i = 0; i < NUM_DMA_REGIONS; i++)
        va[i] = dma_alloc_coherent(dev, size, &dma_addr, flags);

for (i = 0; i < NUM_DMA_REGIONS; i++)
        memset(va[i], 0, size/sizeof(u32));

printk("\n================ START READING AT EP  ======================\n");
msleep(4000);
for (i = 0; i < NUM_DMA_REGIONS; i++) {
        pu32 = va[i];
        for (j = 0; j < size/sizeof(u32); j += 64) {
                uint32_t x = pu32[j];

                mdelay(3);
                pu32[j] = 0x12340000 + (i << 12) + j + (x & 0x0f00);
                wmb();
        }
}

[2]
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x00000000
0x00000000
0x12342000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x12340000
0x00000000
0x12342000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x12340000
0x00000000
0x12342000
0x12343000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x12340000
0x00000000
0x12342000
0x12343000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x12340000
0x12341000
0x12342000
0x12343000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x12340000
0x12341000
0x12342000
0x12343000
0x00000000
0x00000000
0x00000000
0x12347000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

0x12340000
0x12341000
0x12342000
0x12343000
0x00000000
0x00000000
0x00000000
0x12347000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000
0x00000000

 ^C
>
>
> The only change I see in that time frame that seems related
> is 7bd6680b47fa ("Revert "Revert "arm64: dma: Drop cache
> invalidation from arch_dma_prep_coherent()"""), so you could
> start by reverting that. However, it's probably something
> else since this is for the coherent mappings, not the
> streaming ones.
>
>        Arnd
Jim Quinlan Oct. 1, 2023, 12:48 p.m. UTC | #16
On Fri, Sep 29, 2023 at 3:52 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Sep 29, 2023, at 15:24, Jim Quinlan wrote:
> > On Thu, Sep 28, 2023 at 11:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> On Thu, Sep 28, 2023, at 10:00, Jim Quinlan wrote:
> >
> > Our RC is definitely not coherent with the ARM/ARM64 caches.
>
> Ok, thanks for the confirmation.
>
> >> It's unlikely but not impossible, as the driver has some
> >> unusual constructs, using a lot of coherent mappings that
> >> might otherwise be streaming mappings, and relying on
> >> dma_sync_single_for_device(..., DMA_BIDIRECTIONAL) for other
> >> data, but without the corresponding dma_sync_single_for_cpu().
> >> If all the testing happens on x86, this might easily lead
> >> to a bug that only shows up on non-coherent systems but
> >> is never seen during testing.
> >>
> >> If the problem is not the "dma-coherent" property, can you
> >> double-check if using a different PCIe device works, or narrow
> >> down which specific buffer you saw get corrupted?
> >
> > I've done some testing, below are the results.  The new two devices, a
> > USB controller
> > and an M2 NVMe stick, behave the same as iwlwifi.
> >
> > Note that I'm not advocating that "select DMA_DIRECT_REMAP" is the
> > anser, I'm just showing that it fixes my examples.
>
> Ok, so I think we can stop looking at the device drivers for
> bugs then.
>
> > VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
> >                       ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
> > 5.15     iwlwifi        P       P        P       F             P
> > 5.15     nvme           P       P        P       F             P
> > 5.15     usb            P       P        P       F             P
> >
> > 6.1      iwlwifi        P       P        P       F             P
> > 6.1      nvme           P       P        P       F             P
> > 6.1      usb            P        P       P       F             P
> >
> > Upstrm   iwlwifi        P       P        F       F             F
> > Upstrm   nvme           P       P        F       F             F
> > Upstrm   usb            P       P        F       F             F
> >                       ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
> > VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
> >
> > LEGEND:
> >   P       := pass, driver probe and some functional test performed
> >   F       := fail, usually when probe is called; impossible to do
> > functional test
> >   Upstrm  := 633b47cb009d "Merge tag 'scsi-fixes' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi"
> >
> >   iwlwifi := 7260 Wifi 8086:08b1
> >   nvme    := 1e95:1007
> >   usb     := Supahub, 1912:0014
>
> Thanks for the thorough testing, that looks very useful, even though
> I don't have an answer immediately. Maybe Robin can see something
> here that I don't.
>
> It's particularly interesting how arm64 only started failing
> on fairly recent kernels, so if nothing else helps you could
> always try bisecting the history between 6.1 and 633b47cb009d,
> hoping that the commit that broke it points us to the arm32
> problem.
>
> The only change I see in that time frame that seems related
> is 7bd6680b47fa ("Revert "Revert "arm64: dma: Drop cache
> invalidation from arch_dma_prep_coherent()"""), so you could
> start by reverting that. However, it's probably something
> else since this is for the coherent mappings, not the
> streaming ones.

Hi Arnd,

I started to do the bisection and it was clear early on that something
else was afoot --
it turns out that our 6.1 and 5.15 builds included a NAK'd upstream
commit which was required
by our legacy DT tree.  Once I updated our DT tree, the upstream results matched
those of 6.1 and 5.15.  I've included the new results below.

Long story short, it appears that restricted DMA does not work with ARCH=arm
platforms using non-dma-coherent devices.  AFAICT, the
dma_alloc_coherent() function
should return a pointer to an uncacheable page -- the only way our system can
have coherent memory -- but it is not.

The "select DMA_DIRECT_REMAP" makes things work in 5.15, 6.1 and upstream,
although I cannot really say if that is a viable solution.

Regards,
Jim Quinlan
Broadcom STB/CM

VER      PCI-DEV                       <--------- RESTRICTED DMA --------->
                      ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
5.15     iwlwifi        P       P        P       F             P
5.15     nvme           P       P        P       F             P
5.15     usb            P       P        P       F             P

6.1      iwlwifi        P       P        P       F             P
6.1      nvme           P       P        P       F             P
6.1      usb            P       P        P       F             P

Upstrm   iwlwifi        P       P        P       F             P
Upstrm   nvme           P       P        P       F             P
Upstrm   usb            P       P        P       F             P
                      ARM64    ARM     ARM64    ARM    ARM+DMA_DIRECT_REMAP
VER      PCI-DEV                       <--------- RESTRICTED DMA --------->

LEGEND:
  P       := pass, driver probe and some functional test performed
  F       := fail, usually when probe is called; impossible to do
functional test
  Upstrm  := 633b47cb009d "Merge tag 'scsi-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi"

  iwlwifi := 7260 Wifi 8086:08b1
  nvme    := 1e95:1007
  usb     := Supahub, 1912:0014
>
>        Arnd
Christoph Hellwig Oct. 2, 2023, 6:16 a.m. UTC | #17
On Thu, Sep 28, 2023 at 01:10:27AM +0200, Linus Walleij wrote:
> (...)
> > +       select DMA_DIRECT_REMAP
> 
> Christoph invented that symbol so he can certainly
> explain what is missing to use this on ARM.
> 
> This looks weird to me, because:
> > git grep atomic_pool_init
> arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
> kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
> 
> Now you have two atomic DMA pools in the kernel,
> and a lot more than that is duplicated. I'm amazed that it
> compiles at all.
> 
> Clearly if you want to do this, surely the ARM-specific
> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> needs to be removed at the same time?
> 
> However I don't think it's that simple, because Christoph would surely
> had done this a long time ago if it was that simple.

Yes, DMA_DIRECT_REMAP should only be used for platforms using the
generic generic remap that plus straight into dma-direct and
bypasses arch_dma_alloc.

ARM first needs support to directly set the uncached/wc bits on
the direct mapping for CMA, which should be fairly simple but require
wide spread testing.

I'd be happy to work with anyone who wants to look into this.
Jim Quinlan Oct. 2, 2023, 12:33 p.m. UTC | #18
On Thu, Sep 28, 2023 at 11:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 28/09/2023 1:07 pm, Jim Quinlan wrote:
> > On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> >>
> >> Hi Jim,
> >>
> >> thanks for your patch!
> >>
> >> On Tue, Sep 26, 2023 at 7:52 PM Jim Quinlan <james.quinlan@broadcom.com> wrote:
> >>
> >>> Without this commit, the use of dma_alloc_coherent() while
> >>> using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
> >>> For example, the common Wifi 7260 chip (iwlwifi) works fine
> >>> on arm64 with restricted memory but not on arm, unless this
> >>> commit is applied.
> >>>
> >>> Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
> >>
> >> (...)
> >>> +       select DMA_DIRECT_REMAP
> >>
> >> Christoph invented that symbol so he can certainly
> >> explain what is missing to use this on ARM.
> >>
> >> This looks weird to me, because:
> >>> git grep atomic_pool_init
> >> arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
> >> kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
> >>
> >> Now you have two atomic DMA pools in the kernel,
> >> and a lot more than that is duplicated. I'm amazed that it
> >> compiles at all.
> >>
> >> Clearly if you want to do this, surely the ARM-specific
> >> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> >> needs to be removed at the same time?
> >>
> >> However I don't think it's that simple, because Christoph would surely
> >> had done this a long time ago if it was that simple.
> >
> > Hello Linus,
> >
> > Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
> > I debugged it enough to see that the host driver's
> > writes to the dma_alloc_coherent() region  were not appearing in
> > memory, and that
> > led me to DMA_DIRECT_REMAP.
>
> Oh, another thing - the restricted-dma-pool is really only for streaming
> DMA - IIRC there can be cases where the emergency fallback of trying to
> allocate out of the bounce buffer won't work properly. Are you also
> using an additional shared-dma-pool carveout to satisfy the coherent
> allocations, per the DT binding?

Hello Robin,
Sorry for the delay.  We use "restricted DMA" as a poor person's IOMMU; we can
restrict the DMA memory of a device to a narrow region, and our memory
bus HW has
"checkers' to enforce said region for a specific memory client, e.g. PCIe.

We can confirm the assignment of restricted DMA in the bootlog when the device
is probed:

        iwlwifi 0001:01:00.0: assigned reserved memory node pcieSR1@4a000000
        iwlwifi 0001:01:00.0: enabling device (0000 -> 0002)

As far as your other question, why don't I  just post our relevant DT [1].

Regards,
Jim Quinlan
Broardcom STB/CM

[1]
memory {
        device_type = "memory";
        reg = <0x0 0x40000000 0x1 0x0>;
};

reserved-memory {
        #address-cells = <0x2>;
        #size-cells = <0x2>;
        ranges;
        /* ... */

        pcieSR1@4a000000 {
                linux,phandle = <0x2a>;
                phandle = <0x2a>;
                compatible = "restricted-dma-pool";
                reserved-names = "pcieSR1";
                reg = <0x0 0x4a000000 0x0 0x2400000>;
        };
};
pcie@8b20000 {
        /* ... */
        pci@0,0 {
                /* ... */
                pci-ep@0,0 {
                        memory-region = <0x2a>;
                        reg = <0x10000 0x0 0x0 0x0 0x0>;
                };
        };
};




>
> Thanks,
> Robin.
Robin Murphy Oct. 2, 2023, 3:08 p.m. UTC | #19
On 02/10/2023 1:33 pm, Jim Quinlan wrote:
> On Thu, Sep 28, 2023 at 11:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 28/09/2023 1:07 pm, Jim Quinlan wrote:
>>> On Wed, Sep 27, 2023 at 7:10 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>>>>
>>>> Hi Jim,
>>>>
>>>> thanks for your patch!
>>>>
>>>> On Tue, Sep 26, 2023 at 7:52 PM Jim Quinlan <james.quinlan@broadcom.com> wrote:
>>>>
>>>>> Without this commit, the use of dma_alloc_coherent() while
>>>>> using CONFIG_DMA_RESTRICTED_POOL=y breaks devices from working.
>>>>> For example, the common Wifi 7260 chip (iwlwifi) works fine
>>>>> on arm64 with restricted memory but not on arm, unless this
>>>>> commit is applied.
>>>>>
>>>>> Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
>>>>
>>>> (...)
>>>>> +       select DMA_DIRECT_REMAP
>>>>
>>>> Christoph invented that symbol so he can certainly
>>>> explain what is missing to use this on ARM.
>>>>
>>>> This looks weird to me, because:
>>>>> git grep atomic_pool_init
>>>> arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
>>>> kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
>>>>
>>>> Now you have two atomic DMA pools in the kernel,
>>>> and a lot more than that is duplicated. I'm amazed that it
>>>> compiles at all.
>>>>
>>>> Clearly if you want to do this, surely the ARM-specific
>>>> arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
>>>> needs to be removed at the same time?
>>>>
>>>> However I don't think it's that simple, because Christoph would surely
>>>> had done this a long time ago if it was that simple.
>>>
>>> Hello Linus,
>>>
>>> Yes, this is the reason I used "RFC" as the fix looked too easy to be viable :-)
>>> I debugged it enough to see that the host driver's
>>> writes to the dma_alloc_coherent() region  were not appearing in
>>> memory, and that
>>> led me to DMA_DIRECT_REMAP.
>>
>> Oh, another thing - the restricted-dma-pool is really only for streaming
>> DMA - IIRC there can be cases where the emergency fallback of trying to
>> allocate out of the bounce buffer won't work properly. Are you also
>> using an additional shared-dma-pool carveout to satisfy the coherent
>> allocations, per the DT binding?
> 
> Hello Robin,
> Sorry for the delay.  We use "restricted DMA" as a poor person's IOMMU; we can
> restrict the DMA memory of a device to a narrow region, and our memory
> bus HW has
> "checkers' to enforce said region for a specific memory client, e.g. PCIe.
> 
> We can confirm the assignment of restricted DMA in the bootlog when the device
> is probed:
> 
>          iwlwifi 0001:01:00.0: assigned reserved memory node pcieSR1@4a000000
>          iwlwifi 0001:01:00.0: enabling device (0000 -> 0002)
> 
> As far as your other question, why don't I  just post our relevant DT [1].

OK, I spent a while reminding myself of the restricted DMA code, and I'm 
at least 95% confident that I now recall the relevant details...

Restricted DMA has never actually been supported on ARM, or various 
other platforms which the config doesn't do a great job of reflecting. 
ARM does not fully use dma-direct, and its arch_dma_alloc() 
implementation doesn't understand how to call the swiotlb_alloc() 
fallback path. TBH I'm now rather puzzled that you get any semblance of 
working at all, since AFAICS what ARM's arch_dma_alloc() should end up 
doing is giving you a non-cacheable remap as expected, but of some 
regular kernel memory outside the restricted address range, which 
oughtn't to work at all if something is actually restricting device 
accesses.

Secondly, the case I was half-remembering above is that the 
aforementioned fallback path fundamentally *only* works for non-coherent 
devices where dma_alloc_coherent() calls are in non-blocking context. 
The only way to make atomic coherent allocations come from the 
restricted range is to set up part of it as a per-device coherent pool, 
via an additional reserved-memory region as mentioned.

Thanks,
Robin.

> 
> Regards,
> Jim Quinlan
> Broardcom STB/CM
> 
> [1]
> memory {
>          device_type = "memory";
>          reg = <0x0 0x40000000 0x1 0x0>;
> };
> 
> reserved-memory {
>          #address-cells = <0x2>;
>          #size-cells = <0x2>;
>          ranges;
>          /* ... */
> 
>          pcieSR1@4a000000 {
>                  linux,phandle = <0x2a>;
>                  phandle = <0x2a>;
>                  compatible = "restricted-dma-pool";
>                  reserved-names = "pcieSR1";
>                  reg = <0x0 0x4a000000 0x0 0x2400000>;
>          };
> };
> pcie@8b20000 {
>          /* ... */
>          pci@0,0 {
>                  /* ... */
>                  pci-ep@0,0 {
>                          memory-region = <0x2a>;
>                          reg = <0x10000 0x0 0x0 0x0 0x0>;
>                  };
>          };
> };
> 
> 
> 
> 
>>
>> Thanks,
>> Robin.
Jim Quinlan Oct. 5, 2023, 5:53 p.m. UTC | #20
On Mon, Oct 2, 2023 at 2:16 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Sep 28, 2023 at 01:10:27AM +0200, Linus Walleij wrote:
> > (...)
> > > +       select DMA_DIRECT_REMAP
> >
> > Christoph invented that symbol so he can certainly
> > explain what is missing to use this on ARM.
> >
> > This looks weird to me, because:
> > > git grep atomic_pool_init
> > arch/arm/mm/dma-mapping.c:static int __init atomic_pool_init(void)
> > kernel/dma/pool.c:static int __init dma_atomic_pool_init(void)
> >
> > Now you have two atomic DMA pools in the kernel,
> > and a lot more than that is duplicated. I'm amazed that it
> > compiles at all.
> >
> > Clearly if you want to do this, surely the ARM-specific
> > arch/arm/mm/dma-mapping.c and arch/arm/mm/dma-mapping-nommu.c
> > needs to be removed at the same time?
> >
> > However I don't think it's that simple, because Christoph would surely
> > had done this a long time ago if it was that simple.
>
> Yes, DMA_DIRECT_REMAP should only be used for platforms using the
> generic generic remap that plus straight into dma-direct and
> bypasses arch_dma_alloc.
>
> ARM first needs support to directly set the uncached/wc bits on
> the direct mapping for CMA, which should be fairly simple but require
> wide spread testing.
>
> I'd be happy to work with anyone who wants to look into this.
I'd like to look into this and help make it work for ARCH=arm but you
seem to be saying that you also need help from ARM the company?

Thanks,
Jim Quinlan
Broadcom STB/CM
Christoph Hellwig Oct. 6, 2023, 7:40 a.m. UTC | #21
On Thu, Oct 05, 2023 at 01:53:33PM -0400, Jim Quinlan wrote:
> > Yes, DMA_DIRECT_REMAP should only be used for platforms using the
> > generic generic remap that plus straight into dma-direct and
> > bypasses arch_dma_alloc.
> >
> > ARM first needs support to directly set the uncached/wc bits on
> > the direct mapping for CMA, which should be fairly simple but require
> > wide spread testing.
> >
> > I'd be happy to work with anyone who wants to look into this.
> I'd like to look into this and help make it work for ARCH=arm but you
> seem to be saying that you also need help from ARM the company?

No, I don't care about companies.  I just need someone (singular or
plural) to test a wide range of arm systems.

Here is my idea for the attack plan:

As step 1 ignore the whole CMA direct map issue, and just to the
trivial generic dma remap conversion.  This should involved:

 - select DMA_DIRECT_REMAP
 - provide arch_dma_prep_coherent to flush out all dirty data by
   calling __dma_clear_buffer
 - remove the existing arch_dma_alloc/arch_dma_free and all their
   infrastructure

With this things should work fine on any system not using CMA

Then attack the CMA direct mapping:

 - modify the core DMA mapping code so that the
   ARCH_HAS_DMA_SET_UNCACHED code is only used conditionally
   I'm not quite sure what the right checks and right place is,
   but the intent is that it should allow arm to only use that
   path for CMA allocations.  For all existing users of
   CONFIG_ARCH_HAS_DMA_SET_UNCACHED it should evaluate to
   a compile-time true to not change the behavior or code
   generation
 - then in arm select ARCH_HAS_DMA_SET_UNCACHED and implement
   arch_dma_set_uncached, arch_dma_clear_uncached and the new
   helper above
Marek Szyprowski Oct. 20, 2023, 8:16 a.m. UTC | #22
Dear All,

I didn't have enough time to follow the whole discussion, but it looks I 
can add some comments.

On 06.10.2023 09:40, Christoph Hellwig wrote:
> On Thu, Oct 05, 2023 at 01:53:33PM -0400, Jim Quinlan wrote:
>>> Yes, DMA_DIRECT_REMAP should only be used for platforms using the
>>> generic generic remap that plus straight into dma-direct and
>>> bypasses arch_dma_alloc.
>>>
>>> ARM first needs support to directly set the uncached/wc bits on
>>> the direct mapping for CMA, which should be fairly simple but require
>>> wide spread testing.
>>>
>>> I'd be happy to work with anyone who wants to look into this.
>> I'd like to look into this and help make it work for ARCH=arm but you
>> seem to be saying that you also need help from ARM the company?
> No, I don't care about companies.  I just need someone (singular or
> plural) to test a wide range of arm systems.
>
> Here is my idea for the attack plan:
>
> As step 1 ignore the whole CMA direct map issue, and just to the
> trivial generic dma remap conversion.  This should involved:
>
>   - select DMA_DIRECT_REMAP
>   - provide arch_dma_prep_coherent to flush out all dirty data by
>     calling __dma_clear_buffer
>   - remove the existing arch_dma_alloc/arch_dma_free and all their
>     infrastructure
>
> With this things should work fine on any system not using CMA

This won't be that easy.

For historical reasons (performance and limitations of the pre-ARM v7 
cores), on the 32bit ARM the whole kernel's direct mapping is done using 
so called 'sections' (1MiB size afair). Those sections are created in 
the per-process MMU page tables (there are no separate MMU table for the 
kernel mappings), so altering those mappings requires updating bits in 
all processes in the system. Practically this means that those mappings 
has to be static once created during boot time. That's why when no CMA 
is selected, the whole dma_alloc_coherent() allocations are limited to 
rather small region, which is already remapped as non-cached during boot.

This is a serious limitation, that's why some other approach was needed 
and it turned out that CMA can resolve that issue too.

CMA limits the DMA allocations to the specific memory regions, thus each 
such region (part of the kernel's direct map) CAN be easily remapped 
during boot time with standard 4K pages and then altered/updated 
on-demand when coherent allocation is performed. This slightly lowers 
the performance of the memory related operation on that region (access 
to 4K pages is a bit slower compared to the memory mapped with 
sections), but CMA is mainly used on the newer ARMv7 systems which often 
have a decent cache, which mitigates such performance drop.


> Then attack the CMA direct mapping:
>
>   - modify the core DMA mapping code so that the
>     ARCH_HAS_DMA_SET_UNCACHED code is only used conditionally
>     I'm not quite sure what the right checks and right place is,
>     but the intent is that it should allow arm to only use that
>     path for CMA allocations.  For all existing users of
>     CONFIG_ARCH_HAS_DMA_SET_UNCACHED it should evaluate to
>     a compile-time true to not change the behavior or code
>     generation
>   - then in arm select ARCH_HAS_DMA_SET_UNCACHED and implement
>     arch_dma_set_uncached, arch_dma_clear_uncached and the new
>     helper above

The plan for the CMA related case sounds good.

If you need any ARM related tests, let me know. I have a bunch of ARM 
based test machines, which I use for the tests of the linux-next on the 
day-to-day basis.

Best regards
Christoph Hellwig Oct. 23, 2023, 6:16 a.m. UTC | #23
On Fri, Oct 20, 2023 at 10:16:46AM +0200, Marek Szyprowski wrote:
> For historical reasons (performance and limitations of the pre-ARM v7 
> cores), on the 32bit ARM the whole kernel's direct mapping is done using 
> so called 'sections' (1MiB size afair). Those sections are created in 
> the per-process MMU page tables (there are no separate MMU table for the 
> kernel mappings), so altering those mappings requires updating bits in 
> all processes in the system. Practically this means that those mappings 
> has to be static once created during boot time.

That's actually the same on many architetures, and matches the
explanation I heard from Russell before.

> That's why when no CMA 
> is selected, the whole dma_alloc_coherent() allocations are limited to 
> rather small region, which is already remapped as non-cached during boot.

But this does not match my understanding of the code:

 - arch_dma_alloc calls __dma_alloc with is_coherent set to false
 - __dma_alloc then selects cma_allocator if CMA is supported for
   the device / allocation, else remap_allocator if the allocation
   is allowed to block and only if blocking is not allowed pool_allocator
   to allocate from the boot-time pool

This very match matches the dma-direct flow with DMA_DIRECT_REMAP
selected.  The major exception is the direct mapping of the CMA
allocations done by arm32.
diff mbox series

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 9557808e8937..b6f1cea923cf 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -34,6 +34,7 @@  config ARM
 	select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_HUGETLBFS if ARM_LPAE
+	select DMA_DIRECT_REMAP
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_USE_MEMTEST