diff mbox series

[3/3] drm/panthor: Prevent potential overwrite of buffer objects

Message ID 20241024145432.934086-4-akash.goel@arm.com (mailing list archive)
State New, archived
Headers show
Series drm/panthor: Coherency related fixes | expand

Commit Message

Akash Goel Oct. 24, 2024, 2:54 p.m. UTC
All CPU mappings are forced as uncached for Panthor buffer objects when
system(IO) coherency is disabled. Physical backing for Panthor BOs is
allocated by shmem, which clears the pages also after allocation. But
there is no explicit cache flush done after the clearing of pages.
So it could happen that there are dirty cachelines in the CPU cache
for the BOs, when they are accessed from the CPU side through uncached
CPU mapping, and the eviction of cachelines overwrites the data of BOs.

This commit tries to avoid the potential overwrite scenario.

Signed-off-by: Akash Goel <akash.goel@arm.com>
---
 drivers/gpu/drm/panthor/panthor_gem.h | 10 ++++++++++
 drivers/gpu/drm/panthor/panthor_mmu.c |  5 +++++
 2 files changed, 15 insertions(+)

Comments

Akash Goel Oct. 31, 2024, 9:42 p.m. UTC | #1
On 10/24/24 16:39, Boris Brezillon wrote:
> On Thu, 24 Oct 2024 15:54:32 +0100
> Akash Goel <akash.goel@arm.com> wrote:
> 
>> All CPU mappings are forced as uncached for Panthor buffer objects when
>> system(IO) coherency is disabled. Physical backing for Panthor BOs is
>> allocated by shmem, which clears the pages also after allocation. But
>> there is no explicit cache flush done after the clearing of pages.
>> So it could happen that there are dirty cachelines in the CPU cache
>> for the BOs, when they are accessed from the CPU side through uncached
>> CPU mapping, and the eviction of cachelines overwrites the data of BOs.
> 
> Hm, this looks like something that should be handled at the
> drm_gem_shmem level when drm_gem_shmem_object::map_wc=true, as I
> suspect other drivers can hit the same issue (I'm thinking of panfrost
> and lima, but there might be others).
> 


I am sorry for the late reply.
Many thanks for the quick feedback.

I assume you also reckon that there is a potential problem here for arm64.

Fully agree with your suggestion that the handling needs to be at the 
drm_gem_shmem level. I was not sure if we really need to do anything, as 
I didn't observe any overwrite issue during the testing. So thought 
better to limit the change to Panthor and get some feedback.

shmem calls 'flush_dcache_folio()' after clearing the pages but that 
just clears the 'PG_dcache_clean' bit and CPU cache is not cleaned 
immediately.

I realize that this patch is not foolproof, as Userspace can try to 
populate the BO from CPU side before mapping it on the GPU side.

Not sure if we also need to consider the case when shmem pages are 
swapped out. Don't know if there could be a similar situation of dirty 
cachelines after the swap in.

Also not sure if dma_sync_sgtable_for_device() can be called from 
drm_gem_shmem_get_pages() as the sg_table won't be available at that point.

Please let me know your thoughts.

Best regards
Akash



>>
>> This commit tries to avoid the potential overwrite scenario.
>>
>> Signed-off-by: Akash Goel <akash.goel@arm.com>
>> ---
>>   drivers/gpu/drm/panthor/panthor_gem.h | 10 ++++++++++
>>   drivers/gpu/drm/panthor/panthor_mmu.c |  5 +++++
>>   2 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
>> index e43021cf6d45..4b0f43f1edf1 100644
>> --- a/drivers/gpu/drm/panthor/panthor_gem.h
>> +++ b/drivers/gpu/drm/panthor/panthor_gem.h
>> @@ -46,6 +46,16 @@ struct panthor_gem_object {
>>   
>>   	/** @flags: Combination of drm_panthor_bo_flags flags. */
>>   	u32 flags;
>> +
>> +	/**
>> +	 * @cleaned: The buffer object pages have been cleaned.
>> +	 *
>> +	 * There could be dirty CPU cachelines for the pages of buffer object
>> +	 * after allocation, as shmem will zero out the pages. The cachelines
>> +	 * need to be cleaned if the pages are going to be accessed with an
>> +	 * uncached CPU mapping.
>> +	 */
>> +	bool cleaned;
>>   };
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
>> index f522a116c1b1..d8cc9e7d064e 100644
>> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
>> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
>> @@ -1249,6 +1249,11 @@ static int panthor_vm_prepare_map_op_ctx(struct panthor_vm_op_ctx *op_ctx,
>>   
>>   	op_ctx->map.sgt = sgt;
>>   
>> +	if (bo->base.map_wc && !bo->cleaned) {
>> +		dma_sync_sgtable_for_device(vm->ptdev->base.dev, sgt, DMA_TO_DEVICE);
>> +		bo->cleaned = true;
>> +	}
>> +
>>   	preallocated_vm_bo = drm_gpuvm_bo_create(&vm->base, &bo->base.base);
>>   	if (!preallocated_vm_bo) {
>>   		if (!bo->base.base.import_attach)
>
Boris Brezillon Nov. 4, 2024, 11:16 a.m. UTC | #2
Hi Akash,

On Thu, 31 Oct 2024 21:42:27 +0000
Akash Goel <akash.goel@arm.com> wrote:

> I assume you also reckon that there is a potential problem here for arm64.

It impacts any system that's not IO-coherent I would say, and this
comment seems to prove this is a known issue [3].

> 
> Fully agree with your suggestion that the handling needs to be at the 
> drm_gem_shmem level. I was not sure if we really need to do anything, as 
> I didn't observe any overwrite issue during the testing. So thought 
> better to limit the change to Panthor and get some feedback.

Actually, I wonder if PowerVR isn't papering over the same issue with
[1], so it looks like at least two drivers would benefit from a fix at
the drm_gem_shmem level.

> 
> shmem calls 'flush_dcache_folio()' after clearing the pages but that 
> just clears the 'PG_dcache_clean' bit and CPU cache is not cleaned 
> immediately.
> 
> I realize that this patch is not foolproof, as Userspace can try to 
> populate the BO from CPU side before mapping it on the GPU side.
> 
> Not sure if we also need to consider the case when shmem pages are 
> swapped out. Don't know if there could be a similar situation of dirty 
> cachelines after the swap in.

I think we do. We basically need to flush CPU caches any time
pages are [re]allocated, because the shmem layer will either zero-out
(first allocation) or populate (swap-in) in that path, and in both
cases, it involves a CPU copy to a cached mapping.

> 
> Also not sure if dma_sync_sgtable_for_device() can be called from 
> drm_gem_shmem_get_pages() as the sg_table won't be available at that point.

Okay, that's indeed an issue. Maybe we should tie the sgt allocation to
the pages allocation, as I can't think of a case where we would
allocate pages without needing the sg table that goes with it. And if
there are driver that want the sgt to be lazily allocated, we can
always add a drm_gem_shmem_object::lazy_sgt_alloc flag.

Regards,

Boris

[1]https://elixir.bootlin.com/linux/v6.11.6/source/drivers/gpu/drm/imagination/pvr_gem.c#L363
[2]https://elixir.bootlin.com/linux/v6.11.6/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L177
[3]https://elixir.bootlin.com/linux/v6.11.6/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L185
Akash Goel Nov. 4, 2024, 12:49 p.m. UTC | #3
On 11/4/24 11:16, Boris Brezillon wrote:
> Hi Akash,
> 
> On Thu, 31 Oct 2024 21:42:27 +0000
> Akash Goel <akash.goel@arm.com> wrote:
> 
>> I assume you also reckon that there is a potential problem here for arm64.
> 
> It impacts any system that's not IO-coherent I would say, and this
> comment seems to prove this is a known issue [3].
> 

Thanks for confirming.

Actually I had tried to check with Daniel Vetter about [3], as it was 
not clear to me that how that code exactly helped in x86 case.
As far as I understand, [3] updates the attribute of direct kernel 
mapping of the shmem pages to WC, so as to be consistent with the 
Userspace mapping of the pages or their vmapping inside the kernel.
But didn't get how that alignment actually helped in cleaning the dirty 
cache lines.

>>
>> Fully agree with your suggestion that the handling needs to be at the
>> drm_gem_shmem level. I was not sure if we really need to do anything, as
>> I didn't observe any overwrite issue during the testing. So thought
>> better to limit the change to Panthor and get some feedback.
> 
> Actually, I wonder if PowerVR isn't papering over the same issue with
> [1], so it looks like at least two drivers would benefit from a fix at
> the drm_gem_shmem level.
> 

Thanks for giving the reference of PowerVR code.
It is unconditionally calling dma_sync_sgtable after acquiring the 
pages, so could be papering over the issue as you suspected.

>>
>> shmem calls 'flush_dcache_folio()' after clearing the pages but that
>> just clears the 'PG_dcache_clean' bit and CPU cache is not cleaned
>> immediately.
>>
>> I realize that this patch is not foolproof, as Userspace can try to
>> populate the BO from CPU side before mapping it on the GPU side.
>>
>> Not sure if we also need to consider the case when shmem pages are
>> swapped out. Don't know if there could be a similar situation of dirty
>> cachelines after the swap in.
> 
> I think we do. We basically need to flush CPU caches any time
> pages are [re]allocated, because the shmem layer will either zero-out
> (first allocation) or populate (swap-in) in that path, and in both
> cases, it involves a CPU copy to a cached mapping.
> 

Thanks for confirming.

I think we may have to do cache flush page by page.
Not all pages might get swapped out and the initial allocation of all 
pages may not happen at the same time.
Please correct me if my understanding is wrong.


>>
>> Also not sure if dma_sync_sgtable_for_device() can be called from
>> drm_gem_shmem_get_pages() as the sg_table won't be available at that point.
> 
> Okay, that's indeed an issue. Maybe we should tie the sgt allocation to
> the pages allocation, as I can't think of a case where we would
> allocate pages without needing the sg table that goes with it. And if
> there are driver that want the sgt to be lazily allocated, we can
> always add a drm_gem_shmem_object::lazy_sgt_alloc flag.
> 

Many thanks for the suggestion.

Will try to see how we can progress this work.

Best regards
Akash


> Regards,
> 
> Boris
> 
> [1]https://elixir.bootlin.com/linux/v6.11.6/source/drivers/gpu/drm/imagination/pvr_gem.c#L363
> [2]https://elixir.bootlin.com/linux/v6.11.6/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L177
> [3]https://elixir.bootlin.com/linux/v6.11.6/source/drivers/gpu/drm/drm_gem_shmem_helper.c#L185
Boris Brezillon Nov. 4, 2024, 1:41 p.m. UTC | #4
On Mon, 4 Nov 2024 12:49:56 +0000
Akash Goel <akash.goel@arm.com> wrote:

> On 11/4/24 11:16, Boris Brezillon wrote:
> > Hi Akash,
> > 
> > On Thu, 31 Oct 2024 21:42:27 +0000
> > Akash Goel <akash.goel@arm.com> wrote:
> >   
> >> I assume you also reckon that there is a potential problem here for arm64.  
> > 
> > It impacts any system that's not IO-coherent I would say, and this
> > comment seems to prove this is a known issue [3].
> >   
> 
> Thanks for confirming.
> 
> Actually I had tried to check with Daniel Vetter about [3], as it was 
> not clear to me that how that code exactly helped in x86 case.
> As far as I understand, [3] updates the attribute of direct kernel 
> mapping of the shmem pages to WC, so as to be consistent with the 
> Userspace mapping of the pages or their vmapping inside the kernel.
> But didn't get how that alignment actually helped in cleaning the dirty 
> cache lines.

Yeah, I was not referring to the code but rather the fact that x86,
with its IO coherency model, is a special case here, and that other
archs probably need explicit flushes in a few places.

> >>
> >> shmem calls 'flush_dcache_folio()' after clearing the pages but that
> >> just clears the 'PG_dcache_clean' bit and CPU cache is not cleaned
> >> immediately.
> >>
> >> I realize that this patch is not foolproof, as Userspace can try to
> >> populate the BO from CPU side before mapping it on the GPU side.
> >>
> >> Not sure if we also need to consider the case when shmem pages are
> >> swapped out. Don't know if there could be a similar situation of dirty
> >> cachelines after the swap in.  
> > 
> > I think we do. We basically need to flush CPU caches any time
> > pages are [re]allocated, because the shmem layer will either zero-out
> > (first allocation) or populate (swap-in) in that path, and in both
> > cases, it involves a CPU copy to a cached mapping.
> >   
> 
> Thanks for confirming.
> 
> I think we may have to do cache flush page by page.
> Not all pages might get swapped out and the initial allocation of all 
> pages may not happen at the same time.

If the pages are mapped GPU-side, it's always all pages at a time (at
least until we add support for lazy page allocation, AKA growing/heap
buffers). You're right that GPU buffers that have only been mapped
CPU-side with mmap() get their pages lazily allocated, though I'm not
really sure we care about optimizing that case just yet.

> Please correct me if my understanding is wrong.

Eviction should be rare enough that we can probably pay the price of a
flush on the entire BO range.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
index e43021cf6d45..4b0f43f1edf1 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.h
+++ b/drivers/gpu/drm/panthor/panthor_gem.h
@@ -46,6 +46,16 @@  struct panthor_gem_object {
 
 	/** @flags: Combination of drm_panthor_bo_flags flags. */
 	u32 flags;
+
+	/**
+	 * @cleaned: The buffer object pages have been cleaned.
+	 *
+	 * There could be dirty CPU cachelines for the pages of buffer object
+	 * after allocation, as shmem will zero out the pages. The cachelines
+	 * need to be cleaned if the pages are going to be accessed with an
+	 * uncached CPU mapping.
+	 */
+	bool cleaned;
 };
 
 /**
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index f522a116c1b1..d8cc9e7d064e 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -1249,6 +1249,11 @@  static int panthor_vm_prepare_map_op_ctx(struct panthor_vm_op_ctx *op_ctx,
 
 	op_ctx->map.sgt = sgt;
 
+	if (bo->base.map_wc && !bo->cleaned) {
+		dma_sync_sgtable_for_device(vm->ptdev->base.dev, sgt, DMA_TO_DEVICE);
+		bo->cleaned = true;
+	}
+
 	preallocated_vm_bo = drm_gpuvm_bo_create(&vm->base, &bo->base.base);
 	if (!preallocated_vm_bo) {
 		if (!bo->base.base.import_attach)