diff mbox

[3/5] drm: add ARM64 flush implementations

Message ID 20180124025606.3020-3-gurchetansingh@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Gurchetan Singh Jan. 24, 2018, 2:56 a.m. UTC
This patch uses the __dma_map_area function to flush the cache
on ARM64.

v2: Don't use DMA API, call functions directly (Daniel)

Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
---
 drivers/gpu/drm/drm_cache.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Robin Murphy Jan. 24, 2018, noon UTC | #1
On 24/01/18 02:56, Gurchetan Singh wrote:
> This patch uses the __dma_map_area function to flush the cache
> on ARM64.
> 
> v2: Don't use DMA API, call functions directly (Daniel)
> 
> Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
> ---
>   drivers/gpu/drm/drm_cache.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> index 5124582451c6..250cdfbb711f 100644
> --- a/drivers/gpu/drm/drm_cache.c
> +++ b/drivers/gpu/drm/drm_cache.c
> @@ -159,6 +159,12 @@ drm_flush_pages(struct page *pages[], unsigned long num_pages)
>   	for (i = 0; i < num_pages; i++)
>   		drm_cache_maint_page(pages[i], 0, PAGE_SIZE, DMA_TO_DEVICE,
>   				     dmac_map_area);
> +#elif defined(CONFIG_ARM64)
> +	unsigned long i;
> +
> +	for (i = 0; i < num_pages; i++)
> +		__dma_map_area(phys_to_virt(page_to_phys(pages[i])), PAGE_SIZE,
> +			       DMA_TO_DEVICE);

Note that this is not exactly equivalent to clflush - if it's at all 
possible for a non-coherent GPU to write back to these pages and someone 
somewhere expects to be able to read the updated data from a CPU, that's 
  going to be subtly broken.

This also breaks building DRM as a module. And I doubt that either of 
the two easy solutions to that are going to fly with the respective 
maintainers...

Given all the bodging around which seems to happen in DRM/ION/etc., it 
would be really nice to pin down what exactly the shortcomings of the 
DMA API are for these use-cases, and extend it to address them properly.

Robin.

>   #else
>   	pr_err("Architecture has no drm_cache.c support\n");
>   	WARN_ON_ONCE(1);
> @@ -196,6 +202,13 @@ drm_flush_sg(struct sg_table *st)
>   	for_each_sg_page(st->sgl, &sg_iter, st->nents, 0)
>   		drm_cache_maint_page(sg_page_iter_page(&sg_iter), 0, PAGE_SIZE,
>   				     DMA_TO_DEVICE, dmac_map_area);
> +#elif defined(CONFIG_ARM64)
> +	int i;
> +	struct scatterlist *sg;
> +
> +	for_each_sg(st->sgl, sg, st->nents, i)
> +		__dma_map_area(phys_to_virt(sg_phys(sg)), sg->length,
> +			       DMA_TO_DEVICE);
>   #else
>   	pr_err("Architecture has no drm_cache.c support\n");
>   	WARN_ON_ONCE(1);
>
Russell King (Oracle) Jan. 24, 2018, 12:36 p.m. UTC | #2
On Wed, Jan 24, 2018 at 12:00:59PM +0000, Robin Murphy wrote:
> On 24/01/18 02:56, Gurchetan Singh wrote:
> >This patch uses the __dma_map_area function to flush the cache
> >on ARM64.
> >
> >v2: Don't use DMA API, call functions directly (Daniel)
> >
> >Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
> >---
> >  drivers/gpu/drm/drm_cache.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> >diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> >index 5124582451c6..250cdfbb711f 100644
> >--- a/drivers/gpu/drm/drm_cache.c
> >+++ b/drivers/gpu/drm/drm_cache.c
> >@@ -159,6 +159,12 @@ drm_flush_pages(struct page *pages[], unsigned long num_pages)
> >  	for (i = 0; i < num_pages; i++)
> >  		drm_cache_maint_page(pages[i], 0, PAGE_SIZE, DMA_TO_DEVICE,
> >  				     dmac_map_area);
> >+#elif defined(CONFIG_ARM64)
> >+	unsigned long i;
> >+
> >+	for (i = 0; i < num_pages; i++)
> >+		__dma_map_area(phys_to_virt(page_to_phys(pages[i])), PAGE_SIZE,
> >+			       DMA_TO_DEVICE);
> 
> Note that this is not exactly equivalent to clflush - if it's at all
> possible for a non-coherent GPU to write back to these pages and someone
> somewhere expects to be able to read the updated data from a CPU, that's
> going to be subtly broken.
> 
> This also breaks building DRM as a module. And I doubt that either of the
> two easy solutions to that are going to fly with the respective
> maintainers...
> 
> Given all the bodging around which seems to happen in DRM/ION/etc., it would
> be really nice to pin down what exactly the shortcomings of the DMA API are
> for these use-cases, and extend it to address them properly.

Aren't the cache flushing instructions available from userspace on
ARM64?  (I've certainly used them in my spectre/meltdown PoC).  If
userspace wants to flush before reading a GPU buffer that has been
writtent to, why can't it use the already available cache maintenance
instructions?  It's likely to be quicker, and it can target just the
area of the buffer it wants to read.

32-bit ARM is a different matter, and I'll reply separately.
Robin Murphy Jan. 24, 2018, 1:14 p.m. UTC | #3
On 24/01/18 12:36, Russell King - ARM Linux wrote:
> On Wed, Jan 24, 2018 at 12:00:59PM +0000, Robin Murphy wrote:
>> On 24/01/18 02:56, Gurchetan Singh wrote:
>>> This patch uses the __dma_map_area function to flush the cache
>>> on ARM64.
>>>
>>> v2: Don't use DMA API, call functions directly (Daniel)
>>>
>>> Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
>>> ---
>>>   drivers/gpu/drm/drm_cache.c | 13 +++++++++++++
>>>   1 file changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
>>> index 5124582451c6..250cdfbb711f 100644
>>> --- a/drivers/gpu/drm/drm_cache.c
>>> +++ b/drivers/gpu/drm/drm_cache.c
>>> @@ -159,6 +159,12 @@ drm_flush_pages(struct page *pages[], unsigned long num_pages)
>>>   	for (i = 0; i < num_pages; i++)
>>>   		drm_cache_maint_page(pages[i], 0, PAGE_SIZE, DMA_TO_DEVICE,
>>>   				     dmac_map_area);
>>> +#elif defined(CONFIG_ARM64)
>>> +	unsigned long i;
>>> +
>>> +	for (i = 0; i < num_pages; i++)
>>> +		__dma_map_area(phys_to_virt(page_to_phys(pages[i])), PAGE_SIZE,
>>> +			       DMA_TO_DEVICE);
>>
>> Note that this is not exactly equivalent to clflush - if it's at all
>> possible for a non-coherent GPU to write back to these pages and someone
>> somewhere expects to be able to read the updated data from a CPU, that's
>> going to be subtly broken.
>>
>> This also breaks building DRM as a module. And I doubt that either of the
>> two easy solutions to that are going to fly with the respective
>> maintainers...
>>
>> Given all the bodging around which seems to happen in DRM/ION/etc., it would
>> be really nice to pin down what exactly the shortcomings of the DMA API are
>> for these use-cases, and extend it to address them properly.
> 
> Aren't the cache flushing instructions available from userspace on
> ARM64?  (I've certainly used them in my spectre/meltdown PoC).  If
> userspace wants to flush before reading a GPU buffer that has been
> writtent to, why can't it use the already available cache maintenance
> instructions?  It's likely to be quicker, and it can target just the
> area of the buffer it wants to read.

Indeed, whilst the main reason we enable EL0 cache maintenance access is 
for JITs cleaning to PoU, it inherently makes PoC operations available 
as well. The semantics of drm_clflush_*() weren't immediately obvious, 
but if it is purely a sync point around userspace reading/writing GPU 
buffers directly, then what you suggest does sound ideal (and 
__dma_map_area is definitely wrong for the read case as above).

> 
> 32-bit ARM is a different matter, and I'll reply separately.
> 

Thanks,
Robin.
Daniel Vetter Jan. 30, 2018, 9:53 a.m. UTC | #4
On Wed, Jan 24, 2018 at 12:00:59PM +0000, Robin Murphy wrote:
> On 24/01/18 02:56, Gurchetan Singh wrote:
> > This patch uses the __dma_map_area function to flush the cache
> > on ARM64.
> > 
> > v2: Don't use DMA API, call functions directly (Daniel)
> > 
> > Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org>
> > ---
> >   drivers/gpu/drm/drm_cache.c | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
> > index 5124582451c6..250cdfbb711f 100644
> > --- a/drivers/gpu/drm/drm_cache.c
> > +++ b/drivers/gpu/drm/drm_cache.c
> > @@ -159,6 +159,12 @@ drm_flush_pages(struct page *pages[], unsigned long num_pages)
> >   	for (i = 0; i < num_pages; i++)
> >   		drm_cache_maint_page(pages[i], 0, PAGE_SIZE, DMA_TO_DEVICE,
> >   				     dmac_map_area);
> > +#elif defined(CONFIG_ARM64)
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < num_pages; i++)
> > +		__dma_map_area(phys_to_virt(page_to_phys(pages[i])), PAGE_SIZE,
> > +			       DMA_TO_DEVICE);
> 
> Note that this is not exactly equivalent to clflush - if it's at all
> possible for a non-coherent GPU to write back to these pages and someone
> somewhere expects to be able to read the updated data from a CPU, that's
> going to be subtly broken.
> 
> This also breaks building DRM as a module. And I doubt that either of the
> two easy solutions to that are going to fly with the respective
> maintainers...
> 
> Given all the bodging around which seems to happen in DRM/ION/etc., it would
> be really nice to pin down what exactly the shortcomings of the DMA API are
> for these use-cases, and extend it to address them properly.

tldr; a dma-buf exporter more-or-less has to act like a dma-api
architecture implementation. Which means flushing stuff at the right time.
Going through the dma-api means we need to pass around a struct device *
where none is needed, which seems silly. The original patches did add that
dummy struct device *, but after digging around in the actual
implementations we've noticed that there's no need for them.

Ofc you can question whether gpu drivers really need to noodle around in
such platform details, but given that all of them do that (all = those
that implement rendering, not just display) I'm just accepting that as a
fact of life. It's definitely unrealistic to demand those all get fixed,
even if the end result would be more maintainable.

We can ofc postpone this entire discussion by mandating that all shared
gpu buffers on ARM32 must be wc mapped by everyone. But at least on some
x86 machines (it's a function of how big your caches are and where your
gpu sits) cached access is actually faster for upload/download buffers.
Ofc since the dma-api tries to hide all this the wc vs. cached assumptions
are all implicit in dma-buf, which makes this all lots of fun.
-Daniel

> 
> Robin.
> 
> >   #else
> >   	pr_err("Architecture has no drm_cache.c support\n");
> >   	WARN_ON_ONCE(1);
> > @@ -196,6 +202,13 @@ drm_flush_sg(struct sg_table *st)
> >   	for_each_sg_page(st->sgl, &sg_iter, st->nents, 0)
> >   		drm_cache_maint_page(sg_page_iter_page(&sg_iter), 0, PAGE_SIZE,
> >   				     DMA_TO_DEVICE, dmac_map_area);
> > +#elif defined(CONFIG_ARM64)
> > +	int i;
> > +	struct scatterlist *sg;
> > +
> > +	for_each_sg(st->sgl, sg, st->nents, i)
> > +		__dma_map_area(phys_to_virt(sg_phys(sg)), sg->length,
> > +			       DMA_TO_DEVICE);
> >   #else
> >   	pr_err("Architecture has no drm_cache.c support\n");
> >   	WARN_ON_ONCE(1);
> > 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
diff mbox

Patch

diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index 5124582451c6..250cdfbb711f 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -159,6 +159,12 @@  drm_flush_pages(struct page *pages[], unsigned long num_pages)
 	for (i = 0; i < num_pages; i++)
 		drm_cache_maint_page(pages[i], 0, PAGE_SIZE, DMA_TO_DEVICE,
 				     dmac_map_area);
+#elif defined(CONFIG_ARM64)
+	unsigned long i;
+
+	for (i = 0; i < num_pages; i++)
+		__dma_map_area(phys_to_virt(page_to_phys(pages[i])), PAGE_SIZE,
+			       DMA_TO_DEVICE);
 #else
 	pr_err("Architecture has no drm_cache.c support\n");
 	WARN_ON_ONCE(1);
@@ -196,6 +202,13 @@  drm_flush_sg(struct sg_table *st)
 	for_each_sg_page(st->sgl, &sg_iter, st->nents, 0)
 		drm_cache_maint_page(sg_page_iter_page(&sg_iter), 0, PAGE_SIZE,
 				     DMA_TO_DEVICE, dmac_map_area);
+#elif defined(CONFIG_ARM64)
+	int i;
+	struct scatterlist *sg;
+
+	for_each_sg(st->sgl, sg, st->nents, i)
+		__dma_map_area(phys_to_virt(sg_phys(sg)), sg->length,
+			       DMA_TO_DEVICE);
 #else
 	pr_err("Architecture has no drm_cache.c support\n");
 	WARN_ON_ONCE(1);