diff mbox

[v2,22/22] gpu: host1x: At first try a non-blocking allocation for the gather copy

Message ID 036615566011ab4d52190439a12d943df9e704ef.1497394243.git.digetx@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dmitry Osipenko June 13, 2017, 11:16 p.m. UTC
The blocking gather copy allocation is a major performance downside of the
Host1x firewall, it may take hundreds milliseconds which is unacceptable
for the real-time graphics operations. Let's try a non-blocking allocation
first as a least invasive solution, it makes opentegra (Xorg driver)
performance indistinguishable with/without the firewall.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---

The other much more invasive solution could be: to have a memory pool
reserved for the gather copy and performing the firewall checks (and the
gather copying) on the actual job submission to the CDMA, not on the job
allocation like it is done now. This solution reduces the overhead of a
gather copy allocations.

 drivers/gpu/host1x/job.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Erik Faye-Lund June 14, 2017, 7:56 a.m. UTC | #1
On Wed, Jun 14, 2017 at 1:16 AM, Dmitry Osipenko <digetx@gmail.com> wrote:
> The blocking gather copy allocation is a major performance downside of the
> Host1x firewall, it may take hundreds milliseconds which is unacceptable
> for the real-time graphics operations. Let's try a non-blocking allocation
> first as a least invasive solution, it makes opentegra (Xorg driver)
> performance indistinguishable with/without the firewall.
>
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>
> The other much more invasive solution could be: to have a memory pool
> reserved for the gather copy and performing the firewall checks (and the
> gather copying) on the actual job submission to the CDMA, not on the job
> allocation like it is done now. This solution reduces the overhead of a
> gather copy allocations.
>
>  drivers/gpu/host1x/job.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
> index fc194c676d91..ed0575f14093 100644
> --- a/drivers/gpu/host1x/job.c
> +++ b/drivers/gpu/host1x/job.c
> @@ -594,12 +594,20 @@ static inline int copy_gathers(struct host1x_job *job, struct device *dev)
>                 size += g->words * sizeof(u32);
>         }
>
> +       /*
> +        * Try a non-blocking allocation from a higher priority pools first,
> +        * as awaiting for the allocation here is a major performance hit.
> +        */
>         job->gather_copy_mapped = dma_alloc_wc(dev, size, &job->gather_copy,
> -                                              GFP_KERNEL);
> -       if (!job->gather_copy_mapped) {
> -               job->gather_copy_mapped = NULL;
> +                                              GFP_NOWAIT);
> +
> +       /* the higher priority allocation failed, try the generic-blocking */
> +       if (!job->gather_copy_mapped)
> +               job->gather_copy_mapped = dma_alloc_wc(dev, size,
> +                                                      &job->gather_copy,
> +                                                      GFP_KERNEL);
> +       if (!job->gather_copy_mapped)
>                 return -ENOMEM;
> -       }
>
>         job->gather_copy_size = size;

Can't we just have a static fixed-size buffer, and validate chunks at
the time? Or why can't we validate directly on the mapped pointers?

It feels pretty silly to have to allocate memory in the first place here...
Dmitry Osipenko June 14, 2017, 8:32 a.m. UTC | #2
On 14.06.2017 10:56, Erik Faye-Lund wrote:
> On Wed, Jun 14, 2017 at 1:16 AM, Dmitry Osipenko <digetx@gmail.com> wrote:
>> The blocking gather copy allocation is a major performance downside of the
>> Host1x firewall, it may take hundreds milliseconds which is unacceptable
>> for the real-time graphics operations. Let's try a non-blocking allocation
>> first as a least invasive solution, it makes opentegra (Xorg driver)
>> performance indistinguishable with/without the firewall.
>>
>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>> ---
>>
>> The other much more invasive solution could be: to have a memory pool
>> reserved for the gather copy and performing the firewall checks (and the
>> gather copying) on the actual job submission to the CDMA, not on the job
>> allocation like it is done now. This solution reduces the overhead of a
>> gather copy allocations.
>>
>>  drivers/gpu/host1x/job.c | 16 ++++++++++++----
>>  1 file changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
>> index fc194c676d91..ed0575f14093 100644
>> --- a/drivers/gpu/host1x/job.c
>> +++ b/drivers/gpu/host1x/job.c
>> @@ -594,12 +594,20 @@ static inline int copy_gathers(struct host1x_job *job, struct device *dev)
>>                 size += g->words * sizeof(u32);
>>         }
>>
>> +       /*
>> +        * Try a non-blocking allocation from a higher priority pools first,
>> +        * as awaiting for the allocation here is a major performance hit.
>> +        */
>>         job->gather_copy_mapped = dma_alloc_wc(dev, size, &job->gather_copy,
>> -                                              GFP_KERNEL);
>> -       if (!job->gather_copy_mapped) {
>> -               job->gather_copy_mapped = NULL;
>> +                                              GFP_NOWAIT);
>> +
>> +       /* the higher priority allocation failed, try the generic-blocking */
>> +       if (!job->gather_copy_mapped)
>> +               job->gather_copy_mapped = dma_alloc_wc(dev, size,
>> +                                                      &job->gather_copy,
>> +                                                      GFP_KERNEL);
>> +       if (!job->gather_copy_mapped)
>>                 return -ENOMEM;
>> -       }
>>
>>         job->gather_copy_size = size;
> 
> Can't we just have a static fixed-size buffer, and validate chunks at
> the time? Or why can't we validate directly on the mapped pointers?
> 
> It feels pretty silly to have to allocate memory in the first place here...
> 

We can't validate the mapped pointers because userspace may write to the buffers
memory at any time. Maybe it should be possible to write-protect cmdbuffers
memory for the time of its submission and execution, but I'm not sure whether
it's worth the effort. The current-proposed solution is efficient and least
invasive.
Erik Faye-Lund June 14, 2017, 8:44 a.m. UTC | #3
On Wed, Jun 14, 2017 at 10:32 AM, Dmitry Osipenko <digetx@gmail.com> wrote:
> On 14.06.2017 10:56, Erik Faye-Lund wrote:
>> On Wed, Jun 14, 2017 at 1:16 AM, Dmitry Osipenko <digetx@gmail.com> wrote:
>>> The blocking gather copy allocation is a major performance downside of the
>>> Host1x firewall, it may take hundreds milliseconds which is unacceptable
>>> for the real-time graphics operations. Let's try a non-blocking allocation
>>> first as a least invasive solution, it makes opentegra (Xorg driver)
>>> performance indistinguishable with/without the firewall.
>>>
>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>> ---
>>>
>>> The other much more invasive solution could be: to have a memory pool
>>> reserved for the gather copy and performing the firewall checks (and the
>>> gather copying) on the actual job submission to the CDMA, not on the job
>>> allocation like it is done now. This solution reduces the overhead of a
>>> gather copy allocations.
>>>
>>>  drivers/gpu/host1x/job.c | 16 ++++++++++++----
>>>  1 file changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
>>> index fc194c676d91..ed0575f14093 100644
>>> --- a/drivers/gpu/host1x/job.c
>>> +++ b/drivers/gpu/host1x/job.c
>>> @@ -594,12 +594,20 @@ static inline int copy_gathers(struct host1x_job *job, struct device *dev)
>>>                 size += g->words * sizeof(u32);
>>>         }
>>>
>>> +       /*
>>> +        * Try a non-blocking allocation from a higher priority pools first,
>>> +        * as awaiting for the allocation here is a major performance hit.
>>> +        */
>>>         job->gather_copy_mapped = dma_alloc_wc(dev, size, &job->gather_copy,
>>> -                                              GFP_KERNEL);
>>> -       if (!job->gather_copy_mapped) {
>>> -               job->gather_copy_mapped = NULL;
>>> +                                              GFP_NOWAIT);
>>> +
>>> +       /* the higher priority allocation failed, try the generic-blocking */
>>> +       if (!job->gather_copy_mapped)
>>> +               job->gather_copy_mapped = dma_alloc_wc(dev, size,
>>> +                                                      &job->gather_copy,
>>> +                                                      GFP_KERNEL);
>>> +       if (!job->gather_copy_mapped)
>>>                 return -ENOMEM;
>>> -       }
>>>
>>>         job->gather_copy_size = size;
>>
>> Can't we just have a static fixed-size buffer, and validate chunks at
>> the time? Or why can't we validate directly on the mapped pointers?
>>
>> It feels pretty silly to have to allocate memory in the first place here...
>>
>
> We can't validate the mapped pointers because userspace may write to the buffers
> memory at any time. Maybe it should be possible to write-protect cmdbuffers
> memory for the time of its submission and execution, but I'm not sure whether
> it's worth the effort. The current-proposed solution is efficient and least
> invasive.

Right, good point.

Reviewed-by: Erik Faye-Lund <kusmabite@gmail.com>
diff mbox

Patch

diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
index fc194c676d91..ed0575f14093 100644
--- a/drivers/gpu/host1x/job.c
+++ b/drivers/gpu/host1x/job.c
@@ -594,12 +594,20 @@  static inline int copy_gathers(struct host1x_job *job, struct device *dev)
 		size += g->words * sizeof(u32);
 	}
 
+	/*
+	 * Try a non-blocking allocation from a higher priority pools first,
+	 * as awaiting for the allocation here is a major performance hit.
+	 */
 	job->gather_copy_mapped = dma_alloc_wc(dev, size, &job->gather_copy,
-					       GFP_KERNEL);
-	if (!job->gather_copy_mapped) {
-		job->gather_copy_mapped = NULL;
+					       GFP_NOWAIT);
+
+	/* the higher priority allocation failed, try the generic-blocking */
+	if (!job->gather_copy_mapped)
+		job->gather_copy_mapped = dma_alloc_wc(dev, size,
+						       &job->gather_copy,
+						       GFP_KERNEL);
+	if (!job->gather_copy_mapped)
 		return -ENOMEM;
-	}
 
 	job->gather_copy_size = size;