diff mbox

[v7,6/6] drm/msm: iommu: Replace runtime calls with runtime suppliers

Message ID 1517999482-17317-7-git-send-email-vivek.gautam@codeaurora.org (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Vivek Gautam Feb. 7, 2018, 10:31 a.m. UTC
While handling the concerned iommu, there should not be a
need to power control the drm devices from iommu interface.
If these drm devices need to be powered around this time,
the respective drivers should take care of this.

Replace the pm_runtime_get/put_sync(<drm_device>) with
pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
the connected iommu through the device link interface.
In case the device link is not setup these get/put_suppliers()
calls will be a no-op, and the iommu driver should take care of
powering on its devices accordingly.

Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
---
 drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Comments

Tomasz Figa Feb. 13, 2018, 9:10 a.m. UTC | #1
Hi Vivek,

Thanks for the patch. Please see my comments inline.

On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
<vivek.gautam@codeaurora.org> wrote:
> While handling the concerned iommu, there should not be a
> need to power control the drm devices from iommu interface.
> If these drm devices need to be powered around this time,
> the respective drivers should take care of this.
>
> Replace the pm_runtime_get/put_sync(<drm_device>) with
> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
> the connected iommu through the device link interface.
> In case the device link is not setup these get/put_suppliers()
> calls will be a no-op, and the iommu driver should take care of
> powering on its devices accordingly.
>
> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
> ---
>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index b23d33622f37..1ab629bbee69 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>         int ret;
>
> -       pm_runtime_get_sync(mmu->dev);
> +       pm_runtime_get_suppliers(mmu->dev);
>         ret = iommu_attach_device(iommu->domain, mmu->dev);
> -       pm_runtime_put_sync(mmu->dev);
> +       pm_runtime_put_suppliers(mmu->dev);

For me, it looks like a wrong place to handle runtime PM of IOMMU
here. iommu_attach_device() calls into IOMMU driver's attach_device()
callback and that's where necessary runtime PM gets should happen, if
any. In other words, driver A (MSM DRM driver) shouldn't be dealing
with power state of device controlled by driver B (ARM SMMU).

This is also important for the reasons I stated in my comments to
"[PATCH v7 1/6] base: power: runtime: Export
pm_runtime_get/put_suppliers". Quoting for everyone's convenience:

>> There are however cases in which the consumer wants to power-on
>> the supplier, but not itself.
>> E.g., A Graphics or multimedia driver wants to power-on the SMMU
>> to unmap a buffer and finish the TLB operations without powering
>> on itself.
>
>This sounds strange to me. If the SMMU is powered down, wouldn't the
>TLB lose its contents as well (and so no flushing needed)?
>
>Other than that, what kind of hardware operations would be needed
>besides just updating the page tables from the CPU?
>

In other words, the SMMU driver can deal with hardware state based on
return value of pm_runtime_get_sync() or pm_runtime_get_if_in_use()
and decide whether some operations are necessary or not, e.g.
- a state restore is necessary if the domain was powered off, but we
are bringing the master on,
- a flush may not be required when (un)mapping with the domain powered off,
- etc.

Best regards,
Tomasz
Jordan Crouse Feb. 13, 2018, 4:42 p.m. UTC | #2
On Tue, Feb 13, 2018 at 06:10:38PM +0900, Tomasz Figa wrote:
> Hi Vivek,
> 
> Thanks for the patch. Please see my comments inline.
> 
> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
> <vivek.gautam@codeaurora.org> wrote:
> > While handling the concerned iommu, there should not be a
> > need to power control the drm devices from iommu interface.
> > If these drm devices need to be powered around this time,
> > the respective drivers should take care of this.
> >
> > Replace the pm_runtime_get/put_sync(<drm_device>) with
> > pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
> > the connected iommu through the device link interface.
> > In case the device link is not setup these get/put_suppliers()
> > calls will be a no-op, and the iommu driver should take care of
> > powering on its devices accordingly.
> >
> > Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
> > ---
> >  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> > index b23d33622f37..1ab629bbee69 100644
> > --- a/drivers/gpu/drm/msm/msm_iommu.c
> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
> > @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
> >         struct msm_iommu *iommu = to_msm_iommu(mmu);
> >         int ret;
> >
> > -       pm_runtime_get_sync(mmu->dev);
> > +       pm_runtime_get_suppliers(mmu->dev);
> >         ret = iommu_attach_device(iommu->domain, mmu->dev);
> > -       pm_runtime_put_sync(mmu->dev);
> > +       pm_runtime_put_suppliers(mmu->dev);
> 
> For me, it looks like a wrong place to handle runtime PM of IOMMU
> here. iommu_attach_device() calls into IOMMU driver's attach_device()
> callback and that's where necessary runtime PM gets should happen, if
> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
> with power state of device controlled by driver B (ARM SMMU).
 
This whole thing is confused by the fact that on MSM the GPU and the GPU IOMMU
share some of the same clocks and power rail so turning on the GPU also
turned on the IOMMU register banks by extension.

But if we put that aside the question is who should be responsible for
controlling the power in this relationship and there are several good reasons to
leave it up to the client device. The most important reason is when we move to
the per-instance model where the GPU self-programmings the SMMU registers. In
that case, the driver will need to make sure that the SMMU is powered up before
submitting the command and then removing the power vote when the commands
are done to save energy.

Additionally, there might be legitimate reasons in the driver to batch
operations - you may wish to attach the device and then map several global
buffers immediately - having driver side control prevents several unneeded power
transitions.

Perhaps the right answer is to do both - allow for the driver to enable the
supplier but also do the right power operations at the appropriately places in
the IOMMU driver.

> This is also important for the reasons I stated in my comments to
> "[PATCH v7 1/6] base: power: runtime: Export
> pm_runtime_get/put_suppliers". Quoting for everyone's convenience:
> 
> >> There are however cases in which the consumer wants to power-on
> >> the supplier, but not itself.
> >> E.g., A Graphics or multimedia driver wants to power-on the SMMU
> >> to unmap a buffer and finish the TLB operations without powering
> >> on itself.
> >
> >This sounds strange to me. If the SMMU is powered down, wouldn't the
> >TLB lose its contents as well (and so no flushing needed)?
> >

> >Other than that, what kind of hardware operations would be needed
> >besides just updating the page tables from the CPU?
> >

> In other words, the SMMU driver can deal with hardware state based on
> return value of pm_runtime_get_sync() or pm_runtime_get_if_in_use()
> and decide whether some operations are necessary or not, e.g.
> - a state restore is necessary if the domain was powered off, but we
> are bringing the master on,
> - a flush may not be required when (un)mapping with the domain powered off,
> - etc.

I agree that there is probably some advanced logic that we can do to
conclusively figure out the state of the hardware and improve the behavior.
I would love to see the SMMU driver get smarter but for the moment we can't
trust it and so we need to force the behavior from the GPU driver. The current
code works for a5x and earlier but on sdm845 we can (no longer) treat the GPU
and the SMMU as the same device for power purposes so we need this code. If at
some point in the future we can start to selectively remove the supplier calls
I wouldn't mind one bit.

Jordan
Rob Clark Feb. 13, 2018, 6:03 p.m. UTC | #3
On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
> Hi Vivek,
>
> Thanks for the patch. Please see my comments inline.
>
> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
> <vivek.gautam@codeaurora.org> wrote:
>> While handling the concerned iommu, there should not be a
>> need to power control the drm devices from iommu interface.
>> If these drm devices need to be powered around this time,
>> the respective drivers should take care of this.
>>
>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>> the connected iommu through the device link interface.
>> In case the device link is not setup these get/put_suppliers()
>> calls will be a no-op, and the iommu driver should take care of
>> powering on its devices accordingly.
>>
>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>> ---
>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>> index b23d33622f37..1ab629bbee69 100644
>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>         int ret;
>>
>> -       pm_runtime_get_sync(mmu->dev);
>> +       pm_runtime_get_suppliers(mmu->dev);
>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>> -       pm_runtime_put_sync(mmu->dev);
>> +       pm_runtime_put_suppliers(mmu->dev);
>
> For me, it looks like a wrong place to handle runtime PM of IOMMU
> here. iommu_attach_device() calls into IOMMU driver's attach_device()
> callback and that's where necessary runtime PM gets should happen, if
> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
> with power state of device controlled by driver B (ARM SMMU).

Note that we end up having to do the same, because of iommu_unmap()
while DRM driver is powered off..  it might be cleaner if it was all
self contained in the iommu driver, but that would make it so other
drivers couldn't call iommu_unmap() from an irq handler, which is
apparently something that some of them want to do..

So I'm happy with the pm_runtime_get/put_suppliers() approach as a
reasonable compromise.

(Perhaps specifically, attach/detach this could move inside the iommu
driver, but we still need to get/put_suppliers() for unmap(), so meh)

BR,
-R

> This is also important for the reasons I stated in my comments to
> "[PATCH v7 1/6] base: power: runtime: Export
> pm_runtime_get/put_suppliers". Quoting for everyone's convenience:
>
>>> There are however cases in which the consumer wants to power-on
>>> the supplier, but not itself.
>>> E.g., A Graphics or multimedia driver wants to power-on the SMMU
>>> to unmap a buffer and finish the TLB operations without powering
>>> on itself.
>>
>>This sounds strange to me. If the SMMU is powered down, wouldn't the
>>TLB lose its contents as well (and so no flushing needed)?
>>
>>Other than that, what kind of hardware operations would be needed
>>besides just updating the page tables from the CPU?
>>
>
> In other words, the SMMU driver can deal with hardware state based on
> return value of pm_runtime_get_sync() or pm_runtime_get_if_in_use()
> and decide whether some operations are necessary or not, e.g.
> - a state restore is necessary if the domain was powered off, but we
> are bringing the master on,
> - a flush may not be required when (un)mapping with the domain powered off,
> - etc.
>
> Best regards,
> Tomasz
Tomasz Figa Feb. 14, 2018, 1:59 a.m. UTC | #4
On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>> Hi Vivek,
>>
>> Thanks for the patch. Please see my comments inline.
>>
>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>> <vivek.gautam@codeaurora.org> wrote:
>>> While handling the concerned iommu, there should not be a
>>> need to power control the drm devices from iommu interface.
>>> If these drm devices need to be powered around this time,
>>> the respective drivers should take care of this.
>>>
>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>> the connected iommu through the device link interface.
>>> In case the device link is not setup these get/put_suppliers()
>>> calls will be a no-op, and the iommu driver should take care of
>>> powering on its devices accordingly.
>>>
>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>> ---
>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>> index b23d33622f37..1ab629bbee69 100644
>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>         int ret;
>>>
>>> -       pm_runtime_get_sync(mmu->dev);
>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>> -       pm_runtime_put_sync(mmu->dev);
>>> +       pm_runtime_put_suppliers(mmu->dev);
>>
>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>> callback and that's where necessary runtime PM gets should happen, if
>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>> with power state of device controlled by driver B (ARM SMMU).
>
> Note that we end up having to do the same, because of iommu_unmap()
> while DRM driver is powered off..  it might be cleaner if it was all
> self contained in the iommu driver, but that would make it so other
> drivers couldn't call iommu_unmap() from an irq handler, which is
> apparently something that some of them want to do..

I'd assume that runtime PM status is already guaranteed to be active
when the IRQ handler is running, by some other means (e.g.
pm_runtime_get_sync() called earlier, when queuing some work to the
hardware). Otherwise, I'm not sure how a powered down device could
trigger an IRQ.

So, if the master device power is already on, suppliers should be
powered on as well, thanks to device links.

Best regards,
Tomasz
Rob Clark Feb. 14, 2018, 2:13 a.m. UTC | #5
On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>> Hi Vivek,
>>>
>>> Thanks for the patch. Please see my comments inline.
>>>
>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>> <vivek.gautam@codeaurora.org> wrote:
>>>> While handling the concerned iommu, there should not be a
>>>> need to power control the drm devices from iommu interface.
>>>> If these drm devices need to be powered around this time,
>>>> the respective drivers should take care of this.
>>>>
>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>> the connected iommu through the device link interface.
>>>> In case the device link is not setup these get/put_suppliers()
>>>> calls will be a no-op, and the iommu driver should take care of
>>>> powering on its devices accordingly.
>>>>
>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>> ---
>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>> index b23d33622f37..1ab629bbee69 100644
>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>         int ret;
>>>>
>>>> -       pm_runtime_get_sync(mmu->dev);
>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>> -       pm_runtime_put_sync(mmu->dev);
>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>
>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>> callback and that's where necessary runtime PM gets should happen, if
>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>> with power state of device controlled by driver B (ARM SMMU).
>>
>> Note that we end up having to do the same, because of iommu_unmap()
>> while DRM driver is powered off..  it might be cleaner if it was all
>> self contained in the iommu driver, but that would make it so other
>> drivers couldn't call iommu_unmap() from an irq handler, which is
>> apparently something that some of them want to do..
>
> I'd assume that runtime PM status is already guaranteed to be active
> when the IRQ handler is running, by some other means (e.g.
> pm_runtime_get_sync() called earlier, when queuing some work to the
> hardware). Otherwise, I'm not sure how a powered down device could
> trigger an IRQ.
>
> So, if the master device power is already on, suppliers should be
> powered on as well, thanks to device links.
>

umm, that is kindof the inverse of the problem..  the problem is
things like gpu driver (and v4l2 drivers that import dma-buf's,
afaict).. they will potentially call iommu->unmap() when device is not
active (due to userspace or things beyond the control of the driver)..
so *they* would want iommu to do pm get/put calls.  But other drivers
trying to unmap from irq ctx would not.  Which is the contradictory
requirement that lead to the idea of iommu user powering up iommu for
unmap.

There has already been some discussion about this on various earlier
permutations of this patchset.  I think we have exhausted all other
options.

BR,
-R
Tomasz Figa Feb. 14, 2018, 3:01 a.m. UTC | #6
On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>> Hi Vivek,
>>>>
>>>> Thanks for the patch. Please see my comments inline.
>>>>
>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>> While handling the concerned iommu, there should not be a
>>>>> need to power control the drm devices from iommu interface.
>>>>> If these drm devices need to be powered around this time,
>>>>> the respective drivers should take care of this.
>>>>>
>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>> the connected iommu through the device link interface.
>>>>> In case the device link is not setup these get/put_suppliers()
>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>> powering on its devices accordingly.
>>>>>
>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>> ---
>>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>         int ret;
>>>>>
>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>
>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>> callback and that's where necessary runtime PM gets should happen, if
>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>> with power state of device controlled by driver B (ARM SMMU).
>>>
>>> Note that we end up having to do the same, because of iommu_unmap()
>>> while DRM driver is powered off..  it might be cleaner if it was all
>>> self contained in the iommu driver, but that would make it so other
>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>> apparently something that some of them want to do..
>>
>> I'd assume that runtime PM status is already guaranteed to be active
>> when the IRQ handler is running, by some other means (e.g.
>> pm_runtime_get_sync() called earlier, when queuing some work to the
>> hardware). Otherwise, I'm not sure how a powered down device could
>> trigger an IRQ.
>>
>> So, if the master device power is already on, suppliers should be
>> powered on as well, thanks to device links.
>>
>
> umm, that is kindof the inverse of the problem..  the problem is
> things like gpu driver (and v4l2 drivers that import dma-buf's,
> afaict).. they will potentially call iommu->unmap() when device is not
> active (due to userspace or things beyond the control of the driver)..
> so *they* would want iommu to do pm get/put calls.

Which is fine and which is actually already done by one of the patches
in this series, not for map/unmap, but probe, add_device,
remove_device. Having parts of the API doing it inside the callback
and other parts outside sounds at least inconsistent.

> But other drivers
> trying to unmap from irq ctx would not.  Which is the contradictory
> requirement that lead to the idea of iommu user powering up iommu for
> unmap.

Sorry, maybe I wasn't clear. My last message was supposed to show that
it's not contradictory at all, because "other drivers trying to unmap
from irq ctx" would already have called pm_runtime_get_*() earlier
from a non-irq ctx, which would have also done the same on all the
linked suppliers, including the IOMMU. The ultimate result would be
that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
would do nothing besides incrementing the reference count.

>
> There has already been some discussion about this on various earlier
> permutations of this patchset.  I think we have exhausted all other
> options.

I guess I should have read those. Let me do that now.

Best regards,
Tomasz
Tomasz Figa Feb. 14, 2018, 3:31 a.m. UTC | #7
Hi Jordan,

On Wed, Feb 14, 2018 at 1:42 AM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> On Tue, Feb 13, 2018 at 06:10:38PM +0900, Tomasz Figa wrote:
>> Hi Vivek,
>>
>> Thanks for the patch. Please see my comments inline.
>>
>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>> <vivek.gautam@codeaurora.org> wrote:
>> > While handling the concerned iommu, there should not be a
>> > need to power control the drm devices from iommu interface.
>> > If these drm devices need to be powered around this time,
>> > the respective drivers should take care of this.
>> >
>> > Replace the pm_runtime_get/put_sync(<drm_device>) with
>> > pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>> > the connected iommu through the device link interface.
>> > In case the device link is not setup these get/put_suppliers()
>> > calls will be a no-op, and the iommu driver should take care of
>> > powering on its devices accordingly.
>> >
>> > Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>> > ---
>> >  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>> >  1 file changed, 8 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>> > index b23d33622f37..1ab629bbee69 100644
>> > --- a/drivers/gpu/drm/msm/msm_iommu.c
>> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
>> > @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>> >         struct msm_iommu *iommu = to_msm_iommu(mmu);
>> >         int ret;
>> >
>> > -       pm_runtime_get_sync(mmu->dev);
>> > +       pm_runtime_get_suppliers(mmu->dev);
>> >         ret = iommu_attach_device(iommu->domain, mmu->dev);
>> > -       pm_runtime_put_sync(mmu->dev);
>> > +       pm_runtime_put_suppliers(mmu->dev);
>>
>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>> callback and that's where necessary runtime PM gets should happen, if
>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>> with power state of device controlled by driver B (ARM SMMU).
>
> This whole thing is confused by the fact that on MSM the GPU and the GPU IOMMU
> share some of the same clocks and power rail so turning on the GPU also
> turned on the IOMMU register banks by extension.

This is surprisingly not a very surprising case. Exactly the same can
be seen on Rockchip SoCs and we're solving the problem using the
solution I suggested. In fact, my suggestions to this thread are based
on the design we chose for Rockchip, due to the high level of
similarity (+/- the GPU directly programming IOMMU registers, which is
not present there, but AFAICT it doesn't pose a problem here).

>
> But if we put that aside the question is who should be responsible for
> controlling the power in this relationship and there are several good reasons to
> leave it up to the client device. The most important reason is when we move to
> the per-instance model where the GPU self-programmings the SMMU registers. In
> that case, the driver will need to make sure that the SMMU is powered up before
> submitting the command and then removing the power vote when the commands
> are done to save energy.

I might need more insight on what's going on in your hardware, but
with my current understanding I'd argue that that is not right,
because:

- When submitting commands to the GPU, the GPU driver will
pm_runtime_get_sync() on the GPU device, which will automatically do
the same on all the linked suppliers, which would also include the
SMMU itself. The role of device links here is exactly that the GPU
driver doesn't have to care which other devices need to be brought up.

- When the GPU is operating, the SMMU power must be supplied anyway,
because it needs to be doing the translations, right? Note that by
"power" I really mean the physical power supply in the SoC, e.g. as
for a power domain. The runtime PM API in its current form (e.g.
binary off or on operation) is unsuitable for managing other things,
such as clocks (and there is ongoing work on improving it, e.g. by
adding support for multiple power states).

   ^^ The above would be actually guaranteed by your hardware design,
where SMMU and GPU share the power domain and clocks. (We used to rely
on this in old downstream implementation of Rockchip IOMMU and master
drivers in Chromium OS kernel, before we moved to handling the clocks
explicitly in the IOMMU driver and properly using device links to
manage the power domain and state restoration.)

>
> Additionally, there might be legitimate reasons in the driver to batch
> operations - you may wish to attach the device and then map several global
> buffers immediately - having driver side control prevents several unneeded power
> transitions.

As I mentioned before, these operations wouldn't normally need any
power transitions, since mapping with the TLB powered down boils down
to just updating the page tables in memory. However, as Robin
mentioned before, there might be some hardware factors, such as TLB
being powered separately (or retaining contents in some other way),
where this wouldn't be ensured indeed.

Still, that's where runtime PM autosuspend feature (i.e. delayed
suspend) comes to the rescue, with the advantage of handling the cases
when the master driver receives map/unmap requests not batched (but
maybe a slight drawback in terms of the suspend not happening
instantly and losing some power, but it's about power domains, so
mainly leakage current, isn't it?)

>
> Perhaps the right answer is to do both - allow for the driver to enable the
> supplier but also do the right power operations at the appropriately places in
> the IOMMU driver.
>
>> This is also important for the reasons I stated in my comments to
>> "[PATCH v7 1/6] base: power: runtime: Export
>> pm_runtime_get/put_suppliers". Quoting for everyone's convenience:
>>
>> >> There are however cases in which the consumer wants to power-on
>> >> the supplier, but not itself.
>> >> E.g., A Graphics or multimedia driver wants to power-on the SMMU
>> >> to unmap a buffer and finish the TLB operations without powering
>> >> on itself.
>> >
>> >This sounds strange to me. If the SMMU is powered down, wouldn't the
>> >TLB lose its contents as well (and so no flushing needed)?
>> >
>
>> >Other than that, what kind of hardware operations would be needed
>> >besides just updating the page tables from the CPU?
>> >
>
>> In other words, the SMMU driver can deal with hardware state based on
>> return value of pm_runtime_get_sync() or pm_runtime_get_if_in_use()
>> and decide whether some operations are necessary or not, e.g.
>> - a state restore is necessary if the domain was powered off, but we
>> are bringing the master on,
>> - a flush may not be required when (un)mapping with the domain powered off,
>> - etc.
>
> I agree that there is probably some advanced logic that we can do to
> conclusively figure out the state of the hardware and improve the behavior.
> I would love to see the SMMU driver get smarter but for the moment we can't
> trust it and so we need to force the behavior from the GPU driver. The current
> code works for a5x and earlier but on sdm845 we can (no longer) treat the GPU
> and the SMMU as the same device for power purposes so we need this code.

Hmm, you've lost me there. Above you  mention that "on MSM the GPU and
the GPU IOMMU share some of the same clocks and power rail". Is this
no longer the case for sdm845? If so, would you mind shedding a bit
more light on how this looks there?

Best regards,
Tomasz
Vivek Gautam Feb. 14, 2018, 4:17 a.m. UTC | #8
Hi Tomasz,

On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>> Hi Vivek,
>>>>>
>>>>> Thanks for the patch. Please see my comments inline.
>>>>>
>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>> While handling the concerned iommu, there should not be a
>>>>>> need to power control the drm devices from iommu interface.
>>>>>> If these drm devices need to be powered around this time,
>>>>>> the respective drivers should take care of this.
>>>>>>
>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>> the connected iommu through the device link interface.
>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>> powering on its devices accordingly.
>>>>>>
>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>> ---
>>>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>         int ret;
>>>>>>
>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>
>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>>> callback and that's where necessary runtime PM gets should happen, if
>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>
>>>> Note that we end up having to do the same, because of iommu_unmap()
>>>> while DRM driver is powered off..  it might be cleaner if it was all
>>>> self contained in the iommu driver, but that would make it so other
>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>> apparently something that some of them want to do..
>>>
>>> I'd assume that runtime PM status is already guaranteed to be active
>>> when the IRQ handler is running, by some other means (e.g.
>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>> hardware). Otherwise, I'm not sure how a powered down device could
>>> trigger an IRQ.
>>>
>>> So, if the master device power is already on, suppliers should be
>>> powered on as well, thanks to device links.
>>>
>>
>> umm, that is kindof the inverse of the problem..  the problem is
>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>> afaict).. they will potentially call iommu->unmap() when device is not
>> active (due to userspace or things beyond the control of the driver)..
>> so *they* would want iommu to do pm get/put calls.
>
> Which is fine and which is actually already done by one of the patches
> in this series, not for map/unmap, but probe, add_device,
> remove_device. Having parts of the API doing it inside the callback
> and other parts outside sounds at least inconsistent.
>
>> But other drivers
>> trying to unmap from irq ctx would not.  Which is the contradictory
>> requirement that lead to the idea of iommu user powering up iommu for
>> unmap.
>
> Sorry, maybe I wasn't clear. My last message was supposed to show that
> it's not contradictory at all, because "other drivers trying to unmap
> from irq ctx" would already have called pm_runtime_get_*() earlier
> from a non-irq ctx, which would have also done the same on all the
> linked suppliers, including the IOMMU. The ultimate result would be
> that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
> would do nothing besides incrementing the reference count.

The entire point was to avoid the slowpath that pm_runtime_get/put_sync()
would add in map/unmap. It would not be correct to add a slowpath in irq_ctx
for taking care of non-irq_ctx and for the situations where master is already
powered-off.

>
>>
>> There has already been some discussion about this on various earlier
>> permutations of this patchset.  I think we have exhausted all other
>> options.
>
> I guess I should have read those. Let me do that now.
Yea, i point to the thread in cover letter and [PATCH 1/6].
Thanks.

regards
Vivek

>
> Best regards,
> Tomasz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tomasz Figa Feb. 14, 2018, 5:38 a.m. UTC | #9
On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
<vivek.gautam@codeaurora.org> wrote:
> Hi Tomasz,
>
> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>> Hi Vivek,
>>>>>>
>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>
>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>> If these drm devices need to be powered around this time,
>>>>>>> the respective drivers should take care of this.
>>>>>>>
>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>> the connected iommu through the device link interface.
>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>> powering on its devices accordingly.
>>>>>>>
>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>         int ret;
>>>>>>>
>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>
>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>>>> callback and that's where necessary runtime PM gets should happen, if
>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>
>>>>> Note that we end up having to do the same, because of iommu_unmap()
>>>>> while DRM driver is powered off..  it might be cleaner if it was all
>>>>> self contained in the iommu driver, but that would make it so other
>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>> apparently something that some of them want to do..
>>>>
>>>> I'd assume that runtime PM status is already guaranteed to be active
>>>> when the IRQ handler is running, by some other means (e.g.
>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>> trigger an IRQ.
>>>>
>>>> So, if the master device power is already on, suppliers should be
>>>> powered on as well, thanks to device links.
>>>>
>>>
>>> umm, that is kindof the inverse of the problem..  the problem is
>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>> afaict).. they will potentially call iommu->unmap() when device is not
>>> active (due to userspace or things beyond the control of the driver)..
>>> so *they* would want iommu to do pm get/put calls.
>>
>> Which is fine and which is actually already done by one of the patches
>> in this series, not for map/unmap, but probe, add_device,
>> remove_device. Having parts of the API doing it inside the callback
>> and other parts outside sounds at least inconsistent.
>>
>>> But other drivers
>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>> requirement that lead to the idea of iommu user powering up iommu for
>>> unmap.
>>
>> Sorry, maybe I wasn't clear. My last message was supposed to show that
>> it's not contradictory at all, because "other drivers trying to unmap
>> from irq ctx" would already have called pm_runtime_get_*() earlier
>> from a non-irq ctx, which would have also done the same on all the
>> linked suppliers, including the IOMMU. The ultimate result would be
>> that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
>> would do nothing besides incrementing the reference count.
>
> The entire point was to avoid the slowpath that pm_runtime_get/put_sync()
> would add in map/unmap. It would not be correct to add a slowpath in irq_ctx
> for taking care of non-irq_ctx and for the situations where master is already
> powered-off.

Correct me if I'm wrong, but I believe that with what I'm proposing
there wouldn't be any slow path.

a) For IRQ context, the master is already powered on and so the SMMU
is also powered on, through respective device link.
pm_runtime_get_sync() would ultimately just increment the runtime PM
usage count.

b) For a case when the master is already powered off (which wouldn't
be IRQ context, for the reason stated in a)), powering on the SMMU is
unavoidable, if the SMMU hardware really needs to be accessed (i.e.
some TLBs need to be invalidated, if their state is preserved despite
master being powered down).

>
>>
>>>
>>> There has already been some discussion about this on various earlier
>>> permutations of this patchset.  I think we have exhausted all other
>>> options.
>>
>> I guess I should have read those. Let me do that now.
> Yea, i point to the thread in cover letter and [PATCH 1/6].
> Thanks.

I read through all the links in the cover letter and I could see other
attempts not working out indeed, but they were different from what I'm
proposing.

There was also a point raised that __pm_runtime_resume() called from
pm_runtime_get_sync() would grab dev->power_lock spinlock, which is
true, except that if the device is already active, it would do it only
for the time of checking device state, so I doubt it would really be a
significant point of contention.

Best regards,
Tomasz
Vivek Gautam Feb. 14, 2018, 9:13 a.m. UTC | #10
Hi Tomasz,

On Wed, Feb 14, 2018 at 11:08 AM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
> <vivek.gautam@codeaurora.org> wrote:
>> Hi Tomasz,
>>
>> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>>> Hi Vivek,
>>>>>>>
>>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>>
>>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>>> If these drm devices need to be powered around this time,
>>>>>>>> the respective drivers should take care of this.
>>>>>>>>
>>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>>> the connected iommu through the device link interface.
>>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>>> powering on its devices accordingly.
>>>>>>>>
>>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>>         int ret;
>>>>>>>>
>>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>>
>>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>>>>> callback and that's where necessary runtime PM gets should happen, if
>>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>>
>>>>>> Note that we end up having to do the same, because of iommu_unmap()
>>>>>> while DRM driver is powered off..  it might be cleaner if it was all
>>>>>> self contained in the iommu driver, but that would make it so other
>>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>>> apparently something that some of them want to do..
>>>>>
>>>>> I'd assume that runtime PM status is already guaranteed to be active
>>>>> when the IRQ handler is running, by some other means (e.g.
>>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>>> trigger an IRQ.
>>>>>
>>>>> So, if the master device power is already on, suppliers should be
>>>>> powered on as well, thanks to device links.
>>>>>
>>>>
>>>> umm, that is kindof the inverse of the problem..  the problem is
>>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>>> afaict).. they will potentially call iommu->unmap() when device is not
>>>> active (due to userspace or things beyond the control of the driver)..
>>>> so *they* would want iommu to do pm get/put calls.
>>>
>>> Which is fine and which is actually already done by one of the patches
>>> in this series, not for map/unmap, but probe, add_device,
>>> remove_device. Having parts of the API doing it inside the callback
>>> and other parts outside sounds at least inconsistent.
>>>
>>>> But other drivers
>>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>>> requirement that lead to the idea of iommu user powering up iommu for
>>>> unmap.
>>>
>>> Sorry, maybe I wasn't clear. My last message was supposed to show that
>>> it's not contradictory at all, because "other drivers trying to unmap
>>> from irq ctx" would already have called pm_runtime_get_*() earlier
>>> from a non-irq ctx, which would have also done the same on all the
>>> linked suppliers, including the IOMMU. The ultimate result would be
>>> that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
>>> would do nothing besides incrementing the reference count.
>>
>> The entire point was to avoid the slowpath that pm_runtime_get/put_sync()
>> would add in map/unmap. It would not be correct to add a slowpath in irq_ctx
>> for taking care of non-irq_ctx and for the situations where master is already
>> powered-off.
>
> Correct me if I'm wrong, but I believe that with what I'm proposing
> there wouldn't be any slow path.

Yea, but only when the power domain is irq-safe? And not all platforms
enable irq-safe power domains. For instance, msm doesn't enable its
gdsc power domains as irq-safe.
Is it something i am missing?

>
> a) For IRQ context, the master is already powered on and so the SMMU
> is also powered on, through respective device link.
> pm_runtime_get_sync() would ultimately just increment the runtime PM
> usage count.
>
> b) For a case when the master is already powered off (which wouldn't
> be IRQ context, for the reason stated in a)), powering on the SMMU is
> unavoidable, if the SMMU hardware really needs to be accessed (i.e.
> some TLBs need to be invalidated, if their state is preserved despite
> master being powered down).
>
>>
>>>
>>>>
>>>> There has already been some discussion about this on various earlier
>>>> permutations of this patchset.  I think we have exhausted all other
>>>> options.
>>>
>>> I guess I should have read those. Let me do that now.
>> Yea, i point to the thread in cover letter and [PATCH 1/6].
>> Thanks.
>
> I read through all the links in the cover letter and I could see other
> attempts not working out indeed, but they were different from what I'm
> proposing.
>
> There was also a point raised that __pm_runtime_resume() called from
> pm_runtime_get_sync() would grab dev->power_lock spinlock, which is
> true, except that if the device is already active, it would do it only
> for the time of checking device state, so I doubt it would really be a
> significant point of contention.
>
> Best regards,
> Tomasz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tomasz Figa Feb. 14, 2018, 9:16 a.m. UTC | #11
On Wed, Feb 14, 2018 at 6:13 PM, Vivek Gautam
<vivek.gautam@codeaurora.org> wrote:
> Hi Tomasz,
>
> On Wed, Feb 14, 2018 at 11:08 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>> On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
>> <vivek.gautam@codeaurora.org> wrote:
>>> Hi Tomasz,
>>>
>>> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>>>> Hi Vivek,
>>>>>>>>
>>>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>>>
>>>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>>>> If these drm devices need to be powered around this time,
>>>>>>>>> the respective drivers should take care of this.
>>>>>>>>>
>>>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>>>> the connected iommu through the device link interface.
>>>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>>>> powering on its devices accordingly.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>>>> ---
>>>>>>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>>>         int ret;
>>>>>>>>>
>>>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>>>
>>>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>>>>>> callback and that's where necessary runtime PM gets should happen, if
>>>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>>>
>>>>>>> Note that we end up having to do the same, because of iommu_unmap()
>>>>>>> while DRM driver is powered off..  it might be cleaner if it was all
>>>>>>> self contained in the iommu driver, but that would make it so other
>>>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>>>> apparently something that some of them want to do..
>>>>>>
>>>>>> I'd assume that runtime PM status is already guaranteed to be active
>>>>>> when the IRQ handler is running, by some other means (e.g.
>>>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>>>> trigger an IRQ.
>>>>>>
>>>>>> So, if the master device power is already on, suppliers should be
>>>>>> powered on as well, thanks to device links.
>>>>>>
>>>>>
>>>>> umm, that is kindof the inverse of the problem..  the problem is
>>>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>>>> afaict).. they will potentially call iommu->unmap() when device is not
>>>>> active (due to userspace or things beyond the control of the driver)..
>>>>> so *they* would want iommu to do pm get/put calls.
>>>>
>>>> Which is fine and which is actually already done by one of the patches
>>>> in this series, not for map/unmap, but probe, add_device,
>>>> remove_device. Having parts of the API doing it inside the callback
>>>> and other parts outside sounds at least inconsistent.
>>>>
>>>>> But other drivers
>>>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>>>> requirement that lead to the idea of iommu user powering up iommu for
>>>>> unmap.
>>>>
>>>> Sorry, maybe I wasn't clear. My last message was supposed to show that
>>>> it's not contradictory at all, because "other drivers trying to unmap
>>>> from irq ctx" would already have called pm_runtime_get_*() earlier
>>>> from a non-irq ctx, which would have also done the same on all the
>>>> linked suppliers, including the IOMMU. The ultimate result would be
>>>> that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
>>>> would do nothing besides incrementing the reference count.
>>>
>>> The entire point was to avoid the slowpath that pm_runtime_get/put_sync()
>>> would add in map/unmap. It would not be correct to add a slowpath in irq_ctx
>>> for taking care of non-irq_ctx and for the situations where master is already
>>> powered-off.
>>
>> Correct me if I'm wrong, but I believe that with what I'm proposing
>> there wouldn't be any slow path.
>
> Yea, but only when the power domain is irq-safe? And not all platforms
> enable irq-safe power domains. For instance, msm doesn't enable its
> gdsc power domains as irq-safe.
> Is it something i am missing?

irq-safe would matter if there would exist a case when the call is
done from IRQ context and the power is off. As I explained in a), it
shouldn't happen.

Best regards,
Tomasz

>
>>
>> a) For IRQ context, the master is already powered on and so the SMMU
>> is also powered on, through respective device link.
>> pm_runtime_get_sync() would ultimately just increment the runtime PM
>> usage count.
>>
>> b) For a case when the master is already powered off (which wouldn't
>> be IRQ context, for the reason stated in a)), powering on the SMMU is
>> unavoidable, if the SMMU hardware really needs to be accessed (i.e.
>> some TLBs need to be invalidated, if their state is preserved despite
>> master being powered down).
>>
>>>
>>>>
>>>>>
>>>>> There has already been some discussion about this on various earlier
>>>>> permutations of this patchset.  I think we have exhausted all other
>>>>> options.
>>>>
>>>> I guess I should have read those. Let me do that now.
>>> Yea, i point to the thread in cover letter and [PATCH 1/6].
>>> Thanks.
>>
>> I read through all the links in the cover letter and I could see other
>> attempts not working out indeed, but they were different from what I'm
>> proposing.
>>
>> There was also a point raised that __pm_runtime_resume() called from
>> pm_runtime_get_sync() would grab dev->power_lock spinlock, which is
>> true, except that if the device is already active, it would do it only
>> for the time of checking device state, so I doubt it would really be a
>> significant point of contention.
>>
>> Best regards,
>> Tomasz
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation
Vivek Gautam Feb. 14, 2018, 10:33 a.m. UTC | #12
On Wed, Feb 14, 2018 at 2:46 PM, Tomasz Figa <tfiga@chromium.org> wrote:

Adding Jordan to this thread as well.

> On Wed, Feb 14, 2018 at 6:13 PM, Vivek Gautam
> <vivek.gautam@codeaurora.org> wrote:
>> Hi Tomasz,
>>
>> On Wed, Feb 14, 2018 at 11:08 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>> On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
>>> <vivek.gautam@codeaurora.org> wrote:
>>>> Hi Tomasz,
>>>>
>>>> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>>>>> Hi Vivek,
>>>>>>>>>
>>>>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>>>>
>>>>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>>>>> If these drm devices need to be powered around this time,
>>>>>>>>>> the respective drivers should take care of this.
>>>>>>>>>>
>>>>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>>>>> the connected iommu through the device link interface.
>>>>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>>>>> powering on its devices accordingly.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>>>>> ---
>>>>>>>>>>  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>>>>  1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>>>>>>         struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>>>>         int ret;
>>>>>>>>>>
>>>>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>>>>         ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>>>>
>>>>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>>>>>>> callback and that's where necessary runtime PM gets should happen, if
>>>>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>>>>
>>>>>>>> Note that we end up having to do the same, because of iommu_unmap()
>>>>>>>> while DRM driver is powered off..  it might be cleaner if it was all
>>>>>>>> self contained in the iommu driver, but that would make it so other
>>>>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>>>>> apparently something that some of them want to do..
>>>>>>>
>>>>>>> I'd assume that runtime PM status is already guaranteed to be active
>>>>>>> when the IRQ handler is running, by some other means (e.g.
>>>>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>>>>> trigger an IRQ.
>>>>>>>
>>>>>>> So, if the master device power is already on, suppliers should be
>>>>>>> powered on as well, thanks to device links.
>>>>>>>
>>>>>>
>>>>>> umm, that is kindof the inverse of the problem..  the problem is
>>>>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>>>>> afaict).. they will potentially call iommu->unmap() when device is not
>>>>>> active (due to userspace or things beyond the control of the driver)..
>>>>>> so *they* would want iommu to do pm get/put calls.
>>>>>
>>>>> Which is fine and which is actually already done by one of the patches
>>>>> in this series, not for map/unmap, but probe, add_device,
>>>>> remove_device. Having parts of the API doing it inside the callback
>>>>> and other parts outside sounds at least inconsistent.
>>>>>
>>>>>> But other drivers
>>>>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>>>>> requirement that lead to the idea of iommu user powering up iommu for
>>>>>> unmap.
>>>>>
>>>>> Sorry, maybe I wasn't clear. My last message was supposed to show that
>>>>> it's not contradictory at all, because "other drivers trying to unmap
>>>>> from irq ctx" would already have called pm_runtime_get_*() earlier
>>>>> from a non-irq ctx, which would have also done the same on all the
>>>>> linked suppliers, including the IOMMU. The ultimate result would be
>>>>> that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
>>>>> would do nothing besides incrementing the reference count.
>>>>
>>>> The entire point was to avoid the slowpath that pm_runtime_get/put_sync()
>>>> would add in map/unmap. It would not be correct to add a slowpath in irq_ctx
>>>> for taking care of non-irq_ctx and for the situations where master is already
>>>> powered-off.
>>>
>>> Correct me if I'm wrong, but I believe that with what I'm proposing
>>> there wouldn't be any slow path.
>>
>> Yea, but only when the power domain is irq-safe? And not all platforms
>> enable irq-safe power domains. For instance, msm doesn't enable its
>> gdsc power domains as irq-safe.
>> Is it something i am missing?
>
> irq-safe would matter if there would exist a case when the call is
> done from IRQ context and the power is off. As I explained in a), it
> shouldn't happen.

Hi Robin, Will

Does adding pm_runtime_get() in map/unmap sounds good to you?

Quoting Tomasz once again here:
>>> a) For IRQ context, the master is already powered on and so the SMMU
>>> is also powered on, through respective device link.
>>> pm_runtime_get_sync() would ultimately just increment the runtime PM
>>> usage count.
>>>
>>> b) For a case when the master is already powered off (which wouldn't
>>> be IRQ context, for the reason stated in a)), powering on the SMMU is
>>> unavoidable, if the SMMU hardware really needs to be accessed (i.e.
>>> some TLBs need to be invalidated, if their state is preserved despite
>>> master being powered down).

>>> There was also a point raised that __pm_runtime_resume() called from
>>> pm_runtime_get_sync() would grab dev->power_lock spinlock, which is
>>> true, except that if the device is already active, it would do it only
>>> for the time of checking device state, so I doubt it would really be a
>>> significant point of contention.

Regards
Vivek

>
> Best regards,
> Tomasz
>
>>
>>>
>>> a) For IRQ context, the master is already powered on and so the SMMU
>>> is also powered on, through respective device link.
>>> pm_runtime_get_sync() would ultimately just increment the runtime PM
>>> usage count.
>>>
>>> b) For a case when the master is already powered off (which wouldn't
>>> be IRQ context, for the reason stated in a)), powering on the SMMU is
>>> unavoidable, if the SMMU hardware really needs to be accessed (i.e.
>>> some TLBs need to be invalidated, if their state is preserved despite
>>> master being powered down).
>>>
>>>>
>>>>>
>>>>>>
>>>>>> There has already been some discussion about this on various earlier
>>>>>> permutations of this patchset.  I think we have exhausted all other
>>>>>> options.
>>>>>
>>>>> I guess I should have read those. Let me do that now.
>>>> Yea, i point to the thread in cover letter and [PATCH 1/6].
>>>> Thanks.
>>>
>>> I read through all the links in the cover letter and I could see other
>>> attempts not working out indeed, but they were different from what I'm
>>> proposing.
>>>
>>> There was also a point raised that __pm_runtime_resume() called from
>>> pm_runtime_get_sync() would grab dev->power_lock spinlock, which is
>>> true, except that if the device is already active, it would do it only
>>> for the time of checking device state, so I doubt it would really be a
>>> significant point of contention.
>>>
>>> Best regards,
>>> Tomasz
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
>> of Code Aurora Forum, hosted by The Linux Foundation
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jordan Crouse Feb. 14, 2018, 3:48 p.m. UTC | #13
On Wed, Feb 14, 2018 at 12:31:29PM +0900, Tomasz Figa wrote:
> Hi Jordan,
> 
> On Wed, Feb 14, 2018 at 1:42 AM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> > On Tue, Feb 13, 2018 at 06:10:38PM +0900, Tomasz Figa wrote:
> >> Hi Vivek,
> >>
> >> Thanks for the patch. Please see my comments inline.
> >>
> >> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
> >> <vivek.gautam@codeaurora.org> wrote:
> >> > While handling the concerned iommu, there should not be a
> >> > need to power control the drm devices from iommu interface.
> >> > If these drm devices need to be powered around this time,
> >> > the respective drivers should take care of this.
> >> >
> >> > Replace the pm_runtime_get/put_sync(<drm_device>) with
> >> > pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
> >> > the connected iommu through the device link interface.
> >> > In case the device link is not setup these get/put_suppliers()
> >> > calls will be a no-op, and the iommu driver should take care of
> >> > powering on its devices accordingly.
> >> >
> >> > Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
> >> > ---
> >> >  drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
> >> >  1 file changed, 8 insertions(+), 8 deletions(-)
> >> >
> >> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> >> > index b23d33622f37..1ab629bbee69 100644
> >> > --- a/drivers/gpu/drm/msm/msm_iommu.c
> >> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
> >> > @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
> >> >         struct msm_iommu *iommu = to_msm_iommu(mmu);
> >> >         int ret;
> >> >
> >> > -       pm_runtime_get_sync(mmu->dev);
> >> > +       pm_runtime_get_suppliers(mmu->dev);
> >> >         ret = iommu_attach_device(iommu->domain, mmu->dev);
> >> > -       pm_runtime_put_sync(mmu->dev);
> >> > +       pm_runtime_put_suppliers(mmu->dev);
> >>
> >> For me, it looks like a wrong place to handle runtime PM of IOMMU
> >> here. iommu_attach_device() calls into IOMMU driver's attach_device()
> >> callback and that's where necessary runtime PM gets should happen, if
> >> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
> >> with power state of device controlled by driver B (ARM SMMU).
> >
> > This whole thing is confused by the fact that on MSM the GPU and the GPU IOMMU
> > share some of the same clocks and power rail so turning on the GPU also
> > turned on the IOMMU register banks by extension.
> 
> This is surprisingly not a very surprising case. Exactly the same can
> be seen on Rockchip SoCs and we're solving the problem using the
> solution I suggested. In fact, my suggestions to this thread are based
> on the design we chose for Rockchip, due to the high level of
> similarity (+/- the GPU directly programming IOMMU registers, which is
> not present there, but AFAICT it doesn't pose a problem here).
> 
> >
> > But if we put that aside the question is who should be responsible for
> > controlling the power in this relationship and there are several good reasons to
> > leave it up to the client device. The most important reason is when we move to
> > the per-instance model where the GPU self-programmings the SMMU registers. In
> > that case, the driver will need to make sure that the SMMU is powered up before
> > submitting the command and then removing the power vote when the commands
> > are done to save energy.
> 
> I might need more insight on what's going on in your hardware, but
> with my current understanding I'd argue that that is not right,
> because:
> 
> - When submitting commands to the GPU, the GPU driver will
> pm_runtime_get_sync() on the GPU device, which will automatically do
> the same on all the linked suppliers, which would also include the
> SMMU itself. The role of device links here is exactly that the GPU
> driver doesn't have to care which other devices need to be brought up.

This is true.  Assuming that the device link works correctly we would not need
to explicitly power the SMMU which makes my point entirely moot.

> - When the GPU is operating, the SMMU power must be supplied anyway,
> because it needs to be doing the translations, right? Note that by
> "power" I really mean the physical power supply in the SoC, e.g. as
> for a power domain. The runtime PM API in its current form (e.g.
> binary off or on operation) is unsuitable for managing other things,
> such as clocks (and there is ongoing work on improving it, e.g. by
> adding support for multiple power states).

As others have pointed out, the register banks and the translation unit are
powered separately (or at least, clocked separately).

>    ^^ The above would be actually guaranteed by your hardware design,
> where SMMU and GPU share the power domain and clocks. (We used to rely
> on this in old downstream implementation of Rockchip IOMMU and master
> drivers in Chromium OS kernel, before we moved to handling the clocks
> explicitly in the IOMMU driver and properly using device links to
> manage the power domain and state restoration.)

I wouldn't call it a guarantee. I would instead say that it works by a happy
coincidence that I don't think we should depend on.

> >
> > Additionally, there might be legitimate reasons in the driver to batch
> > operations - you may wish to attach the device and then map several global
> > buffers immediately - having driver side control prevents several unneeded power
> > transitions.
> 
> As I mentioned before, these operations wouldn't normally need any
> power transitions, since mapping with the TLB powered down boils down
> to just updating the page tables in memory. However, as Robin
> mentioned before, there might be some hardware factors, such as TLB
> being powered separately (or retaining contents in some other way),
> where this wouldn't be ensured indeed.
> 
> Still, that's where runtime PM autosuspend feature (i.e. delayed
> suspend) comes to the rescue, with the advantage of handling the cases
> when the master driver receives map/unmap requests not batched (but
> maybe a slight drawback in terms of the suspend not happening
> instantly and losing some power, but it's about power domains, so
> mainly leakage current, isn't it?)

> >
> > Perhaps the right answer is to do both - allow for the driver to enable the
> > supplier but also do the right power operations at the appropriately places in
> > the IOMMU driver.
> >
> >> This is also important for the reasons I stated in my comments to
> >> "[PATCH v7 1/6] base: power: runtime: Export
> >> pm_runtime_get/put_suppliers". Quoting for everyone's convenience:
> >>
> >> >> There are however cases in which the consumer wants to power-on
> >> >> the supplier, but not itself.
> >> >> E.g., A Graphics or multimedia driver wants to power-on the SMMU
> >> >> to unmap a buffer and finish the TLB operations without powering
> >> >> on itself.
> >> >
> >> >This sounds strange to me. If the SMMU is powered down, wouldn't the
> >> >TLB lose its contents as well (and so no flushing needed)?
> >> >
> >
> >> >Other than that, what kind of hardware operations would be needed
> >> >besides just updating the page tables from the CPU?
> >> >
> >
> >> In other words, the SMMU driver can deal with hardware state based on
> >> return value of pm_runtime_get_sync() or pm_runtime_get_if_in_use()
> >> and decide whether some operations are necessary or not, e.g.
> >> - a state restore is necessary if the domain was powered off, but we
> >> are bringing the master on,
> >> - a flush may not be required when (un)mapping with the domain powered off,
> >> - etc.
> >
> > I agree that there is probably some advanced logic that we can do to
> > conclusively figure out the state of the hardware and improve the behavior.
> > I would love to see the SMMU driver get smarter but for the moment we can't
> > trust it and so we need to force the behavior from the GPU driver. The current
> > code works for a5x and earlier but on sdm845 we can (no longer) treat the GPU
> > and the SMMU as the same device for power purposes so we need this code.
> 
> Hmm, you've lost me there. Above you  mention that "on MSM the GPU and
> the GPU IOMMU share some of the same clocks and power rail". Is this
> no longer the case for sdm845? If so, would you mind shedding a bit
> more light on how this looks there?

Sure. I've sent out the code, but it can be confusing, so I'll try to explain it
a little better.

On a5xx and earlier the GPU power/clocks were directly controlled by the CPU so
the pm resume consisted of a handful of clock controls plus the domain(s)
controlled by genpd. 

Starting on sdm845 we have added a new integrated microcontroller called the GMU
which takes over power control for the GPU. The GMU runs in real time and it
can bring the GPU power up and down very quickly - even quickly enough to
collapse between frames. If done right this can save significant leakage.

The problem is of course that the GMU is a fully featured processor in its own
right so its not longer a matter of just turning on clocks and rails.
We need to boot it, load the microcode, establish IPC and so on. As you
imagine,the GMU also uses the SMMU to share code with the CPU.

The kicker is that the while SMMU and GPU share common clocks, the GMU does not
and since from the perspective of the CPU the only device that we control is the
GMU and we have to treat the SMMU as a truly separate device and thats how we
get to where we are.

But as you said, as long as we have the device link correctly set up, I think we
might just be able to get away with depending on the supplier chain working
during pm resume.  I'll test it out today and see how it goes.

Thanks,
Jordan
Robin Murphy Feb. 14, 2018, 4:03 p.m. UTC | #14
On 14/02/18 10:33, Vivek Gautam wrote:
> On Wed, Feb 14, 2018 at 2:46 PM, Tomasz Figa <tfiga@chromium.org> wrote:
> 
> Adding Jordan to this thread as well.
> 
>> On Wed, Feb 14, 2018 at 6:13 PM, Vivek Gautam
>> <vivek.gautam@codeaurora.org> wrote:
>>> Hi Tomasz,
>>>
>>> On Wed, Feb 14, 2018 at 11:08 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>> On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>> Hi Tomasz,
>>>>>
>>>>> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>>>>>>>> Hi Vivek,
>>>>>>>>>>
>>>>>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>>>>>> If these drm devices need to be powered around this time,
>>>>>>>>>>> the respective drivers should take care of this.
>>>>>>>>>>>
>>>>>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>>>>>> the connected iommu through the device link interface.
>>>>>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>>>>>> powering on its devices accordingly.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>>>>>> ---
>>>>>>>>>>>   drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>>>>>   1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
>>>>>>>>>>>          struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>>>>>          int ret;
>>>>>>>>>>>
>>>>>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>>>>>          ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>>>>>
>>>>>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>>>>>> here. iommu_attach_device() calls into IOMMU driver's attach_device()
>>>>>>>>>> callback and that's where necessary runtime PM gets should happen, if
>>>>>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be dealing
>>>>>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>>>>>
>>>>>>>>> Note that we end up having to do the same, because of iommu_unmap()
>>>>>>>>> while DRM driver is powered off..  it might be cleaner if it was all
>>>>>>>>> self contained in the iommu driver, but that would make it so other
>>>>>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>>>>>> apparently something that some of them want to do..
>>>>>>>>
>>>>>>>> I'd assume that runtime PM status is already guaranteed to be active
>>>>>>>> when the IRQ handler is running, by some other means (e.g.
>>>>>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>>>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>>>>>> trigger an IRQ.
>>>>>>>>
>>>>>>>> So, if the master device power is already on, suppliers should be
>>>>>>>> powered on as well, thanks to device links.
>>>>>>>>
>>>>>>>
>>>>>>> umm, that is kindof the inverse of the problem..  the problem is
>>>>>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>>>>>> afaict).. they will potentially call iommu->unmap() when device is not
>>>>>>> active (due to userspace or things beyond the control of the driver)..
>>>>>>> so *they* would want iommu to do pm get/put calls.
>>>>>>
>>>>>> Which is fine and which is actually already done by one of the patches
>>>>>> in this series, not for map/unmap, but probe, add_device,
>>>>>> remove_device. Having parts of the API doing it inside the callback
>>>>>> and other parts outside sounds at least inconsistent.
>>>>>>
>>>>>>> But other drivers
>>>>>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>>>>>> requirement that lead to the idea of iommu user powering up iommu for
>>>>>>> unmap.
>>>>>>
>>>>>> Sorry, maybe I wasn't clear. My last message was supposed to show that
>>>>>> it's not contradictory at all, because "other drivers trying to unmap
>>>>>> from irq ctx" would already have called pm_runtime_get_*() earlier
>>>>>> from a non-irq ctx, which would have also done the same on all the
>>>>>> linked suppliers, including the IOMMU. The ultimate result would be
>>>>>> that the map/unmap() of the IOMMU driver calling pm_runtime_get_sync()
>>>>>> would do nothing besides incrementing the reference count.
>>>>>
>>>>> The entire point was to avoid the slowpath that pm_runtime_get/put_sync()
>>>>> would add in map/unmap. It would not be correct to add a slowpath in irq_ctx
>>>>> for taking care of non-irq_ctx and for the situations where master is already
>>>>> powered-off.
>>>>
>>>> Correct me if I'm wrong, but I believe that with what I'm proposing
>>>> there wouldn't be any slow path.
>>>
>>> Yea, but only when the power domain is irq-safe? And not all platforms
>>> enable irq-safe power domains. For instance, msm doesn't enable its
>>> gdsc power domains as irq-safe.
>>> Is it something i am missing?
>>
>> irq-safe would matter if there would exist a case when the call is
>> done from IRQ context and the power is off. As I explained in a), it
>> shouldn't happen.
> 
> Hi Robin, Will
> 
> Does adding pm_runtime_get() in map/unmap sounds good to you?

Given that we spent significant effort last year removing as much 
locking as we possibly could from the map/unmap path to minimise the 
significant performance impact it was having on networking/storage/etc. 
workloads, I really don't want to introduce more for the sake of one 
specific use-case, so no.

Robin.
Rob Clark Feb. 14, 2018, 4:12 p.m. UTC | #15
On Wed, Feb 14, 2018 at 10:48 AM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
> On Wed, Feb 14, 2018 at 12:31:29PM +0900, Tomasz Figa wrote:
>>
>> - When submitting commands to the GPU, the GPU driver will
>> pm_runtime_get_sync() on the GPU device, which will automatically do
>> the same on all the linked suppliers, which would also include the
>> SMMU itself. The role of device links here is exactly that the GPU
>> driver doesn't have to care which other devices need to be brought up.
>
> This is true.  Assuming that the device link works correctly we would not need
> to explicitly power the SMMU which makes my point entirely moot.

Just to point out what motivated this patchset, the biggest problem is
iommu_unmap() because that can happen when GPU is not powered on (or
in the v4l2 case, because some other device dropped it's reference to
the dma-buf allowing it to be free'd).  Currently we pm get/put the
GPU device around unmap, but it is kinda silly to boot up the GPU just
to unmap a buffer.

(Semi-related, I would also like to batch map/unmap's, I just haven't
gotten around to implementing it yet.. but that would be another case
where a single get_supplier()/put_supplier() outside of the iommu
would make sense instead of pm_get/put() inside the iommu driver's
->unmap().)

If you really dislike the get/put_supplier() approach, then perhaps we
need iommu_pm_get()/iommu_pm_put() operations that the iommu user
could use to accomplish the same thing?

BR,
-R
Tomasz Figa Feb. 15, 2018, 3:17 a.m. UTC | #16
On Thu, Feb 15, 2018 at 1:03 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 14/02/18 10:33, Vivek Gautam wrote:
>>
>> On Wed, Feb 14, 2018 at 2:46 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>
>> Adding Jordan to this thread as well.
>>
>>> On Wed, Feb 14, 2018 at 6:13 PM, Vivek Gautam
>>> <vivek.gautam@codeaurora.org> wrote:
>>>>
>>>> Hi Tomasz,
>>>>
>>>> On Wed, Feb 14, 2018 at 11:08 AM, Tomasz Figa <tfiga@chromium.org>
>>>> wrote:
>>>>>
>>>>> On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>
>>>>>> Hi Tomasz,
>>>>>>
>>>>>> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Vivek,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>>>>>>> If these drm devices need to be powered around this time,
>>>>>>>>>>>> the respective drivers should take care of this.
>>>>>>>>>>>>
>>>>>>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>>>>>>> the connected iommu through the device link interface.
>>>>>>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>>>>>>> powering on its devices accordingly.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>>>>>>   1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>> b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu
>>>>>>>>>>>> *mmu, const char * const *names,
>>>>>>>>>>>>          struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>>>>>>          int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>>>>>>          ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>>>>>>> here. iommu_attach_device() calls into IOMMU driver's
>>>>>>>>>>> attach_device()
>>>>>>>>>>> callback and that's where necessary runtime PM gets should
>>>>>>>>>>> happen, if
>>>>>>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be
>>>>>>>>>>> dealing
>>>>>>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Note that we end up having to do the same, because of
>>>>>>>>>> iommu_unmap()
>>>>>>>>>> while DRM driver is powered off..  it might be cleaner if it was
>>>>>>>>>> all
>>>>>>>>>> self contained in the iommu driver, but that would make it so
>>>>>>>>>> other
>>>>>>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>>>>>>> apparently something that some of them want to do..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd assume that runtime PM status is already guaranteed to be
>>>>>>>>> active
>>>>>>>>> when the IRQ handler is running, by some other means (e.g.
>>>>>>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>>>>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>>>>>>> trigger an IRQ.
>>>>>>>>>
>>>>>>>>> So, if the master device power is already on, suppliers should be
>>>>>>>>> powered on as well, thanks to device links.
>>>>>>>>>
>>>>>>>>
>>>>>>>> umm, that is kindof the inverse of the problem..  the problem is
>>>>>>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>>>>>>> afaict).. they will potentially call iommu->unmap() when device is
>>>>>>>> not
>>>>>>>> active (due to userspace or things beyond the control of the
>>>>>>>> driver)..
>>>>>>>> so *they* would want iommu to do pm get/put calls.
>>>>>>>
>>>>>>>
>>>>>>> Which is fine and which is actually already done by one of the
>>>>>>> patches
>>>>>>> in this series, not for map/unmap, but probe, add_device,
>>>>>>> remove_device. Having parts of the API doing it inside the callback
>>>>>>> and other parts outside sounds at least inconsistent.
>>>>>>>
>>>>>>>> But other drivers
>>>>>>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>>>>>>> requirement that lead to the idea of iommu user powering up iommu
>>>>>>>> for
>>>>>>>> unmap.
>>>>>>>
>>>>>>>
>>>>>>> Sorry, maybe I wasn't clear. My last message was supposed to show
>>>>>>> that
>>>>>>> it's not contradictory at all, because "other drivers trying to unmap
>>>>>>> from irq ctx" would already have called pm_runtime_get_*() earlier
>>>>>>> from a non-irq ctx, which would have also done the same on all the
>>>>>>> linked suppliers, including the IOMMU. The ultimate result would be
>>>>>>> that the map/unmap() of the IOMMU driver calling
>>>>>>> pm_runtime_get_sync()
>>>>>>> would do nothing besides incrementing the reference count.
>>>>>>
>>>>>>
>>>>>> The entire point was to avoid the slowpath that
>>>>>> pm_runtime_get/put_sync()
>>>>>> would add in map/unmap. It would not be correct to add a slowpath in
>>>>>> irq_ctx
>>>>>> for taking care of non-irq_ctx and for the situations where master is
>>>>>> already
>>>>>> powered-off.
>>>>>
>>>>>
>>>>> Correct me if I'm wrong, but I believe that with what I'm proposing
>>>>> there wouldn't be any slow path.
>>>>
>>>>
>>>> Yea, but only when the power domain is irq-safe? And not all platforms
>>>> enable irq-safe power domains. For instance, msm doesn't enable its
>>>> gdsc power domains as irq-safe.
>>>> Is it something i am missing?
>>>
>>>
>>> irq-safe would matter if there would exist a case when the call is
>>> done from IRQ context and the power is off. As I explained in a), it
>>> shouldn't happen.
>>
>>
>> Hi Robin, Will
>>
>> Does adding pm_runtime_get() in map/unmap sounds good to you?
>
>
> Given that we spent significant effort last year removing as much locking as
> we possibly could from the map/unmap path to minimise the significant
> performance impact it was having on networking/storage/etc. workloads, I
> really don't want to introduce more for the sake of one specific use-case,
> so no.

Could you elaborate on what kind of locking you are concerned about?
As I explained before, the normally happening fast path would lock
dev->power_lock only for the brief moment of incrementing the runtime
PM usage counter.

Best regards,
Tomasz
Tomasz Figa Feb. 15, 2018, 4:09 a.m. UTC | #17
On Thu, Feb 15, 2018 at 1:12 AM, Rob Clark <robdclark@gmail.com> wrote:
> On Wed, Feb 14, 2018 at 10:48 AM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
>> On Wed, Feb 14, 2018 at 12:31:29PM +0900, Tomasz Figa wrote:
>>>
>>> - When submitting commands to the GPU, the GPU driver will
>>> pm_runtime_get_sync() on the GPU device, which will automatically do
>>> the same on all the linked suppliers, which would also include the
>>> SMMU itself. The role of device links here is exactly that the GPU
>>> driver doesn't have to care which other devices need to be brought up.
>>
>> This is true.  Assuming that the device link works correctly we would not need
>> to explicitly power the SMMU which makes my point entirely moot.
>
> Just to point out what motivated this patchset, the biggest problem is
> iommu_unmap() because that can happen when GPU is not powered on (or
> in the v4l2 case, because some other device dropped it's reference to
> the dma-buf allowing it to be free'd).  Currently we pm get/put the
> GPU device around unmap, but it is kinda silly to boot up the GPU just
> to unmap a buffer.

Note that in V4L2 both mapping and unmapping can happen completely
without involving the driver. So AFAICT the approach being implemented
by this patchset will not work, because there will be no one to power
up the IOMMU before the operation. Moreover, there are platforms for
which there is no reason to power up the IOMMU just for map/unmap,
because the hardware state is lost anyway and the only real work
needed is updating the page tables in memory. (I feel like this is
actually true for most of the platforms in the wild, but this is based
purely on the not so small number of platforms I worked with, haven't
bothered looking for more general evidence.)

>
> (Semi-related, I would also like to batch map/unmap's, I just haven't
> gotten around to implementing it yet.. but that would be another case
> where a single get_supplier()/put_supplier() outside of the iommu
> would make sense instead of pm_get/put() inside the iommu driver's
> ->unmap().)
>
> If you really dislike the get/put_supplier() approach, then perhaps we
> need iommu_pm_get()/iommu_pm_put() operations that the iommu user
> could use to accomplish the same thing?

I'm afraid this wouldn't work for V4L2 either. And I still haven't
been given any evidence that the approach I'm suggesting, which relies
only on existing pieces of infrastructure and which worked for both
Exynos and Rockchip, including V4L2, wouldn't work for SMMU and/or QC
SoCs.

Best regards,
Tomasz
Tomasz Figa Feb. 15, 2018, 4:17 a.m. UTC | #18
On Thu, Feb 15, 2018 at 12:17 PM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Thu, Feb 15, 2018 at 1:03 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 14/02/18 10:33, Vivek Gautam wrote:
>>>
>>> On Wed, Feb 14, 2018 at 2:46 PM, Tomasz Figa <tfiga@chromium.org> wrote:
>>>
>>> Adding Jordan to this thread as well.
>>>
>>>> On Wed, Feb 14, 2018 at 6:13 PM, Vivek Gautam
>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>
>>>>> Hi Tomasz,
>>>>>
>>>>> On Wed, Feb 14, 2018 at 11:08 AM, Tomasz Figa <tfiga@chromium.org>
>>>>> wrote:
>>>>>>
>>>>>> On Wed, Feb 14, 2018 at 1:17 PM, Vivek Gautam
>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>
>>>>>>> Hi Tomasz,
>>>>>>>
>>>>>>> On Wed, Feb 14, 2018 at 8:31 AM, Tomasz Figa <tfiga@chromium.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Wed, Feb 14, 2018 at 11:13 AM, Rob Clark <robdclark@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Feb 13, 2018 at 8:59 PM, Tomasz Figa <tfiga@chromium.org>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 14, 2018 at 3:03 AM, Rob Clark <robdclark@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 13, 2018 at 4:10 AM, Tomasz Figa <tfiga@chromium.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Vivek,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the patch. Please see my comments inline.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 7, 2018 at 7:31 PM, Vivek Gautam
>>>>>>>>>>>> <vivek.gautam@codeaurora.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> While handling the concerned iommu, there should not be a
>>>>>>>>>>>>> need to power control the drm devices from iommu interface.
>>>>>>>>>>>>> If these drm devices need to be powered around this time,
>>>>>>>>>>>>> the respective drivers should take care of this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Replace the pm_runtime_get/put_sync(<drm_device>) with
>>>>>>>>>>>>> pm_runtime_get/put_suppliers(<drm_device>) calls, to power-up
>>>>>>>>>>>>> the connected iommu through the device link interface.
>>>>>>>>>>>>> In case the device link is not setup these get/put_suppliers()
>>>>>>>>>>>>> calls will be a no-op, and the iommu driver should take care of
>>>>>>>>>>>>> powering on its devices accordingly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   drivers/gpu/drm/msm/msm_iommu.c | 16 ++++++++--------
>>>>>>>>>>>>>   1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>>> b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>>> index b23d33622f37..1ab629bbee69 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_iommu.c
>>>>>>>>>>>>> @@ -40,9 +40,9 @@ static int msm_iommu_attach(struct msm_mmu
>>>>>>>>>>>>> *mmu, const char * const *names,
>>>>>>>>>>>>>          struct msm_iommu *iommu = to_msm_iommu(mmu);
>>>>>>>>>>>>>          int ret;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -       pm_runtime_get_sync(mmu->dev);
>>>>>>>>>>>>> +       pm_runtime_get_suppliers(mmu->dev);
>>>>>>>>>>>>>          ret = iommu_attach_device(iommu->domain, mmu->dev);
>>>>>>>>>>>>> -       pm_runtime_put_sync(mmu->dev);
>>>>>>>>>>>>> +       pm_runtime_put_suppliers(mmu->dev);
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> For me, it looks like a wrong place to handle runtime PM of IOMMU
>>>>>>>>>>>> here. iommu_attach_device() calls into IOMMU driver's
>>>>>>>>>>>> attach_device()
>>>>>>>>>>>> callback and that's where necessary runtime PM gets should
>>>>>>>>>>>> happen, if
>>>>>>>>>>>> any. In other words, driver A (MSM DRM driver) shouldn't be
>>>>>>>>>>>> dealing
>>>>>>>>>>>> with power state of device controlled by driver B (ARM SMMU).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Note that we end up having to do the same, because of
>>>>>>>>>>> iommu_unmap()
>>>>>>>>>>> while DRM driver is powered off..  it might be cleaner if it was
>>>>>>>>>>> all
>>>>>>>>>>> self contained in the iommu driver, but that would make it so
>>>>>>>>>>> other
>>>>>>>>>>> drivers couldn't call iommu_unmap() from an irq handler, which is
>>>>>>>>>>> apparently something that some of them want to do..
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'd assume that runtime PM status is already guaranteed to be
>>>>>>>>>> active
>>>>>>>>>> when the IRQ handler is running, by some other means (e.g.
>>>>>>>>>> pm_runtime_get_sync() called earlier, when queuing some work to the
>>>>>>>>>> hardware). Otherwise, I'm not sure how a powered down device could
>>>>>>>>>> trigger an IRQ.
>>>>>>>>>>
>>>>>>>>>> So, if the master device power is already on, suppliers should be
>>>>>>>>>> powered on as well, thanks to device links.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> umm, that is kindof the inverse of the problem..  the problem is
>>>>>>>>> things like gpu driver (and v4l2 drivers that import dma-buf's,
>>>>>>>>> afaict).. they will potentially call iommu->unmap() when device is
>>>>>>>>> not
>>>>>>>>> active (due to userspace or things beyond the control of the
>>>>>>>>> driver)..
>>>>>>>>> so *they* would want iommu to do pm get/put calls.
>>>>>>>>
>>>>>>>>
>>>>>>>> Which is fine and which is actually already done by one of the
>>>>>>>> patches
>>>>>>>> in this series, not for map/unmap, but probe, add_device,
>>>>>>>> remove_device. Having parts of the API doing it inside the callback
>>>>>>>> and other parts outside sounds at least inconsistent.
>>>>>>>>
>>>>>>>>> But other drivers
>>>>>>>>> trying to unmap from irq ctx would not.  Which is the contradictory
>>>>>>>>> requirement that lead to the idea of iommu user powering up iommu
>>>>>>>>> for
>>>>>>>>> unmap.
>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry, maybe I wasn't clear. My last message was supposed to show
>>>>>>>> that
>>>>>>>> it's not contradictory at all, because "other drivers trying to unmap
>>>>>>>> from irq ctx" would already have called pm_runtime_get_*() earlier
>>>>>>>> from a non-irq ctx, which would have also done the same on all the
>>>>>>>> linked suppliers, including the IOMMU. The ultimate result would be
>>>>>>>> that the map/unmap() of the IOMMU driver calling
>>>>>>>> pm_runtime_get_sync()
>>>>>>>> would do nothing besides incrementing the reference count.
>>>>>>>
>>>>>>>
>>>>>>> The entire point was to avoid the slowpath that
>>>>>>> pm_runtime_get/put_sync()
>>>>>>> would add in map/unmap. It would not be correct to add a slowpath in
>>>>>>> irq_ctx
>>>>>>> for taking care of non-irq_ctx and for the situations where master is
>>>>>>> already
>>>>>>> powered-off.
>>>>>>
>>>>>>
>>>>>> Correct me if I'm wrong, but I believe that with what I'm proposing
>>>>>> there wouldn't be any slow path.
>>>>>
>>>>>
>>>>> Yea, but only when the power domain is irq-safe? And not all platforms
>>>>> enable irq-safe power domains. For instance, msm doesn't enable its
>>>>> gdsc power domains as irq-safe.
>>>>> Is it something i am missing?
>>>>
>>>>
>>>> irq-safe would matter if there would exist a case when the call is
>>>> done from IRQ context and the power is off. As I explained in a), it
>>>> shouldn't happen.
>>>
>>>
>>> Hi Robin, Will
>>>
>>> Does adding pm_runtime_get() in map/unmap sounds good to you?
>>
>>
>> Given that we spent significant effort last year removing as much locking as
>> we possibly could from the map/unmap path to minimise the significant
>> performance impact it was having on networking/storage/etc. workloads, I
>> really don't want to introduce more for the sake of one specific use-case,
>> so no.
>
> Could you elaborate on what kind of locking you are concerned about?
> As I explained before, the normally happening fast path would lock
> dev->power_lock only for the brief moment of incrementing the runtime
> PM usage counter.

My bad, that's not even it.

The atomic usage counter is incremented beforehands, without any
locking [1] and the spinlock is acquired only for the sake of
validating that device's runtime PM state remained valid indeed [2],
which would be the case in the fast path of the same driver doing two
mappings in parallel, with the master powered on (and so the SMMU,
through device links; if master was not powered on already, powering
on the SMMU is unavoidable anyway and it would add much more latency
than the spinlock itself).

[1] http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028
[2] http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613

In any case, I can't imagine this working with V4L2 or anything else
relying on any memory management more generic than calling IOMMU API
directly from the driver, with the IOMMU device having runtime PM
enabled, but without managing the runtime PM from the IOMMU driver's
callbacks that need access to the hardware. As I mentioned before,
only the IOMMU driver knows when exactly the real hardware access
needs to be done (e.g. Rockchip/Exynos don't need to do that for
map/unmap if the power is down, but some implementations of SMMU with
TLB powered separately might need to do so).

Best regards,
Tomasz
Rob Clark Feb. 15, 2018, 2:14 p.m. UTC | #19
On Wed, Feb 14, 2018 at 11:09 PM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Thu, Feb 15, 2018 at 1:12 AM, Rob Clark <robdclark@gmail.com> wrote:
>> On Wed, Feb 14, 2018 at 10:48 AM, Jordan Crouse <jcrouse@codeaurora.org> wrote:
>>> On Wed, Feb 14, 2018 at 12:31:29PM +0900, Tomasz Figa wrote:
>>>>
>>>> - When submitting commands to the GPU, the GPU driver will
>>>> pm_runtime_get_sync() on the GPU device, which will automatically do
>>>> the same on all the linked suppliers, which would also include the
>>>> SMMU itself. The role of device links here is exactly that the GPU
>>>> driver doesn't have to care which other devices need to be brought up.
>>>
>>> This is true.  Assuming that the device link works correctly we would not need
>>> to explicitly power the SMMU which makes my point entirely moot.
>>
>> Just to point out what motivated this patchset, the biggest problem is
>> iommu_unmap() because that can happen when GPU is not powered on (or
>> in the v4l2 case, because some other device dropped it's reference to
>> the dma-buf allowing it to be free'd).  Currently we pm get/put the
>> GPU device around unmap, but it is kinda silly to boot up the GPU just
>> to unmap a buffer.
>
> Note that in V4L2 both mapping and unmapping can happen completely
> without involving the driver. So AFAICT the approach being implemented
> by this patchset will not work, because there will be no one to power
> up the IOMMU before the operation. Moreover, there are platforms for
> which there is no reason to power up the IOMMU just for map/unmap,
> because the hardware state is lost anyway and the only real work
> needed is updating the page tables in memory. (I feel like this is
> actually true for most of the platforms in the wild, but this is based
> purely on the not so small number of platforms I worked with, haven't
> bothered looking for more general evidence.)
>

At least as far as drm/msm/adreno, I'm not terribly concerned about
other platforms that don't need to power up iommu.  It's not really
the same situation as a IP block that shows up in all different
vendor's SoCs.

But if you can convince Robin to go for get/put_sync() calls inside
the iommu driver, I'm fine with that approach too.  That is what I do
in qcom_iommu already.  But if not I'd like to at least solve this for
some platforms if we can't solve for all.

BR,
-R

>>
>> (Semi-related, I would also like to batch map/unmap's, I just haven't
>> gotten around to implementing it yet.. but that would be another case
>> where a single get_supplier()/put_supplier() outside of the iommu
>> would make sense instead of pm_get/put() inside the iommu driver's
>> ->unmap().)
>>
>> If you really dislike the get/put_supplier() approach, then perhaps we
>> need iommu_pm_get()/iommu_pm_put() operations that the iommu user
>> could use to accomplish the same thing?
>
> I'm afraid this wouldn't work for V4L2 either. And I still haven't
> been given any evidence that the approach I'm suggesting, which relies
> only on existing pieces of infrastructure and which worked for both
> Exynos and Rockchip, including V4L2, wouldn't work for SMMU and/or QC
> SoCs.
>
> Best regards,
> Tomasz
Robin Murphy Feb. 15, 2018, 5:14 p.m. UTC | #20
On 15/02/18 04:17, Tomasz Figa wrote:
[...]
>> Could you elaborate on what kind of locking you are concerned about?
>> As I explained before, the normally happening fast path would lock
>> dev->power_lock only for the brief moment of incrementing the runtime
>> PM usage counter.
> 
> My bad, that's not even it.
> 
> The atomic usage counter is incremented beforehands, without any
> locking [1] and the spinlock is acquired only for the sake of
> validating that device's runtime PM state remained valid indeed [2],
> which would be the case in the fast path of the same driver doing two
> mappings in parallel, with the master powered on (and so the SMMU,
> through device links; if master was not powered on already, powering
> on the SMMU is unavoidable anyway and it would add much more latency
> than the spinlock itself).

We now have no locking at all in the map path, and only a per-domain 
lock around TLB sync in unmap which is unfortunately necessary for 
correctness; the latter isn't too terrible, since in "serious" hardware 
it should only be serialising a few cpus serving the same device against 
each other (e.g. for multiple queues on a single NIC).

Putting in a global lock which serialises *all* concurrent map and unmap 
calls for *all* unrelated devices makes things worse. Period. Even if 
the lock itself were held for the minimum possible time, i.e. trivially 
"spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing 
that one cache line around between 96 CPUs across two sockets is not 
negligible.

> [1] http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028
> [2] http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613
> 
> In any case, I can't imagine this working with V4L2 or anything else
> relying on any memory management more generic than calling IOMMU API
> directly from the driver, with the IOMMU device having runtime PM
> enabled, but without managing the runtime PM from the IOMMU driver's
> callbacks that need access to the hardware. As I mentioned before,
> only the IOMMU driver knows when exactly the real hardware access
> needs to be done (e.g. Rockchip/Exynos don't need to do that for
> map/unmap if the power is down, but some implementations of SMMU with
> TLB powered separately might need to do so).

It's worth noting that Exynos and Rockchip are relatively small 
self-contained IP blocks integrated closely with the interfaces of their 
relevant master devices; SMMU is an architecture, implementations of 
which may be large, distributed, and have complex and wildly differing 
internal topologies. As such, it's a lot harder to make 
hardware-specific assumptions and/or be correct for all possible cases.

Don't get me wrong, I do ultimately agree that the IOMMU driver is the 
only agent who ultimately knows what calls are going to be necessary for 
whatever operation it's performing on its own hardware*; it's just that 
for SMMU it needs to be implemented in a way that has zero impact on the 
cases where it doesn't matter, because it's not viable to specialise 
that driver for any particular IP implementation/use-case.

Robin.


*AFAICS it still makes some sense to have the get_suppliers option as 
well, though - the IOMMU driver does what it needs for correctness 
internally, but the external consumer doing something non-standard can 
can grab and hold the link around multiple calls to short-circuit that.
Tomasz Figa Feb. 16, 2018, 12:13 a.m. UTC | #21
On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 15/02/18 04:17, Tomasz Figa wrote:
> [...]
>>>
>>> Could you elaborate on what kind of locking you are concerned about?
>>> As I explained before, the normally happening fast path would lock
>>> dev->power_lock only for the brief moment of incrementing the runtime
>>> PM usage counter.
>>
>>
>> My bad, that's not even it.
>>
>> The atomic usage counter is incremented beforehands, without any
>> locking [1] and the spinlock is acquired only for the sake of
>> validating that device's runtime PM state remained valid indeed [2],
>> which would be the case in the fast path of the same driver doing two
>> mappings in parallel, with the master powered on (and so the SMMU,
>> through device links; if master was not powered on already, powering
>> on the SMMU is unavoidable anyway and it would add much more latency
>> than the spinlock itself).
>
>
> We now have no locking at all in the map path, and only a per-domain lock
> around TLB sync in unmap which is unfortunately necessary for correctness;
> the latter isn't too terrible, since in "serious" hardware it should only be
> serialising a few cpus serving the same device against each other (e.g. for
> multiple queues on a single NIC).
>
> Putting in a global lock which serialises *all* concurrent map and unmap
> calls for *all* unrelated devices makes things worse. Period. Even if the
> lock itself were held for the minimum possible time, i.e. trivially
> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that
> one cache line around between 96 CPUs across two sockets is not negligible.

Fair enough. Note that we're in a quite interesting situation now:
 a) We need to have runtime PM enabled on Qualcomm SoC to have power
properly managed,
 b) We need to have lock-free map/unmap on such distributed systems,
 c) If runtime PM is enabled, we need to call into runtime PM from any
code that does hardware accesses, otherwise the IOMMU API (and so DMA
API and then any V4L2 driver) becomes unusable.

I can see one more way that could potentially let us have all the
three. How about enabling runtime PM only on selected implementations
(e.g. qcom,smmu) and then having all the runtime PM calls surrounded
with if (pm_runtime_enabled()), which is lockless?

>
>> [1]
>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028
>> [2]
>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613
>>
>> In any case, I can't imagine this working with V4L2 or anything else
>> relying on any memory management more generic than calling IOMMU API
>> directly from the driver, with the IOMMU device having runtime PM
>> enabled, but without managing the runtime PM from the IOMMU driver's
>> callbacks that need access to the hardware. As I mentioned before,
>> only the IOMMU driver knows when exactly the real hardware access
>> needs to be done (e.g. Rockchip/Exynos don't need to do that for
>> map/unmap if the power is down, but some implementations of SMMU with
>> TLB powered separately might need to do so).
>
>
> It's worth noting that Exynos and Rockchip are relatively small
> self-contained IP blocks integrated closely with the interfaces of their
> relevant master devices; SMMU is an architecture, implementations of which
> may be large, distributed, and have complex and wildly differing internal
> topologies. As such, it's a lot harder to make hardware-specific assumptions
> and/or be correct for all possible cases.
>
> Don't get me wrong, I do ultimately agree that the IOMMU driver is the only
> agent who ultimately knows what calls are going to be necessary for whatever
> operation it's performing on its own hardware*; it's just that for SMMU it
> needs to be implemented in a way that has zero impact on the cases where it
> doesn't matter, because it's not viable to specialise that driver for any
> particular IP implementation/use-case.

Still, exactly the same holds for the low power embedded use cases,
where we strive for the lowest possible power consumption, while
maintaining performance levels high as well. And so the SMMU code is
expected to also work with our use cases, such as V4L2 or DRM drivers.
Since these points don't hold for current SMMU code, I could say that
the it has been already specialized for large, distributed
implementations.

Best regards,
Tomasz
Tomasz Figa Feb. 22, 2018, 8:13 a.m. UTC | #22
On Fri, Feb 16, 2018 at 9:13 AM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 15/02/18 04:17, Tomasz Figa wrote:
>> [...]
>>>>
>>>> Could you elaborate on what kind of locking you are concerned about?
>>>> As I explained before, the normally happening fast path would lock
>>>> dev->power_lock only for the brief moment of incrementing the runtime
>>>> PM usage counter.
>>>
>>>
>>> My bad, that's not even it.
>>>
>>> The atomic usage counter is incremented beforehands, without any
>>> locking [1] and the spinlock is acquired only for the sake of
>>> validating that device's runtime PM state remained valid indeed [2],
>>> which would be the case in the fast path of the same driver doing two
>>> mappings in parallel, with the master powered on (and so the SMMU,
>>> through device links; if master was not powered on already, powering
>>> on the SMMU is unavoidable anyway and it would add much more latency
>>> than the spinlock itself).
>>
>>
>> We now have no locking at all in the map path, and only a per-domain lock
>> around TLB sync in unmap which is unfortunately necessary for correctness;
>> the latter isn't too terrible, since in "serious" hardware it should only be
>> serialising a few cpus serving the same device against each other (e.g. for
>> multiple queues on a single NIC).
>>
>> Putting in a global lock which serialises *all* concurrent map and unmap
>> calls for *all* unrelated devices makes things worse. Period. Even if the
>> lock itself were held for the minimum possible time, i.e. trivially
>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that
>> one cache line around between 96 CPUs across two sockets is not negligible.
>
> Fair enough. Note that we're in a quite interesting situation now:
>  a) We need to have runtime PM enabled on Qualcomm SoC to have power
> properly managed,
>  b) We need to have lock-free map/unmap on such distributed systems,
>  c) If runtime PM is enabled, we need to call into runtime PM from any
> code that does hardware accesses, otherwise the IOMMU API (and so DMA
> API and then any V4L2 driver) becomes unusable.
>
> I can see one more way that could potentially let us have all the
> three. How about enabling runtime PM only on selected implementations
> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded
> with if (pm_runtime_enabled()), which is lockless?
>

Sorry for pinging, but any opinion on this kind of approach?

Best regards,
Tomasz

>>
>>> [1]
>>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028
>>> [2]
>>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613
>>>
>>> In any case, I can't imagine this working with V4L2 or anything else
>>> relying on any memory management more generic than calling IOMMU API
>>> directly from the driver, with the IOMMU device having runtime PM
>>> enabled, but without managing the runtime PM from the IOMMU driver's
>>> callbacks that need access to the hardware. As I mentioned before,
>>> only the IOMMU driver knows when exactly the real hardware access
>>> needs to be done (e.g. Rockchip/Exynos don't need to do that for
>>> map/unmap if the power is down, but some implementations of SMMU with
>>> TLB powered separately might need to do so).
>>
>>
>> It's worth noting that Exynos and Rockchip are relatively small
>> self-contained IP blocks integrated closely with the interfaces of their
>> relevant master devices; SMMU is an architecture, implementations of which
>> may be large, distributed, and have complex and wildly differing internal
>> topologies. As such, it's a lot harder to make hardware-specific assumptions
>> and/or be correct for all possible cases.
>>
>> Don't get me wrong, I do ultimately agree that the IOMMU driver is the only
>> agent who ultimately knows what calls are going to be necessary for whatever
>> operation it's performing on its own hardware*; it's just that for SMMU it
>> needs to be implemented in a way that has zero impact on the cases where it
>> doesn't matter, because it's not viable to specialise that driver for any
>> particular IP implementation/use-case.
>
> Still, exactly the same holds for the low power embedded use cases,
> where we strive for the lowest possible power consumption, while
> maintaining performance levels high as well. And so the SMMU code is
> expected to also work with our use cases, such as V4L2 or DRM drivers.
> Since these points don't hold for current SMMU code, I could say that
> the it has been already specialized for large, distributed
> implementations.
>
> Best regards,
> Tomasz
Rob Clark Feb. 22, 2018, 1:30 p.m. UTC | #23
On Thu, Feb 22, 2018 at 3:13 AM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Fri, Feb 16, 2018 at 9:13 AM, Tomasz Figa <tfiga@chromium.org> wrote:
>> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>>> On 15/02/18 04:17, Tomasz Figa wrote:
>>> [...]
>>>>>
>>>>> Could you elaborate on what kind of locking you are concerned about?
>>>>> As I explained before, the normally happening fast path would lock
>>>>> dev->power_lock only for the brief moment of incrementing the runtime
>>>>> PM usage counter.
>>>>
>>>>
>>>> My bad, that's not even it.
>>>>
>>>> The atomic usage counter is incremented beforehands, without any
>>>> locking [1] and the spinlock is acquired only for the sake of
>>>> validating that device's runtime PM state remained valid indeed [2],
>>>> which would be the case in the fast path of the same driver doing two
>>>> mappings in parallel, with the master powered on (and so the SMMU,
>>>> through device links; if master was not powered on already, powering
>>>> on the SMMU is unavoidable anyway and it would add much more latency
>>>> than the spinlock itself).
>>>
>>>
>>> We now have no locking at all in the map path, and only a per-domain lock
>>> around TLB sync in unmap which is unfortunately necessary for correctness;
>>> the latter isn't too terrible, since in "serious" hardware it should only be
>>> serialising a few cpus serving the same device against each other (e.g. for
>>> multiple queues on a single NIC).
>>>
>>> Putting in a global lock which serialises *all* concurrent map and unmap
>>> calls for *all* unrelated devices makes things worse. Period. Even if the
>>> lock itself were held for the minimum possible time, i.e. trivially
>>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that
>>> one cache line around between 96 CPUs across two sockets is not negligible.
>>
>> Fair enough. Note that we're in a quite interesting situation now:
>>  a) We need to have runtime PM enabled on Qualcomm SoC to have power
>> properly managed,
>>  b) We need to have lock-free map/unmap on such distributed systems,
>>  c) If runtime PM is enabled, we need to call into runtime PM from any
>> code that does hardware accesses, otherwise the IOMMU API (and so DMA
>> API and then any V4L2 driver) becomes unusable.
>>
>> I can see one more way that could potentially let us have all the
>> three. How about enabling runtime PM only on selected implementations
>> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded
>> with if (pm_runtime_enabled()), which is lockless?
>>
>
> Sorry for pinging, but any opinion on this kind of approach?
>

It is ok by me, for whatever that is worth

BR,
-R
Robin Murphy Feb. 22, 2018, 1:45 p.m. UTC | #24
[sorry, I had intended to reply sooner but clearly forgot]

On 16/02/18 00:13, Tomasz Figa wrote:
> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> On 15/02/18 04:17, Tomasz Figa wrote:
>> [...]
>>>>
>>>> Could you elaborate on what kind of locking you are concerned about?
>>>> As I explained before, the normally happening fast path would lock
>>>> dev->power_lock only for the brief moment of incrementing the runtime
>>>> PM usage counter.
>>>
>>>
>>> My bad, that's not even it.
>>>
>>> The atomic usage counter is incremented beforehands, without any
>>> locking [1] and the spinlock is acquired only for the sake of
>>> validating that device's runtime PM state remained valid indeed [2],
>>> which would be the case in the fast path of the same driver doing two
>>> mappings in parallel, with the master powered on (and so the SMMU,
>>> through device links; if master was not powered on already, powering
>>> on the SMMU is unavoidable anyway and it would add much more latency
>>> than the spinlock itself).
>>
>>
>> We now have no locking at all in the map path, and only a per-domain lock
>> around TLB sync in unmap which is unfortunately necessary for correctness;
>> the latter isn't too terrible, since in "serious" hardware it should only be
>> serialising a few cpus serving the same device against each other (e.g. for
>> multiple queues on a single NIC).
>>
>> Putting in a global lock which serialises *all* concurrent map and unmap
>> calls for *all* unrelated devices makes things worse. Period. Even if the
>> lock itself were held for the minimum possible time, i.e. trivially
>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that
>> one cache line around between 96 CPUs across two sockets is not negligible.
> 
> Fair enough. Note that we're in a quite interesting situation now:
>   a) We need to have runtime PM enabled on Qualcomm SoC to have power
> properly managed,
>   b) We need to have lock-free map/unmap on such distributed systems,
>   c) If runtime PM is enabled, we need to call into runtime PM from any
> code that does hardware accesses, otherwise the IOMMU API (and so DMA
> API and then any V4L2 driver) becomes unusable.
> 
> I can see one more way that could potentially let us have all the
> three. How about enabling runtime PM only on selected implementations
> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded
> with if (pm_runtime_enabled()), which is lockless?

Yes, that's the kind of thing I was gravitating towards - my vague 
thought was adding some flag to the smmu_domain, but 
pm_runtime_enabled() does look conceptually a lot cleaner.

>>
>>> [1]
>>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028
>>> [2]
>>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613
>>>
>>> In any case, I can't imagine this working with V4L2 or anything else
>>> relying on any memory management more generic than calling IOMMU API
>>> directly from the driver, with the IOMMU device having runtime PM
>>> enabled, but without managing the runtime PM from the IOMMU driver's
>>> callbacks that need access to the hardware. As I mentioned before,
>>> only the IOMMU driver knows when exactly the real hardware access
>>> needs to be done (e.g. Rockchip/Exynos don't need to do that for
>>> map/unmap if the power is down, but some implementations of SMMU with
>>> TLB powered separately might need to do so).
>>
>>
>> It's worth noting that Exynos and Rockchip are relatively small
>> self-contained IP blocks integrated closely with the interfaces of their
>> relevant master devices; SMMU is an architecture, implementations of which
>> may be large, distributed, and have complex and wildly differing internal
>> topologies. As such, it's a lot harder to make hardware-specific assumptions
>> and/or be correct for all possible cases.
>>
>> Don't get me wrong, I do ultimately agree that the IOMMU driver is the only
>> agent who ultimately knows what calls are going to be necessary for whatever
>> operation it's performing on its own hardware*; it's just that for SMMU it
>> needs to be implemented in a way that has zero impact on the cases where it
>> doesn't matter, because it's not viable to specialise that driver for any
>> particular IP implementation/use-case.
> 
> Still, exactly the same holds for the low power embedded use cases,
> where we strive for the lowest possible power consumption, while
> maintaining performance levels high as well. And so the SMMU code is
> expected to also work with our use cases, such as V4L2 or DRM drivers.
> Since these points don't hold for current SMMU code, I could say that
> the it has been already specialized for large, distributed
> implementations.

Heh, really it's specialised for ease of maintenance in terms of doing 
as little as we can get away with, but for what we have implemented, 
fast code does save CPU cycles and power on embedded systems too ;)

Robin.
Tomasz Figa Feb. 22, 2018, 2:12 p.m. UTC | #25
On Thu, Feb 22, 2018 at 10:45 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> [sorry, I had intended to reply sooner but clearly forgot]
>
>
> On 16/02/18 00:13, Tomasz Figa wrote:
>>
>> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@arm.com>
>> wrote:
>>>
>>> On 15/02/18 04:17, Tomasz Figa wrote:
>>> [...]
>>>>>
>>>>>
>>>>> Could you elaborate on what kind of locking you are concerned about?
>>>>> As I explained before, the normally happening fast path would lock
>>>>> dev->power_lock only for the brief moment of incrementing the runtime
>>>>> PM usage counter.
>>>>
>>>>
>>>>
>>>> My bad, that's not even it.
>>>>
>>>> The atomic usage counter is incremented beforehands, without any
>>>> locking [1] and the spinlock is acquired only for the sake of
>>>> validating that device's runtime PM state remained valid indeed [2],
>>>> which would be the case in the fast path of the same driver doing two
>>>> mappings in parallel, with the master powered on (and so the SMMU,
>>>> through device links; if master was not powered on already, powering
>>>> on the SMMU is unavoidable anyway and it would add much more latency
>>>> than the spinlock itself).
>>>
>>>
>>>
>>> We now have no locking at all in the map path, and only a per-domain lock
>>> around TLB sync in unmap which is unfortunately necessary for
>>> correctness;
>>> the latter isn't too terrible, since in "serious" hardware it should only
>>> be
>>> serialising a few cpus serving the same device against each other (e.g.
>>> for
>>> multiple queues on a single NIC).
>>>
>>> Putting in a global lock which serialises *all* concurrent map and unmap
>>> calls for *all* unrelated devices makes things worse. Period. Even if the
>>> lock itself were held for the minimum possible time, i.e. trivially
>>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing
>>> that
>>> one cache line around between 96 CPUs across two sockets is not
>>> negligible.
>>
>>
>> Fair enough. Note that we're in a quite interesting situation now:
>>   a) We need to have runtime PM enabled on Qualcomm SoC to have power
>> properly managed,
>>   b) We need to have lock-free map/unmap on such distributed systems,
>>   c) If runtime PM is enabled, we need to call into runtime PM from any
>> code that does hardware accesses, otherwise the IOMMU API (and so DMA
>> API and then any V4L2 driver) becomes unusable.
>>
>> I can see one more way that could potentially let us have all the
>> three. How about enabling runtime PM only on selected implementations
>> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded
>> with if (pm_runtime_enabled()), which is lockless?
>
>
> Yes, that's the kind of thing I was gravitating towards - my vague thought
> was adding some flag to the smmu_domain, but pm_runtime_enabled() does look
> conceptually a lot cleaner.

Great, thanks. Looks like we're in agreement now. \o/

Vivek, does this sound reasonable to you?

Best regards,
Tomasz
Vivek Gautam Feb. 22, 2018, 5:24 p.m. UTC | #26
Hi,

On Thu, Feb 22, 2018 at 7:42 PM, Tomasz Figa <tfiga@chromium.org> wrote:
> On Thu, Feb 22, 2018 at 10:45 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>> [sorry, I had intended to reply sooner but clearly forgot]
>>
>>
>> On 16/02/18 00:13, Tomasz Figa wrote:
>>>
>>> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@arm.com>
>>> wrote:
>>>>
>>>> On 15/02/18 04:17, Tomasz Figa wrote:
>>>> [...]
>>>>>>
>>>>>>
>>>>>> Could you elaborate on what kind of locking you are concerned about?
>>>>>> As I explained before, the normally happening fast path would lock
>>>>>> dev->power_lock only for the brief moment of incrementing the runtime
>>>>>> PM usage counter.
>>>>>
>>>>>
>>>>>
>>>>> My bad, that's not even it.
>>>>>
>>>>> The atomic usage counter is incremented beforehands, without any
>>>>> locking [1] and the spinlock is acquired only for the sake of
>>>>> validating that device's runtime PM state remained valid indeed [2],
>>>>> which would be the case in the fast path of the same driver doing two
>>>>> mappings in parallel, with the master powered on (and so the SMMU,
>>>>> through device links; if master was not powered on already, powering
>>>>> on the SMMU is unavoidable anyway and it would add much more latency
>>>>> than the spinlock itself).
>>>>
>>>>
>>>>
>>>> We now have no locking at all in the map path, and only a per-domain lock
>>>> around TLB sync in unmap which is unfortunately necessary for
>>>> correctness;
>>>> the latter isn't too terrible, since in "serious" hardware it should only
>>>> be
>>>> serialising a few cpus serving the same device against each other (e.g.
>>>> for
>>>> multiple queues on a single NIC).
>>>>
>>>> Putting in a global lock which serialises *all* concurrent map and unmap
>>>> calls for *all* unrelated devices makes things worse. Period. Even if the
>>>> lock itself were held for the minimum possible time, i.e. trivially
>>>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing
>>>> that
>>>> one cache line around between 96 CPUs across two sockets is not
>>>> negligible.
>>>
>>>
>>> Fair enough. Note that we're in a quite interesting situation now:
>>>   a) We need to have runtime PM enabled on Qualcomm SoC to have power
>>> properly managed,
>>>   b) We need to have lock-free map/unmap on such distributed systems,
>>>   c) If runtime PM is enabled, we need to call into runtime PM from any
>>> code that does hardware accesses, otherwise the IOMMU API (and so DMA
>>> API and then any V4L2 driver) becomes unusable.
>>>
>>> I can see one more way that could potentially let us have all the
>>> three. How about enabling runtime PM only on selected implementations
>>> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded
>>> with if (pm_runtime_enabled()), which is lockless?
>>
>>
>> Yes, that's the kind of thing I was gravitating towards - my vague thought
>> was adding some flag to the smmu_domain, but pm_runtime_enabled() does look
>> conceptually a lot cleaner.
>
> Great, thanks. Looks like we're in agreement now. \o/
>
> Vivek, does this sound reasonable to you?

Yea, sound good to me. I will respin the patches.

Thanks & Regards
Vivek

>
> Best regards,
> Tomasz
diff mbox

Patch

diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index b23d33622f37..1ab629bbee69 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -40,9 +40,9 @@  static int msm_iommu_attach(struct msm_mmu *mmu, const char * const *names,
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
 	int ret;
 
-	pm_runtime_get_sync(mmu->dev);
+	pm_runtime_get_suppliers(mmu->dev);
 	ret = iommu_attach_device(iommu->domain, mmu->dev);
-	pm_runtime_put_sync(mmu->dev);
+	pm_runtime_put_suppliers(mmu->dev);
 
 	return ret;
 }
@@ -52,9 +52,9 @@  static void msm_iommu_detach(struct msm_mmu *mmu, const char * const *names,
 {
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
 
-	pm_runtime_get_sync(mmu->dev);
+	pm_runtime_get_suppliers(mmu->dev);
 	iommu_detach_device(iommu->domain, mmu->dev);
-	pm_runtime_put_sync(mmu->dev);
+	pm_runtime_put_suppliers(mmu->dev);
 }
 
 static int msm_iommu_map(struct msm_mmu *mmu, uint64_t iova,
@@ -63,9 +63,9 @@  static int msm_iommu_map(struct msm_mmu *mmu, uint64_t iova,
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
 	size_t ret;
 
-//	pm_runtime_get_sync(mmu->dev);
+	pm_runtime_get_suppliers(mmu->dev);
 	ret = iommu_map_sg(iommu->domain, iova, sgt->sgl, sgt->nents, prot);
-//	pm_runtime_put_sync(mmu->dev);
+	pm_runtime_put_suppliers(mmu->dev);
 	WARN_ON(ret < 0);
 
 	return (ret == len) ? 0 : -EINVAL;
@@ -76,9 +76,9 @@  static int msm_iommu_unmap(struct msm_mmu *mmu, uint64_t iova,
 {
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
 
-	pm_runtime_get_sync(mmu->dev);
+	pm_runtime_get_suppliers(mmu->dev);
 	iommu_unmap(iommu->domain, iova, len);
-	pm_runtime_put_sync(mmu->dev);
+	pm_runtime_put_suppliers(mmu->dev);
 
 	return 0;
 }