TTM Locking order of bo::reserve -> vm::mmap_sem

Message ID	5264EA79.3050505@vmware.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org> Message-ID: <5264EA79.3050505@vmware.com> Date: Mon, 21 Oct 2013 10:48:57 +0200 From: Thomas Hellstrom <thellstrom@vmware.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Maarten Lankhorst <maarten.lankhorst@canonical.com> Subject: TTM Locking order of bo::reserve -> vm::mmap_sem Content-Type: multipart/mixed; boundary="------------070303050700000509040803" Cc: "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org> Precedence: list Sender: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org Errors-To: dri-devel-bounces+patchwork-dri-devel=patchwork.kernel.org@lists.freedesktop.org

Thomas Hellstrom Oct. 21, 2013, 8:48 a.m. UTC

Hi!

As discussed previously the current locking order in TTM of these locks 
is bo::reserve -> vm::mmap_sem. This leads to a hack in
the TTM fault() handle to try and revert the locking order. If a 
tryreserve failed, we tried to have the vm code release the mmap_sem() 
and then schedule, to give the holder of bo::reserve a chance to release 
the lock. This solution is no longer legal, since we've been more or 
less kindly asked to remove the set_need_resched() call.

Maarten has proposed to invert the locking order. I've previously said I 
had no strong preference. The current locking order dates back from the 
time when TTM wasn't using unmap_mapping_range() but walked the page 
tables itself, updating PTEs as needed. Furthermore it was needed for 
user bos that used get_user_pages() in the TTM populate and swap-in 
methods. User-bos were removed some time ago but I'm looking at 
re-adding them. They would suite the VMware model of cached-only pages 
very well. I see uses both in the gallium API, XA's DMA functionality 
and openCL.

We would then need a somewhat nicer way to invert the locking order. 
I've attached a solution that ups the mmap_sem and then reserves, but 
due to how the fault API is done, we then need to release the reserve 
and retry the fault. This of course opens up for starvation, but I don't 
think starvation at this point is very likely: One thread being refused 
to write or read from a buffer object because the GPU is continously 
busy with it. If this *would* become a problem, it's probably possible 
to modify the fault code to allow us to hold locks until the retried 
fault, but that would be a bit invasive, since it touches the arch code....

Basically I'm proposing to keep the current locking order.


/Thomas

Maarten Lankhorst Oct. 21, 2013, 9:01 a.m. UTC | #1

op 21-10-13 10:48, Thomas Hellstrom schreef:
> Hi!
>
> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>
> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>
> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>
> Basically I'm proposing to keep the current locking order.

I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
no extra locking is needed for get_user_pages. Releasing it is a bit silly. I think we should keep mmap_sem as outer
lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.

~Maarten

Jakob Bornecrantz Oct. 21, 2013, 9:04 a.m. UTC | #2

On Mon, Oct 21, 2013 at 10:48 AM, Thomas Hellstrom
<thellstrom@vmware.com> wrote:
> Hi!
>
> As discussed previously the current locking order in TTM of these locks is
> bo::reserve -> vm::mmap_sem. This leads to a hack in
> the TTM fault() handle to try and revert the locking order. If a tryreserve
> failed, we tried to have the vm code release the mmap_sem() and then
> schedule, to give the holder of bo::reserve a chance to release the lock.
> This solution is no longer legal, since we've been more or less kindly asked
> to remove the set_need_resched() call.
>
> Maarten has proposed to invert the locking order. I've previously said I had
> no strong preference. The current locking order dates back from the time
> when TTM wasn't using unmap_mapping_range() but walked the page tables
> itself, updating PTEs as needed. Furthermore it was needed for user bos that
> used get_user_pages() in the TTM populate and swap-in methods. User-bos were
> removed some time ago but I'm looking at re-adding them. They would suite
> the VMware model of cached-only pages very well. I see uses both in the
> gallium API, XA's DMA functionality and openCL.

In particular OpenCL's clEnqueueWriteBuffer[1] function and friends
which can move quite a bit of data in and out of VRAM.

>
> We would then need a somewhat nicer way to invert the locking order. I've
> attached a solution that ups the mmap_sem and then reserves, but due to how
> the fault API is done, we then need to release the reserve and retry the
> fault. This of course opens up for starvation, but I don't think starvation
> at this point is very likely: One thread being refused to write or read from
> a buffer object because the GPU is continously busy with it. If this *would*
> become a problem, it's probably possible to modify the fault code to allow
> us to hold locks until the retried fault, but that would be a bit invasive,
> since it touches the arch code....
>
> Basically I'm proposing to keep the current locking order.

Cheers, Jakob.

[1] http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueWriteBuffer.html

Thomas Hellstrom Oct. 21, 2013, 9:37 a.m. UTC | #3

On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
> op 21-10-13 10:48, Thomas Hellstrom schreef:
>> Hi!
>>
>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>
>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>
>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>
>> Basically I'm proposing to keep the current locking order.
> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
> no extra locking is needed for get_user_pages.

Typically, they are populated outside of fault, as part of execbuf, 
where we don't hold and don't want to hold mmap_sem(). In fact,
user bo's should not be remappable through the TTM VM system. Anyway, we 
need to grab the mmap_sem inside ttm_populate for user buffers.

>   Releasing it is a bit silly. I think we should keep mmap_sem as outer
> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>
> ~Maarten
Using DMA API analogy, user BOs correspond to using streaming DMA 
whereas normal BOs correspond to alloced DMA memory buffers.
We boost performance and save resources.

Thanks,
/Thomas


>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

Maarten Lankhorst Oct. 21, 2013, 9:48 a.m. UTC | #4

op 21-10-13 11:37, Thomas Hellstrom schreef:
> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>> Hi!
>>>
>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>
>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>
>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>
>>> Basically I'm proposing to keep the current locking order.
>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>> no extra locking is needed for get_user_pages.
>
> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.

This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
is meant to be a vmwgfx specific optimization I think it might be worth it.

>>   Releasing it is a bit silly. I think we should keep mmap_sem as outer
>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>
>> ~Maarten
> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
> We boost performance and save resources.
Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
expensive than performing a memcpy.

Thomas Hellstrom Oct. 21, 2013, 10:10 a.m. UTC | #5

On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
> op 21-10-13 11:37, Thomas Hellstrom schreef:
>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>> Hi!
>>>>
>>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>>
>>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>>
>>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>>
>>>> Basically I'm proposing to keep the current locking order.
>>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>>> no extra locking is needed for get_user_pages.
>> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
>> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
> If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.
>
> This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
> is meant to be a vmwgfx specific optimization I think it might be worth it.

Would that work (lockdep-wise) with user BO swapout as part of a normal 
BO validation? During user BO swapout, we don't need mmap_sem, but the 
BO needs to be reserved. But I guess it would work, since we use 
tryreserve when walking the LRU lists?

>
>>>    Releasing it is a bit silly. I think we should keep mmap_sem as outer
>>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>>
>>> ~Maarten
>> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
>> We boost performance and save resources.
> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
> expensive than performing a memcpy
In the end, I'm not sure it will be vmwgfx specific once there is a 
working example, and there are user-space APIs that will benefit from 
it. There are other examples out there today that uses streaming DMA to 
feed the DMA engines, although not through TTM, see for example 
via_dmablit.c.

Also if we need a separate locking order for User BOs, what would be the 
big benefit of having the locking order mmap_sem()->bo_reserve() ?

/Thomas

Maarten Lankhorst Oct. 21, 2013, 10:24 a.m. UTC | #6

op 21-10-13 12:10, Thomas Hellstrom schreef:
> On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
>> op 21-10-13 11:37, Thomas Hellstrom schreef:
>>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>>> Hi!
>>>>>
>>>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>>>
>>>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>>>
>>>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>>>
>>>>> Basically I'm proposing to keep the current locking order.
>>>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>>>> no extra locking is needed for get_user_pages.
>>> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
>>> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
>> If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
>> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.
>>
>> This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
>> is meant to be a vmwgfx specific optimization I think it might be worth it.
>
> Would that work (lockdep-wise) with user BO swapout as part of a normal BO validation? During user BO swapout, we don't need mmap_sem, but the BO needs to be reserved. But I guess it would work, since we use tryreserve when walking the LRU lists?
Correct.

>>
>>>>    Releasing it is a bit silly. I think we should keep mmap_sem as outer
>>>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>>>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>>>
>>>> ~Maarten
>>> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
>>> We boost performance and save resources.
>> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
>> expensive than performing a memcpy
> In the end, I'm not sure it will be vmwgfx specific once there is a working example, and there are user-space APIs that will benefit from it. There are other examples out there today that uses streaming DMA to feed the DMA engines, although not through TTM, see for example via_dmablit.c.
>
> Also if we need a separate locking order for User BOs, what would be the big benefit of having the locking order mmap_sem()->bo_reserve() ?
Deterministic locking. I fear about livelocks that could happen otherwise with the right timing. Especially but not exclusively on -rt kernels.

~Maarten

Thomas Hellstrom Oct. 21, 2013, 11:10 a.m. UTC | #7

On 10/21/2013 12:24 PM, Maarten Lankhorst wrote:
> op 21-10-13 12:10, Thomas Hellstrom schreef:
>> On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
>>> op 21-10-13 11:37, Thomas Hellstrom schreef:
>>>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>>>> Hi!
>>>>>>
>>>>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>>>>
>>>>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>>>>
>>>>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>>>>
>>>>>> Basically I'm proposing to keep the current locking order.
>>>>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>>>>> no extra locking is needed for get_user_pages.
>>>> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
>>>> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
>>> If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
>>> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.
>>>
>>> This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
>>> is meant to be a vmwgfx specific optimization I think it might be worth it.
>> Would that work (lockdep-wise) with user BO swapout as part of a normal BO validation? During user BO swapout, we don't need mmap_sem, but the BO needs to be reserved. But I guess it would work, since we use tryreserve when walking the LRU lists?
> Correct.
>
>>>>>     Releasing it is a bit silly. I think we should keep mmap_sem as outer
>>>>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>>>>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>>>>
>>>>> ~Maarten
>>>> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
>>>> We boost performance and save resources.
>>> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
>>> expensive than performing a memcpy
>> In the end, I'm not sure it will be vmwgfx specific once there is a working example, and there are user-space APIs that will benefit from it. There are other examples out there today that uses streaming DMA to feed the DMA engines, although not through TTM, see for example via_dmablit.c.
>>
>> Also if we need a separate locking order for User BOs, what would be the big benefit of having the locking order mmap_sem()->bo_reserve() ?
> Deterministic locking. I fear about livelocks that could happen otherwise with the right timing. Especially but not exclusively on -rt kernels.
But livelocks wouldn't be an issue anymore since we use a waiting 
reserve in the retry path, right? The only issue we might theoretically 
be facing is starvation in the fault path.

With a two-step validation scheme I think we're up to real issues, like 
bouncing ordinary BOs in and out of swap during a single execbuf...

/Thomas


>
> ~Maarten

Maarten Lankhorst Oct. 21, 2013, 1:21 p.m. UTC | #8

op 21-10-13 13:10, Thomas Hellstrom schreef:
> On 10/21/2013 12:24 PM, Maarten Lankhorst wrote:
>> op 21-10-13 12:10, Thomas Hellstrom schreef:
>>> On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
>>>> op 21-10-13 11:37, Thomas Hellstrom schreef:
>>>>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>>>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>>>>> Hi!
>>>>>>>
>>>>>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>>>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>>>>>
>>>>>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>>>>>
>>>>>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>>>>>
>>>>>>> Basically I'm proposing to keep the current locking order.
>>>>>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>>>>>> no extra locking is needed for get_user_pages.
>>>>> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
>>>>> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
>>>> If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
>>>> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.
>>>>
>>>> This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
>>>> is meant to be a vmwgfx specific optimization I think it might be worth it.
>>> Would that work (lockdep-wise) with user BO swapout as part of a normal BO validation? During user BO swapout, we don't need mmap_sem, but the BO needs to be reserved. But I guess it would work, since we use tryreserve when walking the LRU lists?
>> Correct.
>>
>>>>>>     Releasing it is a bit silly. I think we should keep mmap_sem as outer
>>>>>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>>>>>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>>>>>
>>>>>> ~Maarten
>>>>> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
>>>>> We boost performance and save resources.
>>>> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
>>>> expensive than performing a memcpy
>>> In the end, I'm not sure it will be vmwgfx specific once there is a working example, and there are user-space APIs that will benefit from it. There are other examples out there today that uses streaming DMA to feed the DMA engines, although not through TTM, see for example via_dmablit.c.
>>>
>>> Also if we need a separate locking order for User BOs, what would be the big benefit of having the locking order mmap_sem()->bo_reserve() ?
>> Deterministic locking. I fear about livelocks that could happen otherwise with the right timing. Especially but not exclusively on -rt kernels.
> But livelocks wouldn't be an issue anymore since we use a waiting reserve in the retry path, right? The only issue we might theoretically be facing is starvation in the fault path.
>
> With a two-step validation scheme I think we're up to real issues, like bouncing ordinary BOs in and out of swap during a single execbuf...
I would need to see some code, but I think it can be avoided by having a slowpath similar to i915 if it fails.

Anyway have 2 threads fault on the same bo in different offsets, so fault handler is called twice on the same bo.

You could end up a loop, first one grabbed the bo lock without mmap_sem, second one does trylock, which fails then unlocks mmap_sem, and acquires the bo lock again. At which point the first thread acquired mmap_sem and does a trylock. So the same thing can theoretically happen infinitely over and over again.

~Maarten

Thomas Hellstrom Oct. 21, 2013, 6:20 p.m. UTC | #9

On 10/21/2013 03:21 PM, Maarten Lankhorst wrote:
> op 21-10-13 13:10, Thomas Hellstrom schreef:
>> On 10/21/2013 12:24 PM, Maarten Lankhorst wrote:
>>> op 21-10-13 12:10, Thomas Hellstrom schreef:
>>>> On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
>>>>> op 21-10-13 11:37, Thomas Hellstrom schreef:
>>>>>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>>>>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> As discussed previously the current locking order in TTM of these locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>>>>>> the TTM fault() handle to try and revert the locking order. If a tryreserve failed, we tried to have the vm code release the mmap_sem() and then schedule, to give the holder of bo::reserve a chance to release the lock. This solution is no longer legal, since we've been more or less kindly asked to remove the set_need_resched() call.
>>>>>>>>
>>>>>>>> Maarten has proposed to invert the locking order. I've previously said I had no strong preference. The current locking order dates back from the time when TTM wasn't using unmap_mapping_range() but walked the page tables itself, updating PTEs as needed. Furthermore it was needed for user bos that used get_user_pages() in the TTM populate and swap-in methods. User-bos were removed some time ago but I'm looking at re-adding them. They would suite the VMware model of cached-only pages very well. I see uses both in the gallium API, XA's DMA functionality and openCL.
>>>>>>>>
>>>>>>>> We would then need a somewhat nicer way to invert the locking order. I've attached a solution that ups the mmap_sem and then reserves, but due to how the fault API is done, we then need to release the reserve and retry the fault. This of course opens up for starvation, but I don't think starvation at this point is very likely: One thread being refused to write or read from a buffer object because the GPU is continously busy with it. If this *would* become a problem, it's probably possible to modify the fault code to allow us to hold locks until the retried fault, but that would be a bit invasive, since it touches the arch code....
>>>>>>>>
>>>>>>>> Basically I'm proposing to keep the current locking order.
>>>>>>> I'm not sure why we have to worry about mmap_sem lock being taken before bo::reserve. If we already hold mmap_sem,
>>>>>>> no extra locking is needed for get_user_pages.
>>>>>> Typically, they are populated outside of fault, as part of execbuf, where we don't hold and don't want to hold mmap_sem(). In fact,
>>>>>> user bo's should not be remappable through the TTM VM system. Anyway, we need to grab the mmap_sem inside ttm_populate for user buffers.
>>>>> If we don't allow mmapping user bo's through TTM, we can use special lockdep annotation when user-bo's are used. Normal bo's would have
>>>>> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have the other way around.
>>>>>
>>>>> This might complicate validation a little, since you would have to reserve and validate all user-bo's before any normal bo's are reserved. But since this
>>>>> is meant to be a vmwgfx specific optimization I think it might be worth it.
>>>> Would that work (lockdep-wise) with user BO swapout as part of a normal BO validation? During user BO swapout, we don't need mmap_sem, but the BO needs to be reserved. But I guess it would work, since we use tryreserve when walking the LRU lists?
>>> Correct.
>>>
>>>>>>>      Releasing it is a bit silly. I think we should keep mmap_sem as outer
>>>>>>> lock, and have bo::reserve as inner, even if it might complicate support for user-bo's. I'm not sure what you can do
>>>>>>> with user-bo's that can't be done by allocating the same bo from kernel first and map + populate it.
>>>>>>>
>>>>>>> ~Maarten
>>>>>> Using DMA API analogy, user BOs correspond to using streaming DMA whereas normal BOs correspond to alloced DMA memory buffers.
>>>>>> We boost performance and save resources.
>>>>> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy engines that can be used. Flushing the vm and stalling is probably more
>>>>> expensive than performing a memcpy
>>>> In the end, I'm not sure it will be vmwgfx specific once there is a working example, and there are user-space APIs that will benefit from it. There are other examples out there today that uses streaming DMA to feed the DMA engines, although not through TTM, see for example via_dmablit.c.
>>>>
>>>> Also if we need a separate locking order for User BOs, what would be the big benefit of having the locking order mmap_sem()->bo_reserve() ?
>>> Deterministic locking. I fear about livelocks that could happen otherwise with the right timing. Especially but not exclusively on -rt kernels.
>> But livelocks wouldn't be an issue anymore since we use a waiting reserve in the retry path, right? The only issue we might theoretically be facing is starvation in the fault path.
>>
>> With a two-step validation scheme I think we're up to real issues, like bouncing ordinary BOs in and out of swap during a single execbuf...
> I would need to see some code, but I think it can be avoided by having a slowpath similar to i915 if it fails.

There's no slowpath needed, but the following situation might happen. 
Let's assume there is a memory shortage.

1. user_bo reference
2. bo reference
3. user_bo validate. bos are swapped out.
4. bo validate bos are swapped in again.

While it's not an error, it's not a situation we should get into.
>
> Anyway have 2 threads fault on the same bo in different offsets, so fault handler is called twice on the same bo.
>
> You could end up a loop, first one grabbed the bo lock without mmap_sem, second one does trylock, which fails then unlocks mmap_sem, and acquires the bo lock again. At which point the first thread acquired mmap_sem and does a trylock. So the same thing can theoretically happen infinitely over and over again.

Indeed. This situation would be extremely unlikely but could however be 
easily fixed with wait_for_unreserved semantics instead of reserve; 
unreserve. Like in mm/filemap.c line 666 and onwards. Either by an API 
addition or using the current API. The latter would, however require an 
extra mutex per bo.

/Thomas



>
> ~Maarten

TTM Locking order of bo::reserve -> vm::mmap_sem

Commit Message

Comments

Patch