mbox series

[for-next,0/4] RDMA/hns: Introduce delay-destruction mechanism

Message ID 20250217070123.3171232-1-huangjunxian6@hisilicon.com (mailing list archive)
Headers show
Series RDMA/hns: Introduce delay-destruction mechanism | expand

Message

Junxian Huang Feb. 17, 2025, 7:01 a.m. UTC
When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
to notify HW about the destruction. In this case, driver will still
free the resources, while HW may still access them, thus leading to
a UAF.

This series introduces delay-destruction mechanism to fix such HW UAF,
including thw HW CTX and doorbells.

Junxian Huang (2):
  RDMA/hns: Change mtr member to pointer in hns QP/CQ/MR/SRQ/EQ struct
  RDMA/hns: Fix HW doorbell UAF by adding delay-destruction mechanism

wenglianfa (2):
  RDMA/hns: Fix HW CTX UAF by adding delay-destruction mechanism
  Revert "RDMA/hns: Do not destroy QP resources in the hw resetting
    phase"

 drivers/infiniband/hw/hns/hns_roce_cq.c       |  34 +++--
 drivers/infiniband/hw/hns/hns_roce_db.c       |  91 ++++++++++----
 drivers/infiniband/hw/hns/hns_roce_device.h   |  73 ++++++++---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c    |  97 +++++++--------
 drivers/infiniband/hw/hns/hns_roce_main.c     |  13 ++
 drivers/infiniband/hw/hns/hns_roce_mr.c       | 117 ++++++++++++++----
 drivers/infiniband/hw/hns/hns_roce_qp.c       |  30 +++--
 drivers/infiniband/hw/hns/hns_roce_restrack.c |   4 +-
 drivers/infiniband/hw/hns/hns_roce_srq.c      |  45 ++++---
 9 files changed, 348 insertions(+), 156 deletions(-)

--
2.33.0

Comments

Leon Romanovsky Feb. 19, 2025, 12:14 p.m. UTC | #1
On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
> to notify HW about the destruction. In this case, driver will still
> free the resources, while HW may still access them, thus leading to
> a UAF.

> This series introduces delay-destruction mechanism to fix such HW UAF,
> including thw HW CTX and doorbells.

And why can't you fix FW instead?

Thanks
Junxian Huang Feb. 19, 2025, 1:07 p.m. UTC | #2
On 2025/2/19 20:14, Leon Romanovsky wrote:
> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
>> to notify HW about the destruction. In this case, driver will still
>> free the resources, while HW may still access them, thus leading to
>> a UAF.
> 
>> This series introduces delay-destruction mechanism to fix such HW UAF,
>> including thw HW CTX and doorbells.
> 
> And why can't you fix FW instead?
> 

The key is the failure of mailbox, and there are some cases that would
lead to it, which we don't really consider as FW bugs.

For example, when some random fatal error like RAS error occurs in FW,
our FW will be reset. Driver's mailbox will fail during the FW reset.

Another case is the mailbox timeout when FW is under heavy load, as it is
shared by multi-functions.

Thanks,
Junxian

> Thanks
Leon Romanovsky Feb. 19, 2025, 2:35 p.m. UTC | #3
On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote:
> 
> 
> On 2025/2/19 20:14, Leon Romanovsky wrote:
> > On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
> >> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
> >> to notify HW about the destruction. In this case, driver will still
> >> free the resources, while HW may still access them, thus leading to
> >> a UAF.
> > 
> >> This series introduces delay-destruction mechanism to fix such HW UAF,
> >> including thw HW CTX and doorbells.
> > 
> > And why can't you fix FW instead?
> > 
> 
> The key is the failure of mailbox, and there are some cases that would
> lead to it, which we don't really consider as FW bugs.
> 
> For example, when some random fatal error like RAS error occurs in FW,
> our FW will be reset. Driver's mailbox will fail during the FW reset.

I don't understand this scenario. You said at the beginning that HW can
access host memory and this triggers UAF. However now, you are presenting 
case where driver tries to access mailbox.

> 
> Another case is the mailbox timeout when FW is under heavy load, as it is
> shared by multi-functions.

It is not different from any other mailbox errors. FW needs to handle
these cases.

Thanks

> 
> Thanks,
> Junxian
> 
> > Thanks
Junxian Huang Feb. 20, 2025, 3:48 a.m. UTC | #4
On 2025/2/19 22:35, Leon Romanovsky wrote:
> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote:
>>
>>
>> On 2025/2/19 20:14, Leon Romanovsky wrote:
>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
>>>> to notify HW about the destruction. In this case, driver will still
>>>> free the resources, while HW may still access them, thus leading to
>>>> a UAF.
>>>
>>>> This series introduces delay-destruction mechanism to fix such HW UAF,
>>>> including thw HW CTX and doorbells.
>>>
>>> And why can't you fix FW instead?
>>>
>>
>> The key is the failure of mailbox, and there are some cases that would
>> lead to it, which we don't really consider as FW bugs.
>>
>> For example, when some random fatal error like RAS error occurs in FW,
>> our FW will be reset. Driver's mailbox will fail during the FW reset.
> 
> I don't understand this scenario. You said at the beginning that HW can
> access host memory and this triggers UAF. However now, you are presenting 
> case where driver tries to access mailbox.
> 

No, I'm saying that mailbox errors are the reason of HW UAF. Let me
explain this scenario in more detail.

Driver notifies HW about the memory release with mailbox. The procedure
of a mailbox is:
	a) driver posts the mailbox to FW
	b) FW writes the mailbox data into HW

In this scenario, step a) will fail due to the FW reset, HW won't get
notified and thus may lead to UAF.

Junxian

>>
>> Another case is the mailbox timeout when FW is under heavy load, as it is
>> shared by multi-functions.
> 
> It is not different from any other mailbox errors. FW needs to handle
> these cases.
> 
> Thanks
> 
>>
>> Thanks,
>> Junxian
>>
>>> Thanks
Leon Romanovsky Feb. 20, 2025, 7:32 a.m. UTC | #5
On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote:
> 
> 
> On 2025/2/19 22:35, Leon Romanovsky wrote:
> > On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote:
> >>
> >>
> >> On 2025/2/19 20:14, Leon Romanovsky wrote:
> >>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
> >>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
> >>>> to notify HW about the destruction. In this case, driver will still
> >>>> free the resources, while HW may still access them, thus leading to
> >>>> a UAF.
> >>>
> >>>> This series introduces delay-destruction mechanism to fix such HW UAF,
> >>>> including thw HW CTX and doorbells.
> >>>
> >>> And why can't you fix FW instead?
> >>>
> >>
> >> The key is the failure of mailbox, and there are some cases that would
> >> lead to it, which we don't really consider as FW bugs.
> >>
> >> For example, when some random fatal error like RAS error occurs in FW,
> >> our FW will be reset. Driver's mailbox will fail during the FW reset.
> > 
> > I don't understand this scenario. You said at the beginning that HW can
> > access host memory and this triggers UAF. However now, you are presenting 
> > case where driver tries to access mailbox.
> > 
> 
> No, I'm saying that mailbox errors are the reason of HW UAF. Let me
> explain this scenario in more detail.
> 
> Driver notifies HW about the memory release with mailbox. The procedure
> of a mailbox is:
> 	a) driver posts the mailbox to FW
> 	b) FW writes the mailbox data into HW
> 
> In this scenario, step a) will fail due to the FW reset, HW won't get
> notified and thus may lead to UAF.

Exactly, FW performed reset and didn't prevent from HW to access it.

Thanks

> 
> Junxian
> 
> >>
> >> Another case is the mailbox timeout when FW is under heavy load, as it is
> >> shared by multi-functions.
> > 
> > It is not different from any other mailbox errors. FW needs to handle
> > these cases.
> > 
> > Thanks
> > 
> >>
> >> Thanks,
> >> Junxian
> >>
> >>> Thanks
Junxian Huang Feb. 20, 2025, 8:45 a.m. UTC | #6
On 2025/2/20 15:32, Leon Romanovsky wrote:
> On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote:
>>
>>
>> On 2025/2/19 22:35, Leon Romanovsky wrote:
>>> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote:
>>>>
>>>>
>>>> On 2025/2/19 20:14, Leon Romanovsky wrote:
>>>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
>>>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
>>>>>> to notify HW about the destruction. In this case, driver will still
>>>>>> free the resources, while HW may still access them, thus leading to
>>>>>> a UAF.
>>>>>
>>>>>> This series introduces delay-destruction mechanism to fix such HW UAF,
>>>>>> including thw HW CTX and doorbells.
>>>>>
>>>>> And why can't you fix FW instead?
>>>>>
>>>>
>>>> The key is the failure of mailbox, and there are some cases that would
>>>> lead to it, which we don't really consider as FW bugs.
>>>>
>>>> For example, when some random fatal error like RAS error occurs in FW,
>>>> our FW will be reset. Driver's mailbox will fail during the FW reset.
>>>
>>> I don't understand this scenario. You said at the beginning that HW can
>>> access host memory and this triggers UAF. However now, you are presenting 
>>> case where driver tries to access mailbox.
>>>
>>
>> No, I'm saying that mailbox errors are the reason of HW UAF. Let me
>> explain this scenario in more detail.
>>
>> Driver notifies HW about the memory release with mailbox. The procedure
>> of a mailbox is:
>> 	a) driver posts the mailbox to FW
>> 	b) FW writes the mailbox data into HW
>>
>> In this scenario, step a) will fail due to the FW reset, HW won't get
>> notified and thus may lead to UAF.
> 
> Exactly, FW performed reset and didn't prevent from HW to access it.
> 

Yes, but the problem is that our HW doesn't provide a method to prevent
the access. There's nothing FW can do in this scenario, so we can only
prevent UAF by adding these codes in driver.

Thanks,
Junxian

> Thanks
> 
>>
>> Junxian
>>
>>>>
>>>> Another case is the mailbox timeout when FW is under heavy load, as it is
>>>> shared by multi-functions.
>>>
>>> It is not different from any other mailbox errors. FW needs to handle
>>> these cases.
>>>
>>> Thanks
>>>
>>>>
>>>> Thanks,
>>>> Junxian
>>>>
>>>>> Thanks
Leon Romanovsky Feb. 20, 2025, 9:08 a.m. UTC | #7
On Thu, Feb 20, 2025 at 04:45:54PM +0800, Junxian Huang wrote:
> 
> 
> On 2025/2/20 15:32, Leon Romanovsky wrote:
> > On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote:
> >>
> >>
> >> On 2025/2/19 22:35, Leon Romanovsky wrote:
> >>> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote:
> >>>>
> >>>>
> >>>> On 2025/2/19 20:14, Leon Romanovsky wrote:
> >>>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
> >>>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
> >>>>>> to notify HW about the destruction. In this case, driver will still
> >>>>>> free the resources, while HW may still access them, thus leading to
> >>>>>> a UAF.
> >>>>>
> >>>>>> This series introduces delay-destruction mechanism to fix such HW UAF,
> >>>>>> including thw HW CTX and doorbells.
> >>>>>
> >>>>> And why can't you fix FW instead?
> >>>>>
> >>>>
> >>>> The key is the failure of mailbox, and there are some cases that would
> >>>> lead to it, which we don't really consider as FW bugs.
> >>>>
> >>>> For example, when some random fatal error like RAS error occurs in FW,
> >>>> our FW will be reset. Driver's mailbox will fail during the FW reset.
> >>>
> >>> I don't understand this scenario. You said at the beginning that HW can
> >>> access host memory and this triggers UAF. However now, you are presenting 
> >>> case where driver tries to access mailbox.
> >>>
> >>
> >> No, I'm saying that mailbox errors are the reason of HW UAF. Let me
> >> explain this scenario in more detail.
> >>
> >> Driver notifies HW about the memory release with mailbox. The procedure
> >> of a mailbox is:
> >> 	a) driver posts the mailbox to FW
> >> 	b) FW writes the mailbox data into HW
> >>
> >> In this scenario, step a) will fail due to the FW reset, HW won't get
> >> notified and thus may lead to UAF.
> > 
> > Exactly, FW performed reset and didn't prevent from HW to access it.
> > 
> 
> Yes, but the problem is that our HW doesn't provide a method to prevent
> the access. There's nothing FW can do in this scenario, so we can only
> prevent UAF by adding these codes in driver.

Somehow HW doesn't access mailbox if destroy was successful, so why
can't FW use same "method" to inform HW before reset?

Thanks
Junxian Huang Feb. 20, 2025, 11:05 a.m. UTC | #8
On 2025/2/20 17:08, Leon Romanovsky wrote:
> On Thu, Feb 20, 2025 at 04:45:54PM +0800, Junxian Huang wrote:
>>
>>
>> On 2025/2/20 15:32, Leon Romanovsky wrote:
>>> On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote:
>>>>
>>>>
>>>> On 2025/2/19 22:35, Leon Romanovsky wrote:
>>>>> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/2/19 20:14, Leon Romanovsky wrote:
>>>>>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote:
>>>>>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable
>>>>>>>> to notify HW about the destruction. In this case, driver will still
>>>>>>>> free the resources, while HW may still access them, thus leading to
>>>>>>>> a UAF.
>>>>>>>
>>>>>>>> This series introduces delay-destruction mechanism to fix such HW UAF,
>>>>>>>> including thw HW CTX and doorbells.
>>>>>>>
>>>>>>> And why can't you fix FW instead?
>>>>>>>
>>>>>>
>>>>>> The key is the failure of mailbox, and there are some cases that would
>>>>>> lead to it, which we don't really consider as FW bugs.
>>>>>>
>>>>>> For example, when some random fatal error like RAS error occurs in FW,
>>>>>> our FW will be reset. Driver's mailbox will fail during the FW reset.
>>>>>
>>>>> I don't understand this scenario. You said at the beginning that HW can
>>>>> access host memory and this triggers UAF. However now, you are presenting 
>>>>> case where driver tries to access mailbox.
>>>>>
>>>>
>>>> No, I'm saying that mailbox errors are the reason of HW UAF. Let me
>>>> explain this scenario in more detail.
>>>>
>>>> Driver notifies HW about the memory release with mailbox. The procedure
>>>> of a mailbox is:
>>>> 	a) driver posts the mailbox to FW
>>>> 	b) FW writes the mailbox data into HW
>>>>
>>>> In this scenario, step a) will fail due to the FW reset, HW won't get
>>>> notified and thus may lead to UAF.
>>>
>>> Exactly, FW performed reset and didn't prevent from HW to access it.
>>>
>>
>> Yes, but the problem is that our HW doesn't provide a method to prevent
>> the access. There's nothing FW can do in this scenario, so we can only
>> prevent UAF by adding these codes in driver.
> 
> Somehow HW doesn't access mailbox if destroy was successful, so why
> can't FW use same "method" to inform HW before reset?
> 

Mailbox carries information of the specific resource (QP/CQ/SRQ/MR)
that are being destroyed. It's impossible for FW to predict which
QP/CQ/SRQ/MR will be destroyed by driver during reset before the
reset starts.

Thanks,
Junxian

> Thanks
Jason Gunthorpe Feb. 20, 2025, 2:10 p.m. UTC | #9
On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote:

> Driver notifies HW about the memory release with mailbox. The procedure
> of a mailbox is:
> 	a) driver posts the mailbox to FW
> 	b) FW writes the mailbox data into HW
> 
> In this scenario, step a) will fail due to the FW reset, HW won't get
> notified and thus may lead to UAF.

That's just wrong, a FW reset must fully stop and sanitize the HW as
well. You can't have HW running rouge with no way for FW to control it
anymore.

Jason
Jason Gunthorpe Feb. 20, 2025, 2:13 p.m. UTC | #10
On Thu, Feb 20, 2025 at 07:05:06PM +0800, Junxian Huang wrote:

> Mailbox carries information of the specific resource (QP/CQ/SRQ/MR)
> that are being destroyed. It's impossible for FW to predict which
> QP/CQ/SRQ/MR will be destroyed by driver during reset before the
> reset starts.

That doesn't make any sense, the device reset is supposed to clean up
everything. It doesn't matter what the mailbox was doing, after the
reset finishes it is no longer necessary because the reset was the
thing that cleaned it up.

You need a way to track the reset completion and cancel all
outstanding commands with a reset failure so cleanup can
happen. Combined with disassociate and some other locking you need to
create a strong fence across the reset where there is no leakage of
'before' and 'after' reset objects and kernel state.

Jason