Message ID | 20250217070123.3171232-1-huangjunxian6@hisilicon.com (mailing list archive) |
---|---|
Headers | show |
Series | RDMA/hns: Introduce delay-destruction mechanism | expand |
On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: > When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable > to notify HW about the destruction. In this case, driver will still > free the resources, while HW may still access them, thus leading to > a UAF. > This series introduces delay-destruction mechanism to fix such HW UAF, > including thw HW CTX and doorbells. And why can't you fix FW instead? Thanks
On 2025/2/19 20:14, Leon Romanovsky wrote: > On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: >> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable >> to notify HW about the destruction. In this case, driver will still >> free the resources, while HW may still access them, thus leading to >> a UAF. > >> This series introduces delay-destruction mechanism to fix such HW UAF, >> including thw HW CTX and doorbells. > > And why can't you fix FW instead? > The key is the failure of mailbox, and there are some cases that would lead to it, which we don't really consider as FW bugs. For example, when some random fatal error like RAS error occurs in FW, our FW will be reset. Driver's mailbox will fail during the FW reset. Another case is the mailbox timeout when FW is under heavy load, as it is shared by multi-functions. Thanks, Junxian > Thanks
On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote: > > > On 2025/2/19 20:14, Leon Romanovsky wrote: > > On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: > >> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable > >> to notify HW about the destruction. In this case, driver will still > >> free the resources, while HW may still access them, thus leading to > >> a UAF. > > > >> This series introduces delay-destruction mechanism to fix such HW UAF, > >> including thw HW CTX and doorbells. > > > > And why can't you fix FW instead? > > > > The key is the failure of mailbox, and there are some cases that would > lead to it, which we don't really consider as FW bugs. > > For example, when some random fatal error like RAS error occurs in FW, > our FW will be reset. Driver's mailbox will fail during the FW reset. I don't understand this scenario. You said at the beginning that HW can access host memory and this triggers UAF. However now, you are presenting case where driver tries to access mailbox. > > Another case is the mailbox timeout when FW is under heavy load, as it is > shared by multi-functions. It is not different from any other mailbox errors. FW needs to handle these cases. Thanks > > Thanks, > Junxian > > > Thanks
On 2025/2/19 22:35, Leon Romanovsky wrote: > On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote: >> >> >> On 2025/2/19 20:14, Leon Romanovsky wrote: >>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: >>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable >>>> to notify HW about the destruction. In this case, driver will still >>>> free the resources, while HW may still access them, thus leading to >>>> a UAF. >>> >>>> This series introduces delay-destruction mechanism to fix such HW UAF, >>>> including thw HW CTX and doorbells. >>> >>> And why can't you fix FW instead? >>> >> >> The key is the failure of mailbox, and there are some cases that would >> lead to it, which we don't really consider as FW bugs. >> >> For example, when some random fatal error like RAS error occurs in FW, >> our FW will be reset. Driver's mailbox will fail during the FW reset. > > I don't understand this scenario. You said at the beginning that HW can > access host memory and this triggers UAF. However now, you are presenting > case where driver tries to access mailbox. > No, I'm saying that mailbox errors are the reason of HW UAF. Let me explain this scenario in more detail. Driver notifies HW about the memory release with mailbox. The procedure of a mailbox is: a) driver posts the mailbox to FW b) FW writes the mailbox data into HW In this scenario, step a) will fail due to the FW reset, HW won't get notified and thus may lead to UAF. Junxian >> >> Another case is the mailbox timeout when FW is under heavy load, as it is >> shared by multi-functions. > > It is not different from any other mailbox errors. FW needs to handle > these cases. > > Thanks > >> >> Thanks, >> Junxian >> >>> Thanks
On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote: > > > On 2025/2/19 22:35, Leon Romanovsky wrote: > > On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote: > >> > >> > >> On 2025/2/19 20:14, Leon Romanovsky wrote: > >>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: > >>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable > >>>> to notify HW about the destruction. In this case, driver will still > >>>> free the resources, while HW may still access them, thus leading to > >>>> a UAF. > >>> > >>>> This series introduces delay-destruction mechanism to fix such HW UAF, > >>>> including thw HW CTX and doorbells. > >>> > >>> And why can't you fix FW instead? > >>> > >> > >> The key is the failure of mailbox, and there are some cases that would > >> lead to it, which we don't really consider as FW bugs. > >> > >> For example, when some random fatal error like RAS error occurs in FW, > >> our FW will be reset. Driver's mailbox will fail during the FW reset. > > > > I don't understand this scenario. You said at the beginning that HW can > > access host memory and this triggers UAF. However now, you are presenting > > case where driver tries to access mailbox. > > > > No, I'm saying that mailbox errors are the reason of HW UAF. Let me > explain this scenario in more detail. > > Driver notifies HW about the memory release with mailbox. The procedure > of a mailbox is: > a) driver posts the mailbox to FW > b) FW writes the mailbox data into HW > > In this scenario, step a) will fail due to the FW reset, HW won't get > notified and thus may lead to UAF. Exactly, FW performed reset and didn't prevent from HW to access it. Thanks > > Junxian > > >> > >> Another case is the mailbox timeout when FW is under heavy load, as it is > >> shared by multi-functions. > > > > It is not different from any other mailbox errors. FW needs to handle > > these cases. > > > > Thanks > > > >> > >> Thanks, > >> Junxian > >> > >>> Thanks
On 2025/2/20 15:32, Leon Romanovsky wrote: > On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote: >> >> >> On 2025/2/19 22:35, Leon Romanovsky wrote: >>> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote: >>>> >>>> >>>> On 2025/2/19 20:14, Leon Romanovsky wrote: >>>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: >>>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable >>>>>> to notify HW about the destruction. In this case, driver will still >>>>>> free the resources, while HW may still access them, thus leading to >>>>>> a UAF. >>>>> >>>>>> This series introduces delay-destruction mechanism to fix such HW UAF, >>>>>> including thw HW CTX and doorbells. >>>>> >>>>> And why can't you fix FW instead? >>>>> >>>> >>>> The key is the failure of mailbox, and there are some cases that would >>>> lead to it, which we don't really consider as FW bugs. >>>> >>>> For example, when some random fatal error like RAS error occurs in FW, >>>> our FW will be reset. Driver's mailbox will fail during the FW reset. >>> >>> I don't understand this scenario. You said at the beginning that HW can >>> access host memory and this triggers UAF. However now, you are presenting >>> case where driver tries to access mailbox. >>> >> >> No, I'm saying that mailbox errors are the reason of HW UAF. Let me >> explain this scenario in more detail. >> >> Driver notifies HW about the memory release with mailbox. The procedure >> of a mailbox is: >> a) driver posts the mailbox to FW >> b) FW writes the mailbox data into HW >> >> In this scenario, step a) will fail due to the FW reset, HW won't get >> notified and thus may lead to UAF. > > Exactly, FW performed reset and didn't prevent from HW to access it. > Yes, but the problem is that our HW doesn't provide a method to prevent the access. There's nothing FW can do in this scenario, so we can only prevent UAF by adding these codes in driver. Thanks, Junxian > Thanks > >> >> Junxian >> >>>> >>>> Another case is the mailbox timeout when FW is under heavy load, as it is >>>> shared by multi-functions. >>> >>> It is not different from any other mailbox errors. FW needs to handle >>> these cases. >>> >>> Thanks >>> >>>> >>>> Thanks, >>>> Junxian >>>> >>>>> Thanks
On Thu, Feb 20, 2025 at 04:45:54PM +0800, Junxian Huang wrote: > > > On 2025/2/20 15:32, Leon Romanovsky wrote: > > On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote: > >> > >> > >> On 2025/2/19 22:35, Leon Romanovsky wrote: > >>> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote: > >>>> > >>>> > >>>> On 2025/2/19 20:14, Leon Romanovsky wrote: > >>>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: > >>>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable > >>>>>> to notify HW about the destruction. In this case, driver will still > >>>>>> free the resources, while HW may still access them, thus leading to > >>>>>> a UAF. > >>>>> > >>>>>> This series introduces delay-destruction mechanism to fix such HW UAF, > >>>>>> including thw HW CTX and doorbells. > >>>>> > >>>>> And why can't you fix FW instead? > >>>>> > >>>> > >>>> The key is the failure of mailbox, and there are some cases that would > >>>> lead to it, which we don't really consider as FW bugs. > >>>> > >>>> For example, when some random fatal error like RAS error occurs in FW, > >>>> our FW will be reset. Driver's mailbox will fail during the FW reset. > >>> > >>> I don't understand this scenario. You said at the beginning that HW can > >>> access host memory and this triggers UAF. However now, you are presenting > >>> case where driver tries to access mailbox. > >>> > >> > >> No, I'm saying that mailbox errors are the reason of HW UAF. Let me > >> explain this scenario in more detail. > >> > >> Driver notifies HW about the memory release with mailbox. The procedure > >> of a mailbox is: > >> a) driver posts the mailbox to FW > >> b) FW writes the mailbox data into HW > >> > >> In this scenario, step a) will fail due to the FW reset, HW won't get > >> notified and thus may lead to UAF. > > > > Exactly, FW performed reset and didn't prevent from HW to access it. > > > > Yes, but the problem is that our HW doesn't provide a method to prevent > the access. There's nothing FW can do in this scenario, so we can only > prevent UAF by adding these codes in driver. Somehow HW doesn't access mailbox if destroy was successful, so why can't FW use same "method" to inform HW before reset? Thanks
On 2025/2/20 17:08, Leon Romanovsky wrote: > On Thu, Feb 20, 2025 at 04:45:54PM +0800, Junxian Huang wrote: >> >> >> On 2025/2/20 15:32, Leon Romanovsky wrote: >>> On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote: >>>> >>>> >>>> On 2025/2/19 22:35, Leon Romanovsky wrote: >>>>> On Wed, Feb 19, 2025 at 09:07:36PM +0800, Junxian Huang wrote: >>>>>> >>>>>> >>>>>> On 2025/2/19 20:14, Leon Romanovsky wrote: >>>>>>> On Mon, Feb 17, 2025 at 03:01:19PM +0800, Junxian Huang wrote: >>>>>>>> When mailboxes for resource(QP/CQ/SRQ) destruction fail, it's unable >>>>>>>> to notify HW about the destruction. In this case, driver will still >>>>>>>> free the resources, while HW may still access them, thus leading to >>>>>>>> a UAF. >>>>>>> >>>>>>>> This series introduces delay-destruction mechanism to fix such HW UAF, >>>>>>>> including thw HW CTX and doorbells. >>>>>>> >>>>>>> And why can't you fix FW instead? >>>>>>> >>>>>> >>>>>> The key is the failure of mailbox, and there are some cases that would >>>>>> lead to it, which we don't really consider as FW bugs. >>>>>> >>>>>> For example, when some random fatal error like RAS error occurs in FW, >>>>>> our FW will be reset. Driver's mailbox will fail during the FW reset. >>>>> >>>>> I don't understand this scenario. You said at the beginning that HW can >>>>> access host memory and this triggers UAF. However now, you are presenting >>>>> case where driver tries to access mailbox. >>>>> >>>> >>>> No, I'm saying that mailbox errors are the reason of HW UAF. Let me >>>> explain this scenario in more detail. >>>> >>>> Driver notifies HW about the memory release with mailbox. The procedure >>>> of a mailbox is: >>>> a) driver posts the mailbox to FW >>>> b) FW writes the mailbox data into HW >>>> >>>> In this scenario, step a) will fail due to the FW reset, HW won't get >>>> notified and thus may lead to UAF. >>> >>> Exactly, FW performed reset and didn't prevent from HW to access it. >>> >> >> Yes, but the problem is that our HW doesn't provide a method to prevent >> the access. There's nothing FW can do in this scenario, so we can only >> prevent UAF by adding these codes in driver. > > Somehow HW doesn't access mailbox if destroy was successful, so why > can't FW use same "method" to inform HW before reset? > Mailbox carries information of the specific resource (QP/CQ/SRQ/MR) that are being destroyed. It's impossible for FW to predict which QP/CQ/SRQ/MR will be destroyed by driver during reset before the reset starts. Thanks, Junxian > Thanks
On Thu, Feb 20, 2025 at 11:48:49AM +0800, Junxian Huang wrote: > Driver notifies HW about the memory release with mailbox. The procedure > of a mailbox is: > a) driver posts the mailbox to FW > b) FW writes the mailbox data into HW > > In this scenario, step a) will fail due to the FW reset, HW won't get > notified and thus may lead to UAF. That's just wrong, a FW reset must fully stop and sanitize the HW as well. You can't have HW running rouge with no way for FW to control it anymore. Jason
On Thu, Feb 20, 2025 at 07:05:06PM +0800, Junxian Huang wrote: > Mailbox carries information of the specific resource (QP/CQ/SRQ/MR) > that are being destroyed. It's impossible for FW to predict which > QP/CQ/SRQ/MR will be destroyed by driver during reset before the > reset starts. That doesn't make any sense, the device reset is supposed to clean up everything. It doesn't matter what the mailbox was doing, after the reset finishes it is no longer necessary because the reset was the thing that cleaned it up. You need a way to track the reset completion and cancel all outstanding commands with a reset failure so cleanup can happen. Combined with disassociate and some other locking you need to create a strong fence across the reset where there is no leakage of 'before' and 'after' reset objects and kernel state. Jason