Message ID | 20220218005521.172832-3-baolu.lu@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix BUG_ON in vfio_iommu_group_notifier() | expand |
So we are back to the callback madness instead of the nice and simple flag? Sigh.
On 2022-02-19 07:32, Christoph Hellwig wrote: > So we are back to the callback madness instead of the nice and simple > flag? Sigh. TBH, I *think* this part could be a fair bit simpler. It looks like this whole callback mess is effectively just to decrement group->owner_cnt, but since we should only care about ownership at probe, hotplug, and other places well outside critical fast-paths, I'm not sure we really need to keep track of that anyway - it can always be recalculated by walking the group->devices list, and some of the relevant places have to do that anyway. It should be pretty straightforward for iommu_bus_notifier to clear group->owner automatically upon an unbind of the matching driver when it's no longer bound to any other devices in the group either. And if we still want to entertain the notion of VFIO being able to release ownership without unbinding (I'm not entirely convinced that's a realistically necessary use-case) then it should be up to VFIO to decide when it's finally finished with the whole group, rather than pretending we can keep track of nested ownership claims from inside the API. Furthermore, If Greg was willing to compromise just far enough to let us put driver_managed_dma in the 3-byte hole in the generic struct device_driver, we wouldn't have to have quite so much boilerplate repeated across the various bus implementations (I'm not suggesting to move any actual calls back into the driver core, just the storage of flag itself). FWIW I have some ideas for re-converging .dma_configure in future which I think should probably be able to subsume this into a completely generic common path, given a common flag. Robin.
On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote: > On 2022-02-19 07:32, Christoph Hellwig wrote: > > So we are back to the callback madness instead of the nice and simple > > flag? Sigh. > > TBH, I *think* this part could be a fair bit simpler. It looks like this > whole callback mess is effectively just to decrement > group->owner_cnt, but Right, the new callback is because of Greg's push to put all the work into the existing bus callback. Having symetrical callbacks is cleaner. > since we should only care about ownership at probe, hotplug, and other > places well outside critical fast-paths, I'm not sure we really need to keep > track of that anyway - it can always be recalculated by walking the > group->devices list, It has to be locked against concurrent probe, and there isn't currently any locking scheme that can support this. The owner_cnt is effectively a new lock for this purpose. It is the same issue we talked about with that VFIO patch you showed me. So, using the group->device_list would require adding something else somewhere - which I think should happen when someone has justification for another use of whatever that something else is. Also, Greg's did have an objection to the the first version, with code living in dd.c, that was basically probe time performance. I'm not sure making this slower would really be welcomed.. > and some of the relevant places have to do that anyway. ??? > It has to be s It should be pretty straightforward for > iommu_bus_notifier to clear group->owner automatically upon an > unbind of the matching driver when it's no longer bound to any other > devices in the group either. That not_bound/unbind notifier isn't currently triggred during necessary failure paths of really_probe(). Even if this was patched up, it looks like spaghetti to me.. > use-case) then it should be up to VFIO to decide when it's finally > finished with the whole group, rather than pretending we can keep > track of nested ownership claims from inside the API. What nesting? > Furthermore, If Greg was willing to compromise just far enough to let us put > driver_managed_dma in the 3-byte hole in the generic struct > device_driver, Space was not an issue, the earlier version of this switched an existing bool to a bitfield. > we wouldn't have to have quite so much boilerplate repeated across the > various bus implementations (I'm not suggesting to move any actual calls > back into the driver core, just the storage of flag itself). Not sure that makes sense.. But I don't understand why we need to copy and paste this code into every bus's dma_configure *shrug* > FWIW I have some ideas for re-converging .dma_configure in future > which I think should probably be able to subsume this into a > completely generic common path, given a common flag. This would be great! Jason
On 2/22/22 7:48 AM, Jason Gunthorpe wrote: >> since we should only care about ownership at probe, hotplug, and other >> places well outside critical fast-paths, I'm not sure we really need to keep >> track of that anyway - it can always be recalculated by walking the >> group->devices list, > It has to be locked against concurrent probe, and there isn't > currently any locking scheme that can support this. The owner_cnt is > effectively a new lock for this purpose. It is the same issue we > talked about with that VFIO patch you showed me. > > So, using the group->device_list would require adding something else > somewhere - which I think should happen when someone has > justification for another use of whatever that something else is. This series was originated from the similar idea by adding some fields in driver structure and intercepting it in iommu core. We stopped doing that due to the lack of lock mechanism between iommu and driver core. It then evolved into what it is today. Best regards, baolu
On 2022-02-21 23:48, Jason Gunthorpe wrote: > On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote: >> On 2022-02-19 07:32, Christoph Hellwig wrote: >>> So we are back to the callback madness instead of the nice and simple >>> flag? Sigh. >> >> TBH, I *think* this part could be a fair bit simpler. It looks like this >> whole callback mess is effectively just to decrement >> group->owner_cnt, but > > Right, the new callback is because of Greg's push to put all the work > into the existing bus callback. Having symetrical callbacks is > cleaner. I'll continue to disagree that having tons more code purely for the sake of it is cleaner. The high-level requirements are fundamentally asymmetrical - ownership has to be actively claimed by the bus code at a point during probe where it can block probing if necessary, but it can be released anywhere at all during remove since that cannot fail. I don't personally see the value in a bunch of code bloat for no reason other than trying to pretend that an asymmetrical thing isn't. We already have other concepts in the IOMMU API, like the domain ops lifecycle, which are almost self-contained but for needing an external prod to get started, so I'm naturally viewing this one the same way. >> since we should only care about ownership at probe, hotplug, and other >> places well outside critical fast-paths, I'm not sure we really need to keep >> track of that anyway - it can always be recalculated by walking the >> group->devices list, > > It has to be locked against concurrent probe, and there isn't > currently any locking scheme that can support this. The owner_cnt is > effectively a new lock for this purpose. It is the same issue we > talked about with that VFIO patch you showed me. Huh? How hard is it to hold group->mutex when reading or writing group->owner? Walking the list would only have to be done for *releasing* ownership and I'm pretty sure all the races there are benign - only probe/remove of the driver (or DMA API token) matching a current non-NULL owner matter; if two removes race, the first might end up releasing ownership "early", but the second is waiting to do that anyway so it's OK; if a remove races with a probe, the remove may end up leaving the owner set, but the probe is waiting to do that anyway so it's OK. > So, using the group->device_list would require adding something else > somewhere - which I think should happen when someone has > justification for another use of whatever that something else is. > > Also, Greg's did have an objection to the the first version, with code > living in dd.c, that was basically probe time performance. I'm not > sure making this slower would really be welcomed.. Again, this does not affect probe at all, only remove, and TBH I'd expect the performance impact to be negligible. On any sensible system, IOMMU groups are not large. Heck, in the typical case I'd guess it's no worse than the time we currently spend on group notifiers. I was just making the point that there should not be a significant performance argument for needing to cache a count value. >> and some of the relevant places have to do that anyway. > > ??? I was looking at iommu_group_remove_device() at the time, but of course we should always have seen an unbind before we get there - that one's on me, sorry for the confusion. >> It has to be s It should be pretty straightforward for >> iommu_bus_notifier to clear group->owner automatically upon an >> unbind of the matching driver when it's no longer bound to any other >> devices in the group either. > > That not_bound/unbind notifier isn't currently triggred during > necessary failure paths of really_probe(). Eh? Just look at the context of patch #2, let alone the rest of the function, and tell me how, if we can't rely on BUS_NOTIFY_DRIVER_NOT_BOUND, calling .dma_cleanup *from the exact same place* is somehow more reliable? AFAICS, a notifier handling both BUS_NOTIFY_UNBOUND_DRIVER and BUS_NOTIFY_DRIVER_NOT_BOUND would be directly equivalent to the callers of .dma_cleanup here. > Even if this was patched up, it looks like spaghetti to me.. > >> use-case) then it should be up to VFIO to decide when it's finally >> finished with the whole group, rather than pretending we can keep >> track of nested ownership claims from inside the API. > > What nesting? The current implementation of iommu_group_claim_dma_owner() allows owner_cnt to increase beyond 1, and correspondingly requires iommu_group_release_dma_owner() to be called the same number of times. It doesn't appear that VFIO needs that, and I'm not sure I'd trust any other potential users to get it right either. >> Furthermore, If Greg was willing to compromise just far enough to let us put >> driver_managed_dma in the 3-byte hole in the generic struct >> device_driver, > > Space was not an issue, the earlier version of this switched an > existing bool to a bitfield. > >> we wouldn't have to have quite so much boilerplate repeated across the >> various bus implementations (I'm not suggesting to move any actual calls >> back into the driver core, just the storage of flag itself). > > Not sure that makes sense.. But I don't understand why we need to copy > and paste this code into every bus's dma_configure *shrug* That's what I'm saying - right now every bus *has* to have a specific .dma_configure implementation if only to retrieve a semantically-identical flag from each bus-specific structure; there is zero possible code-sharing. With a generically-defined flag, there is some possibility for code-sharing now (e.g. patch #3 wouldn't be needed), and the potential for more in future. >> FWIW I have some ideas for re-converging .dma_configure in future >> which I think should probably be able to subsume this into a >> completely generic common path, given a common flag. > > This would be great! Indeed, so if we're enthusiastic about future cleanup that necessitates a generic flag, why not make the flag generic to start with? Thanks, Robin.
On Tue, Feb 22, 2022 at 10:58:37AM +0000, Robin Murphy wrote: > On 2022-02-21 23:48, Jason Gunthorpe wrote: > > On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote: > > > On 2022-02-19 07:32, Christoph Hellwig wrote: > > > > So we are back to the callback madness instead of the nice and simple > > > > flag? Sigh. > > > > > > TBH, I *think* this part could be a fair bit simpler. It looks like this > > > whole callback mess is effectively just to decrement > > > group->owner_cnt, but > > > > Right, the new callback is because of Greg's push to put all the work > > into the existing bus callback. Having symetrical callbacks is > > cleaner. > > I'll continue to disagree that having tons more code purely for the sake of > it is cleaner. The high-level requirements are fundamentally asymmetrical - > ownership has to be actively claimed by the bus code at a point during probe > where it can block probing if necessary, but it can be released anywhere at > all during remove since that cannot fail. I don't personally see the value > in a bunch of code bloat for no reason other than trying to pretend that an > asymmetrical thing isn't. Then we should put this in the share core code like most of us want. If we are doing this distorted thing then it may as well make some kind of self consistent sense with a configure/unconfigure op pair. > group->owner? Walking the list would only have to be done for *releasing* > ownership and I'm pretty sure all the races there are benign - only > probe/remove of the driver (or DMA API token) matching a current non-NULL > owner matter; if two removes race, the first might end up releasing > ownership "early", but the second is waiting to do that anyway so it's OK; > if a remove races with a probe, the remove may end up leaving the owner set, > but the probe is waiting to do that anyway so it's OK. With a lockless algorithm the race is probably wrongly releasing an ownership that probe just set in the multi-device group case. Still not sure I see what you are thinking though.. How did we get from adding a few simple lines to dd.c into building some complex lockless algorithm and hoping we did it right? > > > It has to be s It should be pretty straightforward for > > > iommu_bus_notifier to clear group->owner automatically upon an > > > unbind of the matching driver when it's no longer bound to any other > > > devices in the group either. > > > > That not_bound/unbind notifier isn't currently triggred during > > necessary failure paths of really_probe(). > > Eh? Just look at the context of patch #2, let alone the rest of the > function, and tell me how, if we can't rely on BUS_NOTIFY_DRIVER_NOT_BOUND, > calling .dma_cleanup *from the exact same place* is somehow more reliable? Yeah, OK > AFAICS, a notifier handling both BUS_NOTIFY_UNBOUND_DRIVER and > BUS_NOTIFY_DRIVER_NOT_BOUND would be directly equivalent to the callers of > .dma_cleanup here. Yes, but why hide this in a notifier, it is still spaghetti > > > use-case) then it should be up to VFIO to decide when it's finally > > > finished with the whole group, rather than pretending we can keep > > > track of nested ownership claims from inside the API. > > > > What nesting? > > The current implementation of iommu_group_claim_dma_owner() allows owner_cnt > to increase beyond 1, and correspondingly requires > iommu_group_release_dma_owner() to be called the same number of times. It > doesn't appear that VFIO needs that, and I'm not sure I'd trust any other > potential users to get it right either. That isn't for "nesting" it is keeping track of multi-device groups. Each count represents a device, not a nest. > > > FWIW I have some ideas for re-converging .dma_configure in future > > > which I think should probably be able to subsume this into a > > > completely generic common path, given a common flag. > > > > This would be great! > > Indeed, so if we're enthusiastic about future cleanup that necessitates a > generic flag, why not make the flag generic to start with? Maybe when someone has patches to delete the bus ops completely they can convince Greg. The good news is that it isn't much work to flip the flag, Lu has already done it 3 times in the previous versions.. It has already been 8 weeks on this point, lets just move on please. Jason
On 2022-02-22 15:16, Jason Gunthorpe wrote: > On Tue, Feb 22, 2022 at 10:58:37AM +0000, Robin Murphy wrote: >> On 2022-02-21 23:48, Jason Gunthorpe wrote: >>> On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote: >>>> On 2022-02-19 07:32, Christoph Hellwig wrote: >>>>> So we are back to the callback madness instead of the nice and simple >>>>> flag? Sigh. >>>> >>>> TBH, I *think* this part could be a fair bit simpler. It looks like this >>>> whole callback mess is effectively just to decrement >>>> group->owner_cnt, but >>> >>> Right, the new callback is because of Greg's push to put all the work >>> into the existing bus callback. Having symetrical callbacks is >>> cleaner. >> >> I'll continue to disagree that having tons more code purely for the sake of >> it is cleaner. The high-level requirements are fundamentally asymmetrical - >> ownership has to be actively claimed by the bus code at a point during probe >> where it can block probing if necessary, but it can be released anywhere at >> all during remove since that cannot fail. I don't personally see the value >> in a bunch of code bloat for no reason other than trying to pretend that an >> asymmetrical thing isn't. > > Then we should put this in the share core code like most of us want. > > If we are doing this distorted thing then it may as well make some > kind of self consistent sense with a configure/unconfigure op pair. > >> group->owner? Walking the list would only have to be done for *releasing* >> ownership and I'm pretty sure all the races there are benign - only >> probe/remove of the driver (or DMA API token) matching a current non-NULL >> owner matter; if two removes race, the first might end up releasing >> ownership "early", but the second is waiting to do that anyway so it's OK; >> if a remove races with a probe, the remove may end up leaving the owner set, >> but the probe is waiting to do that anyway so it's OK. > > With a lockless algorithm the race is probably wrongly releasing an > ownership that probe just set in the multi-device group case. > > Still not sure I see what you are thinking though.. What part of "How hard is it to hold group->mutex when reading or writing group->owner?" sounded like "complex lockless algorithm", exactly? To spell it out, the scheme I'm proposing looks like this: probe/claim: void *owner = driver_or_DMA_API_token(dev);//oversimplification! if (owner) { mutex_lock(group->mutex); if (!group->owner) group->owner = owner; else if (group->owner != owner); ret = -EBUSY; mutex_unlock(group->mutex); } remove: bool still_owned = false; mutex_lock(group->mutex); list_for_each_entry(tmp, &group->devices, list) { void *owner = driver_or_DMA_API_token(tmp); if (tmp == dev || !owner || owner != group->owner) continue; still_owned = true; break; } if (!still_owned) group->owner = NULL; mutex_unlock(group->mutex); Of course now that I've made it more concrete I realise that the remove hook does need to run *after* dev->driver is cleared, so not quite "anywhere at all", but the main point remains: as long as actual changes of ownership are always serialised, even if the list walk in the remove hook sees "future" information WRT other devices' drivers, at worst it should merely short-cut to a corresponding pending reclaim of ownership. > How did we get from adding a few simple lines to dd.c into building > some complex lockless algorithm and hoping we did it right? Because the current alternative to adding a few simple lines to dd.c is adding loads of lines all over the place to end up calling back into common IOMMU code, to do something I'm 99% certain the common IOMMU code could do for itself in private. That said, having worked through the above, it does start looking like a bit of a big change for this series at this point, so I'm happy to keep it on the back burner for when I have to rip .dma_configure to pieces anyway. According to lockdep, I think I've solved the VFIO locking issue provided vfio_group_viable() goes away, so I'm certainly keen not to delay that for another cycle! >>>> It has to be s It should be pretty straightforward for >>>> iommu_bus_notifier to clear group->owner automatically upon an >>>> unbind of the matching driver when it's no longer bound to any other >>>> devices in the group either. >>> >>> That not_bound/unbind notifier isn't currently triggred during >>> necessary failure paths of really_probe(). >> >> Eh? Just look at the context of patch #2, let alone the rest of the >> function, and tell me how, if we can't rely on BUS_NOTIFY_DRIVER_NOT_BOUND, >> calling .dma_cleanup *from the exact same place* is somehow more reliable? > > Yeah, OK > >> AFAICS, a notifier handling both BUS_NOTIFY_UNBOUND_DRIVER and >> BUS_NOTIFY_DRIVER_NOT_BOUND would be directly equivalent to the callers of >> .dma_cleanup here. > > Yes, but why hide this in a notifier, it is still spaghetti Quick quiz! 1: The existing IOMMU group management has spent the last 10 years being driven from: A - All over random bits of bus code and the driver core B - A private bus notifier 2: The functionality that this series replaces and improves upon was split between VFIO and... A - Random bits of bus code and the driver core B - The same private bus notifier >>>> use-case) then it should be up to VFIO to decide when it's finally >>>> finished with the whole group, rather than pretending we can keep >>>> track of nested ownership claims from inside the API. >>> >>> What nesting? >> >> The current implementation of iommu_group_claim_dma_owner() allows owner_cnt >> to increase beyond 1, and correspondingly requires >> iommu_group_release_dma_owner() to be called the same number of times. It >> doesn't appear that VFIO needs that, and I'm not sure I'd trust any other >> potential users to get it right either. > > That isn't for "nesting" it is keeping track of multi-device > groups. Each count represents a device, not a nest. I was originally going to say "recursion", but then thought that might carry too much risk of misinterpretation, oh well. Hold your favourite word for "taking a mutual-exclusion token that you already hold" in mind and read my paragraph quoted above again. I'm not talking about automatic DMA API claiming, that clearly happens per-device; I'm talking about explicit callers of iommu_group_claim_dma_owner(). Does VFIO call that multiple times for individual devices? No. Should it? No. Is it reasonable that any other future callers should need to? I don't think so. Would things be easier to reason about if we just disallowed it outright? For sure. >>>> FWIW I have some ideas for re-converging .dma_configure in future >>>> which I think should probably be able to subsume this into a >>>> completely generic common path, given a common flag. >>> >>> This would be great! >> >> Indeed, so if we're enthusiastic about future cleanup that necessitates a >> generic flag, why not make the flag generic to start with? > > Maybe when someone has patches to delete the bus ops completely they > can convince Greg. The good news is that it isn't much work to flip > the flag, Lu has already done it 3 times in the previous versions.. > > It has already been 8 weeks on this point, lets just move on please. Sure, if it was rc7 with the merge window looming I'd be saying "this is close enough, let's get it in now and fix the small stuff next cycle". However while there's still potentially time to get things right first time, I for one am going to continue to point them out because I'm not a fan of avoidable churn. I'm sorry I haven't had a chance to look properly at this series between v1 and v6, but that's just how things have been. Robin.
On Tue, Feb 22, 2022 at 09:18:23PM +0000, Robin Murphy wrote: > > Still not sure I see what you are thinking though.. > > What part of "How hard is it to hold group->mutex when reading or writing > group->owner?" sounded like "complex lockless algorithm", exactly? group->owner is not the issue, this series is already using the group lock to protect the group_cnt and owner. It is how you inspect the struct device it iterates over to decide if it is still using the DMA API or not that is the problem. Hint: this is why I keep mentioning the device_lock() as it is the only locking we for this today. > To spell it out, the scheme I'm proposing looks like this: Well, I already got this, it is what is in driver_or_DMA_API_token() that matters I think you are suggesting to do something like: if (!READ_ONCE(dev->driver) || ???) return NULL; return group; // A DMA_API 'token' Which is locklessly reading dev->driver, and why you are talking about races, I guess. > remove: > bool still_owned = false; > mutex_lock(group->mutex); > list_for_each_entry(tmp, &group->devices, list) { > void *owner = driver_or_DMA_API_token(tmp); > if (tmp == dev || !owner || owner != group->owner) And here you expect this will never be called if a group is owned by VFIO? Which bakes in that weird behavior of really_probe() that only some errors deserve to get a notifier? How does the next series work? The iommu_attach_device() work relies on the owner_cnt too. I don't think this list can replace it there. > always serialised, even if the list walk in the remove hook sees "future" > information WRT other devices' drivers, at worst it should merely short-cut > to a corresponding pending reclaim of ownership. Depending on what the actual logic is I could accept this argument, assuming it came with a WRITE_ONCE on the store side and we all thought carefully about how all this is ordered. > Because the current alternative to adding a few simple lines to dd.c is > adding loads of lines all over the place to end up calling back into common > IOMMU code, to do something I'm 99% certain the common IOMMU code > could do *shrug* both Christoph and I tried to convince Greg. He never really explained why, but currently he thinks this is the right way to design it, and so here we are. > for itself in private. That said, having worked through the above, it does > start looking like a bit of a big change for this series at this point, so > I'm happy to keep it on the back burner for when I have to rip > .dma_configure to pieces anyway. OK, thanks. > According to lockdep, I think I've solved the VFIO locking issue provided > vfio_group_viable() goes away, so I'm certainly keen not to delay that for > another cycle! Indeed. Keep in mind that lockdep is disabled on the device_lock().. > paragraph quoted above again. I'm not talking about automatic DMA API > claiming, that clearly happens per-device; I'm talking about explicit > callers of iommu_group_claim_dma_owner(). Does VFIO call that multiple times > for individual devices? No. Should it? No. Is it reasonable that any other > future callers should need to? I don't think so. Would things be easier to > reason about if we just disallowed it outright? For sure. iommufd is device centric and the current draft does call iommu_group_claim_dma_owner() once for each device. It doesn't have any reason to track groups, so it has no way to know if it is "nesting" or not. I hope the iommu_attach_device() work will progress and iommufd can eventualy call a cleaner device API, it is setup to work this way at least. So, yes, currently future calls need the owner_cnt to work right. (and we are doing all this to allow iommufd and vfio to share the ownership logic - adding VFIO-like group tracking complexity to iommufd to save a few bus callbacks is not a win, IMHO) > > It has already been 8 weeks on this point, lets just move on please. > > Sure, if it was rc7 with the merge window looming I'd be saying "this is > close enough, let's get it in now and fix the small stuff next cycle". > However while there's still potentially time to get things right first time, > I for one am going to continue to point them out because I'm not a fan of > avoidable churn. I'm sorry I haven't had a chance to look properly at this > series between v1 and v6, but that's just how things have been. Well, it is understandable, but this was supposed to be a smallish cleanup series. All the improvments from the discussion are welcomed and certainly did improve it, but this started in November and is dragging on.. Sometimes we need churn to bring everyone along the journey. Jason
On 2/23/22 7:53 AM, Jason Gunthorpe wrote: >> To spell it out, the scheme I'm proposing looks like this: > Well, I already got this, it is what is in driver_or_DMA_API_token() > that matters > > I think you are suggesting to do something like: > > if (!READ_ONCE(dev->driver) || ???) > return NULL; > return group; // A DMA_API 'token' > > Which is locklessly reading dev->driver, and why you are talking about > races, I guess. > I am afraid that we are not able to implement a race-free driver_or_DMA_API_token() helper. The lock problem between the IOMMU core and driver core always exists. For example, when we implemented iommu_group_store_type() to change the default domain type of a device through sysfs, we could only comprised and limited this functionality to singleton groups to avoid the lock issue. Unfortunately, that compromise cannot simply applied to the problem to be solved by this series, because the iommu core cannot abort the driver binding when the conflict is detected in the bus notifier. Best regards, baolu
On 2022-02-23 05:01, Lu Baolu wrote: > On 2/23/22 7:53 AM, Jason Gunthorpe wrote: >>> To spell it out, the scheme I'm proposing looks like this: >> Well, I already got this, it is what is in driver_or_DMA_API_token() >> that matters >> >> I think you are suggesting to do something like: >> >> if (!READ_ONCE(dev->driver) || ???) >> return NULL; >> return group; // A DMA_API 'token' >> >> Which is locklessly reading dev->driver, and why you are talking about >> races, I guess. >> > > I am afraid that we are not able to implement a race-free > driver_or_DMA_API_token() helper. The lock problem between the IOMMU > core and driver core always exists. It's not race-free. My point is that the races aren't harmful because what we might infer from the "wrong" information still leads to the right action. dev->driver is obviously always valid and constant for *claiming* ownership, since that either happens for the DMA API in the middle of really_probe() binding driver to dev, or while driver is actively using dev and calling iommu_group_claim_dma_owner(). The races exist during remove, but both probe and remove are serialised on the group mutex after respectively setting/clearing dev->driver, there are only 4 possibilities for the state of any other group sibling "tmp" during the time that dev holds that mutex in its remove path: 1 - tmp->driver is non-NULL because tmp is already bound. 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be DMA-API-owned as a whole. Regardless of what driver dev has unbound from, its removal does not release someone else's DMA API (co-)ownership. 1.b - If tmp->driver->driver_managed_dma == 1 and tmp->driver == group->owner, then dev must have unbound from the same driver, but either way that driver has not yet released ownership so dev's removal does not change anything. 1.c - If tmp->driver->driver_managed_dma == 1 and tmp->driver != group->owner, it doesn't matter. Even if tmp->driver is currently waiting to attempt to claim ownership it can't do so until we release the mutex. 2 - tmp->driver is non-NULL because tmp is in the process of binding. 2.a - If tmp->driver->driver_managed_dma == 0, tmp can be assumed to be waiting on the group mutex to claim DMA API ownership. 2.a.i - If the group is DMA API owned, this race is simply a short-cut to case 1.a - dev's ownership is effectively handed off directly to tmp, rather than potentially being released and immediately reclaimed. Once tmp gets its turn, it finds the group already DMA-API-owned as it wanted and all is well. This may be "unfair" if an explicit claim was also waiting, but not incorrect. 2.a.ii - If the group is driver-owned, it doesn't matter. Removing dev does not change the current ownership, and tmp's probe will eventually get its turn and find whatever it finds at that point in future. 2.b - If tmp->driver->driver_managed_dma == 1, it doesn't matter. Either that driver already owns the group, or it might try to claim it after we've resolved dev's removal and released the mutex, in which case it will find whatever it finds. 3 - tmp->driver is NULL because tmp is unbound. Obviously no impact. 4 - tmp->driver is NULL because tmp is in the process of unbinding. 4.a - If the group is DMA-API-owned, either way tmp has no further influence. 4.a.i - If tmp has unbound from a driver_managed_dma=0 driver, it must be waiting to release its DMA API ownership, thus if tmp would otherwise be the only remaining DMA API owner, the race is that dev's removal releases ownership on behalf of both devices. When tmp's own removal subsequently gets the mutex, it will either see that the group is already unowned, or maybe that someone else has re-claimed it in the interim, and either way do nothing, which is fine. 4.a.ii - If tmp has unbound from a driver_managed_dma=1 driver, it doesn't matter, as in case 1.c. 4.b - If the group is driver-owned, it doesn't matter. That ownership can only change if that driver releases it, which isn't happening while we hold the mutex. As I said yesterday, I'm really just airing out an idea here; I might write up some proper patches as part of the bus ops work, and we can give it proper scrutiny then. > For example, when we implemented iommu_group_store_type() to change the > default domain type of a device through sysfs, we could only comprised > and limited this functionality to singleton groups to avoid the lock > issue. Indeed, but once the probe and remove paths for grouped devices have to serialise on the group mutex, as we're introducing here, the story changes and we gain a lot more power. In fact that's a good point I hadn't considered yet - that sysfs constraint is functionally equivalent to the one in iommu_attach_device(), so once we land this ownership concept we should be free to relax it from "singleton" to "unowned" in much the same way as your other series is doing for attach. > Unfortunately, that compromise cannot simply applied to the problem to > be solved by this series, because the iommu core cannot abort the driver > binding when the conflict is detected in the bus notifier. No, I've never proposed that probe-time DMA ownership can be claimed from a notifier, we all know why that doesn't work. It's only the dma_cleanup() step that *could* be punted back to iommu_bus_notifier vs. the driver core having to know about it. Either way we're still serialising remove/failure against probe/remove of other devices in a group, and that's the critical aspect. Thanks, Robin.
On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: > 1 - tmp->driver is non-NULL because tmp is already bound. > 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be > DMA-API-owned as a whole. Regardless of what driver dev has unbound from, > its removal does not release someone else's DMA API (co-)ownership. This is an uncommon locking pattern, but it does work. It relies on the mutex being an effective synchronization barrier for an unlocked store: WRITE_ONCE(dev->driver, NULL) mutex_lock(&group->lock) READ_ONCE(dev->driver) != NULL and no UAF mutex_unlock(&group->lock) mutex_lock(&group->lock) tmp = READ_ONCE(dev1->driver); if (tmp && tmp->blah) [..] mutex_unlock(&group->lock) mutex_lock(&group->lock) READ_ONCE(dev->driver) == NULL mutex_unlock(&group->lock) /* No other CPU can UAF dev->driver */ kfree(driver) Ie the CPU setting driver cannot pass to the next step without all other CPUs observing the new value because of the release/acquire built into the mutex_lock. It is tricky, and can work in this instance, but the pattern's unlocked design relies on ordering between the WRITE_ONCE and the locks - and that ordering in dd.c isn't like that today. Jason
On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote: > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: > > > 1 - tmp->driver is non-NULL because tmp is already bound. > > 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from, > > its removal does not release someone else's DMA API (co-)ownership. > > This is an uncommon locking pattern, but it does work. It relies on > the mutex being an effective synchronization barrier for an unlocked > store: > > WRITE_ONCE(dev->driver, NULL) Only the driver core should be messing with the dev->driver pointer as when it does so, it already has the proper locks held. Do I need to move that to a "private" location so that nothing outside of the driver core can mess with it? thanks, greg k-h
On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote: > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote: > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: > > > > > 1 - tmp->driver is non-NULL because tmp is already bound. > > > 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from, > > > its removal does not release someone else's DMA API (co-)ownership. > > > > This is an uncommon locking pattern, but it does work. It relies on > > the mutex being an effective synchronization barrier for an unlocked > > store: > > > > WRITE_ONCE(dev->driver, NULL) > > Only the driver core should be messing with the dev->driver pointer as > when it does so, it already has the proper locks held. Do I need to > move that to a "private" location so that nothing outside of the driver > core can mess with it? It would be nice, I've seen a abuse and mislocking of it in drivers Jason
On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote: > On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote: > > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote: > > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: > > > > > > > 1 - tmp->driver is non-NULL because tmp is already bound. > > > > 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be > > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from, > > > > its removal does not release someone else's DMA API (co-)ownership. > > > > > > This is an uncommon locking pattern, but it does work. It relies on > > > the mutex being an effective synchronization barrier for an unlocked > > > store: > > > > > > WRITE_ONCE(dev->driver, NULL) > > > > Only the driver core should be messing with the dev->driver pointer as > > when it does so, it already has the proper locks held. Do I need to > > move that to a "private" location so that nothing outside of the driver > > core can mess with it? > > It would be nice, I've seen a abuse and mislocking of it in drivers Though to be clear, what Robin is describing is still keeping the dev->driver stores in dd.c, just reading it in a lockless way from other modules. Jason
On Wed, Feb 23, 2022 at 10:30:11AM -0400, Jason Gunthorpe wrote: > On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote: > > On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote: > > > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote: > > > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: > > > > > > > > > 1 - tmp->driver is non-NULL because tmp is already bound. > > > > > 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be > > > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from, > > > > > its removal does not release someone else's DMA API (co-)ownership. > > > > > > > > This is an uncommon locking pattern, but it does work. It relies on > > > > the mutex being an effective synchronization barrier for an unlocked > > > > store: > > > > > > > > WRITE_ONCE(dev->driver, NULL) > > > > > > Only the driver core should be messing with the dev->driver pointer as > > > when it does so, it already has the proper locks held. Do I need to > > > move that to a "private" location so that nothing outside of the driver > > > core can mess with it? > > > > It would be nice, I've seen a abuse and mislocking of it in drivers > > Though to be clear, what Robin is describing is still keeping the > dev->driver stores in dd.c, just reading it in a lockless way from > other modules. "other modules" should never care if a device has a driver bound to it because instantly after the check happens, it can change so what ever logic it wanted to do with that knowledge is gone. Unless the bus lock is held that the device is on, but that should be only accessable from within the driver core as it controls that type of stuff, not any random other part of the kernel. And in looking at this, ick, there are loads of places in the kernel that are thinking that this pointer being set to something actually means something. Sometimes it does, but lots of places, it doesn't as it can change. In a semi-related incident right now, we currently have a syzbot failure in the usb gadget code where it was manipulating the ->driver pointer directly and other parts of the kernel are crashing. See https://lore.kernel.org/r/PH0PR11MB58805E3C4CF7D4C41D49BFCFDA3C9@PH0PR11MB5880.namprd11.prod.outlook.com for the thread. I'll poke at this as a background task to try to clean up over time. thanks, greg k-h
On 2022-02-23 16:03, Greg Kroah-Hartman wrote: > On Wed, Feb 23, 2022 at 10:30:11AM -0400, Jason Gunthorpe wrote: >> On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote: >>> On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote: >>>> On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote: >>>>> On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: >>>>> >>>>>> 1 - tmp->driver is non-NULL because tmp is already bound. >>>>>> 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be >>>>>> DMA-API-owned as a whole. Regardless of what driver dev has unbound from, >>>>>> its removal does not release someone else's DMA API (co-)ownership. >>>>> >>>>> This is an uncommon locking pattern, but it does work. It relies on >>>>> the mutex being an effective synchronization barrier for an unlocked >>>>> store: >>>>> >>>>> WRITE_ONCE(dev->driver, NULL) >>>> >>>> Only the driver core should be messing with the dev->driver pointer as >>>> when it does so, it already has the proper locks held. Do I need to >>>> move that to a "private" location so that nothing outside of the driver >>>> core can mess with it? >>> >>> It would be nice, I've seen a abuse and mislocking of it in drivers >> >> Though to be clear, what Robin is describing is still keeping the >> dev->driver stores in dd.c, just reading it in a lockless way from >> other modules. > > "other modules" should never care if a device has a driver bound to it > because instantly after the check happens, it can change so what ever > logic it wanted to do with that knowledge is gone. > > Unless the bus lock is held that the device is on, but that should be > only accessable from within the driver core as it controls that type of > stuff, not any random other part of the kernel. > > And in looking at this, ick, there are loads of places in the kernel > that are thinking that this pointer being set to something actually > means something. Sometimes it does, but lots of places, it doesn't as > it can change. That's fine. In this case we're only talking about the low-level IOMMU code which has to be in cahoots with the driver core to some degree (via these new callbacks) anyway, but if you're uncomfortable about relying on dev->driver even there, I can live with that. There are several potential places to capture the relevant information in IOMMU API private data, from the point in really_probe() where it *is* stable, and then never look at dev->driver ever again - even from .dma_cleanup() or future equivalent, which is the aspect from whence this whole proof-of-concept tangent span out. Cheers, Robin.
On Wed, Feb 23, 2022 at 05:05:23PM +0000, Robin Murphy wrote: > On 2022-02-23 16:03, Greg Kroah-Hartman wrote: > > On Wed, Feb 23, 2022 at 10:30:11AM -0400, Jason Gunthorpe wrote: > > > On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote: > > > > On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote: > > > > > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote: > > > > > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote: > > > > > > > > > > > > > 1 - tmp->driver is non-NULL because tmp is already bound. > > > > > > > 1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be > > > > > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from, > > > > > > > its removal does not release someone else's DMA API (co-)ownership. > > > > > > > > > > > > This is an uncommon locking pattern, but it does work. It relies on > > > > > > the mutex being an effective synchronization barrier for an unlocked > > > > > > store: > > > > > > > > > > > > WRITE_ONCE(dev->driver, NULL) > > > > > > > > > > Only the driver core should be messing with the dev->driver pointer as > > > > > when it does so, it already has the proper locks held. Do I need to > > > > > move that to a "private" location so that nothing outside of the driver > > > > > core can mess with it? > > > > > > > > It would be nice, I've seen a abuse and mislocking of it in drivers > > > > > > Though to be clear, what Robin is describing is still keeping the > > > dev->driver stores in dd.c, just reading it in a lockless way from > > > other modules. > > > > "other modules" should never care if a device has a driver bound to it > > because instantly after the check happens, it can change so what ever > > logic it wanted to do with that knowledge is gone. > > > > Unless the bus lock is held that the device is on, but that should be > > only accessable from within the driver core as it controls that type of > > stuff, not any random other part of the kernel. > > > > And in looking at this, ick, there are loads of places in the kernel > > that are thinking that this pointer being set to something actually > > means something. Sometimes it does, but lots of places, it doesn't as > > it can change. > > That's fine. In this case we're only talking about the low-level IOMMU code > which has to be in cahoots with the driver core to some degree (via these > new callbacks) anyway, but if you're uncomfortable about relying on > dev->driver even there, I can live with that. There are several potential > places to capture the relevant information in IOMMU API private data, from > the point in really_probe() where it *is* stable, and then never look at > dev->driver ever again - even from .dma_cleanup() or future equivalent, > which is the aspect from whence this whole proof-of-concept tangent span > out. For a specific driver core callback, like dma_cleanup(), all is fine, but you shouldn't be caring about a driver pointer in your bus callback for stuff like this as you "know" what happened by virtue of the callback being called. thanks, greg k-h
diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h index a039ab809753..d8b29ccd07e5 100644 --- a/include/linux/device/bus.h +++ b/include/linux/device/bus.h @@ -59,6 +59,8 @@ struct fwnode_handle; * bus supports. * @dma_configure: Called to setup DMA configuration on a device on * this bus. + * @dma_cleanup: Called to cleanup DMA configuration on a device on + * this bus. * @pm: Power management operations of this bus, callback the specific * device driver's pm-ops. * @iommu_ops: IOMMU specific operations for this bus, used to attach IOMMU @@ -103,6 +105,7 @@ struct bus_type { int (*num_vf)(struct device *dev); int (*dma_configure)(struct device *dev); + void (*dma_cleanup)(struct device *dev); const struct dev_pm_ops *pm; diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 9eaaff2f556c..de05c5c60c6b 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -662,6 +662,8 @@ static int really_probe(struct device *dev, struct device_driver *drv) if (dev->bus) blocking_notifier_call_chain(&dev->bus->p->bus_notifier, BUS_NOTIFY_DRIVER_NOT_BOUND, dev); + if (dev->bus && dev->bus->dma_cleanup) + dev->bus->dma_cleanup(dev); pinctrl_bind_failed: device_links_no_driver(dev); devres_release_all(dev); @@ -1205,6 +1207,9 @@ static void __device_release_driver(struct device *dev, struct device *parent) else if (drv->remove) drv->remove(dev); + if (dev->bus && dev->bus->dma_cleanup) + dev->bus->dma_cleanup(dev); + device_links_driver_cleanup(dev); devres_release_all(dev);