diff mbox series

[v6,02/11] driver core: Add dma_cleanup callback in bus_type

Message ID 20220218005521.172832-3-baolu.lu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Fix BUG_ON in vfio_iommu_group_notifier() | expand

Commit Message

Baolu Lu Feb. 18, 2022, 12:55 a.m. UTC
The bus_type structure defines dma_configure() callback for bus drivers
to configure DMA on the devices. This adds the paired dma_cleanup()
callback and calls it during driver unbinding so that bus drivers can do
some cleanup work.

One use case for this paired DMA callbacks is for the bus driver to check
for DMA ownership conflicts during driver binding, where multiple devices
belonging to a same IOMMU group (the minimum granularity of isolation and
protection) may be assigned to kernel drivers or user space respectively.

Without this change, for example, the vfio driver has to listen to a bus
BOUND_DRIVER event and then BUG_ON() in case of dma ownership conflict.
This leads to bad user experience since careless driver binding operation
may crash the system if the admin overlooks the group restriction. Aside
from bad design, this leads to a security problem as a root user, even with
lockdown=integrity, can force the kernel to BUG.

With this change, the bus driver could check and set the DMA ownership in
driver binding process and fail on ownership conflicts. The DMA ownership
should be released during driver unbinding.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/linux/device/bus.h | 3 +++
 drivers/base/dd.c          | 5 +++++
 2 files changed, 8 insertions(+)

Comments

Christoph Hellwig Feb. 19, 2022, 7:32 a.m. UTC | #1
So we are back to the callback madness instead of the nice and simple
flag?  Sigh.
Robin Murphy Feb. 21, 2022, 8:43 p.m. UTC | #2
On 2022-02-19 07:32, Christoph Hellwig wrote:
> So we are back to the callback madness instead of the nice and simple
> flag?  Sigh.

TBH, I *think* this part could be a fair bit simpler. It looks like this 
whole callback mess is effectively just to decrement group->owner_cnt, 
but since we should only care about ownership at probe, hotplug, and 
other places well outside critical fast-paths, I'm not sure we really 
need to keep track of that anyway - it can always be recalculated by 
walking the group->devices list, and some of the relevant places have to 
do that anyway. It should be pretty straightforward for 
iommu_bus_notifier to clear group->owner automatically upon an unbind of 
the matching driver when it's no longer bound to any other devices in 
the group either. And if we still want to entertain the notion of VFIO 
being able to release ownership without unbinding (I'm not entirely 
convinced that's a realistically necessary use-case) then it should be 
up to VFIO to decide when it's finally finished with the whole group, 
rather than pretending we can keep track of nested ownership claims from 
inside the API.

Furthermore, If Greg was willing to compromise just far enough to let us 
put driver_managed_dma in the 3-byte hole in the generic struct 
device_driver, we wouldn't have to have quite so much boilerplate 
repeated across the various bus implementations (I'm not suggesting to 
move any actual calls back into the driver core, just the storage of 
flag itself). FWIW I have some ideas for re-converging .dma_configure in 
future which I think should probably be able to subsume this into a 
completely generic common path, given a common flag.

Robin.
Jason Gunthorpe Feb. 21, 2022, 11:48 p.m. UTC | #3
On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote:
> On 2022-02-19 07:32, Christoph Hellwig wrote:
> > So we are back to the callback madness instead of the nice and simple
> > flag?  Sigh.
> 
> TBH, I *think* this part could be a fair bit simpler. It looks like this
> whole callback mess is effectively just to decrement
> group->owner_cnt, but

Right, the new callback is because of Greg's push to put all the work
into the existing bus callback. Having symetrical callbacks is
cleaner.

> since we should only care about ownership at probe, hotplug, and other
> places well outside critical fast-paths, I'm not sure we really need to keep
> track of that anyway - it can always be recalculated by walking the
> group->devices list, 

It has to be locked against concurrent probe, and there isn't
currently any locking scheme that can support this. The owner_cnt is
effectively a new lock for this purpose. It is the same issue we
talked about with that VFIO patch you showed me.

So, using the group->device_list would require adding something else
somewhere - which I think should happen when someone has
justification for another use of whatever that something else is.

Also, Greg's did have an objection to the the first version, with code
living in dd.c, that was basically probe time performance. I'm not
sure making this slower would really be welcomed..

> and some of the relevant places have to do that anyway.

???

> It has to be s It should be pretty straightforward for
> iommu_bus_notifier to clear group->owner automatically upon an
> unbind of the matching driver when it's no longer bound to any other
> devices in the group either.

That not_bound/unbind notifier isn't currently triggred during
necessary failure paths of really_probe().

Even if this was patched up, it looks like spaghetti to me..

> use-case) then it should be up to VFIO to decide when it's finally
> finished with the whole group, rather than pretending we can keep
> track of nested ownership claims from inside the API.

What nesting?
 
> Furthermore, If Greg was willing to compromise just far enough to let us put
> driver_managed_dma in the 3-byte hole in the generic struct
> device_driver,

Space was not an issue, the earlier version of this switched an
existing bool to a bitfield.

> we wouldn't have to have quite so much boilerplate repeated across the
> various bus implementations (I'm not suggesting to move any actual calls
> back into the driver core, just the storage of flag itself). 

Not sure that makes sense.. But I don't understand why we need to copy
and paste this code into every bus's dma_configure *shrug*

> FWIW I have some ideas for re-converging .dma_configure in future
> which I think should probably be able to subsume this into a
> completely generic common path, given a common flag.

This would be great!

Jason
Baolu Lu Feb. 22, 2022, 4:48 a.m. UTC | #4
On 2/22/22 7:48 AM, Jason Gunthorpe wrote:
>> since we should only care about ownership at probe, hotplug, and other
>> places well outside critical fast-paths, I'm not sure we really need to keep
>> track of that anyway - it can always be recalculated by walking the
>> group->devices list,
> It has to be locked against concurrent probe, and there isn't
> currently any locking scheme that can support this. The owner_cnt is
> effectively a new lock for this purpose. It is the same issue we
> talked about with that VFIO patch you showed me.
> 
> So, using the group->device_list would require adding something else
> somewhere - which I think should happen when someone has
> justification for another use of whatever that something else is.

This series was originated from the similar idea by adding some fields
in driver structure and intercepting it in iommu core. We stopped doing
that due to the lack of lock mechanism between iommu and driver core.
It then evolved into what it is today.

Best regards,
baolu
Robin Murphy Feb. 22, 2022, 10:58 a.m. UTC | #5
On 2022-02-21 23:48, Jason Gunthorpe wrote:
> On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote:
>> On 2022-02-19 07:32, Christoph Hellwig wrote:
>>> So we are back to the callback madness instead of the nice and simple
>>> flag?  Sigh.
>>
>> TBH, I *think* this part could be a fair bit simpler. It looks like this
>> whole callback mess is effectively just to decrement
>> group->owner_cnt, but
> 
> Right, the new callback is because of Greg's push to put all the work
> into the existing bus callback. Having symetrical callbacks is
> cleaner.

I'll continue to disagree that having tons more code purely for the sake 
of it is cleaner. The high-level requirements are fundamentally 
asymmetrical - ownership has to be actively claimed by the bus code at a 
point during probe where it can block probing if necessary, but it can 
be released anywhere at all during remove since that cannot fail. I 
don't personally see the value in a bunch of code bloat for no reason 
other than trying to pretend that an asymmetrical thing isn't.

We already have other concepts in the IOMMU API, like the domain ops 
lifecycle, which are almost self-contained but for needing an external 
prod to get started, so I'm naturally viewing this one the same way.

>> since we should only care about ownership at probe, hotplug, and other
>> places well outside critical fast-paths, I'm not sure we really need to keep
>> track of that anyway - it can always be recalculated by walking the
>> group->devices list,
> 
> It has to be locked against concurrent probe, and there isn't
> currently any locking scheme that can support this. The owner_cnt is
> effectively a new lock for this purpose. It is the same issue we
> talked about with that VFIO patch you showed me.

Huh? How hard is it to hold group->mutex when reading or writing 
group->owner? Walking the list would only have to be done for 
*releasing* ownership and I'm pretty sure all the races there are benign 
- only probe/remove of the driver (or DMA API token) matching a current 
non-NULL owner matter; if two removes race, the first might end up 
releasing ownership "early", but the second is waiting to do that anyway 
so it's OK; if a remove races with a probe, the remove may end up 
leaving the owner set, but the probe is waiting to do that anyway so 
it's OK.

> So, using the group->device_list would require adding something else
> somewhere - which I think should happen when someone has
> justification for another use of whatever that something else is.
> 
> Also, Greg's did have an objection to the the first version, with code
> living in dd.c, that was basically probe time performance. I'm not
> sure making this slower would really be welcomed..

Again, this does not affect probe at all, only remove, and TBH I'd 
expect the performance impact to be negligible. On any sensible system, 
IOMMU groups are not large. Heck, in the typical case I'd guess it's no 
worse than the time we currently spend on group notifiers. I was just 
making the point that there should not be a significant performance 
argument for needing to cache a count value.

>> and some of the relevant places have to do that anyway.
> 
> ???

I was looking at iommu_group_remove_device() at the time, but of course 
we should always have seen an unbind before we get there - that one's on 
me, sorry for the confusion.

>> It has to be s It should be pretty straightforward for
>> iommu_bus_notifier to clear group->owner automatically upon an
>> unbind of the matching driver when it's no longer bound to any other
>> devices in the group either.
> 
> That not_bound/unbind notifier isn't currently triggred during
> necessary failure paths of really_probe().

Eh? Just look at the context of patch #2, let alone the rest of the 
function, and tell me how, if we can't rely on 
BUS_NOTIFY_DRIVER_NOT_BOUND, calling .dma_cleanup *from the exact same 
place* is somehow more reliable?

AFAICS, a notifier handling both BUS_NOTIFY_UNBOUND_DRIVER and 
BUS_NOTIFY_DRIVER_NOT_BOUND would be directly equivalent to the callers 
of .dma_cleanup here.

> Even if this was patched up, it looks like spaghetti to me..
> 
>> use-case) then it should be up to VFIO to decide when it's finally
>> finished with the whole group, rather than pretending we can keep
>> track of nested ownership claims from inside the API.
> 
> What nesting?

The current implementation of iommu_group_claim_dma_owner() allows 
owner_cnt to increase beyond 1, and correspondingly requires 
iommu_group_release_dma_owner() to be called the same number of times. 
It doesn't appear that VFIO needs that, and I'm not sure I'd trust any 
other potential users to get it right either.

>> Furthermore, If Greg was willing to compromise just far enough to let us put
>> driver_managed_dma in the 3-byte hole in the generic struct
>> device_driver,
> 
> Space was not an issue, the earlier version of this switched an
> existing bool to a bitfield.
> 
>> we wouldn't have to have quite so much boilerplate repeated across the
>> various bus implementations (I'm not suggesting to move any actual calls
>> back into the driver core, just the storage of flag itself).
> 
> Not sure that makes sense.. But I don't understand why we need to copy
> and paste this code into every bus's dma_configure *shrug*

That's what I'm saying - right now every bus *has* to have a specific 
.dma_configure implementation if only to retrieve a 
semantically-identical flag from each bus-specific structure; there is 
zero possible code-sharing. With a generically-defined flag, there is 
some possibility for code-sharing now (e.g. patch #3 wouldn't be 
needed), and the potential for more in future.

>> FWIW I have some ideas for re-converging .dma_configure in future
>> which I think should probably be able to subsume this into a
>> completely generic common path, given a common flag.
> 
> This would be great!

Indeed, so if we're enthusiastic about future cleanup that necessitates 
a generic flag, why not make the flag generic to start with?

Thanks,
Robin.
Jason Gunthorpe Feb. 22, 2022, 3:16 p.m. UTC | #6
On Tue, Feb 22, 2022 at 10:58:37AM +0000, Robin Murphy wrote:
> On 2022-02-21 23:48, Jason Gunthorpe wrote:
> > On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote:
> > > On 2022-02-19 07:32, Christoph Hellwig wrote:
> > > > So we are back to the callback madness instead of the nice and simple
> > > > flag?  Sigh.
> > > 
> > > TBH, I *think* this part could be a fair bit simpler. It looks like this
> > > whole callback mess is effectively just to decrement
> > > group->owner_cnt, but
> > 
> > Right, the new callback is because of Greg's push to put all the work
> > into the existing bus callback. Having symetrical callbacks is
> > cleaner.
> 
> I'll continue to disagree that having tons more code purely for the sake of
> it is cleaner. The high-level requirements are fundamentally asymmetrical -
> ownership has to be actively claimed by the bus code at a point during probe
> where it can block probing if necessary, but it can be released anywhere at
> all during remove since that cannot fail. I don't personally see the value
> in a bunch of code bloat for no reason other than trying to pretend that an
> asymmetrical thing isn't.

Then we should put this in the share core code like most of us want.

If we are doing this distorted thing then it may as well make some
kind of self consistent sense with a configure/unconfigure op pair.

> group->owner?  Walking the list would only have to be done for *releasing*
> ownership and I'm pretty sure all the races there are benign - only
> probe/remove of the driver (or DMA API token) matching a current non-NULL
> owner matter; if two removes race, the first might end up releasing
> ownership "early", but the second is waiting to do that anyway so it's OK;
> if a remove races with a probe, the remove may end up leaving the owner set,
> but the probe is waiting to do that anyway so it's OK.

With a lockless algorithm the race is probably wrongly releasing an
ownership that probe just set in the multi-device group case.

Still not sure I see what you are thinking though..

How did we get from adding a few simple lines to dd.c into building
some complex lockless algorithm and hoping we did it right?

> > > It has to be s It should be pretty straightforward for
> > > iommu_bus_notifier to clear group->owner automatically upon an
> > > unbind of the matching driver when it's no longer bound to any other
> > > devices in the group either.
> > 
> > That not_bound/unbind notifier isn't currently triggred during
> > necessary failure paths of really_probe().
> 
> Eh? Just look at the context of patch #2, let alone the rest of the
> function, and tell me how, if we can't rely on BUS_NOTIFY_DRIVER_NOT_BOUND,
> calling .dma_cleanup *from the exact same place* is somehow more reliable?

Yeah, OK

> AFAICS, a notifier handling both BUS_NOTIFY_UNBOUND_DRIVER and
> BUS_NOTIFY_DRIVER_NOT_BOUND would be directly equivalent to the callers of
> .dma_cleanup here.

Yes, but why hide this in a notifier, it is still spaghetti

> > > use-case) then it should be up to VFIO to decide when it's finally
> > > finished with the whole group, rather than pretending we can keep
> > > track of nested ownership claims from inside the API.
> > 
> > What nesting?
> 
> The current implementation of iommu_group_claim_dma_owner() allows owner_cnt
> to increase beyond 1, and correspondingly requires
> iommu_group_release_dma_owner() to be called the same number of times. It
> doesn't appear that VFIO needs that, and I'm not sure I'd trust any other
> potential users to get it right either.

That isn't for "nesting" it is keeping track of multi-device
groups. Each count represents a device, not a nest.

> > > FWIW I have some ideas for re-converging .dma_configure in future
> > > which I think should probably be able to subsume this into a
> > > completely generic common path, given a common flag.
> > 
> > This would be great!
> 
> Indeed, so if we're enthusiastic about future cleanup that necessitates a
> generic flag, why not make the flag generic to start with?

Maybe when someone has patches to delete the bus ops completely they
can convince Greg. The good news is that it isn't much work to flip
the flag, Lu has already done it 3 times in the previous versions..

It has already been 8 weeks on this point, lets just move on please.

Jason
Robin Murphy Feb. 22, 2022, 9:18 p.m. UTC | #7
On 2022-02-22 15:16, Jason Gunthorpe wrote:
> On Tue, Feb 22, 2022 at 10:58:37AM +0000, Robin Murphy wrote:
>> On 2022-02-21 23:48, Jason Gunthorpe wrote:
>>> On Mon, Feb 21, 2022 at 08:43:33PM +0000, Robin Murphy wrote:
>>>> On 2022-02-19 07:32, Christoph Hellwig wrote:
>>>>> So we are back to the callback madness instead of the nice and simple
>>>>> flag?  Sigh.
>>>>
>>>> TBH, I *think* this part could be a fair bit simpler. It looks like this
>>>> whole callback mess is effectively just to decrement
>>>> group->owner_cnt, but
>>>
>>> Right, the new callback is because of Greg's push to put all the work
>>> into the existing bus callback. Having symetrical callbacks is
>>> cleaner.
>>
>> I'll continue to disagree that having tons more code purely for the sake of
>> it is cleaner. The high-level requirements are fundamentally asymmetrical -
>> ownership has to be actively claimed by the bus code at a point during probe
>> where it can block probing if necessary, but it can be released anywhere at
>> all during remove since that cannot fail. I don't personally see the value
>> in a bunch of code bloat for no reason other than trying to pretend that an
>> asymmetrical thing isn't.
> 
> Then we should put this in the share core code like most of us want.
> 
> If we are doing this distorted thing then it may as well make some
> kind of self consistent sense with a configure/unconfigure op pair.
> 
>> group->owner?  Walking the list would only have to be done for *releasing*
>> ownership and I'm pretty sure all the races there are benign - only
>> probe/remove of the driver (or DMA API token) matching a current non-NULL
>> owner matter; if two removes race, the first might end up releasing
>> ownership "early", but the second is waiting to do that anyway so it's OK;
>> if a remove races with a probe, the remove may end up leaving the owner set,
>> but the probe is waiting to do that anyway so it's OK.
> 
> With a lockless algorithm the race is probably wrongly releasing an
> ownership that probe just set in the multi-device group case.
> 
> Still not sure I see what you are thinking though..

What part of "How hard is it to hold group->mutex when reading or 
writing group->owner?" sounded like "complex lockless algorithm", exactly?

To spell it out, the scheme I'm proposing looks like this:

probe/claim:
	void *owner = driver_or_DMA_API_token(dev);//oversimplification!
	if (owner) {
		mutex_lock(group->mutex);
		if (!group->owner)
			group->owner = owner;
		else if (group->owner != owner);
			ret = -EBUSY;
		mutex_unlock(group->mutex);
	}

remove:
	bool still_owned = false;
	mutex_lock(group->mutex);
	list_for_each_entry(tmp, &group->devices, list) {
		void *owner = driver_or_DMA_API_token(tmp);
		if (tmp == dev || !owner || owner != group->owner)
			continue;
		still_owned = true;
		break;
	}
	if (!still_owned)
		group->owner = NULL;
	mutex_unlock(group->mutex);

Of course now that I've made it more concrete I realise that the remove 
hook does need to run *after* dev->driver is cleared, so not quite 
"anywhere at all", but the main point remains: as long as actual changes 
of ownership are always serialised, even if the list walk in the remove 
hook sees "future" information WRT other devices' drivers, at worst it 
should merely short-cut to a corresponding pending reclaim of ownership.

> How did we get from adding a few simple lines to dd.c into building
> some complex lockless algorithm and hoping we did it right?

Because the current alternative to adding a few simple lines to dd.c is 
adding loads of lines all over the place to end up calling back into 
common IOMMU code, to do something I'm 99% certain the common IOMMU code 
could do for itself in private. That said, having worked through the 
above, it does start looking like a bit of a big change for this series 
at this point, so I'm happy to keep it on the back burner for when I 
have to rip .dma_configure to pieces anyway.

According to lockdep, I think I've solved the VFIO locking issue 
provided vfio_group_viable() goes away, so I'm certainly keen not to 
delay that for another cycle!

>>>> It has to be s It should be pretty straightforward for
>>>> iommu_bus_notifier to clear group->owner automatically upon an
>>>> unbind of the matching driver when it's no longer bound to any other
>>>> devices in the group either.
>>>
>>> That not_bound/unbind notifier isn't currently triggred during
>>> necessary failure paths of really_probe().
>>
>> Eh? Just look at the context of patch #2, let alone the rest of the
>> function, and tell me how, if we can't rely on BUS_NOTIFY_DRIVER_NOT_BOUND,
>> calling .dma_cleanup *from the exact same place* is somehow more reliable?
> 
> Yeah, OK
> 
>> AFAICS, a notifier handling both BUS_NOTIFY_UNBOUND_DRIVER and
>> BUS_NOTIFY_DRIVER_NOT_BOUND would be directly equivalent to the callers of
>> .dma_cleanup here.
> 
> Yes, but why hide this in a notifier, it is still spaghetti

Quick quiz!

1: The existing IOMMU group management has spent the last 10 years being 
driven from:

   A - All over random bits of bus code and the driver core
   B - A private bus notifier


2: The functionality that this series replaces and improves upon was 
split between VFIO and...

   A - Random bits of bus code and the driver core
   B - The same private bus notifier

>>>> use-case) then it should be up to VFIO to decide when it's finally
>>>> finished with the whole group, rather than pretending we can keep
>>>> track of nested ownership claims from inside the API.
>>>
>>> What nesting?
>>
>> The current implementation of iommu_group_claim_dma_owner() allows owner_cnt
>> to increase beyond 1, and correspondingly requires
>> iommu_group_release_dma_owner() to be called the same number of times. It
>> doesn't appear that VFIO needs that, and I'm not sure I'd trust any other
>> potential users to get it right either.
> 
> That isn't for "nesting" it is keeping track of multi-device
> groups. Each count represents a device, not a nest.

I was originally going to say "recursion", but then thought that might 
carry too much risk of misinterpretation, oh well. Hold your favourite 
word for "taking a mutual-exclusion token that you already hold" in mind 
and read my paragraph quoted above again. I'm not talking about 
automatic DMA API claiming, that clearly happens per-device; I'm talking 
about explicit callers of iommu_group_claim_dma_owner(). Does VFIO call 
that multiple times for individual devices? No. Should it? No. Is it 
reasonable that any other future callers should need to? I don't think 
so. Would things be easier to reason about if we just disallowed it 
outright? For sure.

>>>> FWIW I have some ideas for re-converging .dma_configure in future
>>>> which I think should probably be able to subsume this into a
>>>> completely generic common path, given a common flag.
>>>
>>> This would be great!
>>
>> Indeed, so if we're enthusiastic about future cleanup that necessitates a
>> generic flag, why not make the flag generic to start with?
> 
> Maybe when someone has patches to delete the bus ops completely they
> can convince Greg. The good news is that it isn't much work to flip
> the flag, Lu has already done it 3 times in the previous versions..
> 
> It has already been 8 weeks on this point, lets just move on please.

Sure, if it was rc7 with the merge window looming I'd be saying "this is 
close enough, let's get it in now and fix the small stuff next cycle". 
However while there's still potentially time to get things right first 
time, I for one am going to continue to point them out because I'm not a 
fan of avoidable churn. I'm sorry I haven't had a chance to look 
properly at this series between v1 and v6, but that's just how things 
have been.

Robin.
Jason Gunthorpe Feb. 22, 2022, 11:53 p.m. UTC | #8
On Tue, Feb 22, 2022 at 09:18:23PM +0000, Robin Murphy wrote:

> > Still not sure I see what you are thinking though..
> 
> What part of "How hard is it to hold group->mutex when reading or writing
> group->owner?" sounded like "complex lockless algorithm", exactly?

group->owner is not the issue, this series is already using the group
lock to protect the group_cnt and owner.

It is how you inspect the struct device it iterates over to decide if
it is still using the DMA API or not that is the problem. Hint: this
is why I keep mentioning the device_lock() as it is the only locking we
for this today.

> To spell it out, the scheme I'm proposing looks like this:

Well, I already got this, it is what is in driver_or_DMA_API_token()
that matters

I think you are suggesting to do something like:

   if (!READ_ONCE(dev->driver) ||  ???)
       return NULL;
   return group;  // A DMA_API 'token'

Which is locklessly reading dev->driver, and why you are talking about
races, I guess.

> remove:
>        bool still_owned = false;
>        mutex_lock(group->mutex);
>        list_for_each_entry(tmp, &group->devices, list) {
>                void *owner = driver_or_DMA_API_token(tmp);
>                if (tmp == dev || !owner || owner != group->owner)

And here you expect this will never be called if a group is owned by
VFIO? Which bakes in that weird behavior of really_probe() that only
some errors deserve to get a notifier?

How does the next series work? The iommu_attach_device() work relies
on the owner_cnt too. I don't think this list can replace it there.

> always serialised, even if the list walk in the remove hook sees "future"
> information WRT other devices' drivers, at worst it should merely short-cut
> to a corresponding pending reclaim of ownership.

Depending on what the actual logic is I could accept this argument,
assuming it came with a WRITE_ONCE on the store side and we all
thought carefully about how all this is ordered.

> Because the current alternative to adding a few simple lines to dd.c is
> adding loads of lines all over the place to end up calling back into common
> IOMMU code, to do something I'm 99% certain the common IOMMU code
> could do

*shrug* both Christoph and I tried to convince Greg. He never really
explained why, but currently he thinks this is the right way to design
it, and so here we are.

> for itself in private. That said, having worked through the above, it does
> start looking like a bit of a big change for this series at this point, so
> I'm happy to keep it on the back burner for when I have to rip
> .dma_configure to pieces anyway.

OK, thanks.

> According to lockdep, I think I've solved the VFIO locking issue provided
> vfio_group_viable() goes away, so I'm certainly keen not to delay that for
> another cycle!

Indeed.

Keep in mind that lockdep is disabled on the device_lock()..

> paragraph quoted above again. I'm not talking about automatic DMA API
> claiming, that clearly happens per-device; I'm talking about explicit
> callers of iommu_group_claim_dma_owner(). Does VFIO call that multiple times
> for individual devices? No. Should it? No. Is it reasonable that any other
> future callers should need to? I don't think so. Would things be easier to
> reason about if we just disallowed it outright? For sure.

iommufd is device centric and the current draft does call
iommu_group_claim_dma_owner() once for each device. It doesn't have
any reason to track groups, so it has no way to know if it is
"nesting" or not.

I hope the iommu_attach_device() work will progress and iommufd can
eventualy call a cleaner device API, it is setup to work this way at
least.

So, yes, currently future calls need the owner_cnt to work right.
(and we are doing all this to allow iommufd and vfio to share the
ownership logic - adding VFIO-like group tracking complexity to
iommufd to save a few bus callbacks is not a win, IMHO)

> > It has already been 8 weeks on this point, lets just move on please.
> 
> Sure, if it was rc7 with the merge window looming I'd be saying "this is
> close enough, let's get it in now and fix the small stuff next cycle".
> However while there's still potentially time to get things right first time,
> I for one am going to continue to point them out because I'm not a fan of
> avoidable churn. I'm sorry I haven't had a chance to look properly at this
> series between v1 and v6, but that's just how things have been.

Well, it is understandable, but this was supposed to be a smallish
cleanup series. All the improvments from the discussion are welcomed
and certainly did improve it, but this started in November and is
dragging on..

Sometimes we need churn to bring everyone along the journey.

Jason
Baolu Lu Feb. 23, 2022, 5:01 a.m. UTC | #9
On 2/23/22 7:53 AM, Jason Gunthorpe wrote:
>> To spell it out, the scheme I'm proposing looks like this:
> Well, I already got this, it is what is in driver_or_DMA_API_token()
> that matters
> 
> I think you are suggesting to do something like:
> 
>     if (!READ_ONCE(dev->driver) ||  ???)
>         return NULL;
>     return group;  // A DMA_API 'token'
> 
> Which is locklessly reading dev->driver, and why you are talking about
> races, I guess.
> 

I am afraid that we are not able to implement a race-free
driver_or_DMA_API_token() helper. The lock problem between the IOMMU
core and driver core always exists.

For example, when we implemented iommu_group_store_type() to change the
default domain type of a device through sysfs, we could only comprised
and limited this functionality to singleton groups to avoid the lock
issue.

Unfortunately, that compromise cannot simply applied to the problem to
be solved by this series, because the iommu core cannot abort the driver
binding when the conflict is detected in the bus notifier.

Best regards,
baolu
Robin Murphy Feb. 23, 2022, 1:04 p.m. UTC | #10
On 2022-02-23 05:01, Lu Baolu wrote:
> On 2/23/22 7:53 AM, Jason Gunthorpe wrote:
>>> To spell it out, the scheme I'm proposing looks like this:
>> Well, I already got this, it is what is in driver_or_DMA_API_token()
>> that matters
>>
>> I think you are suggesting to do something like:
>>
>>     if (!READ_ONCE(dev->driver) ||  ???)
>>         return NULL;
>>     return group;  // A DMA_API 'token'
>>
>> Which is locklessly reading dev->driver, and why you are talking about
>> races, I guess.
>>
> 
> I am afraid that we are not able to implement a race-free
> driver_or_DMA_API_token() helper. The lock problem between the IOMMU
> core and driver core always exists.

It's not race-free. My point is that the races aren't harmful because 
what we might infer from the "wrong" information still leads to the 
right action. dev->driver is obviously always valid and constant for 
*claiming* ownership, since that either happens for the DMA API in the 
middle of really_probe() binding driver to dev, or while driver is 
actively using dev and calling iommu_group_claim_dma_owner(). The races 
exist during remove, but both probe and remove are serialised on the 
group mutex after respectively setting/clearing dev->driver, there are 
only 4 possibilities for the state of any other group sibling "tmp" 
during the time that dev holds that mutex in its remove path:

1 - tmp->driver is non-NULL because tmp is already bound.
   1.a - If tmp->driver->driver_managed_dma == 0, the group must 
currently be DMA-API-owned as a whole. Regardless of what driver dev has 
unbound from, its removal does not release someone else's DMA API 
(co-)ownership.
   1.b - If tmp->driver->driver_managed_dma == 1 and tmp->driver == 
group->owner, then dev must have unbound from the same driver, but 
either way that driver has not yet released ownership so dev's removal 
does not change anything.
   1.c - If tmp->driver->driver_managed_dma == 1 and tmp->driver != 
group->owner, it doesn't matter. Even if tmp->driver is currently 
waiting to attempt to claim ownership it can't do so until we release 
the mutex.

2 - tmp->driver is non-NULL because tmp is in the process of binding.
   2.a - If tmp->driver->driver_managed_dma == 0, tmp can be assumed to 
be waiting on the group mutex to claim DMA API ownership.
     2.a.i - If the group is DMA API owned, this race is simply a 
short-cut to case 1.a - dev's ownership is effectively handed off 
directly to tmp, rather than potentially being released and immediately 
reclaimed. Once tmp gets its turn, it finds the group already 
DMA-API-owned as it wanted and all is well. This may be "unfair" if an 
explicit claim was also waiting, but not incorrect.
     2.a.ii - If the group is driver-owned, it doesn't matter. Removing 
dev does not change the current ownership, and tmp's probe will 
eventually get its turn and find whatever it finds at that point in future.
   2.b - If tmp->driver->driver_managed_dma == 1, it doesn't matter. 
Either that driver already owns the group, or it might try to claim it 
after we've resolved dev's removal and released the mutex, in which case 
it will find whatever it finds.

3 - tmp->driver is NULL because tmp is unbound. Obviously no impact.

4 - tmp->driver is NULL because tmp is in the process of unbinding.
   4.a - If the group is DMA-API-owned, either way tmp has no further 
influence.
     4.a.i - If tmp has unbound from a driver_managed_dma=0 driver, it 
must be waiting to release its DMA API ownership, thus if tmp would 
otherwise be the only remaining DMA API owner, the race is that dev's 
removal releases ownership on behalf of both devices. When tmp's own 
removal subsequently gets the mutex, it will either see that the group 
is already unowned, or maybe that someone else has re-claimed it in the 
interim, and either way do nothing, which is fine.
     4.a.ii - If tmp has unbound from a driver_managed_dma=1 driver, it 
doesn't matter, as in case 1.c.
   4.b - If the group is driver-owned, it doesn't matter. That ownership 
can only change if that driver releases it, which isn't happening while 
we hold the mutex.

As I said yesterday, I'm really just airing out an idea here; I might 
write up some proper patches as part of the bus ops work, and we can 
give it proper scrutiny then.

> For example, when we implemented iommu_group_store_type() to change the
> default domain type of a device through sysfs, we could only comprised
> and limited this functionality to singleton groups to avoid the lock
> issue.

Indeed, but once the probe and remove paths for grouped devices have to 
serialise on the group mutex, as we're introducing here, the story 
changes and we gain a lot more power. In fact that's a good point I 
hadn't considered yet - that sysfs constraint is functionally equivalent 
to the one in iommu_attach_device(), so once we land this ownership 
concept we should be free to relax it from "singleton" to "unowned" in 
much the same way as your other series is doing for attach.

> Unfortunately, that compromise cannot simply applied to the problem to
> be solved by this series, because the iommu core cannot abort the driver
> binding when the conflict is detected in the bus notifier.

No, I've never proposed that probe-time DMA ownership can be claimed 
from a notifier, we all know why that doesn't work. It's only the 
dma_cleanup() step that *could* be punted back to iommu_bus_notifier vs. 
the driver core having to know about it. Either way we're still 
serialising remove/failure against probe/remove of other devices in a 
group, and that's the critical aspect.

Thanks,
Robin.
Jason Gunthorpe Feb. 23, 2022, 1:46 p.m. UTC | #11
On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:

> 1 - tmp->driver is non-NULL because tmp is already bound.
>   1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
> DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
> its removal does not release someone else's DMA API (co-)ownership.

This is an uncommon locking pattern, but it does work. It relies on
the mutex being an effective synchronization barrier for an unlocked
store:

				      WRITE_ONCE(dev->driver, NULL)

 mutex_lock(&group->lock)
 READ_ONCE(dev->driver) != NULL and no UAF
 mutex_unlock(&group->lock)

				      mutex_lock(&group->lock)
				      tmp = READ_ONCE(dev1->driver);
				      if (tmp && tmp->blah) [..]
				      mutex_unlock(&group->lock)
 mutex_lock(&group->lock)
 READ_ONCE(dev->driver) == NULL
 mutex_unlock(&group->lock)

				      /* No other CPU can UAF dev->driver */
                                      kfree(driver)

Ie the CPU setting driver cannot pass to the next step without all
other CPUs observing the new value because of the release/acquire built
into the mutex_lock.

It is tricky, and can work in this instance, but the pattern's unlocked
design relies on ordering between the WRITE_ONCE and the locks - and
that ordering in dd.c isn't like that today.

Jason
Greg KH Feb. 23, 2022, 2:06 p.m. UTC | #12
On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:
> 
> > 1 - tmp->driver is non-NULL because tmp is already bound.
> >   1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
> > DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
> > its removal does not release someone else's DMA API (co-)ownership.
> 
> This is an uncommon locking pattern, but it does work. It relies on
> the mutex being an effective synchronization barrier for an unlocked
> store:
> 
> 				      WRITE_ONCE(dev->driver, NULL)

Only the driver core should be messing with the dev->driver pointer as
when it does so, it already has the proper locks held.  Do I need to
move that to a "private" location so that nothing outside of the driver
core can mess with it?

thanks,

greg k-h
Jason Gunthorpe Feb. 23, 2022, 2:09 p.m. UTC | #13
On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote:
> On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:
> > 
> > > 1 - tmp->driver is non-NULL because tmp is already bound.
> > >   1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
> > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
> > > its removal does not release someone else's DMA API (co-)ownership.
> > 
> > This is an uncommon locking pattern, but it does work. It relies on
> > the mutex being an effective synchronization barrier for an unlocked
> > store:
> > 
> > 				      WRITE_ONCE(dev->driver, NULL)
> 
> Only the driver core should be messing with the dev->driver pointer as
> when it does so, it already has the proper locks held.  Do I need to
> move that to a "private" location so that nothing outside of the driver
> core can mess with it?

It would be nice, I've seen a abuse and mislocking of it in drivers

Jason
Jason Gunthorpe Feb. 23, 2022, 2:30 p.m. UTC | #14
On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote:
> > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:
> > > 
> > > > 1 - tmp->driver is non-NULL because tmp is already bound.
> > > >   1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
> > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
> > > > its removal does not release someone else's DMA API (co-)ownership.
> > > 
> > > This is an uncommon locking pattern, but it does work. It relies on
> > > the mutex being an effective synchronization barrier for an unlocked
> > > store:
> > > 
> > > 				      WRITE_ONCE(dev->driver, NULL)
> > 
> > Only the driver core should be messing with the dev->driver pointer as
> > when it does so, it already has the proper locks held.  Do I need to
> > move that to a "private" location so that nothing outside of the driver
> > core can mess with it?
> 
> It would be nice, I've seen a abuse and mislocking of it in drivers

Though to be clear, what Robin is describing is still keeping the
dev->driver stores in dd.c, just reading it in a lockless way from
other modules.

Jason
Greg KH Feb. 23, 2022, 4:03 p.m. UTC | #15
On Wed, Feb 23, 2022 at 10:30:11AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote:
> > > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:
> > > > 
> > > > > 1 - tmp->driver is non-NULL because tmp is already bound.
> > > > >   1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
> > > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
> > > > > its removal does not release someone else's DMA API (co-)ownership.
> > > > 
> > > > This is an uncommon locking pattern, but it does work. It relies on
> > > > the mutex being an effective synchronization barrier for an unlocked
> > > > store:
> > > > 
> > > > 				      WRITE_ONCE(dev->driver, NULL)
> > > 
> > > Only the driver core should be messing with the dev->driver pointer as
> > > when it does so, it already has the proper locks held.  Do I need to
> > > move that to a "private" location so that nothing outside of the driver
> > > core can mess with it?
> > 
> > It would be nice, I've seen a abuse and mislocking of it in drivers
> 
> Though to be clear, what Robin is describing is still keeping the
> dev->driver stores in dd.c, just reading it in a lockless way from
> other modules.

"other modules" should never care if a device has a driver bound to it
because instantly after the check happens, it can change so what ever
logic it wanted to do with that knowledge is gone.

Unless the bus lock is held that the device is on, but that should be
only accessable from within the driver core as it controls that type of
stuff, not any random other part of the kernel.

And in looking at this, ick, there are loads of places in the kernel
that are thinking that this pointer being set to something actually
means something.  Sometimes it does, but lots of places, it doesn't as
it can change.

In a semi-related incident right now, we currently have a syzbot failure
in the usb gadget code where it was manipulating the ->driver pointer
directly and other parts of the kernel are crashing.  See
https://lore.kernel.org/r/PH0PR11MB58805E3C4CF7D4C41D49BFCFDA3C9@PH0PR11MB5880.namprd11.prod.outlook.com
for the thread.

I'll poke at this as a background task to try to clean up over time.

thanks,

greg k-h
Robin Murphy Feb. 23, 2022, 5:05 p.m. UTC | #16
On 2022-02-23 16:03, Greg Kroah-Hartman wrote:
> On Wed, Feb 23, 2022 at 10:30:11AM -0400, Jason Gunthorpe wrote:
>> On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote:
>>> On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote:
>>>> On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote:
>>>>> On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:
>>>>>
>>>>>> 1 - tmp->driver is non-NULL because tmp is already bound.
>>>>>>    1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
>>>>>> DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
>>>>>> its removal does not release someone else's DMA API (co-)ownership.
>>>>>
>>>>> This is an uncommon locking pattern, but it does work. It relies on
>>>>> the mutex being an effective synchronization barrier for an unlocked
>>>>> store:
>>>>>
>>>>> 				      WRITE_ONCE(dev->driver, NULL)
>>>>
>>>> Only the driver core should be messing with the dev->driver pointer as
>>>> when it does so, it already has the proper locks held.  Do I need to
>>>> move that to a "private" location so that nothing outside of the driver
>>>> core can mess with it?
>>>
>>> It would be nice, I've seen a abuse and mislocking of it in drivers
>>
>> Though to be clear, what Robin is describing is still keeping the
>> dev->driver stores in dd.c, just reading it in a lockless way from
>> other modules.
> 
> "other modules" should never care if a device has a driver bound to it
> because instantly after the check happens, it can change so what ever
> logic it wanted to do with that knowledge is gone.
> 
> Unless the bus lock is held that the device is on, but that should be
> only accessable from within the driver core as it controls that type of
> stuff, not any random other part of the kernel.
> 
> And in looking at this, ick, there are loads of places in the kernel
> that are thinking that this pointer being set to something actually
> means something.  Sometimes it does, but lots of places, it doesn't as
> it can change.

That's fine. In this case we're only talking about the low-level IOMMU 
code which has to be in cahoots with the driver core to some degree (via 
these new callbacks) anyway, but if you're uncomfortable about relying 
on dev->driver even there, I can live with that. There are several 
potential places to capture the relevant information in IOMMU API 
private data, from the point in really_probe() where it *is* stable, and 
then never look at dev->driver ever again - even from .dma_cleanup() or 
future equivalent, which is the aspect from whence this whole 
proof-of-concept tangent span out.

Cheers,
Robin.
Greg KH Feb. 23, 2022, 5:47 p.m. UTC | #17
On Wed, Feb 23, 2022 at 05:05:23PM +0000, Robin Murphy wrote:
> On 2022-02-23 16:03, Greg Kroah-Hartman wrote:
> > On Wed, Feb 23, 2022 at 10:30:11AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Feb 23, 2022 at 10:09:01AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Feb 23, 2022 at 03:06:35PM +0100, Greg Kroah-Hartman wrote:
> > > > > On Wed, Feb 23, 2022 at 09:46:27AM -0400, Jason Gunthorpe wrote:
> > > > > > On Wed, Feb 23, 2022 at 01:04:00PM +0000, Robin Murphy wrote:
> > > > > > 
> > > > > > > 1 - tmp->driver is non-NULL because tmp is already bound.
> > > > > > >    1.a - If tmp->driver->driver_managed_dma == 0, the group must currently be
> > > > > > > DMA-API-owned as a whole. Regardless of what driver dev has unbound from,
> > > > > > > its removal does not release someone else's DMA API (co-)ownership.
> > > > > > 
> > > > > > This is an uncommon locking pattern, but it does work. It relies on
> > > > > > the mutex being an effective synchronization barrier for an unlocked
> > > > > > store:
> > > > > > 
> > > > > > 				      WRITE_ONCE(dev->driver, NULL)
> > > > > 
> > > > > Only the driver core should be messing with the dev->driver pointer as
> > > > > when it does so, it already has the proper locks held.  Do I need to
> > > > > move that to a "private" location so that nothing outside of the driver
> > > > > core can mess with it?
> > > > 
> > > > It would be nice, I've seen a abuse and mislocking of it in drivers
> > > 
> > > Though to be clear, what Robin is describing is still keeping the
> > > dev->driver stores in dd.c, just reading it in a lockless way from
> > > other modules.
> > 
> > "other modules" should never care if a device has a driver bound to it
> > because instantly after the check happens, it can change so what ever
> > logic it wanted to do with that knowledge is gone.
> > 
> > Unless the bus lock is held that the device is on, but that should be
> > only accessable from within the driver core as it controls that type of
> > stuff, not any random other part of the kernel.
> > 
> > And in looking at this, ick, there are loads of places in the kernel
> > that are thinking that this pointer being set to something actually
> > means something.  Sometimes it does, but lots of places, it doesn't as
> > it can change.
> 
> That's fine. In this case we're only talking about the low-level IOMMU code
> which has to be in cahoots with the driver core to some degree (via these
> new callbacks) anyway, but if you're uncomfortable about relying on
> dev->driver even there, I can live with that. There are several potential
> places to capture the relevant information in IOMMU API private data, from
> the point in really_probe() where it *is* stable, and then never look at
> dev->driver ever again - even from .dma_cleanup() or future equivalent,
> which is the aspect from whence this whole proof-of-concept tangent span
> out.

For a specific driver core callback, like dma_cleanup(), all is fine,
but you shouldn't be caring about a driver pointer in your bus callback
for stuff like this as you "know" what happened by virtue of the
callback being called.

thanks,

greg k-h
diff mbox series

Patch

diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
index a039ab809753..d8b29ccd07e5 100644
--- a/include/linux/device/bus.h
+++ b/include/linux/device/bus.h
@@ -59,6 +59,8 @@  struct fwnode_handle;
  *		bus supports.
  * @dma_configure:	Called to setup DMA configuration on a device on
  *			this bus.
+ * @dma_cleanup:	Called to cleanup DMA configuration on a device on
+ *			this bus.
  * @pm:		Power management operations of this bus, callback the specific
  *		device driver's pm-ops.
  * @iommu_ops:  IOMMU specific operations for this bus, used to attach IOMMU
@@ -103,6 +105,7 @@  struct bus_type {
 	int (*num_vf)(struct device *dev);
 
 	int (*dma_configure)(struct device *dev);
+	void (*dma_cleanup)(struct device *dev);
 
 	const struct dev_pm_ops *pm;
 
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 9eaaff2f556c..de05c5c60c6b 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -662,6 +662,8 @@  static int really_probe(struct device *dev, struct device_driver *drv)
 	if (dev->bus)
 		blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 					     BUS_NOTIFY_DRIVER_NOT_BOUND, dev);
+	if (dev->bus && dev->bus->dma_cleanup)
+		dev->bus->dma_cleanup(dev);
 pinctrl_bind_failed:
 	device_links_no_driver(dev);
 	devres_release_all(dev);
@@ -1205,6 +1207,9 @@  static void __device_release_driver(struct device *dev, struct device *parent)
 		else if (drv->remove)
 			drv->remove(dev);
 
+		if (dev->bus && dev->bus->dma_cleanup)
+			dev->bus->dma_cleanup(dev);
+
 		device_links_driver_cleanup(dev);
 
 		devres_release_all(dev);