Message ID | 20170620154830.17487.1861.stgit@gimli.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jun 20, 2017 at 09:48:31AM -0600, Alex Williamson wrote: > If a device is bound to a non-vfio, non-whitelisted driver while a > group is in use, then the integrity of the group is compromised and > will result in hitting a BUG_ON. This code tries to avoid this case > by mangling driver_override to force a no-match for the driver. The > driver-core will either follow-up with a DRIVER_NOT_BOUND (preferred) > or BOUND_DRIVER, at which point we can remove the driver_override > mangling. Rather than mangling the driver override string to prevent driver binding, I wonder if it would make more sense to allow the BUS_NOTIFY_BIND_DRIVER notifier to fail the device probe? The driver override strings are, after all, exposed to userspace, and it strikes me that this kind of mangling is racy - userspace can read or change the override string at any time.
On Mon, 26 Jun 2017 10:08:55 +0100 Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Tue, Jun 20, 2017 at 09:48:31AM -0600, Alex Williamson wrote: > > If a device is bound to a non-vfio, non-whitelisted driver while a > > group is in use, then the integrity of the group is compromised and > > will result in hitting a BUG_ON. This code tries to avoid this case > > by mangling driver_override to force a no-match for the driver. The > > driver-core will either follow-up with a DRIVER_NOT_BOUND (preferred) > > or BOUND_DRIVER, at which point we can remove the driver_override > > mangling. > > Rather than mangling the driver override string to prevent driver binding, > I wonder if it would make more sense to allow the BUS_NOTIFY_BIND_DRIVER > notifier to fail the device probe? > > The driver override strings are, after all, exposed to userspace, and > it strikes me that this kind of mangling is racy - userspace can read > or change the override string at any time. Indeed, that looks easier. I sent and RFC, let's see what Greg has to say. Thanks, Alex
On Mon, 26 Jun 2017 10:08:55 +0100 Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Tue, Jun 20, 2017 at 09:48:31AM -0600, Alex Williamson wrote: > > If a device is bound to a non-vfio, non-whitelisted driver while a > > group is in use, then the integrity of the group is compromised and > > will result in hitting a BUG_ON. This code tries to avoid this case > > by mangling driver_override to force a no-match for the driver. The > > driver-core will either follow-up with a DRIVER_NOT_BOUND (preferred) > > or BOUND_DRIVER, at which point we can remove the driver_override > > mangling. > > Rather than mangling the driver override string to prevent driver binding, > I wonder if it would make more sense to allow the BUS_NOTIFY_BIND_DRIVER > notifier to fail the device probe? Well, it seemed like a good idea, but I don't think we're getting any traction here, the thread has gone cold: https://lkml.org/lkml/2017/6/27/1002 Greg, any further comments? > The driver override strings are, after all, exposed to userspace, and > it strikes me that this kind of mangling is racy - userspace can read > or change the override string at any time. As an alternative, I think we can make this not racy. BIND_DRIVER is notified through device_bind_driver() which specifies that the device lock is held. This covers not only BIND_DRIVER, but also BOUND_DRIVER and DRIVER_NOT_BOUND. So if the user entry points in sysfs were to require the device lock, we could easily mangle and de-mangle without interference from a user. It also seems like a rather good idea in general to exclude the user from changing driver_override while we're evaluating a match using it. Do you still have an objection to mangling driver_override if we can avoid the user race? Thanks, Alex
On Mon, Jul 10, 2017 at 03:34:12PM -0600, Alex Williamson wrote: > On Mon, 26 Jun 2017 10:08:55 +0100 > Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > > > On Tue, Jun 20, 2017 at 09:48:31AM -0600, Alex Williamson wrote: > > > If a device is bound to a non-vfio, non-whitelisted driver while a > > > group is in use, then the integrity of the group is compromised and > > > will result in hitting a BUG_ON. This code tries to avoid this case > > > by mangling driver_override to force a no-match for the driver. The > > > driver-core will either follow-up with a DRIVER_NOT_BOUND (preferred) > > > or BOUND_DRIVER, at which point we can remove the driver_override > > > mangling. > > > > Rather than mangling the driver override string to prevent driver binding, > > I wonder if it would make more sense to allow the BUS_NOTIFY_BIND_DRIVER > > notifier to fail the device probe? > > Well, it seemed like a good idea, but I don't think we're getting any > traction here, the thread has gone cold: > > https://lkml.org/lkml/2017/6/27/1002 > > Greg, any further comments? I still think your drivers should be fixed, adding yet-another-odd-interaction with the driver core is ripe for added complexity... And, as there's no real patch for me to do anything with (hint, I can't apply RFC patches), I don't know what I can do here... thanks, greg k-h
On Tue, 11 Jul 2017 11:46:27 +0200 Greg KH <greg@kroah.com> wrote: > On Mon, Jul 10, 2017 at 03:34:12PM -0600, Alex Williamson wrote: > > On Mon, 26 Jun 2017 10:08:55 +0100 > > Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > > > > > On Tue, Jun 20, 2017 at 09:48:31AM -0600, Alex Williamson wrote: > > > > If a device is bound to a non-vfio, non-whitelisted driver while a > > > > group is in use, then the integrity of the group is compromised and > > > > will result in hitting a BUG_ON. This code tries to avoid this case > > > > by mangling driver_override to force a no-match for the driver. The > > > > driver-core will either follow-up with a DRIVER_NOT_BOUND (preferred) > > > > or BOUND_DRIVER, at which point we can remove the driver_override > > > > mangling. > > > > > > Rather than mangling the driver override string to prevent driver binding, > > > I wonder if it would make more sense to allow the BUS_NOTIFY_BIND_DRIVER > > > notifier to fail the device probe? > > > > Well, it seemed like a good idea, but I don't think we're getting any > > traction here, the thread has gone cold: > > > > https://lkml.org/lkml/2017/6/27/1002 > > > > Greg, any further comments? > > I still think your drivers should be fixed, adding > yet-another-odd-interaction with the driver core is ripe for added > complexity... Hi Greg, Let me give a concrete scenario, I have a dual-port conventional PCI e1000 NIC. The IOMMU operates on PCIe requester IDs and therefore both NIC functions are masked behind the requester ID of a PCIe-to-PCI bridge. We cannot have the e1000 driver managing one function and a user managing the other (via vfio-pci). In this case, not only is the DMA not isolated but the functions share the same IOMMU context. Therefore in order to allow the user access to one function via vfio-pci, the other function needs to be in a known state, either also bound to vfio-pci, bound to an innocuous driver like pci-stub, or unbound from any driver. Given this state, user now has access to one function of the device, but how can we fix our driver to manage the other function? If the other function is also bound to vfio-pci, the driver core does not allow us to refuse a driver remove request, the best we can do is block for a while, but we best not do that too long so we end up in the device unbound state. Likewise, if the other function was bound to pci-stub, this driver won't block remove, so the device for the other port can transition to an unbound state. Once in an unbound state, how would fixing either the vfio-pci or the core vfio driver prevent the scenario which can now happen of the unbound device being bound to the host e1000 driver? This can happen in pure PCIe topologies as well where perhaps the IOMMU context is not shared, but the devices still lack DMA isolation within the group. The only tool we currently have to manage this scenario is that the vfio core driver can pull BUG_ON after the fact of the other device being bound to a host driver. Understandably, users aren't so keen on this, which is why I'm trying to allow vfio to block binding of that other device before it happens. None of this really seems to fall within the capabilities of the existing driver core, so simply fixing my driver doesn't seem to be a well defined option. Is there a simple solution I'm missing? We're not concerned only with auto-probing, we need to protect against explicit bind attempts as well. > And, as there's no real patch for me to do anything with (hint, I can't > apply RFC patches), I don't know what I can do here... Certainly continuing the discussion is all I'm asking for at this point. The RFC didn't tickle your fancy, but the reply also didn't convey an appreciation of the circumstances. I hope that perhaps this gets us a step closer so we can decide which way to go. Thanks, Alex
On Tue, Jul 11, 2017 at 10:41:16AM -0600, Alex Williamson wrote: > Let me give a concrete scenario, I have a dual-port conventional PCI > e1000 NIC. The IOMMU operates on PCIe requester IDs and therefore both > NIC functions are masked behind the requester ID of a PCIe-to-PCI > bridge. We cannot have the e1000 driver managing one function and a > user managing the other (via vfio-pci). Agreed, but really, if a user asks to do such a thing, they deserve the pieces the kernel ends up in, right? > In this case, not only is the > DMA not isolated but the functions share the same IOMMU context. > Therefore in order to allow the user access to one function via > vfio-pci, the other function needs to be in a known state, either also > bound to vfio-pci, bound to an innocuous driver like pci-stub, or > unbound from any driver. Given this state, user now has access to one > function of the device, but how can we fix our driver to manage the > other function? We have USB drivers that do this all the time, due to crazy USB specs that required it. The cdc-acm driver is one example, and I think there are a number of USB sound devices that also have this issue. Just have the driver of the "first" device grab the second one as well and "know" about the resources involved, as you are doing today. But, then you somehow seem to have the requirement to prevent userspace from mucking around in your driver bindings, and really, you shouldn't care about this, because again, if it messes up here, all bets are off. > If the other function is also bound to vfio-pci, the driver core does > not allow us to refuse a driver remove request, the best we can do is > block for a while, but we best not do that too long so we end up in the > device unbound state. > > Likewise, if the other function was bound to pci-stub, this driver won't > block remove, so the device for the other port can transition to an > unbound state. > > Once in an unbound state, how would fixing either the vfio-pci or the > core vfio driver prevent the scenario which can now happen of the > unbound device being bound to the host e1000 driver? This can happen > in pure PCIe topologies as well where perhaps the IOMMU context is not > shared, but the devices still lack DMA isolation within the group. > > The only tool we currently have to manage this scenario is that the > vfio core driver can pull BUG_ON after the fact of the other device > being bound to a host driver. Well, how about just locking the device down, don't crash the kernel. That at least gives userspace a chance to figure out what they did was wrong. > Understandably, users aren't so keen on > this, which is why I'm trying to allow vfio to block binding of that > other device before it happens. None of this really seems to fall > within the capabilities of the existing driver core, so simply fixing > my driver doesn't seem to be a well defined option. Is there a simple > solution I'm missing? We're not concerned only with auto-probing, we > need to protect against explicit bind attempts as well. Again, if userspace does an explicit bind/unbind attempt, it _HAS_ to know what it is doing, we can't protect ourselves from that, that's always been the case. The bind/unbind was done as a way for people to say "I know more about what is going on than the kernel does right now, so I'm going to override it." and we have to trust that they do know that. Don't spend a lot of time and energy trying to protect yourself from that please. thanks, greg k-h
Hi Greg, On Thu, 13 Jul 2017 10:23:14 +0200 Greg KH <greg@kroah.com> wrote: > On Tue, Jul 11, 2017 at 10:41:16AM -0600, Alex Williamson wrote: > > Let me give a concrete scenario, I have a dual-port conventional PCI > > e1000 NIC. The IOMMU operates on PCIe requester IDs and therefore both > > NIC functions are masked behind the requester ID of a PCIe-to-PCI > > bridge. We cannot have the e1000 driver managing one function and a > > user managing the other (via vfio-pci). > > Agreed, but really, if a user asks to do such a thing, they deserve the > pieces the kernel ends up in, right? I think that's asking a fair bit from users to understand these nuances; at one point in time they can bind the device to e1000 and it works, at another point in time the same operation crashes the system. Perhaps the user is not even using manual binding, maybe the e1000 driver is freshly loaded and probes any devices that are not bound. Maybe the device is passed through /sys/bus/pci/drivers_probe. If the user attempts to bind a device to the wrong driver we don't intentionally run the system into the ground to make that work. Of course if the user is overriding a match with dynamic IDs or driver_override, then I fully agree, the user should be responsible for the action they've dictated. I tend to think of this more towards the former than the latter. > > In this case, not only is the > > DMA not isolated but the functions share the same IOMMU context. > > Therefore in order to allow the user access to one function via > > vfio-pci, the other function needs to be in a known state, either also > > bound to vfio-pci, bound to an innocuous driver like pci-stub, or > > unbound from any driver. Given this state, user now has access to one > > function of the device, but how can we fix our driver to manage the > > other function? > > We have USB drivers that do this all the time, due to crazy USB specs > that required it. The cdc-acm driver is one example, and I think there > are a number of USB sound devices that also have this issue. Just have > the driver of the "first" device grab the second one as well and "know" > about the resources involved, as you are doing today. > > But, then you somehow seem to have the requirement to prevent userspace > from mucking around in your driver bindings, and really, you shouldn't > care about this, because again, if it messes up here, all bets are off. > > > If the other function is also bound to vfio-pci, the driver core does > > not allow us to refuse a driver remove request, the best we can do is > > block for a while, but we best not do that too long so we end up in the > > device unbound state. > > > > Likewise, if the other function was bound to pci-stub, this driver won't > > block remove, so the device for the other port can transition to an > > unbound state. > > > > Once in an unbound state, how would fixing either the vfio-pci or the > > core vfio driver prevent the scenario which can now happen of the > > unbound device being bound to the host e1000 driver? This can happen > > in pure PCIe topologies as well where perhaps the IOMMU context is not > > shared, but the devices still lack DMA isolation within the group. > > > > The only tool we currently have to manage this scenario is that the > > vfio core driver can pull BUG_ON after the fact of the other device > > being bound to a host driver. > > Well, how about just locking the device down, don't crash the kernel. > That at least gives userspace a chance to figure out what they did was > wrong. Locking down the device is an interesting strategy, I'll look into this more. I think we'd be locking down the user/vfio owned device, locking down the device for non-vfio use is essentially what I've been trying to do with blocking driver binding. With PCI devices we have the bus-master bit that we could clear and prevent the user from changing to theoretically stop all DMA for the device. We'd also need to destroy the IOMMU domain or else the IOMMU driver is going to break because the wrong domain type is setup for the host owned device. However, if we manage to do all of this perfectly, the result would be that the user owned device quietly stops working, perhaps only a dmesg log entry to identify what happened. I can imagine many bugs filed and much time wasted looking for that magic breadcrumb. Perhaps the better approach is to try to request the conflicting device(s) back from the user, and failing that kill the user process, ultimately falling back to the existing BUG_ON if we haven't resolved the compromising scenario. A harsher approach, but there's certainly the argument that we're trying to follow the user command of binding to the native host driver. > > Understandably, users aren't so keen on > > this, which is why I'm trying to allow vfio to block binding of that > > other device before it happens. None of this really seems to fall > > within the capabilities of the existing driver core, so simply fixing > > my driver doesn't seem to be a well defined option. Is there a simple > > solution I'm missing? We're not concerned only with auto-probing, we > > need to protect against explicit bind attempts as well. > > Again, if userspace does an explicit bind/unbind attempt, it _HAS_ to > know what it is doing, we can't protect ourselves from that, that's > always been the case. The bind/unbind was done as a way for people to > say "I know more about what is going on than the kernel does right now, > so I'm going to override it." and we have to trust that they do know > that. Don't spend a lot of time and energy trying to protect yourself > from that please. Thanks for the advice, I appreciate your time on this. I'd still like to improve the current behavior, but I'll see what I can do on the side of imposing user consequences before taking the head shot. Thanks, Alex
On Fri, Jul 14, 2017 at 10:03:27AM -0600, Alex Williamson wrote: > Hi Greg, > > On Thu, 13 Jul 2017 10:23:14 +0200 > Greg KH <greg@kroah.com> wrote: > > > On Tue, Jul 11, 2017 at 10:41:16AM -0600, Alex Williamson wrote: > > > Let me give a concrete scenario, I have a dual-port conventional PCI > > > e1000 NIC. The IOMMU operates on PCIe requester IDs and therefore both > > > NIC functions are masked behind the requester ID of a PCIe-to-PCI > > > bridge. We cannot have the e1000 driver managing one function and a > > > user managing the other (via vfio-pci). > > > > Agreed, but really, if a user asks to do such a thing, they deserve the > > pieces the kernel ends up in, right? > > I think that's asking a fair bit from users to understand these > nuances; There is no "nuance" here. > at one point in time they can bind the device to e1000 and it > works, Yeah, they got lucky! > at another point in time the same operation crashes the system. And now they didn't! What did they do differently? Oh look, one other device needed to be unbound/bound/whatever to get this to work properly, let's do that instead next time. > Perhaps the user is not even using manual binding, maybe the e1000 > driver is freshly loaded and probes any devices that are not bound. No, that's not the issue here at all, that should always work as the kernel driver is telling us that it can support this device just fine. Nothing "grey" or "nuanced" here at all. > Maybe the device is passed through /sys/bus/pci/drivers_probe. Then the user gets to keep the pieces of the kernel that might get spit out at them. Again, doing manual binding is a risk that if a user takes, it might or might not work. It's always been that way. > If the user attempts to bind a device to the wrong driver we don't > intentionally run the system into the ground to make that work. Are you kidding? That happens all the time, try to do it yourself and bind a device to a driver that doesn't expect to be handling it. If you are lucky your kernel will crash, if unlucky, it will limp along and do odd things to the device. It's been this way since these sysfs files were added over a decade ago. > Of course if the user is overriding a match with dynamic IDs or > driver_override, then I fully agree, the user should be responsible > for the action they've dictated. I tend to think of this more towards > the former than the latter. The user is always responsible if they are using sysfs to bind/unbind devices from drivers. It's not complex or nuanced at all. thanks, greg k-h
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index f40d1508d368..20e57fecf652 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -25,6 +25,7 @@ #include <linux/miscdevice.h> #include <linux/module.h> #include <linux/mutex.h> +#include <linux/platform_device.h> #include <linux/pci.h> #include <linux/rwsem.h> #include <linux/sched.h> @@ -32,6 +33,7 @@ #include <linux/stat.h> #include <linux/string.h> #include <linux/uaccess.h> +#include <linux/uuid.h> #include <linux/vfio.h> #include <linux/wait.h> @@ -95,6 +97,7 @@ struct vfio_group { bool noiommu; struct kvm *kvm; struct blocking_notifier_head notifier; + unsigned char uuid[16]; }; struct vfio_device { @@ -352,6 +355,8 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group) group->nb.notifier_call = vfio_iommu_group_notifier; + generate_random_uuid(group->uuid); + /* * blocking notifiers acquire a rwsem around registering and hold * it around callback. Therefore, need to register outside of @@ -728,6 +733,111 @@ static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev) return vfio_dev_viable(dev, group); } +#define VFIO_TAG_PREFIX "#vfio_group:" + +static char **vfio_find_driver_override(struct device *dev) +{ + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + return &pdev->driver_override; + } else if (dev->bus == &platform_bus_type) { + struct platform_device *pdev = to_platform_device(dev); + return &pdev->driver_override; + } + + return NULL; +} + +/* + * If we're about to bind to something other than a known whitelisted driver + * or known vfio bus driver, try to avert it with driver_override. + */ +static void vfio_group_nb_pre_bind(struct vfio_group *group, struct device *dev) +{ + struct vfio_bus_driver *driver; + struct device_driver *drv = ACCESS_ONCE(dev->driver); + char **driver_override; + + if (vfio_dev_whitelisted(dev, drv)) + return; /* Binding to known "innocuous" device/driver */ + + mutex_lock(&vfio.bus_drivers_lock); + list_for_each_entry(driver, &vfio.bus_drivers_list, vfio_next) { + if (driver->drv == drv) { + mutex_unlock(&vfio.bus_drivers_lock); + return; /* Binding to known vfio bus driver, ok */ + } + } + mutex_unlock(&vfio.bus_drivers_lock); + + /* Can we stall slightly to let users fall off? */ + if (list_empty(&group->device_list)) { + if (wait_event_timeout(vfio.release_q, + !atomic_read(&group->container_users), HZ)) + return; + } + + driver_override = vfio_find_driver_override(dev); + if (driver_override) { + char tag[50], *new = NULL, *old = *driver_override; + + snprintf(tag, sizeof(tag), "%s%pU", + VFIO_TAG_PREFIX, group->uuid); + + if (old && strstr(old, tag)) + return; /* block already in place */ + + new = kasprintf(GFP_KERNEL, "%s%s", old ? old : "", tag); + if (new) { + *driver_override = new; + kfree(old); + vfio_group_get(group); + dev_warn(dev, "vfio: Blocking unsafe driver bind\n"); + return; + } + } + + dev_warn(dev, "vfio: Unsafe driver binding to in-use group!\n"); +} + +/* If we've mangled driver_override, remove it */ +static void vfio_group_nb_post_bind(struct vfio_group *group, + struct device *dev) +{ + char **driver_override = vfio_find_driver_override(dev); + + if (driver_override && *driver_override) { + char tag[50], *new, *start, *end, *old = *driver_override; + + snprintf(tag, sizeof(tag), "%s%pU", + VFIO_TAG_PREFIX, group->uuid); + + start = strstr(old, tag); + if (start) { + end = start + strlen(tag); + + if (old + strlen(old) > end) + memmove(start, end, + strlen(old) - (end - old) + 1); + else + *start = 0; + + if (strlen(old)) { + new = kasprintf(GFP_KERNEL, "%s", old); + if (new) { + *driver_override = new; + kfree(old); + } /* else, in-place terminated, ok */ + } else { + *driver_override = NULL; + kfree(old); + } + + vfio_group_put(group); + } + } +} + static int vfio_iommu_group_notifier(struct notifier_block *nb, unsigned long action, void *data) { @@ -757,14 +867,23 @@ static int vfio_iommu_group_notifier(struct notifier_block *nb, */ break; case IOMMU_GROUP_NOTIFY_BIND_DRIVER: - pr_debug("%s: Device %s, group %d binding to driver\n", + pr_debug("%s: Device %s, group %d binding to driver %s\n", __func__, dev_name(dev), - iommu_group_id(group->iommu_group)); + iommu_group_id(group->iommu_group), dev->driver->name); + if (vfio_group_nb_verify(group, dev)) + vfio_group_nb_pre_bind(group, dev); + break; + case IOMMU_GROUP_NOTIFY_DRIVER_NOT_BOUND: + pr_debug("%s: Device %s, group %d binding fail to driver %s\n", + __func__, dev_name(dev), + iommu_group_id(group->iommu_group), dev->driver->name); + vfio_group_nb_post_bind(group, dev); break; case IOMMU_GROUP_NOTIFY_BOUND_DRIVER: pr_debug("%s: Device %s, group %d bound to driver %s\n", __func__, dev_name(dev), iommu_group_id(group->iommu_group), dev->driver->name); + vfio_group_nb_post_bind(group, dev); BUG_ON(vfio_group_nb_verify(group, dev)); break; case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER: @@ -1351,6 +1470,7 @@ static int vfio_group_unset_container(struct vfio_group *group) if (users != 1) return -EBUSY; + wake_up(&vfio.release_q); __vfio_group_unset_container(group); return 0; @@ -1364,7 +1484,11 @@ static int vfio_group_unset_container(struct vfio_group *group) */ static void vfio_group_try_dissolve_container(struct vfio_group *group) { - if (0 == atomic_dec_if_positive(&group->container_users)) + int users = atomic_dec_if_positive(&group->container_users); + + wake_up(&vfio.release_q); + + if (!users) __vfio_group_unset_container(group); } @@ -1433,19 +1557,26 @@ static bool vfio_group_viable(struct vfio_group *group) static int vfio_group_add_container_user(struct vfio_group *group) { + int ret; + if (!atomic_inc_not_zero(&group->container_users)) return -EINVAL; if (group->noiommu) { - atomic_dec(&group->container_users); - return -EPERM; + ret = -EPERM; + goto out; } if (!group->container->iommu_driver || !vfio_group_viable(group)) { - atomic_dec(&group->container_users); - return -EINVAL; + ret = -EINVAL; + goto out; } return 0; + +out: + atomic_dec(&group->container_users); + wake_up(&vfio.release_q); + return ret; } static const struct file_operations vfio_device_fops;